Classification Methods
1. Manual classification
(1) Very accurate when job is done by experts
(2) Consistent when the problem size and team is small
(3) Difficult and expensive to scale
2. Automatic document classification: Hand-coded rule-based systems
3. Supervised learning of a document-label assignment function
Many systems partly rely on machine learning models, such as SVM and KNN
Naive Bayes Assumption: assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities P(xi|cj).(conditional independence assumption)
Feature Selection
(1) Text collections have a large number of features
(2) May make using a particular classifier feasible
(3) Reduces training time
(4) Can improve generalization (performance)
Evaluating Categorization
(1) Evaluation must be done on test data that are independent of the training data (usually a disjoint set of instances).
(2) Results can vary based on sampling error due to different training and test sets.
(3) Average results over multiple training and test sets (splits of the overall data) for the best results.
Vector space methods for Text Classification
(1) K Nearest Neighbors
(2) Decision boundaries
(3) Vector space classification using centroids
(4) Decision Trees (briefly)
k Nearest Neighbor Classification
(1) To classify document d into class c:
(2) Define k-neighborhood N as k nearest neighbors of d
(3) Count number of documents i in N that belong to c
(4) Estimate P(c|d) as i/k
(5) Choose as class argmaxc P(c|d) [ = majority class]
No comments:
Post a Comment