An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 69
Текст из файла (страница 69)
But it is nevertheless highly important to not look at the test data and to run systems on it assparingly as possible. Beginners often violate this rule, and their results losevalidity because they have implicitly tuned their system to the test data simply by running many variant systems and keeping the tweaks to the systemthat worked best on the test set.?Exercise 13.6[⋆⋆]Assume a situation where every document in the test collection has been assignedexactly one class, and that a classifier also assigns exactly one class to each document.This setup is called one-of classification (Section 14.5, page 306).
Show that in one-ofclassification (i) the total number of false positive decisions equals the total numberof false negative decisions and (ii) microaveraged F1 and accuracy are identical.Exercise 13.7The class priors in Figure 13.2 are computed as the fraction of documents in the classas opposed to the fraction of tokens in the class. Why?Exercise 13.8The function A PPLY M ULTINOMIAL NB in Figure 13.2 has time complexity Θ ( La +|C | La ). How would you modify the function so that its time complexity is Θ( La +| C | M a )?Exercise 13.9Based on the data in Table 13.10, (i) estimate a multinomial Naive Bayes classifier, (ii)apply the classifier to the test document, (iii) estimate a Bernoulli NB classifier, (iv)apply the classifier to the test document.
You need not estimate parameters that youdon’t need for classifying the test document.Exercise 13.10Your task is to classify words as English or not English. Words are generated by asource with the following distribution:Online edition (c) 2009 Cambridge UP28513.6 Evaluation of text classificationevent1234wordozbuzuzoobunEnglish?nonoyesyesprobability4/94/91/181/18(i) Compute the parameters (priors and conditionals) of a multinomial NB classifier that uses the letters b, n, o, u, and z as features.
Assume a training set thatreflects the probability distribution of the source perfectly. Make the same independence assumptions that are usually made for a multinomial classifier that uses termsas features for text classification. Compute parameters using smoothing, in whichcomputed-zero probabilities are smoothed into probability 0.01, and computed-nonzeroprobabilities are untouched. (This simplistic smoothing may cause P ( A) + P ( A) > 1.Solutions are not required to correct this.) (ii) How does the classifier classify theword zoo? (iii) Classify the word zoo using a multinomial classifier as in part (i), butdo not make the assumption of positional independence. That is, estimate separateparameters for each position in a word.
You only need to compute the parametersyou need for classifying zoo.Exercise 13.11What are the values of I (Ut ; Cc ) and X2 (D, t, c) if term and class are completely independent? What are the values if they are completely dependent?Exercise 13.12The feature selection method in Equation (13.16) is most appropriate for the Bernoullimodel. Why? How could one modify it for the multinomial model?INFORMATION GAINExercise 13.13Features can also be selected according toinformation gain (IG), which is defined as:IG(D, t, c) = H ( pD ) −∑x ∈{D t + ,D t − }|x|H ( px )|D |where H is entropy, D is the training set, and D t+ , and D t− are the subset of D withterm t, and the subset of D without term t, respectively. p A is the class distributionin (sub)collection A, e.g., p A (c) = 0.25, p A (c) = 0.75 if a quarter of the documents inA are in class c.Show that mutual information and information gain are equivalent.Exercise 13.14Show that the two X2 formulas (Equations (13.18) and (13.19)) are equivalent.Exercise 13.15In the χ2 example on page 276 we have | N11 − E11 | = | N10 − E10 | = | N01 − E01 | =| N00 − E00 |.
Show that this holds in general.Exercise 13.16χ2 and mutual information do not distinguish between positively and negatively correlated features. Because most good text classification features are positively correlated (i.e., they occur more often in c than in c), one may want to explicitly rule outthe selection of negative indicators. How would you do this?Online edition (c) 2009 Cambridge UP28613 Text classification and Naive Bayes13.7POINTWISE MUTUALINFORMATIONUTILITY MEASUREReferences and further readingGeneral introductions to statistical classification and machine learning can befound in (Hastie et al.
2001), (Mitchell 1997), and (Duda et al. 2000), includingmany important methods (e.g., decision trees and boosting) that we do notcover. A comprehensive review of text classification methods and results is(Sebastiani 2002). Manning and Schütze (1999, Chapter 16) give an accessibleintroduction to text classification with coverage of decision trees, perceptronsand maximum entropy models. More information on the superlinear timecomplexity of learning methods that are more accurate than Naive Bayes canbe found in (Perkins et al. 2003) and (Joachims 2006a).Maron and Kuhns (1960) described one of the first NB text classifiers.
Lewis(1998) focuses on the history of NB classification. Bernoulli and multinomialmodels and their accuracy for different collections are discussed by McCallum and Nigam (1998). Eyheramendy et al. (2003) present additional NBmodels. Domingos and Pazzani (1997), Friedman (1997), and Hand and Yu(2001) analyze why NB performs well although its probability estimates arepoor. The first paper also discusses NB’s optimality when the independenceassumptions are true of the data. Pavlov et al. (2004) propose a modifieddocument representation that partially addresses the inappropriateness ofthe independence assumptions. Bennett (2000) attributes the tendency of NBprobability estimates to be close to either 0 or 1 to the effect of documentlength.
Ng and Jordan (2001) show that NB is sometimes (although rarely)superior to discriminative methods because it more quickly reaches its optimal error rate. The basic NB model presented in this chapter can be tuned forbetter effectiveness (Rennie et al. 2003;Kołcz and Yih 2007).
The problem ofconcept drift and other reasons why state-of-the-art classifiers do not alwaysexcel in practice are discussed by Forman (2006) and Hand (2006).Early uses of mutual information and χ2 for feature selection in text classification are Lewis and Ringuette (1994) and Schütze et al. (1995), respectively. Yang and Pedersen (1997) review feature selection methods and theirimpact on classification effectiveness. They find that pointwise mutual information is not competitive with other methods. Yang and Pedersen refer toexpected mutual information (Equation (13.16)) as information gain (see Exercise 13.13, page 285).
(Snedecor and Cochran 1989) is a good reference forthe χ2 test in statistics, including the Yates’ correction for continuity for 2 × 2tables. Dunning (1993) discusses problems of the χ2 test when counts aresmall. Nongreedy feature selection techniques are described by Hastie et al.(2001). Cohen (1995) discusses the pitfalls of using multiple significance testsand methods to avoid them. Forman (2004) evaluates different methods forfeature selection for multiple classifiers.David D. Lewis defines the ModApte split at www.daviddlewis.com/resources/testcollections/reuters215based on Apté et al.
(1994). Lewis (1995) describes utility measures for theOnline edition (c) 2009 Cambridge UP13.7 References and further reading287evaluation of text classification systems. Yang and Liu (1999) employ significance tests in the evaluation of text classification methods.Lewis et al. (2004) find that SVMs (Chapter 15) perform better on ReutersRCV1 than kNN and Rocchio (Chapter 14).Online edition (c) 2009 Cambridge UPOnline edition (c) 2009 Cambridge UPDRAFT! © April 1, 2009 Cambridge University Press. Feedback welcome.14CONTIGUITYHYPOTHESIS289Vector space classificationThe document representation in Naive Bayes is a sequence of terms or a binary vector he1 , .
. . , e|V | i ∈ {0, 1}| V |. In this chapter we adopt a differentrepresentation for text classification, the vector space model, developed inChapter 6. It represents each document as a vector with one real-valued component, usually a tf-idf weight, for each term. Thus, the document space X,the domain of the classification function γ, is R |V | . This chapter introduces anumber of classification methods that operate on real-valued vectors.The basic hypothesis in using the vector space model for classification isthe contiguity hypothesis.Contiguity hypothesis.
Documents in the same class form a contiguous region and regions of different classes do not overlap.There are many classification tasks, in particular the type of text classificationthat we encountered in Chapter 13, where classes can be distinguished byword patterns. For example, documents in the class China tend to have highvalues on dimensions like Chinese, Beijing, and Mao whereas documents in theclass UK tend to have high values for London, British and Queen. Documentsof the two classes therefore form distinct contiguous regions as shown inFigure 14.1 and we can draw boundaries that separate them and classify newdocuments. How exactly this is done is the topic of this chapter.Whether or not a set of documents is mapped into a contiguous region depends on the particular choices we make for the document representation:type of weighting, stop list etc.
To see that the document representation iscrucial, consider the two classes written by a group vs. written by a single person. Frequent occurrence of the first person pronoun I is evidence for thesingle-person class. But that information is likely deleted from the documentrepresentation if we use a stop list. If the document representation chosenis unfavorable, the contiguity hypothesis will not hold and successful vectorspace classification is not possible.The same considerations that led us to prefer weighted representations, inparticular length-normalized tf-idf representations, in Chapters 6 and 7 alsoOnline edition (c) 2009 Cambridge UP29014 Vector space classification⋄⋄UK⋄⋆⋄⋄⋄ChinaxxKenyaxx◮ Figure 14.1 Vector space classification into three classes.PROTOTYPEapply here.
For example, a term with 5 occurrences in a document should geta higher weight than a term with one occurrence, but a weight 5 times largerwould give too much emphasis to the term. Unweighted and unnormalizedcounts should not be used in vector space classification.We introduce two vector space classification methods in this chapter, Rocchio and kNN. Rocchio classification (Section 14.2) divides the vector spaceinto regions centered on centroids or prototypes, one for each class, computedas the center of mass of all documents in the class.
Rocchio classification issimple and efficient, but inaccurate if classes are not approximately sphereswith similar radii.kNN or k nearest neighbor classification (Section 14.3) assigns the majorityclass of the k nearest neighbors to a test document. kNN requires no explicittraining and can use the unprocessed training set directly in classification.It is less efficient than other classification methods in classifying documents.If the training set is large, then kNN can handle non-spherical and othercomplex classes better than Rocchio.A large number of text classifiers can be viewed as linear classifiers – classifiers that classify based on a simple linear combination of the features (Section 14.4).