An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 66
Текст из файла (страница 66)
Find exceptions.Consider formulaic documents with a fixed document structure.Exercise 13.4Table 13.3 gives Bernoulli and multinomial estimates for the word the. Explain thedifference.13.5FEATURE SELECTIONNOISE FEATUREOVERFITTINGFeature selectionFeature selection is the process of selecting a subset of the terms occurringin the training set and using only this subset as features in text classification.
Feature selection serves two main purposes. First, it makes trainingand applying a classifier more efficient by decreasing the size of the effectivevocabulary. This is of particular importance for classifiers that, unlike NB,are expensive to train. Second, feature selection often increases classification accuracy by eliminating noise features. A noise feature is one that, whenadded to the document representation, increases the classification error onnew data. Suppose a rare term, say arachnocentric, has no information abouta class, say China, but all instances of arachnocentric happen to occur in Chinadocuments in our training set.
Then the learning method might produce aclassifier that misassigns test documents containing arachnocentric to China.Such an incorrect generalization from an accidental property of the trainingset is called overfitting.We can view feature selection as a method for replacing a complex classifier (using all features) with a simpler one (using a subset of the features).Online edition (c) 2009 Cambridge UP27213 Text classification and Naive BayesIt may appear counterintuitive at first that a seemingly weaker classifier isadvantageous in statistical text classification, but when discussing the biasvariance tradeoff in Section 14.6 (page 308), we will see that weaker modelsare often preferable when limited training data are available.The basic feature selection algorithm is shown in Figure 13.6.
For a givenclass c, we compute a utility measure A(t, c) for each term of the vocabularyand select the k terms that have the highest values of A(t, c). All other termsare discarded and not used in classification. We will introduce three differentutility measures in this section: mutual information, A(t, c) = I (Ut ; Cc ); theχ2 test, A(t, c) = X 2 (t, c); and frequency, A(t, c) = N (t, c).Of the two NB models, the Bernoulli model is particularly sensitive tonoise features. A Bernoulli NB classifier requires some form of feature selection or else its accuracy will be low.This section mainly addresses feature selection for two-class classificationtasks like China versus not-China.
Section 13.5.5 briefly discusses optimizations for systems with more than two classes.13.5.1MUTUAL INFORMATION(13.16)Mutual informationA common feature selection method is to compute A(t, c) as the expectedmutual information (MI) of term t and class c.5 MI measures how much information the presence/absence of a term contributes to making the correctclassification decision on c. Formally:I (U; C )=∑∑e t ∈{1,0} e c ∈{1,0}P(U = et , C = ec ) log2P (U = et , C = ec ),P (U = et ) P ( C = ec )where U is a random variable that takes values et = 1 (the document containsterm t) and et = 0 (the document does not contain t), as defined on page 266,and C is a random variable that takes values ec = 1 (the document is in classc) and ec = 0 (the document is not in class c).
We write Ut and Cc if it is notclear from context which term t and class c we are referring to.ForMLEs of the probabilities, Equation (13.16) is equivalent to Equation (13.17):(13.17)I (U; C )=N N11NN N01N11+ 01 log2log2NN1. N.1NN0. N.1N N10NN N00N+ 00 log2+ 10 log2NN1. N.0NN0. N.0where the Ns are counts of documents that have the values of et and ec thatare indicated by the two subscripts. For example, N10 is the number of doc5.
Take care not to confuse expected mutual information with pointwise mutual information,which is defined as log N11 /E11 where N11 and E11 are defined as in Equation (13.18). Thetwo measures have different properties. See Section 13.7.Online edition (c) 2009 Cambridge UP27313.5 Feature selectionuments that contain t (et = 1) and are not in c (ec = 0). N1. = N10 + N11 isthe number of documents that contain t (et = 1) and we count documentsindependent of class membership (ec ∈ {0, 1}).
N = N00 + N01 + N10 + N11is the total number of documents. An example of one of the MLE estimatesthat transform Equation (13.16) into Equation (13.17) is P(U = 1, C = 1) =N11 /N.✎Example 13.3: Consider the class poultry and the term export in Reuters-RCV1.The counts of the number of documents with the four possible combinations of indicator values are as follows:ec = epoultry = 1 ec = epoultry = 0et = eexport = 1N11 = 49N10 = 27,652et = eexport = 0N01 = 141N00 = 774,106After plugging these values into Equation (13.17) we get:I (U; C )=≈49801,948 · 49log2801,948(49 + 27,652)(49 + 141)141801,948 · 141+log2801,948(141 + 774,106)(49 + 141)27,652801,948 · 27,652+log2801,948(49 + 27,652)(27,652 + 774,106)801,948 · 774,106774,106log2+801,948(141 + 774,106)(27,652 + 774,106)0.0001105To select k terms t1 , .
. . , tk for a given class, we use the feature selection algorithm in Figure 13.6: We compute the utility measure as A(t, c) = I (Ut , Cc )and select the k terms with the largest values.Mutual information measures how much information – in the informationtheoretic sense – a term contains about the class. If a term’s distribution isthe same in the class as it is in the collection as a whole, then I (U; C ) =0. MI reaches its maximum value if the term is a perfect indicator for classmembership, that is, if the term is present in a document if and only if thedocument is in the class.Figure 13.7 shows terms with high mutual information scores for the sixclasses in Figure 13.1.6 The selected terms (e.g., london, uk, british for the classUK) are of obvious utility for making classification decisions for their respective classes.
At the bottom of the list for UK we find terms like peripheralsand tonight (not shown in the figure) that are clearly not helpful in deciding6. Feature scores were computed on the first 100,000 documents, except for poultry, a rare class,for which 800,000 documents were used. We have omitted numbers and other special wordsfrom the top ten lists.Online edition (c) 2009 Cambridge UP27413 Text classification and Naive BayesUK0.19250.07550.05960.05550.04690.03570.02380.02120.01490.0126coffeecoffee0.0111bags0.0042growers0.0025kg0.0019colombia0.0018brazil0.0016export0.0014exporters 0.0013exports0.0013crop0.0012londonukbritishstgbritainplcenglandpencepoundsenglishChina0.09970.05230.04440.03440.02920.01980.01950.01550.01170.0108electionselection0.0519elections0.0342polls0.0339voters0.0315party0.0303vote0.0299poll0.0225candidate0.0202campaign0.0202democratic 0.0198chinachinesebeijingyuanshanghaihongkongxinhuaprovincetaiwanpoultry0.00130.00080.00060.00050.00040.00030.00030.00030.00030.0003sportssoccer0.0681cup0.0515match0.0441matches 0.0408played0.0388league0.0386beat0.0301game0.0299games0.0284team0.0264poultrymeatchickenagricultureavianbroilerveterinarybirdsinspectionpathogenic◮ Figure 13.7 Features with high mutual information scores for six Reuters-RCV1classes.whether the document is in the class.
As you might expect, keeping the informative terms and eliminating the non-informative ones tends to reducenoise and improve the classifier’s accuracy.Such an accuracy increase can be observed in Figure 13.8, which showsF1 as a function of vocabulary size after feature selection for Reuters-RCV1.7Comparing F1 at 132,776 features (corresponding to selection of all features)and at 10–100 features, we see that MI feature selection increases F1 by about0.1 for the multinomial model and by more than 0.2 for the Bernoulli model.For the Bernoulli model, F1 peaks early, at ten features selected. At that point,the Bernoulli model is better than the multinomial model. When basing aclassification decision on only a few features, it is more robust to consider binary occurrence only. For the multinomial model (MI feature selection), thepeak occurs later, at 100 features, and its effectiveness recovers somewhat at7.
We trained the classifiers on the first 100,000 documents and computed F1 on the next 100,000.The graphs are averages over five classes.Online edition (c) 2009 Cambridge UP275bb#bb0.4x bxxb#oob#obxoxxxxbxx#obxbo##xo#xo#bb0.2o##0.0F1 measure0.60.813.5 Feature selection#oo oo# #ox#b1#oxb101001000multinomial, MImultinomial, chisquaremultinomial, frequencybinomial, MI10000number of features selected◮ Figure 13.8 Effect of feature set size on accuracy for multinomial and Bernoullimodels.the end when we use all features. The reason is that the multinomial takesthe number of occurrences into account in parameter estimation and classification and therefore better exploits a larger number of features than theBernoulli model.