An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 67
Текст из файла (страница 67)
Regardless of the differences between the two methods,using a carefully selected subset of the features results in better effectivenessthan using all features.13.5.2χ2FEATURE SELECTIONINDEPENDENCE(13.18)χ2 Feature selectionAnother popular feature selection method is χ2 . In statistics, the χ2 test isapplied to test the independence of two events, where two events A and B aredefined to be independent if P( AB) = P( A) P( B) or, equivalently, P( A| B) =P( A) and P( B| A) = P( B). In feature selection, the two events are occurrenceof the term and occurrence of the class. We then rank terms with respect tothe following quantity:X 2 (D, t, c) =∑e t ∈{0,1} e c( Net ec − Eet ec )2Ee t e c∈{0,1}∑Online edition (c) 2009 Cambridge UP27613 Text classification and Naive Bayeswhere et and ec are defined as in Equation (13.16).
N is the observed frequencyin D and E the expected frequency. For example, E11 is the expected frequencyof t and c occurring together in a document assuming that term and class areindependent.✎We first compute E11 for the data in Example 13.3:Example 13.4:=E11=N11 + N10N + N01× 11NN49 + 2765249 + 141×≈ 6.6N×NNN × P (t ) × P (c) = N ×where N is the total number of documents as before.We compute the other Ee t e c in the same way:eexport = 1eexport = 0epoultry = 1N11 = 49E11 ≈ 6.6N01 = 141 E01 ≈ 183.4epoultry = 0N10 = 27,652E10 ≈ 27,694.4N00 = 774,106 E00 ≈ 774,063.6Plugging these values into Equation (13.18), we get a X2 value of 284:X2 (D, t, c) =∑e t ∈{0,1} e cSTATISTICALSIGNIFICANCE(13.19)( Ne t e c − Ee t e c )2≈ 284Ee t e c∈{0,1}∑X 2 is a measure of how much expected counts E and observed counts Ndeviate from each other.
A high value of X 2 indicates that the hypothesis ofindependence, which implies that expected and observed counts are similar,is incorrect. In our example, X 2 ≈ 284 > 10.83. Based on Table 13.6, wecan reject the hypothesis that poultry and export are independent with only a0.001 chance of being wrong.8 Equivalently, we say that the outcome X 2 ≈284 > 10.83 is statistically significant at the 0.001 level. If the two events aredependent, then the occurrence of the term makes the occurrence of the classmore likely (or less likely), so it should be helpful as a feature.
This is therationale of χ2 feature selection.An arithmetically simpler way of computing X 2 is the following:X 2 (D, t, c) =( N11 + N10 + N01 + N00 ) × ( N11 N00 − N10 N01 )2( N11 + N01 ) × ( N11 + N10 ) × ( N10 + N00 ) × ( N01 + N00 )This is equivalent to Equation (13.18) (Exercise 13.14).8. We can make this inference because, if the two events are independent, then X 2 ∼ χ2 , whereχ2 is the χ2 distribution. See, for example, Rice (2006).Online edition (c) 2009 Cambridge UP27713.5 Feature selection◮ Table 13.6 Critical values of the χ2 distribution with one degree of freedom. Forexample, if the two events are independent, then P ( X 2 > 6.63) < 0.01.
So for X 2 >6.63 the assumption of independence can be rejected with 99% confidence.p0.10.050.010.0050.001✄13.5.3χ2 critical value2.713.846.637.8810.83Assessing χ2 as a feature selection methodFrom a statistical point of view, χ2 feature selection is problematic. For atest with one degree of freedom, the so-called Yates correction should beused (see Section 13.7), which makes it harder to reach statistical significance.Also, whenever a statistical test is used multiple times, then the probabilityof getting at least one error increases.
If 1,000 hypotheses are rejected, eachwith 0.05 error probability, then 0.05 × 1000 = 50 calls of the test will bewrong on average. However, in text classification it rarely matters whether afew additional terms are added to the feature set or removed from it. Rather,the relative importance of features is important. As long as χ2 feature selection only ranks features with respect to their usefulness and is not used tomake statements about statistical dependence or independence of variables,we need not be overly concerned that it does not adhere strictly to statisticaltheory.Frequency-based feature selectionA third feature selection method is frequency-based feature selection, that is,selecting the terms that are most common in the class. Frequency can beeither defined as document frequency (the number of documents in the classc that contain the term t) or as collection frequency (the number of tokens oft that occur in documents in c).
Document frequency is more appropriate forthe Bernoulli model, collection frequency for the multinomial model.Frequency-based feature selection selects some frequent terms that haveno specific information about the class, for example, the days of the week(Monday, Tuesday, . . . ), which are frequent across classes in newswire text.When many thousands of features are selected, then frequency-based feature selection often does well. Thus, if somewhat suboptimal accuracy isacceptable, then frequency-based feature selection can be a good alternativeto more complex methods.
However, Figure 13.8 is a case where frequency-Online edition (c) 2009 Cambridge UP27813 Text classification and Naive Bayesbased feature selection performs a lot worse than MI and χ2 and should notbe used.13.5.4Feature selection for multiple classifiersIn an operational system with a large number of classifiers, it is desirableto select a single set of features instead of a different one for each classifier.One way of doing this is to compute the X 2 statistic for an n × 2 table wherethe columns are occurrence and nonoccurrence of the term and each rowcorresponds to one of the classes. We can then select the k terms with thehighest X 2 statistic as before.More commonly, feature selection statistics are first computed separatelyfor each class on the two-class classification task c versus c and then combined.
One combination method computes a single figure of merit for eachfeature, for example, by averaging the values A(t, c) for feature t, and thenselects the k features with highest figures of merit. Another frequently usedcombination method selects the top k/n features for each of n classifiers andthen combines these n sets into one global feature set.Classification accuracy often decreases when selecting k common featuresfor a system with n classifiers as opposed to n different sets of size k.
But evenif it does, the gain in efficiency owing to a common document representationmay be worth the loss in accuracy.13.5.5Comparison of feature selection methodsMutual information and χ2 represent rather different feature selection methods. The independence of term t and class c can sometimes be rejected withhigh confidence even if t carries little information about membership of adocument in c. This is particularly true for rare terms. If a term occurs oncein a large collection and that one occurrence is in the poultry class, then thisis statistically significant.
But a single occurrence is not very informativeaccording to the information-theoretic definition of information. Becauseits criterion is significance, χ2 selects more rare terms (which are often lessreliable indicators) than mutual information. But the selection criterion ofmutual information also does not necessarily select the terms that maximizeclassification accuracy.Despite the differences between the two methods, the classification accuracy of feature sets selected with χ2 and MI does not seem to differ systematically. In most text classification problems, there are a few strong indicatorsand many weak indicators. As long as all strong indicators and a large number of weak indicators are selected, accuracy is expected to be good.
Bothmethods do this.Figure 13.8 compares MI and χ2 feature selection for the multinomial model.Online edition (c) 2009 Cambridge UP27913.6 Evaluation of text classificationGREEDY FEATURESELECTION?Peak effectiveness is virtually the same for both methods. χ2 reaches thispeak later, at 300 features, probably because the rare, but highly significantfeatures it selects initially do not cover all documents in the class. However,features selected later (in the range of 100–300) are of better quality than thoseselected by MI.All three methods – MI, χ2 and frequency based – are greedy methods.They may select features that contribute no incremental information overpreviously selected features.
In Figure 13.7, kong is selected as the seventhterm even though it is highly correlated with previously selected hong andtherefore redundant. Although such redundancy can negatively impact accuracy, non-greedy methods (see Section 13.7 for references) are rarely usedin text classification due to their computational cost.Exercise 13.5Consider the following frequencies for the class coffee for four terms in the first 100,000documents of Reuters-RCV1:termbrazilcouncilproducersroastedN0098,01296,32298,52499,824N01102133119143N1018353525111823N1151203410Select two of these four terms based on (i) χ2 , (ii) mutual information, (iii) frequency.13.6TWO - CLASS CLASSIFIERM OD A PTE SPLITEvaluation of text classification] Historically, the classic Reuters-21578 collection was the main benchmarkfor text classification evaluation.
This is a collection of 21,578 newswire articles, originally collected and labeled by Carnegie Group, Inc. and Reuters,Ltd. in the course of developing the CONSTRUE text classification system.It is much smaller than and predates the Reuters-RCV1 collection discussedin Chapter 4 (page 69). The articles are assigned classes from a set of 118topic categories. A document may be assigned several classes or none, butthe commonest case is single assignment (documents with at least one classreceived an average of 1.24 classes). The standard approach to this any-ofproblem (Chapter 14, page 306) is to learn 118 two-class classifiers, one foreach class, where the two-class classifier for class c is the classifier for the twoclasses c and its complement c.For each of these classifiers, we can measure recall, precision, and accuracy.
In recent work, people almost invariably use the ModApte split, whichincludes only documents that were viewed and assessed by a human indexer,Online edition (c) 2009 Cambridge UP28013 Text classification and Naive Bayes◮ Table 13.7 The ten largest classes in the Reuters-21578 collection with number ofdocuments in training and test sets.classearnacquisitionsmoney-fxgraincrudeEFFECTIVENESSPERFORMANCEEFFICIENCYMACROAVERAGINGMICROAVERAGING# train28771650538433389# testclass1087 trade179 interest179 ship149 wheat189 corn# train369347197212182# test119131897156and comprises 9,603 training documents and 3,299 test documents. The distribution of documents in classes is very uneven, and some work evaluatessystems on only documents in the ten largest classes. They are listed in Table 13.7.
A typical document with topics is shown in Figure 13.9.In Section 13.1, we stated as our goal in text classification the minimizationof classification error on test data. Classification error is 1.0 minus classification accuracy, the proportion of correct decisions, a measure we introducedin Section 8.3 (page 155). This measure is appropriate if the percentage ofdocuments in the class is high, perhaps 10% to 20% and higher. But as wediscussed in Section 8.3, accuracy is not a good measure for “small” classesbecause always saying no, a strategy that defeats the purpose of building aclassifier, will achieve high accuracy. The always-no classifier is 99% accuratefor a class with relative frequency 1%. For small classes, precision, recall andF1 are better measures.We will use effectiveness as a generic term for measures that evaluate thequality of classification decisions, including precision, recall, F1 , and accuracy.