An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 68
Текст из файла (страница 68)
Performance refers to the computational efficiency of classification andIR systems in this book. However, many researchers mean effectiveness, notefficiency of text classification when they use the term performance.When we process a collection with several two-class classifiers (such asReuters-21578 with its 118 classes), we often want to compute a single aggregate measure that combines the measures for individual classifiers. Thereare two methods for doing this. Macroaveraging computes a simple average over classes. Microaveraging pools per-document decisions across classes,and then computes an effectiveness measure on the pooled contingency table.
Table 13.8 gives an example.The differences between the two methods can be large. Macroaveraginggives equal weight to each class, whereas microaveraging gives equal weightto each per-document classification decision. Because the F1 measure ignorestrue negatives and its magnitude is mostly determined by the number oftrue positives, large classes dominate small classes in microaveraging. In theexample, microaveraged precision (0.83) is much closer to the precision ofOnline edition (c) 2009 Cambridge UP13.6 Evaluation of text classification281<REUTERS TOPICS=’’YES’’ LEWISSPLIT=’’TRAIN’’CGISPLIT=’’TRAINING-SET’’ OLDID=’’12981’’ NEWID=’’798’’><DATE> 2-MAR-1987 16:51:43.42</DATE><TOPICS><D>livestock</D><D>hog</D></TOPICS><TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE><DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American PorkCongress kicks off tomorrow, March 3, in Indianapolis with 160of the nations pork producers from 44 member states determiningindustry positions on a number of issues, according to theNational Pork Producers Council, NPPC.Delegates to the three day Congress will be considering 26resolutions concerning various issues, including the futuredirection of farm policy and the tax law as it applies to theagriculture sector.
The delegates will also debate whether toendorse concepts of a national PRV (pseudorabies virus) controland eradication program, the NPPC said. A largetrade show, in conjunction with the congress, will featurethe latest in technology in all areas of the industry, the NPPCadded. Reuter\&\#3;</BODY></TEXT></REUTERS>◮ Figure 13.9 A sample document from the Reuters-21578 collection.c2 (0.9) than to the precision of c1 (0.5) because c2 is five times larger thanc1 .
Microaveraged results are therefore really a measure of effectiveness onthe large classes in a test collection. To get a sense of effectiveness on smallclasses, you should compute macroaveraged results.In one-of classification (Section 14.5, page 306), microaveraged F1 is thesame as accuracy (Exercise 13.6).Table 13.9 gives microaveraged and macroaveraged effectiveness of NaiveBayes for the ModApte split of Reuters-21578. To give a sense of the relativeeffectiveness of NB, we compare it with linear SVMs (rightmost column; seeChapter 15), one of the most effective classifiers, but also one that is moreexpensive to train than NB.
NB has a microaveraged F1 of 80%, which is9% less than the SVM (89%), a 10% relative decrease (row “micro-avg-L (90classes)”). So there is a surprisingly small effectiveness penalty for its simplicity and efficiency. However, on small classes, some of which only have onthe order of ten positive examples in the training set, NB does much worse.Its macroaveraged F1 is 13% below the SVM, a 22% relative decrease (row“macro-avg (90 classes)”).The table also compares NB with the other classifiers we cover in this book:Online edition (c) 2009 Cambridge UP28213 Text classification and Naive Bayes◮ Table 13.8 Macro- and microaveraging.
“Truth” is the true class and “call” thedecision of the classifier. In this example, macroaveraged precision is [10/(10 + 10) +90/(10 + 90)] /2 = (0.5 + 0.9)/2 = 0.7. Microaveraged precision is 100/(100 + 20) ≈0.83.call:yescall:noclass 1truth: truth:yesno101010970call:yescall:noclass 2truth: truth:yesno901010890call:yescall:nopooled tabletruth: truth:yesno10020201860◮ Table 13.9 Text classification effectiveness numbers on Reuters-21578 for F1 (inpercent). Results from Li and Yang (2003) (a), Joachims (1998) (b: kNN) and Dumaiset al. (1998) (b: NB, Rocchio, trees, SVM).(a)micro-avg-L (90 classes)macro-avg (90 classes)NB8047Rocchio8559kNN8660earnacqmoney-fxgraincrudetradeinterestshipwheatcornmicro-avg (top 10)micro-avg-D (118 classes)NB968857798064658570658275Rocchio936547687065634969486562kNN9792788286777479777882n/a(b)DECISION TREESSVM8960trees9890668585736774939288n/aSVM989475958976788692909287Rocchio and kNN.
In addition, we give numbers for decision trees, an important classification method we do not cover. The bottom part of the tableshows that there is considerable variation from class to class. For instance,NB beats kNN on ship, but is much worse on money-fx.Comparing parts (a) and (b) of the table, one is struck by the degree towhich the cited papers’ results differ. This is partly due to the fact that thenumbers in (b) are break-even scores (cf. page 161) averaged over 118 classes,whereas the numbers in (a) are true F1 scores (computed without any know-Online edition (c) 2009 Cambridge UP13.6 Evaluation of text classificationDEVELOPMENT SETHELD - OUT DATA283ledge of the test set) averaged over ninety classes.
This is unfortunately typical of what happens when comparing different results in text classification:There are often differences in the experimental setup or the evaluation thatcomplicate the interpretation of the results.These and other results have shown that the average effectiveness of NBis uncompetitive with classifiers like SVMs when trained and tested on independent and identically distributed (i.i.d.) data, that is, uniform data with all thegood properties of statistical sampling. However, these differences may often be invisible or even reverse themselves when working in the real worldwhere, usually, the training sample is drawn from a subset of the data towhich the classifier will be applied, the nature of the data drifts over timerather than being stationary (the problem of concept drift we mentioned onpage 269), and there may well be errors in the data (among other problems).Many practitioners have had the experience of being unable to build a fancyclassifier for a certain problem that consistently performs better than NB.Our conclusion from the results in Table 13.9 is that, although most researchers believe that an SVM is better than kNN and kNN better than NB,the ranking of classifiers ultimately depends on the class, the document collection, and the experimental setup.
In text classification, there is alwaysmore to know than simply which machine learning algorithm was used, aswe further discuss in Section 15.3 (page 334).When performing evaluations like the one in Table 13.9, it is important tomaintain a strict separation between the training set and the test set. We caneasily make correct classification decisions on the test set by using information we have gleaned from the test set, such as the fact that a particular termis a good predictor in the test set (even though this is not the case in the training set).
A more subtle example of using knowledge about the test set is totry a large number of values of a parameter (e.g., the number of selected features) and select the value that is best for the test set. As a rule, accuracy onnew data – the type of data we will encounter when we use the classifier inan application – will be much lower than accuracy on a test set that the classifier has been tuned for.
We discussed the same problem in ad hoc retrievalin Section 8.1 (page 153).In a clean statistical text classification experiment, you should never runany program on or even look at the test set while developing a text classification system. Instead, set aside a development set for testing while you developyour method.
When such a set serves the primary purpose of finding a goodvalue for a parameter, for example, the number of selected features, then itis also called held-out data. Train the classifier on the rest of the training setwith different parameter values, and then select the value that gives best results on the held-out part of the training set. Ideally, at the very end, whenall parameters have been set and the method is fully specified, you run onefinal experiment on the test set and publish the results. Because no informa-Online edition (c) 2009 Cambridge UP28413 Text classification and Naive Bayes◮ Table 13.10 Data for parameter estimation exercise.training settest setdocID12345words in documentTaipei TaiwanMacao Taiwan ShanghaiJapan SapporoSapporo Osaka TaiwanTaiwan Taiwan Sapporoin c = China?yesyesnono?tion about the test set was used in developing the classifier, the results of thisexperiment should be indicative of actual performance in practice.This ideal often cannot be met; researchers tend to evaluate several systems on the same test set over a period of several years.