An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 40
Текст из файла (страница 40)
They are computed using unordered sets of documents. We need to extend these measures(or to define new measures) if we are to evaluate the ranked retrieval resultsthat are now standard with search engines. In a ranked retrieval context,appropriate sets of retrieved documents are naturally given by the top k retrieved documents. For each such set, precision and recall values can beplotted to give a precision-recall curve, such as the one shown in Figure 8.2.Precision-recall curves have a distinctive saw-tooth shape: if the (k + 1)thdocument retrieved is nonrelevant then recall is the same as for the top kdocuments, but precision has dropped. If it is relevant, then both precisionand recall increase, and the curve jags up and to the right.
It is often useful toremove these jiggles and the standard way to do this is with an interpolatedprecision: the interpolated precision pinter p at a certain recall level r is definedOnline edition (c) 2009 Cambridge UP1598.4 Evaluation of ranked retrieval resultsRecall0.00.10.20.30.40.50.60.70.80.91.0Interp.Precision1.000.670.630.550.450.410.360.290.130.100.08◮ Table 8.1 Calculation of 11-point Interpolated Average Precision. This is for theprecision-recall curve shown in Figure 8.2.as the highest precision found for any recall level r ′ ≥ r:(8.7)11- POINTINTERPOLATEDAVERAGE PRECISIONM EAN AVERAGEP RECISIONpinter p(r ) = max p(r ′ )r ′ ≥rThe justification is that almost anyone would be prepared to look at a fewmore documents if it would increase the percentage of the viewed set thatwere relevant (that is, if the precision of the larger set is higher).
Interpolatedprecision is shown by a thinner line in Figure 8.2. With this definition, theinterpolated precision at a recall of 0 is well-defined (Exercise 8.4).Examining the entire precision-recall curve is very informative, but thereis often a desire to boil this information down to a few numbers, or perhapseven a single number.
The traditional way of doing this (used for instancein the first 8 TREC Ad Hoc evaluations) is the 11-point interpolated averageprecision. For each information need, the interpolated precision is measuredat the 11 recall levels of 0.0, 0.1, 0.2, . . . , 1.0. For the precision-recall curve inFigure 8.2, these 11 values are shown in Table 8.1. For each recall level, wethen calculate the arithmetic mean of the interpolated precision at that recalllevel for each information need in the test collection. A composite precisionrecall curve showing 11 points can then be graphed. Figure 8.3 shows anexample graph of such results from a representative good system at TREC 8.In recent years, other measures have become more common.
Most standard among the TREC community is Mean Average Precision (MAP), whichprovides a single-figure measure of quality across recall levels. Among evaluation measures, MAP has been shown to have especially good discrimination and stability. For a single information need, Average Precision is theOnline edition (c) 2009 Cambridge UP1608 Evaluation in information retrieval1Precision0.80.60.40.2000.20.40.60.81Recall◮ Figure 8.3 Averaged 11-point precision/recall graph across 50 queries for a representative TREC system.
The Mean Average Precision for this system is 0.2553.average of the precision value obtained for the set of top k documents existing after each relevant document is retrieved, and this value is then averagedover information needs. That is, if the set of relevant documents for an information need q j ∈ Q is {d1 , . . . d m j } and R jk is the set of ranked retrievalresults from the top result until you get to document dk , then(8.8)MAP( Q) =1|Q|| Q|1mj∑ m j ∑ Precision( R jk )j =1k =1When a relevant document is not retrieved at all,1 the precision value in theabove equation is taken to be 0.
For a single information need, the averageprecision approximates the area under the uninterpolated precision-recallcurve, and so the MAP is roughly the average area under the precision-recallcurve for a set of queries.Using MAP, fixed recall levels are not chosen, and there is no interpolation. The MAP value for a test collection is the arithmetic mean of average1. A system may not fully order all documents in the collection in response to a query or atany rate an evaluation exercise may be based on submitting only the top k results for eachinformation need.Online edition (c) 2009 Cambridge UP8.4 Evaluation of ranked retrieval resultsPRECISION ATkR- PRECISIONBREAK - EVEN POINT161precision values for individual information needs. (This has the effect ofweighting each information need equally in the final reported number, evenif many documents are relevant to some queries whereas very few are relevant to other queries.) Calculated MAP scores normally vary widely acrossinformation needs when measured within a single system, for instance, between 0.1 and 0.7.
Indeed, there is normally more agreement in MAP foran individual information need across systems than for MAP scores for different information needs for the same system. This means that a set of testinformation needs must be large and diverse enough to be representative ofsystem effectiveness across different queries.The above measures factor in precision at all recall levels. For many prominent applications, particularly web search, this may not be germane to users.What matters is rather how many good results there are on the first page orthe first three pages. This leads to measuring precision at fixed low levels ofretrieved results, such as 10 or 30 documents.
This is referred to as “Precisionat k”, for example “Precision at 10”. It has the advantage of not requiring anyestimate of the size of the set of relevant documents but the disadvantagesthat it is the least stable of the commonly used evaluation measures and thatit does not average well, since the total number of relevant documents for aquery has a strong influence on precision at k.An alternative, which alleviates this problem, is R-precision. It requireshaving a set of known relevant documents Rel, from which we calculate theprecision of the top Rel documents returned. (The set Rel may be incomplete,such as when Rel is formed by creating relevance judgments for the pooledtop k results of particular systems in a set of experiments.) R-precision adjusts for the size of the set of relevant documents: A perfect system couldscore 1 on this metric for each query, whereas, even a perfect system couldonly achieve a precision at 20 of 0.4 if there were only 8 documents in thecollection relevant to an information need.
Averaging this measure acrossqueries thus makes more sense. This measure is harder to explain to naiveusers than Precision at k but easier to explain than MAP. If there are | Rel |relevant documents for a query, we examine the top | Rel | results of a system, and find that r are relevant, then by definition, not only is the precision(and hence R-precision) r/| Rel |, but the recall of this result set is also r/| Rel |.Thus, R-precision turns out to be identical to the break-even point, anothermeasure which is sometimes used, defined in terms of this equality relationship holding.
Like Precision at k, R-precision describes only one point onthe precision-recall curve, rather than attempting to summarize effectivenessacross the curve, and it is somewhat unclear why you should be interestedin the break-even point rather than either the best point on the curve (thepoint with maximal F-measure) or a retrieval level of interest to a particularapplication (Precision at k). Nevertheless, R-precision turns out to be highlycorrelated with MAP empirically, despite measuring only a single point onOnline edition (c) 2009 Cambridge UP1628 Evaluation in information retrievalsensitivity ( = recall)1.00.80.60.40.20.000.20.40.60.811 − specificity◮ Figure 8.4 The ROC curve corresponding to the precision-recall curve in Figure 8.2..ROC CURVESENSITIVITYSPECIFICITYCUMULATIVE GAINthe curve.Another concept sometimes used in evaluation is an ROC curve.
(“ROC”stands for “Receiver Operating Characteristics”, but knowing that doesn’thelp most people.) An ROC curve plots the true positive rate or sensitivity against the false positive rate or (1 − specificity). Here, sensitivity is justanother term for recall. The false positive rate is given by f p/( f p + tn). Figure 8.4 shows the ROC curve corresponding to the precision-recall curve inFigure 8.2. An ROC curve always goes from the bottom left to the top right ofthe graph. For a good system, the graph climbs steeply on the left side. Forunranked result sets, specificity, given by tn/( f p + tn), was not seen as a veryuseful notion.
Because the set of true negatives is always so large, its valuewould be almost 1 for all information needs (and, correspondingly, the valueof the false positive rate would be almost 0). That is, the “interesting” part ofFigure 8.2 is 0 < recall < 0.4, a part which is compressed to a small cornerof Figure 8.4. But an ROC curve could make sense when looking over thefull retrieval spectrum, and it provides another way of looking at the data.In many fields, a common aggregate measure is to report the area under theROC curve, which is the ROC analog of MAP.
Precision-recall curves aresometimes loosely referred to as ROC curves. This is understandable, butnot accurate.A final approach that has seen increasing adoption, especially when employed with machine learning approaches to ranking (see Section 15.4, page 341)is measures of cumulative gain, and in particular normalized discounted cumu-NORMALIZEDDISCOUNTEDCUMULATIVE GAINOnline edition (c) 2009 Cambridge UP1638.4 Evaluation of ranked retrieval resultsNDCGlative gain (NDCG). NDCG is designed for situations of non-binary notionsof relevance (cf.