An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 41
Текст из файла (страница 41)
Section 8.5.1). Like precision at k, it is evaluated over somenumber k of top search results. For a set of queries Q, let R( j, d) be the relevance score assessors gave to document d for query j. Then,(8.9)NDCG( Q, k) =1|Q|| Q|k2R( j,m) − 1,log2 (1 + m)m =1∑ Zkj ∑j =1where Zkj is a normalization factor calculated to make it so that a perfectranking’s NDCG at k for query j is 1.
For queries for which k′ < k documentsare retrieved, the last summation is done up to k′ .?[ ⋆]Exercise 8.4What are the possible values for interpolated precision at a recall level of 0?[⋆⋆]Exercise 8.5Must there always be a break-even point between precision and recall? Either showthere must be or give a counter-example.[⋆⋆]Exercise 8.6What is the relationship between the value of F1 and the break-even point?[⋆⋆]Exercise 8.7D ICE COEFFICIENTThe Dice coefficient of two sets is a measure of their intersection scaled by their size(giving a value in the range 0 to 1):Dice( X, Y ) =2| X ∩ Y || X | + |Y |Show that the balanced F-measure (F1 ) is equal to the Dice coefficient of the retrievedand relevant document sets.[ ⋆]Exercise 8.8Consider an information need for which there are 4 relevant documents in the collection.
Contrast two systems run on this collection. Their top 10 results are judged forrelevance as follows (the leftmost item is the top ranked search result):System 1R N R N NN N N R RSystem 2N R N N RR R N N Na. What is the MAP of each system? Which has a higher MAP?b. Does this result intuitively make sense? What does it say about what is importantin getting a good MAP score?c.
What is the R-precision of each system? (Does it rank the systems the same asMAP?)Online edition (c) 2009 Cambridge UP1648 Evaluation in information retrieval[⋆⋆]Exercise 8.9The following list of Rs and Ns represents relevant (R) and nonrelevant (N) returneddocuments in a ranked list of 20 documents retrieved in response to a query from acollection of 10,000 documents. The top of the ranked list (the document the systemthinks is most likely to be relevant) is on the left of the list. This list shows 6 relevantdocuments.
Assume that there are 8 relevant documents in total in the collection.R R N N Na.b.c.d.e.N N N R NR N N N RN N N N RWhat is the precision of the system on the top 20?What is the F1 on the top 20?What is the uninterpolated precision of the system at 25% recall?What is the interpolated precision at 33% recall?Assume that these 20 documents are the complete result set of the system. Whatis the MAP for the query?Assume, now, instead, that the system returned the entire 10,000 documents in aranked list, and these are the first 20 results returned.f. What is the largest possible MAP that this system could have?g.
What is the smallest possible MAP that this system could have?h. In a set of experiments, only the top 20 results are evaluated by hand. The resultin (e) is used to approximate the range (f)–(g). For this example, how large (inabsolute terms) can the error for the MAP be by calculating (e) instead of (f) and(g) for this query?8.5POOLINGAssessing relevanceTo properly evaluate a system, your test information needs must be germaneto the documents in the test document collection, and appropriate for predicted usage of the system.
These information needs are best designed bydomain experts. Using random combinations of query terms as an information need is generally not a good idea because typically they will not resemble the actual distribution of information needs.Given information needs and documents, you need to collect relevanceassessments. This is a time-consuming and expensive process involving human beings.
For tiny collections like Cranfield, exhaustive judgments of relevance for each query and document pair were obtained. For large moderncollections, it is usual for relevance to be assessed only for a subset of thedocuments for each query. The most standard approach is pooling, where relevance is assessed over a subset of the collection that is formed from the topk documents returned by a number of different IR systems (usually the onesto be evaluated), and perhaps other sources such as the results of Booleankeyword searches or documents found by expert searchers in an interactiveprocess.Online edition (c) 2009 Cambridge UP1658.5 Assessing relevanceJudge 1RelevanceYesNoTotalJudge 2 RelevanceYesNo Total3002032010708031090400Observed proportion of the times the judges agreedP( A) = (300 + 70)/400 = 370/400 = 0.925Pooled marginalsP(nonrelevant) = (80 + 90)/(400 + 400) = 170/800 = 0.2125P(relevant) = (320 + 310)/(400 + 400) = 630/800 = 0.7878Probability that the two judges agreed by chanceP( E) = P(nonrelevant)2 + P(relevant)2 = 0.21252 + 0.78782 = 0.665Kappa statisticκ = ( P( A) − P( E))/(1 − P( E)) = (0.925 − 0.665)/(1 − 0.665) = 0.776◮ Table 8.2 Calculating the kappa statistic.KAPPA STATISTIC(8.10)MARGINALA human is not a device that reliably reports a gold standard judgmentof relevance of a document to a query.
Rather, humans and their relevancejudgments are quite idiosyncratic and variable. But this is not a problemto be solved: in the final analysis, the success of an IR system depends onhow good it is at satisfying the needs of these idiosyncratic humans, oneinformation need at a time.Nevertheless, it is interesting to consider and measure how much agreement between judges there is on relevance judgments. In the social sciences,a common measure for agreement between judges is the kappa statistic. It isdesigned for categorical judgments and corrects a simple agreement rate forthe rate of chance agreement.kappa =P( A) − P( E)1 − P( E)where P( A) is the proportion of the times the judges agreed, and P( E) is theproportion of the times they would be expected to agree by chance. Thereare choices in how the latter is estimated: if we simply say we are makinga two-class decision and assume nothing more, then the expected chanceagreement rate is 0.5.
However, normally the class distribution assigned isskewed, and it is usual to use marginal statistics to calculate expected agreement.2 There are still two ways to do it depending on whether one pools2. For a contingency table, as in Table 8.2, a marginal statistic is formed by summing a row orcolumn. The marginal ai.k = ∑ j aijk.Online edition (c) 2009 Cambridge UP1668 Evaluation in information retrievalthe marginal distribution across judges or uses the marginals for each judgeseparately; both forms have been used, but we present the pooled versionbecause it is more conservative in the presence of systematic differences in assessments across judges.
The calculations are shown in Table 8.2. The kappavalue will be 1 if two judges always agree, 0 if they agree only at the rategiven by chance, and negative if they are worse than random. If there aremore than two judges, it is normal to calculate an average pairwise kappavalue. As a rule of thumb, a kappa value above 0.8 is taken as good agreement, a kappa value between 0.67 and 0.8 is taken as fair agreement, andagreement below 0.67 is seen as data providing a dubious basis for an evaluation, though the precise cutoffs depend on the purposes for which the datawill be used.Interjudge agreement of relevance has been measured within the TRECevaluations and for medical IR collections.
Using the above rules of thumb,the level of agreement normally falls in the range of “fair” (0.67–0.8). The factthat human agreement on a binary relevance judgment is quite modest is onereason for not requiring more fine-grained relevance labeling from the testset creator. To answer the question of whether IR evaluation results are validdespite the variation of individual assessors’ judgments, people have experimented with evaluations taking one or the other of two judges’ opinions asthe gold standard.
The choice can make a considerable absolute difference toreported scores, but has in general been found to have little impact on the relative effectiveness ranking of either different systems or variants of a singlesystem which are being compared for effectiveness.8.5.1Critiques and justifications of the concept of relevanceThe advantage of system evaluation, as enabled by the standard model ofrelevant and nonrelevant documents, is that we have a fixed setting in whichwe can vary IR systems and system parameters to carry out comparative experiments.
Such formal testing is much less expensive and allows clearerdiagnosis of the effect of changing system parameters than doing user studies of retrieval effectiveness. Indeed, once we have a formal measure thatwe have confidence in, we can proceed to optimize effectiveness by machinelearning methods, rather than tuning parameters by hand. Of course, if theformal measure poorly describes what users actually want, doing this willnot be effective in improving user satisfaction.
Our perspective is that, inpractice, the standard formal measures for IR evaluation, although a simplification, are good enough, and recent work in optimizing formal evaluationmeasures in IR has succeeded brilliantly. There are numerous examples oftechniques developed in formal evaluation settings, which improve effectiveness in operational settings, such as the development of document lengthnormalization methods within the context of TREC (Sections 6.4.4 and 11.4.3)Online edition (c) 2009 Cambridge UP8.5 Assessing relevanceMARGINAL RELEVANCE?167and machine learning methods for adjusting parameter weights in scoring(Section 6.1.2).That is not to say that there are not problems latent within the abstractions used.
The relevance of one document is treated as independent of therelevance of other documents in the collection. (This assumption is actuallybuilt into most retrieval systems – documents are scored against queries, notagainst each other – as well as being assumed in the evaluation methods.)Assessments are binary: there aren’t any nuanced assessments of relevance.Relevance of a document to an information need is treated as an absolute,objective decision.
But judgments of relevance are subjective, varying acrosspeople, as we discussed above. In practice, human assessors are also imperfect measuring instruments, susceptible to failures of understanding and attention. We also have to assume that users’ information needs do not changeas they start looking at retrieval results. Any results based on one collectionare heavily skewed by the choice of collection, queries, and relevance judgment set: the results may not translate from one domain to another or to adifferent user population.Some of these problems may be fixable. A number of recent evaluations,including INEX, some TREC tracks, and NTCIR have adopted an ordinalnotion of relevance with documents divided into 3 or 4 classes, distinguishing slightly relevant documents from highly relevant documents.
See Section 10.4 (page 210) for a detailed discussion of how this is implemented inthe INEX evaluations.One clear problem with the relevance-based assessment that we have presented is the distinction between relevance and marginal relevance: whethera document still has distinctive usefulness after the user has looked at certain other documents (Carbonell and Goldstein 1998). Even if a documentis highly relevant, its information can be completely redundant with otherdocuments which have already been examined. The most extreme case ofthis is documents that are duplicates – a phenomenon that is actually verycommon on the World Wide Web – but it can also easily occur when several documents provide a similar precis of an event.