An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 42
Текст из файла (страница 42)
In such circumstances,marginal relevance is clearly a better measure of utility to the user. Maximizing marginal relevance requires returning documents that exhibit diversityand novelty. One way to approach measuring this is by using distinct factsor entities as evaluation units. This perhaps more directly measures trueutility to the user but doing this makes it harder to create a test collection.Exercise 8.10[⋆⋆]Below is a table showing how two human judges rated the relevance of a set of 12documents to a particular information need (0 = nonrelevant, 1 = relevant).
Let us assume that you’ve written an IR system that for this query returns the set of documents{4, 5, 6, 7, 8}.Online edition (c) 2009 Cambridge UP1688 Evaluation in information retrievaldocID123456789101112Judge 1001111110000Judge 2001100001111a. Calculate the kappa measure between the two judges.b. Calculate precision, recall, and F1 of your system if a document is considered relevant only if the two judges agree.c.
Calculate precision, recall, and F1 of your system if a document is considered relevant if either judge thinks it is relevant.8.6A broader perspective: System quality and user utilityFormal evaluation measures are at some distance from our ultimate interestin measures of human utility: how satisfied is each user with the results thesystem gives for each information need that they pose? The standard way tomeasure human satisfaction is by various kinds of user studies. These mightinclude quantitative measures, both objective, such as time to complete atask, as well as subjective, such as a score for satisfaction with the searchengine, and qualitative measures, such as user comments on the search interface.
In this section we will touch on other system aspects that allow quantitative evaluation and the issue of user utility.8.6.1System issuesThere are many practical benchmarks on which to rate an information retrieval system beyond its retrieval quality. These include:• How fast does it index, that is, how many documents per hour does itindex for a certain distribution over document lengths? (cf. Chapter 4)• How fast does it search, that is, what is its latency as a function of indexsize?• How expressive is its query language? How fast is it on complex queries?Online edition (c) 2009 Cambridge UP8.6 A broader perspective: System quality and user utility169• How large is its document collection, in terms of the number of documents or the collection having information distributed across a broadrange of topics?All these criteria apart from query language expressiveness are straightforwardly measurable: we can quantify the speed or size.
Various kinds of feature checklists can make query language expressiveness semi-precise.8.6.2User utilityWhat we would really like is a way of quantifying aggregate user happiness,based on the relevance, speed, and user interface of a system. One part ofthis is understanding the distribution of people we wish to make happy, andthis depends entirely on the setting. For a web search engine, happy searchusers are those who find what they want. One indirect measure of such usersis that they tend to return to the same engine.
Measuring the rate of returnof users is thus an effective metric, which would of course be more effectiveif you could also measure how much these users used other search engines.But advertisers are also users of modern web search engines. They are happyif customers click through to their sites and then make purchases.
On aneCommerce web site, a user is likely to be wanting to purchase something.Thus, we can measure the time to purchase, or the fraction of searchers whobecome buyers. On a shopfront web site, perhaps both the user’s and thestore owner’s needs are satisfied if a purchase is made. Nevertheless, ingeneral, we need to decide whether it is the end user’s or the eCommercesite owner’s happiness that we are trying to optimize. Usually, it is the storeowner who is paying us.For an “enterprise” (company, government, or academic) intranet searchengine, the relevant metric is more likely to be user productivity: how muchtime do users spend looking for information that they need.
There are alsomany other practical criteria concerning such matters as information security, which we mentioned in Section 4.6 (page 80).User happiness is elusive to measure, and this is part of why the standardmethodology uses the proxy of relevance of search results. The standarddirect way to get at user satisfaction is to run user studies, where people engage in tasks, and usually various metrics are measured, the participants areobserved, and ethnographic interview techniques are used to get qualitativeinformation on satisfaction. User studies are very useful in system design,but they are time consuming and expensive to do. They are also difficult todo well, and expertise is required to design the studies and to interpret theresults.
We will not discuss the details of human usability testing here.Online edition (c) 2009 Cambridge UP1708 Evaluation in information retrieval8.6.3A/B TESTCLICKTHROUGH LOGANALYSISCLICKSTREAM MINING8.7SNIPPETRefining a deployed systemIf an IR system has been built and is being used by a large number of users,the system’s builders can evaluate possible changes by deploying variantversions of the system and recording measures that are indicative of usersatisfaction with one variant vs. others as they are being used.
This methodis frequently used by web search engines.The most common version of this is A/B testing, a term borrowed from theadvertising industry. For such a test, precisely one thing is changed betweenthe current system and a proposed system, and a small proportion of traffic (say, 1–10% of users) is randomly directed to the variant system, whilemost users use the current system. For example, if we wish to investigate achange to the ranking algorithm, we redirect a random sample of users toa variant system and evaluate measures such as the frequency with whichpeople click on the top result, or any result on the first page. (This particularanalysis method is referred to as clickthrough log analysis or clickstream mining. It is further discussed as a method of implicit feedback in Section 9.1.7(page 187).)The basis of A/B testing is running a bunch of single variable tests (eitherin sequence or in parallel): for each test only one parameter is varied from thecontrol (the current live system).
It is therefore easy to see whether varyingeach parameter has a positive or negative effect. Such testing of a live systemcan easily and cheaply gauge the effect of a change on users, and, with alarge enough user base, it is practical to measure even very small positiveand negative effects. In principle, more analytic power can be achieved byvarying multiple things at once in an uncorrelated (random) way, and doingstandard multivariate statistical analysis, such as multiple linear regression.In practice, though, A/B testing is widely used, because A/B tests are easyto deploy, easy to understand, and easy to explain to management.Results snippetsHaving chosen or ranked the documents matching a query, we wish to present a results list that will be informative to the user.
In many cases theuser will not want to examine all the returned documents and so we wantto make the results list informative enough that the user can do a final ranking of the documents for themselves based on relevance to their informationneed.3 The standard way of doing this is to provide a snippet, a short summary of the document, which is designed so as to allow the user to decideits relevance.
Typically, the snippet consists of the document title and a short3. There are exceptions, in domains where recall is emphasized. For instance, in many legaldisclosure cases, a legal associate will review every document that matches a keyword search.Online edition (c) 2009 Cambridge UP8.7 Results snippetsSTATIC SUMMARYDYNAMIC SUMMARYTEXT SUMMARIZATIONKEYWORD - IN - CONTEXT171summary, which is automatically extracted. The question is how to designthe summary so as to maximize its usefulness to the user.The two basic kinds of summaries are static, which are always the sameregardless of the query, and dynamic (or query-dependent), which are customized according to the user’s information need as deduced from a query.Dynamic summaries attempt to explain why a particular document was retrieved for the query at hand.A static summary is generally comprised of either or both a subset of thedocument and metadata associated with the document. The simplest formof summary takes the first two sentences or 50 words of a document, or extracts particular zones of a document, such as the title and author.
Instead ofzones of a document, the summary can instead use metadata associated withthe document. This may be an alternative way to provide an author or date,or may include elements which are designed to give a summary, such as thedescription metadata which can appear in the meta element of a webHTML page. This summary is typically extracted and cached at indexingtime, in such a way that it can be retrieved and presented quickly when displaying search results, whereas having to access the actual document contentmight be a relatively expensive operation.There has been extensive work within natural language processing (NLP)on better ways to do text summarization. Most such work still aims only tochoose sentences from the original document to present and concentrates onhow to select good sentences.