An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 35
Текст из файла (страница 35)
We now consider schemes by which we produce K documents that are likely to be among the K highest scoring documents for aquery. In doing so, we hope to dramatically lower the cost of computingthe K documents we output, without materially altering the user’s perceivedrelevance of the top K results. Consequently, in most applications it sufficesto retrieve K documents whose scores are very close to those of the K best.In the sections that follow we detail schemes that retrieve K such documentswhile potentially avoiding computing scores for most of the N documents inthe collection.Such inexact top-K retrieval is not necessarily, from the user’s perspective,a bad thing. The top K documents by the cosine measure are in any case notnecessarily the K best for the query: cosine similarity is only a proxy for theuser’s perceived relevance.
In Sections 7.1.2–7.1.6 below, we give heuristicsusing which we are likely to retrieve K documents with cosine scores closeto those of the top K documents. The principal cost in computing the output stems from computing cosine similarities between the query and a largenumber of documents.
Having a large number of documents in contentionalso increases the selection cost in the final stage of culling the top K documents from a heap. We now consider a series of ideas designed to eliminatea large number of documents without computing their cosine scores. Theheuristics have the following two-step scheme:1. Find a set A of documents that are contenders, where K < | A| ≪ N.
Adoes not necessarily contain the K top-scoring documents for the query,but is likely to have many documents with scores near those of the top K.2. Return the K top-scoring documents in A.From the descriptions of these ideas it will be clear that many of them requireparameters to be tuned to the collection and application at hand; pointersto experience in setting these parameters may be found at the end of thischapter.
It should also be noted that most of these heuristics are well-suitedto free text queries, but not for Boolean or phrase queries.7.1.2Index eliminationFor a multi-term query q, it is clear we only consider documents containing atleast one of the query terms. We can take this a step further using additionalheuristics:1.
We only consider documents containing terms whose idf exceeds a presetthreshold. Thus, in the postings traversal, we only traverse the postingsOnline edition (c) 2009 Cambridge UP1387 Computing scores in a complete search systemfor terms with high idf. This has a fairly significant benefit: the postings lists of low-idf terms are generally long; with these removed fromcontention, the set of documents for which we compute cosines is greatlyreduced. One way of viewing this heuristic: low-idf terms are treated asstop words and do not contribute to scoring. For instance, on the querycatcher in the rye, we only traverse the postings for catcher and rye.
Thecutoff threshold can of course be adapted in a query-dependent manner.2. We only consider documents that contain many (and as a special case,all) of the query terms. This can be accomplished during the postingstraversal; we only compute scores for documents containing all (or many)of the query terms. A danger of this scheme is that by requiring all (oreven many) query terms to be present in a document before consideringit for cosine computation, we may end up with fewer than K candidatedocuments in the output. This issue will discussed further in Section 7.2.1.7.1.3Champion listsThe idea of champion lists (sometimes also called fancy lists or top docs) is toprecompute, for each term t in the dictionary, the set of the r documentswith the highest weights for t; the value of r is chosen in advance.
For tfidf weighting, these would be the r documents with the highest tf values forterm t. We call this set of r documents the champion list for term t.Now, given a query q we create a set A as follows: we take the unionof the champion lists for each of the terms comprising q. We now restrictcosine computation to only the documents in A. A critical parameter in thisscheme is the value r, which is highly application dependent.
Intuitively, rshould be large compared with K, especially if we use any form of the indexelimination described in Section 7.1.2. One issue here is that the value r is setat the time of index construction, whereas K is application dependent andmay not be available until the query is received; as a result we may (as in thecase of index elimination) find ourselves with a set A that has fewer than Kdocuments. There is no reason to have the same value of r for all terms in thedictionary; it could for instance be set to be higher for rarer terms.7.1.4STATIC QUALITYSCORESStatic quality scores and orderingWe now further develop the idea of champion lists, in the somewhat moregeneral setting of static quality scores.
In many search engines, we have available a measure of quality g(d) for each document d that is query-independentand thus static. This quality measure may be viewed as a number betweenzero and one. For instance, in the context of news stories on the web, g(d)may be derived from the number of favorable reviews of the story by webOnline edition (c) 2009 Cambridge UP1397.1 Efficient scoring and ranking◮ Figure 7.2 A static quality-ordered index. In this example we assume that Doc1,Doc2 and Doc3 respectively have static quality scores g(1) = 0.25, g(2) = 0.5, g(3) =1.surfers.
Section 4.6 (page 80) provides further discussion on this topic, asdoes Chapter 21 in the context of web search.The net score for a document d is some combination of g(d) together withthe query-dependent score induced (say) by (6.12). The precise combinationmay be determined by the learning methods of Section 6.1.2, to be developedfurther in Section 15.4.1; but for the purposes of our exposition here, let usconsider a simple sum:(7.2)net-score(q, d) = g(d) +~ ( q) · V~ (d)V.~ (q)||V~ (d)||VIn this simple form, the static quality g(d) and the query-dependent scorefrom (6.10) have equal contributions, assuming each is between 0 and 1.Other relative weightings are possible; the effectiveness of our heuristics willdepend on the specific relative weighting.First, consider ordering the documents in the postings list for each term bydecreasing value of g(d).
This allows us to perform the postings intersectionalgorithm of Figure 1.6. In order to perform the intersection by a single passthrough the postings of each query term, the algorithm of Figure 1.6 relied onthe postings being ordered by document IDs. But in fact, we only requiredthat all postings be ordered by a single common ordering; here we rely on theg(d) values to provide this common ordering. This is illustrated in Figure 7.2,where the postings are ordered in decreasing order of g(d).The first idea is a direct extension of champion lists: for a well-chosenvalue r, we maintain for each term t a global champion list of the r documentsOnline edition (c) 2009 Cambridge UP1407 Computing scores in a complete search systemwith the highest values for g(d) + tf-idft,d .
The list itself is, like all the postings lists considered so far, sorted by a common order (either by documentIDs or by static quality). Then at query time, we only compute the net scores(7.2) for documents in the union of these global champion lists. Intuitively,this has the effect of focusing on documents likely to have large net scores.We conclude the discussion of global champion lists with one further idea.We maintain for each term t two postings lists consisting of disjoint sets ofdocuments, each sorted by g(d) values.
The first list, which we call high,contains the m documents with the highest tf values for t. The second list,which we call low, contains all other documents containing t. When processing a query, we first scan only the high lists of the query terms, computingnet scores for any document on the high lists of all (or more than a certainnumber of) query terms. If we obtain scores for K documents in the process,we terminate. If not, we continue the scanning into the low lists, scoring documents in these postings lists.