An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 36
Текст из файла (страница 36)
This idea is developed further in Section 7.2.1.7.1.5Impact orderingIn all the postings lists described thus far, we order the documents consistently by some common ordering: typically by document ID but in Section 7.1.4 by static quality scores. As noted at the end of Section 6.3.3, such acommon ordering supports the concurrent traversal of all of the query terms’postings lists, computing the score for each document as we encounter it.Computing scores in this manner is sometimes referred to as document-at-atime scoring. We will now introduce a technique for inexact top-K retrievalin which the postings are not all ordered by a common ordering, therebyprecluding such a concurrent traversal.
We will therefore require scores tobe “accumulated” one term at a time as in the scheme of Figure 6.14, so thatwe have term-at-a-time scoring.The idea is to order the documents d in the postings list of term t bydecreasing order of tft,d . Thus, the ordering of documents will vary fromone postings list to another, and we cannot compute scores by a concurrenttraversal of the postings lists of all query terms. Given postings lists orderedby decreasing order of tft,d , two ideas have been found to significantly lowerthe number of documents for which we accumulate scores: (1) when traversing the postings list for a query term t, we stop after considering a prefixof the postings list – either after a fixed number of documents r have beenseen, or after the value of tft,d has dropped below a threshold; (2) when accumulating scores in the outer loop of Figure 6.14, we consider the queryterms in decreasing order of idf, so that the query terms likely to contributethe most to the final scores are considered first.
This latter idea too can beadaptive at the time of processing a query: as we get to query terms withlower idf, we can determine whether to proceed based on the changes inOnline edition (c) 2009 Cambridge UP7.1 Efficient scoring and ranking141document scores from processing the previous query term. If these changesare minimal, we may omit accumulation from the remaining query terms, oralternatively process shorter prefixes of their postings lists.These ideas form a common generalization of the methods introduced inSections 7.1.2–7.1.4.
We may also implement a version of static ordering inwhich each postings list is ordered by an additive combination of static andquery-dependent scores. We would again lose the consistency of orderingacross postings, thereby having to process query terms one at time accumulating scores for all documents as we go along. Depending on the particularscoring function, the postings list for a document may be ordered by otherquantities than term frequency; under this more general setting, this idea isknown as impact ordering.7.1.6Cluster pruningIn cluster pruning we have a preprocessing step during which we cluster thedocument vectors. Then at query time, we consider only documents in asmall number of clusters as candidates for which we compute cosine scores.Specifically, the preprocessing step is as follows:√1.
Pick N documents at random from the collection. Call these leaders.2. For each document that is not a leader, we compute its nearest leader.We refer to documents that are not leaders as followers.Intuitively, in the par√tition of the followers induced by the use of N randomly√ chosen√ leaders,the expected number of followers for each leader is ≈ N/ N = N.
Next,query processing proceeds as follows:1. Given a query q, find the leader L that is closest√ to q. This entails computing cosine similarities from q to each of the N leaders.2. The candidate set A consists of L together with its followers. We computethe cosine scores for all documents in this candidate set.The use of randomly chosen leaders for clustering is fast and likely to reflect the distribution of the document vectors in the vector space: a regionof the vector space that is dense in documents is likely to produce multiple leaders and thus a finer partition into sub-regions.
This illustrated inFigure 7.3.Variations of cluster pruning introduce additional parameters b1 and b2 ,both of which are positive integers. In the pre-processing step we attacheach follower to its b1 closest leaders, rather than a single closest leader. Atquery time we consider the b2 leaders closest to the query q. Clearly, the basicscheme above corresponds to the case b1 = b2 = 1.
Further, increasing b1 orOnline edition (c) 2009 Cambridge UP1427 Computing scores in a complete search system◮ Figure 7.3 Cluster pruning.b2 increases the likelihood of finding K documents that are more likely to bein the set of true top-scoring K documents, at the expense of more computation. We reiterate this approach when describing clustering in Chapter 16(page 354).?Exercise 7.1We suggested above (Figure 7.2) that the postings for static quality ordering be indecreasing order of g(d). Why do we use the decreasing rather than the increasingorder?Exercise 7.2When discussing champion lists, we simply used the r documents with the largest tfvalues to create the champion list for t. But when considering global champion lists,we used idf as well, identifying documents with the largest values of g(d) + tf-idft,d .Why do we differentiate between these two cases?Exercise 7.3If we were to only have one-term queries, explain why the use of global championlists with r = K suffices for identifying the K highest scoring documents.
What is asimple modification to this idea if we were to only have s-term queries for any fixedinteger s > 1?Exercise 7.4Explain how the common global ordering by g(d) values in all high and low listshelps make the score computation efficient.Online edition (c) 2009 Cambridge UP7.2 Components of an information retrieval system143Exercise 7.5Consider again the data of Exercise 6.23 with nnn.atc for the query-dependent scoring.
Suppose that we were given static quality scores of 1 for Doc1 and 2 for Doc2.Determine under Equation (7.2) what ranges of static quality score for Doc3 result init being the first, second or third result for the query best car insurance.Exercise 7.6Sketch the frequency-ordered postings for the data in Figure 6.9.Exercise 7.7Let the static quality scores for Doc1, Doc2 and Doc3 in Figure 6.11 be respectively0.25, 0.5 and 1. Sketch the postings for impact ordering when each postings list isordered by the sum of the static quality score and the Euclidean normalized tf valuesin Figure 6.11.Exercise 7.8The nearest-neighbor problem in the plane is the following: given a set of N datapoints on the plane, we preprocess them into some data structure such that, givena query point Q, we seek the point in N that is closest to Q in Euclidean distance.Clearly cluster pruning can be used as an approach to the nearest-neighbor problemin the plane, if we wished to avoid computing the distance from Q to every one ofthe query points.
Devise a simple example on the plane so that with two leaders, theanswer returned by cluster pruning is incorrect (it is not the data point closest to Q).7.2Components of an information retrieval systemIn this section we combine the ideas developed so far to describe a rudimentary search system that retrieves and scores documents. We first developfurther ideas for scoring, beyond vector spaces. Following this, we will puttogether all of these elements to outline a complete system. Because we consider a complete system, we do not restrict ourselves to vector space retrievalin this section. Indeed, our complete system will have provisions for vectorspace as well as other query operators and forms of retrieval.
In Section 7.3we will return to how vector space queries interact with other query operators.7.2.1TIERED INDEXESTiered indexesWe mentioned in Section 7.1.2 that when using heuristics such as index elimination for inexact top-K retrieval, we may occasionally find ourselves witha set A of contenders that has fewer than K documents. A common solutionto this issue is the user of tiered indexes, which may be viewed as a generalization of champion lists. We illustrate this idea in Figure 7.4, where werepresent the documents and terms of Figure 6.9. In this example we set a tfthreshold of 20 for tier 1 and 10 for tier 2, meaning that the tier 1 index onlyhas postings entries with tf values exceeding 20, while the tier 2 index onlyOnline edition (c) 2009 Cambridge UP1447 Computing scores in a complete search system◮ Figure 7.4 Tiered indexes.
If we fail to get K results from tier 1, query processing“falls back” to tier 2, and so on. Within each tier, postings are ordered by documentID.has postings entries with tf values exceeding 10. In this example we havechosen to order the postings entries within a tier by document ID.7.2.2Query-term proximityEspecially for free text queries on the web (Chapter 19), users prefer a document in which most or all of the query terms appear close to each other,Online edition (c) 2009 Cambridge UP7.2 Components of an information retrieval systemPROXIMITY WEIGHTING7.2.3145because this is evidence that the document has text focused on their queryintent. Consider a query with two or more query terms, t1 , t2 , .