An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 29
Текст из файла (страница 29)
During query processing, the scanningof many postings lists can then be terminated early because smaller weightsdo not change the ranking of the highest ranked k documents found so far. Itis not a good idea to precompute and store weights in the index (as opposedto frequencies) because they cannot be compressed as well as integers (seeSection 7.1.5, page 140).Document compression can also be important in an efficient information retrieval system.
de Moura et al. (2000) and Brisaboa et al. (2007) describecompression schemes that allow direct searching of terms and phrases in thecompressed text, which is infeasible with standard text compression utilitieslike gzip and compress.?Exercise 5.14[ ⋆]We have defined unary codes as being “10”: sequences of 1s terminated by a 0. Interchanging the roles of 0s and 1s yields an equivalent “01” unary code. When this01 unary code is used, the construction of a γ code can be stated as follows: (1) WriteG down in binary using b = ⌊log2 j⌋ + 1 bits.
(2) Prepend (b − 1) 0s. (i) Encode thenumbers in Table 5.5 in this alternative γ code. (ii) Show that this method producesa well-defined alternative γ code in the sense that it has the same length and can beuniquely decoded.Exercise 5.15[ ⋆ ⋆ ⋆]Unary code is not a universal code in the sense defined above. However, there existsa distribution over gaps for which unary code is optimal. Which distribution is this?Exercise 5.16Give some examples of terms that violate the assumption that gaps all have the samesize (which we made when estimating the space requirements of a γ-encoded index).What are general characteristics of these terms?Exercise 5.17Consider a term whose postings list has size n, say, n = 10,000.
Compare the size ofthe γ-compressed gap-encoded postings list if the distribution of the term is uniform(i.e., all gaps have the same size) versus its size when the distribution is not uniform.Which compressed postings list is smaller?Exercise 5.18Work out the sum in Equation (5.7) and show it adds up to about 251 MB. Use thenumbers in Table 4.2, but do not round Lc, c, and the number of vocabulary blocks.Online edition (c) 2009 Cambridge UPOnline edition (c) 2009 Cambridge UPDRAFT! © April 1, 2009 Cambridge University Press.
Feedback welcome.6109Scoring, term weighting and thevector space modelThus far we have dealt with indexes that support Boolean queries: a document either matches or does not match a query. In the case of large documentcollections, the resulting number of matching documents can far exceed thenumber a human user could possibly sift through.
Accordingly, it is essential for a search engine to rank-order the documents matching a query. To dothis, the search engine computes, for each matching document, a score withrespect to the query at hand. In this chapter we initiate the study of assigninga score to a (query, document) pair. This chapter consists of three main ideas.1. We introduce parametric and zone indexes in Section 6.1, which servetwo purposes. First, they allow us to index and retrieve documents bymetadata such as the language in which a document is written.
Second,they give us a simple means for scoring (and thereby ranking) documentsin response to a query.2. Next, in Section 6.2 we develop the idea of weighting the importance of aterm in a document, based on the statistics of occurrence of the term.3. In Section 6.3 we show that by viewing each document as a vector of suchweights, we can compute a score between a query and each document.This view is known as vector space scoring.Section 6.4 develops several variants of term-weighting for the vector spacemodel.
Chapter 7 develops computational aspects of vector space scoring,and related topics.As we develop these ideas, the notion of a query will assume multiplenuances. In Section 6.1 we consider queries in which specific query termsoccur in specified regions of a matching document.
Beginning Section 6.2 wewill in fact relax the requirement of matching specific regions of a document;instead, we will look at so-called free text queries that simply consist of queryterms with no specification on their relative order, importance or where in adocument they should be found. The bulk of our study of scoring will be inthis latter notion of a query being such a set of terms.Online edition (c) 2009 Cambridge UP1106 Scoring, term weighting and the vector space model6.1METADATAFIELDPARAMETRIC INDEXZONEWEIGHTED ZONESCORINGParametric and zone indexesWe have thus far viewed a document as a sequence of terms.
In fact, mostdocuments have additional structure. Digital documents generally encode,in machine-recognizable form, certain metadata associated with each document. By metadata, we mean specific forms of data about a document, suchas its author(s), title and date of publication. This metadata would generallyinclude fields such as the date of creation and the format of the document, aswell the author and possibly the title of the document. The possible valuesof a field should be thought of as finite – for instance, the set of all dates ofauthorship.Consider queries of the form “find documents authored by William Shakespeare in 1601, containing the phrase alas poor Yorick”. Query processing thenconsists as usual of postings intersections, except that we may merge postings from standard inverted as well as parametric indexes.
There is one parametric index for each field (say, date of creation); it allows us to select onlythe documents matching a date specified in the query. Figure 6.1 illustratesthe user’s view of such a parametric search. Some of the fields may assumeordered values, such as dates; in the example query above, the year 1601 isone such field value. The search engine may support querying ranges onsuch ordered values; to this end, a structure like a B-tree may be used for thefield’s dictionary.Zones are similar to fields, except the contents of a zone can be arbitraryfree text. Whereas a field may take on a relatively small set of values, a zonecan be thought of as an arbitrary, unbounded amount of text.
For instance,document titles and abstracts are generally treated as zones. We may build aseparate inverted index for each zone of a document, to support queries suchas “find documents with merchant in the title and william in the author list andthe phrase gentle rain in the body”. This has the effect of building an indexthat looks like Figure 6.2. Whereas the dictionary for a parametric indexcomes from a fixed vocabulary (the set of languages, or the set of dates), thedictionary for a zone index must structure whatever vocabulary stems fromthe text of that zone.In fact, we can reduce the size of the dictionary by encoding the zone inwhich a term occurs in the postings.
In Figure 6.3 for instance, we show howoccurrences of william in the title and author zones of various documents areencoded. Such an encoding is useful when the size of the dictionary is aconcern (because we require the dictionary to fit in main memory). But thereis another important reason why the encoding of Figure 6.3 is useful: theefficient computation of scores using a technique we will call weighted zonescoring.Online edition (c) 2009 Cambridge UP1116.1 Parametric and zone indexes◮ Figure 6.1 Parametric search. In this example we have a collection with fields allowing us to select publications by zones such as Author and fields such as Language.william.abstract-11-121-1441-1729william.title-2-4-8-16william.author-2-3-5-8◮ Figure 6.2 Basic zone index ; zones are encoded as extensions of dictionary entries.william- 2.author,2.title- 3.author-4.title- 5.author◮ Figure 6.3 Zone index in which the zone is encoded in the postings rather thanthe dictionary.Online edition (c) 2009 Cambridge UP1126 Scoring, term weighting and the vector space model6.1.1Weighted zone scoringThus far in Section 6.1 we have focused on retrieving documents based onBoolean queries on fields and zones.
We now turn to a second application ofzones and fields.Given a Boolean query q and a document d, weighted zone scoring assignsto the pair (q, d) a score in the interval [0, 1], by computing a linear combination of zone scores, where each zone of the document contributes a Booleanvalue. More specifically, consider a set of documents each of which has ℓzones. Let g1 , . .
. , gℓ ∈ [0, 1] such that ∑ℓi=1 gi = 1. For 1 ≤ i ≤ ℓ, let si be theBoolean score denoting a match (or absence thereof) between q and the ithzone. For instance, the Boolean score from a zone could be 1 if all the queryterm(s) occur in that zone, and zero otherwise; indeed, it could be any Boolean function that maps the presence of query terms in a zone to 0, 1. Then,the weighted zone score is defined to beℓ(6.1)∑ gi si .i =1R ANKED B OOLEANRETRIEVAL✎Weighted zone scoring is sometimes referred to also as ranked Boolean retrieval.Consider the query shakespeare in a collection in which each document has three zones: author, title and body.
The Boolean score function for a zonetakes on the value 1 if the query term shakespeare is present in the zone, and zerootherwise. Weighted zone scoring in such a collection would require three weightsg1 , g2 and g3 , respectively corresponding to the author, title and body zones. Supposewe set g1 = 0.2, g2 = 0.3 and g3 = 0.5 (so that the three weights add up to 1); this corresponds to an application in which a match in the author zone is least important tothe overall score, the title zone somewhat more, and the body contributes even more.Thus if the term shakespeare were to appear in the title and body zones but not theauthor zone of a document, the score of this document would be 0.8.Example 6.1:How do we implement the computation of weighted zone scores? A simple approach would be to compute the score for each document in turn,adding in all the contributions from the various zones. However, we nowshow how we may compute weighted zone scores directly from inverted indexes.
The algorithm of Figure 6.4 treats the case when the query q is a twoterm query consisting of query terms q1 and q2 , and the Boolean function isAND: 1 if both query terms are present in a zone and 0 otherwise. Followingthe description of the algorithm, we describe the extension to more complexqueries and Boolean functions.The reader may have noticed the close similarity between this algorithmand that in Figure 1.6. Indeed, they represent the same postings traversal,except that instead of merely adding a document to the set of results forOnline edition (c) 2009 Cambridge UP6.1 Parametric and zone indexes113Z ONE S CORE (q1 , q2 )1 float scores[ N ] = [0]2 constant g[ℓ]3 p1 ← postings(q1 )4 p2 ← postings(q2 )5 // scores[] is an array with a score entry for each document, initialized to zero.6 //p1 and p2 are initialized to point to the beginning of their respective postings.7 //Assume g[] is initialized to the respective zone weights.8 while p1 6= NIL and p2 6= NIL9 do if docID ( p1) = docID ( p2)10then scores[docID ( p1 )] ← W EIGHTED Z ONE ( p1, p2 , g)11p1 ← next( p1 )12p2 ← next( p2 )13else if docID ( p1) < docID ( p2)14then p1 ← next( p1 )15else p2 ← next( p2 )16 return scores◮ Figure 6.4 Algorithm for computing the weighted zone score from two postingslists.
Function W EIGHTED Z ONE (not shown here) is assumed to compute the innerloop of Equation 6.1.ACCUMULATOR6.1.2MACHINE - LEARNEDRELEVANCEa Boolean AND query, we now compute a score for each such document.Some literature refers to the array scores[] above as a set of accumulators. Thereason for this will be clear as we consider more complex Boolean functionsthan the AND; thus we may assign a non-zero score to a document even if itdoes not contain all query terms.Learning weightsHow do we determine the weights gi for weighted zone scoring? Theseweights could be specified by an expert (or, in principle, the user); but increasingly, these weights are “learned” using training examples that havebeen judged editorially. This latter methodology falls under a general classof approaches to scoring and ranking in information retrieval, known asmachine-learned relevance.