An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 32
Текст из файла (страница 32)
These are based onraw term frequency only and are normalized as if these were the only terms in thecollection. (Since affection and jealous occur in all three documents, their tf-idf weightwould be 0 in most formulations.)Of course, there are many other terms occurring in each of these novels.
In this example we represent each of these novels as a unit vector in three dimensions, corresponding to these three terms (only); we use raw term frequencies here, with no idfmultiplier. The resulting weights are as shown in Figure 6.13.Now consider the cosine similarities between pairs of the resulting three-dimensionalvectors. A simple computation shows that sim(~v(SAS), ~v(PAP)) is 0.999, whereassim(~v(SAS), ~v(WH)) is 0.888; thus, the two books authored by Austen (SaS and PaP)are considerably closer to each other than to Brontë’s Wuthering Heights.
In fact, thesimilarity between the first two is almost perfect (when restricted to the three termswe consider). Here we have considered tf weights, but we could of course use otherterm weight functions.TERM - DOCUMENTMATRIX6.3.2Viewing a collection of N documents as a collection of vectors leads to anatural view of a collection as a term-document matrix: this is an M × N matrixwhose rows represent the M terms (dimensions) of the N columns, each ofwhich corresponds to a document. As always, the terms being indexed couldbe stemmed before indexing; for instance, jealous and jealousy would understemming be considered as a single dimension. This matrix view will proveto be useful in Chapter 18.Queries as vectorsThere is a far more compelling reason to represent documents as vectors:we can also view a query as a vector.
Consider the query q = jealous gossip.This query turns into the unit vector ~v(q) = (0, 0.707, 0.707) on the threecoordinates of Figures 6.12 and 6.13. The key idea now: to assign to eachdocument d a score equal to the dot product~v(q) · ~v(d).In the example of Figure 6.13, Wuthering Heights is the top-scoring document for this query with a score of 0.509, with Pride and Prejudice a distantsecond with a score of 0.085, and Sense and Sensibility last with a score of0.074. This simple example is somewhat misleading: the number of dimen-Online edition (c) 2009 Cambridge UP1246 Scoring, term weighting and the vector space modelsions in practice will be far larger than three: it will equal the vocabulary sizeM.To summarize, by viewing a query as a “bag of words”, we are able totreat it as a very short document.
As a consequence, we can use the cosinesimilarity between the query vector and a document vector as a measure ofthe score of the document for that query. The resulting scores can then beused to select the top-scoring documents for a query.
Thus we have(6.12)score(q, d) =~ ( q) · V~ (d)V.~ (q)||V~ (d)||VA document may have a high cosine score for a query even if it does notcontain all query terms. Note that the preceding discussion does not hingeon any specific weighting of terms in the document vector, although for thepresent we may think of them as either tf or tf-idf weights. In fact, a numberof weighting schemes are possible for query as well as document vectors, asillustrated in Example 6.4 and developed further in Section 6.4.Computing the cosine similarities between the query vector and each document vector in the collection, sorting the resulting scores and selecting thetop K documents can be expensive — a single similarity computation canentail a dot product in tens of thousands of dimensions, demanding tens ofthousands of arithmetic operations.
In Section 7.1 we study how to use an inverted index for this purpose, followed by a series of heuristics for improvingon this.✎6.3.3Example 6.4: We now consider the query best car insurance on a fictitious collectionwith N = 1,000,000 documents where the document frequencies of auto, best, car andinsurance are respectively 5000, 50000, 10000 and 1000.querydocumentproducttermtfdf idf wt,q tf wf wt,dauto05000 2.3 01 10.41 01 50000 1.3 1.30 000bestcar1 10000 2.0 2.01 10.41 0.82insurance11000 3.0 3.02 20.82 2.46In this example the weight of a term in the query is simply the idf (and zero for aterm not in the query, such as auto); this is reflected in the column header wt,q (the entry for auto is zero because the query does not contain the termauto).
For documents,we use tf weighting with no use of idf but with Euclidean normalization. The formeris shown under the column headed wf, while the latter is shown under the columnheaded wt,d . Invoking (6.9) now gives a net score of 0 + 0 + 0.82 + 2.46 = 3.28.Computing vector scoresIn a typical setting we have a collection of documents each represented by avector, a free text query represented by a vector, and a positive integer K.
WeOnline edition (c) 2009 Cambridge UP6.3 The vector space model for scoring125C OSINE S CORE (q)1 float Scores[ N ] = 02 Initialize Length[ N ]3 for each query term t4 do calculate wt,q and fetch postings list for t5for each pair(d, tft,d ) in postings list6do Scores[d] += wft,d × wt,q7 Read the array Length[d]8 for each d9 do Scores[d] = Scores[d]/Length[d]10 return Top K components of Scores[]◮ Figure 6.14 The basic algorithm for computing vector space scores.TERM - AT- A - TIMEACCUMULATORseek the K documents of the collection with the highest vector space scores onthe given query.
We now initiate the study of determining the K documentswith the highest vector space scores for a query. Typically, we seek theseK top documents in ordered by decreasing score; for instance many searchengines use K = 10 to retrieve and rank-order the first page of the ten bestresults.
Here we give the basic algorithm for this computation; we develop afuller treatment of efficient techniques and approximations in Chapter 7.Figure 6.14 gives the basic algorithm for computing vector space scores.The array Length holds the lengths (normalization factors) for each of the Ndocuments, whereas the array Scores holds the scores for each of the documents. When the scores are finally computed in Step 9, all that remains inStep 10 is to pick off the K documents with the highest scores.The outermost loop beginning Step 3 repeats the updating of Scores, iterating over each query term t in turn. In Step 5 we calculate the weight inthe query vector for term t.
Steps 6-8 update the score of each document byadding in the contribution from term t. This process of adding in contributions one query term at a time is sometimes known as term-at-a-time scoringor accumulation, and the N elements of the array Scores are therefore knownas accumulators. For this purpose, it would appear necessary to store, witheach postings entry, the weight wft,d of term t in document d (we have thusfar used either tf or tf-idf for this weight, but leave open the possibility ofother functions to be developed in Section 6.4). In fact this is wasteful, sincestoring this weight may require a floating point number.
Two ideas help alleviate this space problem. First, if we are using inverse document frequency,we need not precompute idft ; it suffices to store N/dft at the head of thepostings for t. Second, we store the term frequency tft,d for each postings entry. Finally, Step 12 extracts the top K scores – this requires a priority queueOnline edition (c) 2009 Cambridge UP1266 Scoring, term weighting and the vector space modelDOCUMENT- AT- A - TIME?data structure, often implemented using a heap.
Such a heap takes no morethan 2N comparisons to construct, following which each of the K top scorescan be extracted from the heap at a cost of O(log N ) comparisons.Note that the general algorithm of Figure 6.14 does not prescribe a specificimplementation of how we traverse the postings lists of the various queryterms; we may traverse them one term at a time as in the loop beginningat Step 3, or we could in fact traverse them concurrently as in Figure 1.6.
Insuch a concurrent postings traversal we compute the scores of one documentat a time, so that it is sometimes called document-at-a-time scoring. We willsay more about this in Section 7.1.5.Exercise 6.14If we were to stem jealous and jealousy to a common stem before setting up the vectorspace, detail how the definitions of tf and idf should be modified.Exercise 6.15Recall the tf-idf weights computed in Exercise 6.10.