An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 31
Текст из файла (страница 31)
Intuitively, we want the few documents that contain insurance to geta higher boost for a query on insurance than the many documents containingtry get from a query on try.How is the document frequency df of a term used to scale its weight? Denoting as usual the total number of documents in a collection by N, we definethe inverse document frequency (idf) of a term t as follows:idft = log(6.7)N.dftThus the idf of a rare term is high, whereas the idf of a frequent term islikely to be low. Figure 6.8 gives an example of idf’s in the Reuters collectionof 806,791 documents; in this example logarithms are to the base 10.
In fact,as we will see in Exercise 6.12, the precise base of the logarithm is not materialto ranking. We will give on page 227 a justification of the particular form inEquation (6.7).6.2.2Tf-idf weightingWe now combine the definitions of term frequency and inverse documentfrequency, to produce a composite weight for each term in each document.Online edition (c) 2009 Cambridge UP1196.2 Term frequency and weightingtermcarautoinsurancebestdft18,165672319,24125,235idft1.652.081.621.5◮ Figure 6.8 Example of idf values.
Here we give the idf’s of terms with variousfrequencies in the Reuters collection of 806,791 documents.TF - IDFThe tf-idf weighting scheme assigns to term t a weight in document d givenby(6.8)tf-idft,d = tft,d × idft .In other words, tf-idft,d assigns to term t a weight in document d that is1. highest when t occurs many times within a small number of documents(thus lending high discriminating power to those documents);2. lower when the term occurs fewer times in a document, or occurs in manydocuments (thus offering a less pronounced relevance signal);3. lowest when the term occurs in virtually all documents.DOCUMENT VECTOR(6.9)At this point, we may view each document as a vector with one componentcorresponding to each term in the dictionary, together with a weight for eachcomponent that is given by (6.8).
For dictionary terms that do not occur ina document, this weight is zero. This vector form will prove to be crucial toscoring and ranking; we will develop these ideas in Section 6.3. As a firststep, we introduce the overlap score measure: the score of a document d is thesum, over all query terms, of the number of times each of the query termsoccurs in d. We can refine this idea so that we add up not the number ofoccurrences of each query term t in d, but instead the tf-idf weight of eachterm in d.Score(q, d) = ∑ tf-idft,d .t∈qIn Section 6.3 we will develop a more rigorous form of Equation (6.9).?Exercise 6.8Why is the idf of a term always finite?Exercise 6.9What is the idf of a term that occurs in every document? Compare this with the useof stop word lists.Online edition (c) 2009 Cambridge UP1206 Scoring, term weighting and the vector space modelcarautoinsurancebestDoc1273014Doc2433330Doc32402917◮ Figure 6.9 Table of tf values for Exercise 6.10.Exercise 6.10Consider the table of term frequencies for 3 documents denoted Doc1, Doc2, Doc3 inFigure 6.9.
Compute the tf-idf weights for the terms car, auto, insurance, best, for eachdocument, using the idf values from Figure 6.8.Exercise 6.11Can the tf-idf weight of a term in a document exceed 1?Exercise 6.12How does the base of the logarithm in (6.7) affect the score calculation in (6.9)? Howdoes the base of the logarithm affect the relative scores of two documents on a givenquery?Exercise 6.13If the logarithm in (6.7) is computed base 2, suggest a simple approximation to the idfof a term.6.3VECTOR SPACE MODEL6.3.1The vector space model for scoringIn Section 6.2 (page 117) we developed the notion of a document vector thatcaptures the relative importance of the terms in a document.
The representation of a set of documents as vectors in a common vector space is known asthe vector space model and is fundamental to a host of information retrieval operations ranging from scoring documents on a query, document classificationand document clustering. We first develop the basic ideas underlying vectorspace scoring; a pivotal step in this development is the view (Section 6.3.2)of queries as vectors in the same vector space as the document collection.Dot products~ (d) the vector derived from document d, with one comWe denote by Vponent in the vector for each dictionary term.
Unless otherwise specified,the reader may assume that the components are computed using the tf-idfweighting scheme, although the particular weighting scheme is immaterialto the discussion that follows. The set of documents in a collection then maybe viewed as a set of vectors in a vector space, in which there is one axis forOnline edition (c) 2009 Cambridge UP1216.3 The vector space model for scoringgossip~v(d1 )1~v(q)~v(d2 )θ00~v(d3 )jealous1◮ Figure 6.10 Cosine similarity illustrated. sim(d1 , d2 ) = cos θ.COSINE SIMILARITY(6.10)DOT PRODUCTE UCLIDEAN LENGTHeach term. This representation loses the relative ordering of the terms in eachdocument; recall our example from Section 6.2 (page 117), where we pointedout that the documents Mary is quicker than John and John is quicker than Maryare identical in such a bag of words representation.How do we quantify the similarity between two documents in this vectorspace? A first attempt might consider the magnitude of the vector differencebetween two document vectors.
This measure suffers from a drawback: twodocuments with very similar content can have a significant vector differencesimply because one is much longer than the other. Thus the relative distributions of terms may be identical in the two documents, but the absolute termfrequencies of one may be far larger.To compensate for the effect of document length, the standard way ofquantifying the similarity between two documents d1 and d2 is to compute~ (d1 ) and V~ ( d2 )the cosine similarity of their vector representations Vsim(d1 , d2 ) =~ ( d1 ) · V~ ( d2 )V,~ (d1 )||V~ (d2 )||Vwhere the numerator represents the dot product (also known as the inner prod~ (d1 ) and V~ (d2 ), while the denominator is the product ofuct) of the vectors Vtheir Euclidean lengths. The dot product ~x · ~y of two vectors is defined as~∑iM=1 x i y i .
Let V ( d ) denote the document vector for d, withq M components~~~ 2 ( d ).V1 (d) . . . VM (d). The Euclidean length of d is defined to be ∑ M Vi =1LENGTH NORMALIZATIONiThe effect of the denominator of Equation (6.10) is thus to length-normalize~ (d1 ) and V~ (d2 ) to unit vectors ~v(d1 ) = V~ ( d 1 ) / |V~ (d1 )| andthe vectors VOnline edition (c) 2009 Cambridge UP1226 Scoring, term weighting and the vector space modelcarautoinsurancebestDoc10.880.1000.46Doc20.090.710.710Doc30.5800.700.41◮ Figure 6.11 Euclidean normalized tf values for documents in Figure 6.9.termaffectionjealousgossipSaS115102PaP5870WH20116◮ Figure 6.12 Term frequencies in three novels. The novels are Austen’s Sense andSensibility, Pride and Prejudice and Brontë’s Wuthering Heights.~ ( d 2 ) / |V~ (d2 )|.
We can then rewrite (6.10) as~v(d2 ) = V(6.11)✎sim(d1 , d2 ) = ~v(d1 ) · ~v(d2 ).Consider the documents in Figure 6.9. We now apply Euclideannormalization to theqtf values from the table, for each of the three documents in the~2table. The quantity ∑iM=1 Vi (d) has the values 30.56, 46.84 and 41.30 respectivelyfor Doc1, Doc2 and Doc3. The resulting Euclidean normalized tf values for thesedocuments are shown in Figure 6.11.Example 6.2:Thus, (6.11) can be viewed as the dot product of the normalized versions ofthe two document vectors. This measure is the cosine of the angle θ betweenthe two vectors, shown in Figure 6.10. What use is the similarity measuresim(d1 , d2 )? Given a document d (potentially one of the di in the collection),consider searching for the documents in the collection most similar to d. Sucha search is useful in a system where a user may identify a document andseek others like it – a feature available in the results lists of search enginesas a more like this feature.
We reduce the problem of finding the document(s)most similar to d to that of finding the di with the highest dot products (simvalues) ~v(d) ·~v(di ). We could do this by computing the dot products between~v(d) and each of ~v(d1 ), . . . , ~v(d N ), then picking off the highest resulting simvalues.✎Example 6.3: Figure 6.12 shows the number of occurrences of three terms (affection,jealous and gossip) in each of the following three novels: Jane Austen’s Sense and Sensi-bility (SaS) and Pride and Prejudice (PaP) and Emily Brontë’s Wuthering Heights (WH).Online edition (c) 2009 Cambridge UP1236.3 The vector space model for scoringtermaffectionjealousgossipSaS0.9960.0870.017PaP0.9930.1200WH0.8470.4660.254◮ Figure 6.13 Term vectors for the three novels of Figure 6.12.