An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 33
Текст из файла (страница 33)
Compute the Euclidean normalized document vectors for each of the documents, where each vector has fourcomponents, one for each of the four terms.Exercise 6.16Verify that the sum of the squares of the components of each of the document vectorsin Exercise 6.15 is 1 (to within rounding error). Why is this the case?Exercise 6.17With term weights as computed in Exercise 6.15, rank the three documents by computed score for the query car insurance, for each of the following cases of term weighting in the query:1.
The weight of a term is 1 if present in the query, 0 otherwise.2. Euclidean normalized idf.6.4Variant tf-idf functionsFor assigning a weight for each term in each document, a number of alternatives to tf and tf-idf have been considered. We discuss some of the principalones here; a more complete development is deferred to Chapter 11. We willsummarize these alternatives in Section 6.4.3 (page 128).6.4.1Sublinear tf scalingIt seems unlikely that twenty occurrences of a term in a document truly carrytwenty times the significance of a single occurrence.
Accordingly, there hasbeen considerable research into variants of term frequency that go beyondcounting the number of occurrences of a term. A common modification isOnline edition (c) 2009 Cambridge UP1276.4 Variant tf-idf functions(6.13)(6.14)to use instead the logarithm of the term frequency, which assigns a weightgiven by1 + log tft,d if tft,d > 0wft,d =.0otherwiseIn this form, we may replace tf by some other function wf as in (6.13), toobtain:wf-idft,d = wft,d × idft .Equation (6.9) can then be modified by replacing tf-idf by wf-idf as definedin (6.14).6.4.2Maximum tf normalizationOne well-studied technique is to normalize the tf weights of all terms occurring in a document by the maximum tf in that document.
For each documentd, let tfmax (d) = maxτ ∈d tfτ,d , where τ ranges over all terms in d. Then, wecompute a normalized term frequency for each term t in document d by(6.15)SMOOTHINGntft,d = a + (1 − a)tft,d,tfmax (d)where a is a value between 0 and 1 and is generally set to 0.4, although someearly work used the value 0.5. The term a in (6.15) is a smoothing term whoserole is to damp the contribution of the second term – which may be viewed asa scaling down of tf by the largest tf value in d.
We will encounter smoothingfurther in Chapter 13 when discussing classification; the basic idea is to avoida large swing in ntft,d from modest changes in tft,d (say from 1 to 2). The mainidea of maximum tf normalization is to mitigate the following anomaly: weobserve higher term frequencies in longer documents, merely because longerdocuments tend to repeat the same words over and over again. To appreciatethis, consider the following extreme example: supposed we were to take adocument d and create a new document d′ by simply appending a copy of dto itself. While d′ should be no more relevant to any query than d is, the useof (6.9) would assign it twice as high a score as d. Replacing tf-idft,d in (6.9) byntf-idft,d eliminates the anomaly in this example.
Maximum tf normalizationdoes suffer from the following issues:1. The method is unstable in the following sense: a change in the stop wordlist can dramatically alter term weightings (and therefore ranking). Thus,it is hard to tune.2. A document may contain an outlier term with an unusually large number of occurrences of that term, not representative of the content of thatdocument.Online edition (c) 2009 Cambridge UP1286 Scoring, term weighting and the vector space modelTerm frequencyn (natural)tft,dDocument frequencyn (no)1l (logarithm)t (idf)logp (prob idf)max{0, loga (augmented)b (boolean)L (log ave)1 + log(tft,d )0.5×tft,d0.5 +maxt (tft,d )1 if tft,d > 00 otherwiseNdftN −dftdft }n (none)Normalization11w21 +w22 +...+w2Mc (cosine)√u (pivotedunique)1/u (Section 6.4.4)b (byte size)1/CharLengthα , α < 11+log (tft,d )1+log(avet ∈ d (tft,d ))◮ Figure 6.15 SMART notation for tf-idf variants. Here CharLength is the numberof characters in the document.3.
More generally, a document in which the most frequent term appearsroughly as often as many other terms should be treated differently fromone with a more skewed distribution.6.4.3Document and query weighting schemesEquation (6.12) is fundamental to information retrieval systems that use anyform of vector space scoring. Variations from one vector space scoring method~ (d) andto another hinge on the specific choices of weights in the vectors V~ (q). Figure 6.15 lists some of the principal weighting schemes in use forV~ (d) and V~ (q), together with a mnemonic for representing a speeach of Vcific combination of weights; this system of mnemonics is sometimes calledSMART notation, following the authors of an early text retrieval system.
Themnemonic for representing a combination of weights takes the form ddd.qqqwhere the first triplet gives the term weighting of the document vector, whilethe second triplet gives the weighting in the query vector. The first letter ineach triplet specifies the term frequency component of the weighting, thesecond the document frequency component, and the third the form of normalization used. It is quite common to apply different normalization func~ (d) and V~ (q).
For example, a very standard weighting schemetions to Vis lnc.ltc, where the document vector has log-weighted term frequency, noidf (for both effectiveness and efficiency reasons), and cosine normalization,while the query vector uses log-weighted term frequency, idf weighting, andcosine normalization.Online edition (c) 2009 Cambridge UP6.4 Variant tf-idf functions✄6.4.4PIVOTED DOCUMENTLENGTHNORMALIZATION129Pivoted normalized document lengthIn Section 6.3.1 we normalized each document vector by the Euclidean lengthof the vector, so that all document vectors turned into unit vectors. In doingso, we eliminated all information on the length of the original document;this masks some subtleties about longer documents.
First, longer documentswill – as a result of containing more terms – have higher tf values. Second,longer documents contain more distinct terms. These factors can conspire toraise the scores of longer documents, which (at least for some informationneeds) is unnatural. Longer documents can broadly be lumped into two categories: (1) verbose documents that essentially repeat the same content – inthese, the length of the document does not alter the relative weights of different terms; (2) documents covering multiple different topics, in which thesearch terms probably match small segments of the document but not all ofit – in this case, the relative weights of terms are quite different from a singleshort document that matches the query terms.
Compensating for this phenomenon is a form of document length normalization that is independent ofterm and document frequencies. To this end, we introduce a form of normalizing the vector representations of documents in the collection, so that theresulting “normalized” documents are not necessarily of unit length. Then,when we compute the dot product score between a (unit) query vector andsuch a normalized document, the score is skewed to account for the effectof document length on relevance.
This form of compensation for documentlength is known as pivoted document length normalization.Consider a document collection together with an ensemble of queries forthat collection. Suppose that we were given, for each query q and for eachdocument d, a Boolean judgment of whether or not d is relevant to the queryq; in Chapter 8 we will see how to procure such a set of relevance judgmentsfor a query ensemble and a document collection. Given this set of relevancejudgments, we may compute a probability of relevance as a function of document length, averaged over all queries in the ensemble.
The resulting plotmay look like the curve drawn in thick lines in Figure 6.16. To compute thiscurve, we bucket documents by length and compute the fraction of relevantdocuments in each bucket, then plot this fraction against the median document length of each bucket. (Thus even though the “curve” in Figure 6.16appears to be continuous, it is in fact a histogram of discrete buckets of document length.)On the other hand, the curve in thin lines shows what might happen withthe same documents and query ensemble if we were to use relevance as prescribed by cosine normalization Equation (6.12) – thus, cosine normalizationhas a tendency to distort the computed relevance vis-à-vis the true relevance,at the expense of longer documents. The thin and thick curves crossover at apoint p corresponding to document length ℓ p , which we refer to as the pivotOnline edition (c) 2009 Cambridge UP1306 Scoring, term weighting and the vector space modelRelevance6p Document- lengthℓp◮ Figure 6.16 Pivoted document length normalization.length; dashed lines mark this point on the x − and y− axes.
The idea ofpivoted document length normalization would then be to “rotate” the cosine normalization curve counter-clockwise about p so that it more closelymatches thick line representing the relevance vs. document length curve.As mentioned at the beginning of this section, we do so by using in Equa~ (d) that is nottion (6.12) a normalization factor for each document vector Vthe Euclidean length of that vector, but instead one that is larger than the Euclidean length for documents of length less than ℓ p , and smaller for longerdocuments.~ (d) in the deTo this end, we first note that the normalizing term for V~ (d)|. In thenominator of Equation (6.12) is its Euclidean length, denoted |Vsimplest implementation of pivoted document length normalization, we use~ (d)|, but onea normalization factor in the denominator that is linear in |V~ (d)|,of slope < 1 as in Figure 6.17.