An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 56
Текст из файла (страница 56)
Determine a guess for the size of the relevant document set. If unsure, aconservative (too small) guess is likely to be best. This motivates use of afixed size set V of highest ranked documents.3. Improve our guesses for pt and ut . We choose from the methods of Equations (11.23) and (11.25) for re-estimating pt , except now based on the setV instead of VR. If we let Vt be the subset of documents in V containingxt and use add 21 smoothing, we get:(11.26)pt =|Vt | + 21|V | + 1and if we assume that documents that are not retrieved are nonrelevantthen we can update our ut estimates as:(11.27)ut =dft − |Vt | + 21N − |V | + 14.
Go to step 2 until the ranking of the returned results converges.Once we have a real estimate for pt then the ct weights used in the RSVvalue look almost like a tf-idf value. For instance, using Equation (11.18),Online edition (c) 2009 Cambridge UP23011 Probabilistic information retrieval(11.28)Equation (11.22), and Equation (11.26), we have:"#|Vt | + 121 − utNpt≈ log··ct = log1 − ptut|V | − |Vt | + 1 dftBut things aren’t quite the same: p t /(1 − p t ) measures the (estimated) proportion of relevant documents that the term t occurs in, not term frequency.Moreover, if we apply log identities:(11.29)ct = log|Vt | + 21N+ log|V | − |Vt | + 1dftwe see that we are now adding the two log scaled components rather thanmultiplying them.?Exercise 11.1Work through the derivation of Equation (11.20) from Equations (11.18) and (11.19).Exercise 11.2What are the differences between standard vector space tf-idf weighting and the BIMprobabilistic retrieval model (in the case where no document relevance informationis available)?Exercise 11.3[⋆⋆]Let Xt be a random variable indicating whether the term t appears in a document.Suppose we have | R| relevant documents in the document collection and that Xt = 1in s of the documents.
Take the observed data to be just these observations of Xt foreach document in R. Show that the MLE for the parameter pt = P ( Xt = 1| R = 1, ~q),that is, the value for pt which maximizes the probability of the observed data, ispt = s/| R|.Exercise 11.4Describe the differences between vector space relevance feedback and probabilisticrelevance feedback.11.4An appraisal and some extensions11.4.1An appraisal of probabilistic modelsProbabilistic methods are one of the oldest formal models in IR. Alreadyin the 1970s they were held out as an opportunity to place IR on a firmertheoretical footing, and with the resurgence of probabilistic methods in computational linguistics in the 1990s, that hope has returned, and probabilistic methods are again one of the currently hottest topics in IR.
Traditionally,probabilistic IR has had neat ideas but the methods have never won on performance. Getting reasonable approximations of the needed probabilities forOnline edition (c) 2009 Cambridge UP11.4 An appraisal and some extensions231a probabilistic IR model is possible, but it requires some major assumptions.In the BIM these are:• a Boolean representation of documents/queries/relevance• term independence• terms not in the query don’t affect the outcome• document relevance values are independentIt is perhaps the severity of the modeling assumptions that makes achievinggood performance difficult. A general problem seems to be that probabilisticmodels either require partial relevance information or else only allow forderiving apparently inferior term weighting models.Things started to change in the 1990s when the BM25 weighting scheme,which we discuss in the next section, showed very good performance, andstarted to be adopted as a term weighting scheme by many groups.
Thedifference between “vector space” and “probabilistic” IR systems is not thatgreat: in either case, you build an information retrieval scheme in the exactsame way that we discussed in Chapter 7. For a probabilistic IR system, it’sjust that, at the end, you score queries not by cosine similarity and tf-idf ina vector space, but by a slightly different formula motivated by probabilitytheory. Indeed, sometimes people have changed an existing vector-spaceIR system into an effectively probabilistic system simply by adopted termweighting formulas from probabilistic models. In this section, we brieflypresent three extensions of the traditional probabilistic model, and in the nextchapter, we look at the somewhat different probabilistic language modelingapproach to IR.11.4.2Tree-structured dependencies between termsSome of the assumptions of the BIM can be removed.
For example, we canremove the assumption that terms are independent. This assumption is veryfar from true in practice. A case that particularly violates this assumption isterm pairs like Hong and Kong, which are strongly dependent. But dependencies can occur in various complex configurations, such as between the set ofterms New, York, England, City, Stock, Exchange, and University. van Rijsbergen(1979) proposed a simple, plausible model which allowed a tree structure ofterm dependencies, as in Figure 11.1.
In this model each term can be directlydependent on only one other term, giving a tree structure of dependencies.When it was invented in the 1970s, estimation problems held back the practical success of this model, but the idea was reinvented as the Tree AugmentedNaive Bayes model by Friedman and Goldszmidt (1996), who used it withsome success on various machine learning data sets.Online edition (c) 2009 Cambridge UP23211 Probabilistic information retrievalx1x2x3x4x5x6x7◮ Figure 11.1 A tree of dependencies between terms. In this graphical model representation, a term xi is directly dependent on a term xk if there is an arrow xk → xi .11.4.3BM25 WEIGHTSO KAPI WEIGHTING(11.30)Okapi BM25: a non-binary modelThe BIM was originally designed for short catalog records and abstracts offairly consistent length, and it works reasonably in these contexts, but formodern full-text search collections, it seems clear that a model should payattention to term frequency and document length, as in Chapter 6.
The BM25weighting scheme, often called Okapi weighting, after the system in which it wasfirst implemented, was developed as a way of building a probabilistic modelsensitive to these quantities while not introducing too many additional parameters into the model (Spärck Jones et al. 2000). We will not develop thefull theory behind the model here, but just present a series of forms thatbuild up to the standard form now used for document scoring. The simplestscore for document d is just idf weighting of the query terms present, as inEquation (11.22):NRSVd = ∑ logdftt∈qSometimes, an alternative version of idf is used. If we start with the formulain Equation (11.21) but in the absence of relevance feedback information weestimate that S = s = 0, then we get an alternative idf formulation as follows:(11.31)RSVd =∑ logt∈qN − dft +dft +1212Online edition (c) 2009 Cambridge UP11.4 An appraisal and some extensions(11.32)(11.33)(11.34)233This variant behaves slightly strangely: if a term occurs in over half the documents in the collection then this model gives a negative term weight, whichis presumably undesirable.
But, assuming the use of a stop list, this normallydoesn’t happen, and the value for each summand can be given a floor of 0.We can improve on Equation (11.30) by factoring in the frequency of eachterm and document length:(k1 + 1)tftdN·RSVd = ∑ logdfk((1−b)+b × ( Ld /Lave )) + tftdt1t∈qHere, tftd is the frequency of term t in document d, and Ld and Lave are thelength of document d and the average document length for the whole collection.
The variable k1 is a positive tuning parameter that calibrates thedocument term frequency scaling. A k1 value of 0 corresponds to a binarymodel (no term frequency), and a large value corresponds to using raw termfrequency. b is another tuning parameter (0 ≤ b ≤ 1) which determinesthe scaling by document length: b = 1 corresponds to fully scaling the termweight by the document length, while b = 0 corresponds to no length normalization.If the query is long, then we might also use similar weighting for queryterms. This is appropriate if the queries are paragraph long informationneeds, but unnecessary for short queries.(k3 + 1)tftqN(k1 + 1)tftdRSVd = ∑ log··dfk((1−b)+b×(L/L))+tfk3 + tftqtave1dtdt∈qwith tftq being the frequency of term t in the query q, and k3 being anotherpositive tuning parameter that this time calibrates term frequency scalingof the query.
In the equation presented, there is no length normalization ofqueries (it is as if b = 0 here). Length normalization of the query is unnecessary because retrieval is being done with respect to a single fixed query.The tuning parameters of these formulas should ideally be set to optimizeperformance on a development test collection (see page 153). That is, wecan search for values of these parameters that maximize performance on aseparate development test collection (either manually or with optimizationmethods such as grid search or something more advanced), and then usethese parameters on the actual test collection. In the absence of such optimization, experiments have shown reasonable values are to set k1 and k3 toa value between 1.2 and 2 and b = 0.75.If we have relevance judgments available, then we can use the full form of(11.21) in place of the approximation log( N/dft ) introduced in (11.22):#""(|VRt | + 21 )/(|VNRt | + 12 )RSVd = ∑ log(dft − |VRt | + 12 )/( N − dft − |VR| + |VRt | + 12 )t∈qOnline edition (c) 2009 Cambridge UP23411 Probabilistic information retrieval(k3 + 1)tftq(k1 + 1)tftd××k1 ((1 − b) + b( Ld /Lave )) + tftdk3 + tftqHere, VRt , NVRt , and VR are used as in Section 11.3.4.
The first part of theexpression reflects relevance feedback (or just idf weighting if no relevanceinformation is available), the second implements document term frequencyand document length scaling, and the third considers term frequency in thequery.Rather than just providing a term weighting method for terms in a user’squery, relevance feedback can also involve augmenting the query (automatically or with manual review) with some (say, 10–20) of the top terms in theknown-relevant documents as ordered by the relevance factor ĉt from Equation (11.21), and the above formula can then be used with such an augmentedquery vector ~q.The BM25 term weighting formulas have been used quite widely and quitesuccessfully across a range of collections and search tasks.