An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 57
Текст из файла (страница 57)
Especially in theTREC evaluations, they performed well and were widely adopted by manygroups. See Spärck Jones et al. (2000) for extensive motivation and discussionof experimental results.11.4.4B AYESIAN NETWORKSBayesian network approaches to IRTurtle and Croft (1989; 1991) introduced into information retrieval the useof Bayesian networks (Jensen and Jensen 2001), a form of probabilistic graphical model. We skip the details because fully introducing the formalism ofBayesian networks would require much too much space, but conceptually,Bayesian networks use directed graphs to show probabilistic dependenciesbetween variables, as in Figure 11.1, and have led to the development of sophisticated algorithms for propagating influence so as to allow learning andinference with arbitrary knowledge within arbitrary directed acyclic graphs.Turtle and Croft used a sophisticated network to better model the complexdependencies between a document and a user’s information need.The model decomposes into two parts: a document collection network anda query network.
The document collection network is large, but can be precomputed: it maps from documents to terms to concepts. The concepts area thesaurus-based expansion of the terms appearing in the document. Thequery network is relatively small but a new network needs to be built eachtime a query comes in, and then attached to the document network. Thequery network maps from query terms, to query subexpressions (built using probabilistic or “noisy” versions of AND and OR operators), to the user’sinformation need.The result is a flexible probabilistic network which can generalize various simpler Boolean and probabilistic models.
Indeed, this is the primaryOnline edition (c) 2009 Cambridge UP11.5 References and further reading235case of a statistical ranked retrieval model that naturally supports structuredquery operators. The system allowed efficient large-scale retrieval, and wasthe basis of the InQuery text retrieval system, built at the University of Massachusetts.
This system performed very well in TREC evaluations and for atime was sold commercially. On the other hand, the model still used variousapproximations and independence assumptions to make parameter estimation and computation possible. There has not been much follow-on workalong these lines, but we would note that this model was actually built veryearly on in the modern era of using Bayesian networks, and there have beenmany subsequent developments in the theory, and the time is perhaps rightfor a new generation of Bayesian network-based information retrieval systems.11.5References and further readingLonger introductions to probability theory can be found in most introductory probability and statistics books, such as (Grinstead and Snell 1997, Rice2006, Ross 2006).
An introduction to Bayesian utility theory can be found in(Ripley 1996).The probabilistic approach to IR originated in the UK in the 1950s. Thefirst major presentation of a probabilistic model is Maron and Kuhns (1960).Robertson and Jones (1976) introduce the main foundations of the BIM andvan Rijsbergen (1979) presents in detail the classic BIM probabilistic model.The idea of the PRP is variously attributed to S. E. Robertson, M.
E. Maronand W. S. Cooper (the term “Probabilistic Ordering Principle” is used inRobertson and Jones (1976), but PRP dominates in later work). Fuhr (1992)is a more recent presentation of probabilistic IR, which includes coverage ofother approaches such as probabilistic logics and Bayesian networks. Crestaniet al. (1998) is another survey.Spärck Jones et al. (2000) is the definitive presentation of probabilistic IR experiments by the “London school”, and Robertson (2005) presents a retrospective on the group’s participation in TREC evaluations, including detailed discussion of the Okapi BM25 scoring functionand its development.
Robertson et al. (2004) extend BM25 to the case of multiple weighted fields.The open-source Indri search engine, which is distributed with the Lemurtoolkit (http://www.lemurproject.org/) merges ideas from Bayesian inference networks and statistical language modeling approaches (see Chapter 12), in particular preserving the former’s support for structured query operators.Online edition (c) 2009 Cambridge UPOnline edition (c) 2009 Cambridge UPDRAFT! © April 1, 2009 Cambridge University Press.
Feedback welcome.12237Language models for informationretrievalA common suggestion to users for coming up with good queries is to thinkof words that would likely appear in a relevant document, and to use thosewords as the query. The language modeling approach to IR directly modelsthat idea: a document is a good match to a query if the document modelis likely to generate the query, which will in turn happen if the documentcontains the query words often. This approach thus provides a different realization of some of the basic ideas for document ranking which we saw in Section 6.2 (page 117).
Instead of overtly modeling the probability P( R = 1|q, d)of relevance of a document d to a query q, as in the traditional probabilistic approach to IR (Chapter 11), the basic language modeling approach instead builds a probabilistic language model Md from each document d, andranks documents based on the probability of the model generating the query:P ( q | Md ) .In this chapter, we first introduce the concept of language models (Section 12.1) and then describe the basic and most commonly used languagemodeling approach to IR, the Query Likelihood Model (Section 12.2). After some comparisons between the language modeling approach and otherapproaches to IR (Section 12.3), we finish by briefly describing various extensions to the language modeling approach (Section 12.4).12.112.1.1GENERATIVE MODELLANGUAGELanguage modelsFinite automata and language modelsWhat do we mean by a document model generating a query? A traditionalgenerative model of a language, of the kind familiar from formal languagetheory, can be used either to recognize or to generate strings.
For example,the finite automaton shown in Figure 12.1 can generate strings that includethe examples shown. The full set of strings that can be generated is calledthe language of the automaton.1Online edition (c) 2009 Cambridge UP23812 Language models for information retrievalIwishI wishI wish I wishI wish I wish I wishI wish I wish I wish I wish I wish I wish...C ANNOTGENERATE :wish I wish◮ Figure 12.1 A simple finite automaton and some of the strings in the language itgenerates.
→ shows the start state of the automaton and a double circle indicates a(possible) finishing state.q1P( STOP |q1 ) = 0.2theafrogtoadsaidlikesthat...0.20.10.010.010.030.020.04...◮ Figure 12.2 A one-state finite automaton that acts as a unigram language model.We show a partial specification of the state emission probabilities.LANGUAGE MODEL(12.1)If instead each node has a probability distribution over generating different terms, we have a language model. The notion of a language model isinherently probabilistic.
A language model is a function that puts a probabilitymeasure over strings drawn from some vocabulary. That is, for a languagemodel M over an alphabet Σ:∑P ( s) = 1s∈Σ∗One simple kind of language model is equivalent to a probabilistic finiteautomaton consisting of just a single node with a single probability distribution over producing different terms, so that ∑t∈V P(t) = 1, as shownin Figure 12.2. After generating each word, we decide whether to stop orto loop around and then produce another word, and so the model also requires a probability of stopping in the finishing state.
Such a model places aprobability distribution over any sequence of words. By construction, it alsoprovides a model for generating text according to its distribution.1. Finite automata can have outputs attached to either their states or their arcs; we use stateshere, because that maps directly on to the way probabilistic automata are usually formalized.Online edition (c) 2009 Cambridge UP23912.1 Language modelsModel M1the0.2a0.1frog0.01toad0.01said0.03likes0.02that0.04dog0.005cat0.003monkey 0.001......Model M2the0.15a0.12frog0.0002toad0.0001said0.03likes0.04that0.04dog0.01cat0.015monkey 0.002......◮ Figure 12.3 Partial specification of two unigram language models.✎(12.2)To find the probability of a word sequence, we just multiply theprobabilities which the model gives to each word in the sequence, together with theprobability of continuing or stopping after producing each word.
For example,Example 12.1:P (frog said that toad likes frog)=≈LIKELIHOOD RATIO✎(0.01 × 0.03 × 0.04 × 0.01 × 0.02 × 0.01)×(0.8 × 0.8 × 0.8 × 0.8 × 0.8 × 0.8 × 0.2)0.000000000001573As you can see, the probability of a particular string/document, is usually a verysmall number! Here we stopped after generating frog the second time.
The first line ofnumbers are the term emission probabilities, and the second line gives the probability of continuing or stopping after generating each word. An explicit stop probabilityis needed for a finite automaton to be a well-formed language model according toEquation (12.1). Nevertheless, most of the time, we will omit to include STOP and(1 − STOP) probabilities (as do most other authors). To compare two models for adata set, we can calculate their likelihood ratio, which results from simply dividing theprobability of the data according to one model by the probability of the data according to the other model.
Providing that the stop probability is fixed, its inclusion willnot alter the likelihood ratio that results from comparing the likelihood of two language models generating a string. Hence, it will not alter the ranking of documents.2Nevertheless, formally, the numbers will no longer truly be probabilities, but onlyproportional to probabilities. See Exercise 12.4.Example 12.2: Suppose, now, that we have two language models M1 and M2 ,shown partially in Figure 12.3.