An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 44
Текст из файла (страница 44)
10) and (Korfhage1997), and collections focused on cognitive aspects (Spink and Cole 2005).Online edition (c) 2009 Cambridge UPOnline edition (c) 2009 Cambridge UPDRAFT! © April 1, 2009 Cambridge University Press. Feedback welcome.9SYNONYMY177Relevance feedback and queryexpansionIn most collections, the same concept may be referred to using differentwords. This issue, known as synonymy, has an impact on the recall of mostinformation retrieval systems. For example, you would want a search foraircraft to match plane (but only for references to an airplane, not a woodworking plane), and for a search on thermodynamics to match references to heat inappropriate discussions.
Users often attempt to address this problem themselves by manually refining a query, as was discussed in Section 1.4; in thischapter we discuss ways in which a system can help with query refinement,either fully automatically or with the user in the loop.The methods for tackling this problem split into two major classes: globalmethods and local methods. Global methods are techniques for expandingor reformulating query terms independent of the query and results returnedfrom it, so that changes in the query wording will cause the new query tomatch other semantically similar terms. Global methods include:• Query expansion/reformulation with a thesaurus or WordNet (Section 9.2.2)• Query expansion via automatic thesaurus generation (Section 9.2.3)• Techniques like spelling correction (discussed in Chapter 3)Local methods adjust a query relative to the documents that initially appearto match the query.
The basic methods here are:• Relevance feedback (Section 9.1)• Pseudo relevance feedback, also known as Blind relevance feedback (Section 9.1.6)• (Global) indirect relevance feedback (Section 9.1.7)In this chapter, we will mention all of these approaches, but we will concentrate on relevance feedback, which is one of the most used and most successful approaches.Online edition (c) 2009 Cambridge UP1789 Relevance feedback and query expansion9.1RELEVANCE FEEDBACKRelevance feedback and pseudo relevance feedbackThe idea of relevance feedback (RF) is to involve the user in the retrieval processso as to improve the final result set. In particular, the user gives feedback onthe relevance of documents in an initial set of results. The basic procedure is:• The user issues a (short, simple) query.• The system returns an initial set of retrieval results.• The user marks some returned documents as relevant or nonrelevant.• The system computes a better representation of the information need basedon the user feedback.• The system displays a revised set of retrieval results.Relevance feedback can go through one or more iterations of this sort.
Theprocess exploits the idea that it may be difficult to formulate a good querywhen you don’t know the collection well, but it is easy to judge particulardocuments, and so it makes sense to engage in iterative query refinementof this sort. In such a scenario, relevance feedback can also be effective intracking a user’s evolving information need: seeing some documents maylead users to refine their understanding of the information they are seeking.Image search provides a good example of relevance feedback.
Not only isit easy to see the results at work, but this is a domain where a user can easilyhave difficulty formulating what they want in words, but can easily indicaterelevant or nonrelevant images. After the user enters an initial query for bikeon the demonstration system at:http://nayana.ece.ucsb.edu/imsearch/imsearch.htmlthe initial results (in this case, images) are returned.
In Figure 9.1 (a), theuser has selected some of them as relevant. These will be used to refine thequery, while other displayed results have no effect on the reformulation. Figure 9.1 (b) then shows the new top-ranked results calculated after this roundof relevance feedback.Figure 9.2 shows a textual IR example where the user wishes to find outabout new applications of space satellites.9.1.1The Rocchio algorithm for relevance feedbackThe Rocchio Algorithm is the classic algorithm for implementing relevancefeedback.
It models a way of incorporating relevance feedback informationinto the vector space model of Section 6.3.Online edition (c) 2009 Cambridge UP9.1 Relevance feedback and pseudo relevance feedback179(a)(b)◮ Figure 9.1 Relevance feedback searching over images. (a) The user views theinitial query results for a query of bike, selects the first, third and fourth result inthe top row and the fourth result in the bottom row as relevant, and submits thisfeedback.
(b) The users sees the revised result set. Precision is greatly improved.From http://nayana.ece.ucsb.edu/imsearch/imsearch.html (Newsam et al. 2001).Online edition (c) 2009 Cambridge UP1809 Relevance feedback and query expansionQuery: New space satellite applications(a)(b)+++2.074 new 15.106 space30.816 satellite 5.660 application5.991 nasa 5.196 eos4.196 launch 3.972 aster3.516 instrument 3.446 arianespace3.004 bundespost 2.806 ss2.790 rocket 2.053 scientist2.003 broadcast 1.172 earth0.836 oil 0.646 measure(c)(d)1.
0.539, 08/13/91, NASA Hasn’t Scrapped Imaging Spectrometer2. 0.533, 07/09/91, NASA Scratches Environment Gear From Satellite Plan3. 0.528, 04/04/90, Science Panel Backs NASA Satellite Plan, ButUrges Launches of Smaller Probes4. 0.526, 09/09/91, A NASA Satellite Project Accomplishes Incredible Feat: Staying Within Budget5. 0.525, 07/24/90, Scientist Who Exposed Global Warming Proposes Satellites for Climate Research6. 0.524, 08/22/90, Report Provides Support for the Critics Of UsingBig Satellites to Study Climate7. 0.516, 04/13/87, Arianespace Receives Satellite Launch PactFrom Telesat Canada8. 0.509, 12/02/87, Telecommunications Tale of Two Companies***1. 0.513, 07/09/91, NASA Scratches Environment Gear From Satellite Plan2.
0.500, 08/13/91, NASA Hasn’t Scrapped Imaging Spectrometer3. 0.493, 08/07/89, When the Pentagon Launches a Secret Satellite,Space Sleuths Do Some Spy Work of Their Own4. 0.493, 07/31/89, NASA Uses ‘Warm’ Superconductors For FastCircuit5. 0.492, 12/02/87, Telecommunications Tale of Two Companies6. 0.491, 07/09/91, Soviets May Adapt Parts of SS-20 Missile ForCommercial Use7. 0.490, 07/12/88, Gaping Gap: Pentagon Lags in Race To Matchthe Soviets In Rocket Launchers8. 0.490, 06/14/90, Rescue of Satellite By Space Agency To Cost $90Million◮ Figure 9.2 Example of relevance feedback on a text collection.
(a) The initial query(a). (b) The user marks some relevant documents (shown with a plus sign). (c) Thequery is then expanded by 18 terms with weights as shown. (d) The revised topresults are then shown. A * marks the documents which were judged relevant in therelevance feedback phase.Online edition (c) 2009 Cambridge UP1819.1 Relevance feedback and pseudo relevance feedback◮ Figure 9.3 The Rocchio optimal query for separating relevant and nonrelevantdocuments.The underlying theory. We want to find a query vector, denoted as ~q, thatmaximizes similarity with relevant documents while minimizing similaritywith nonrelevant documents.
If Cr is the set of relevant documents and Cnris the set of nonrelevant documents, then we wish to find:1(9.1)~qopt = arg max[sim(~q, Cr ) − sim(~q, Cnr )],~qwhere sim is defined as in Equation 6.10. Under cosine similarity, the optimalquery vector ~qopt for separating the relevant and nonrelevant documents is:(9.2)~qopt =1|Cr |∑d~ j ∈ Crd~j −1|Cnr |∑d~jd~ j ∈ CnrThat is, the optimal query is the vector difference between the centroids of therelevant and nonrelevant documents; see Figure 9.3. However, this observation is not terribly useful, precisely because the full set of relevant documentsis not known: it is what we want to find.R OCCHIO ALGORITHMThe Rocchio (1971) algorithm.
This was the relevance feedback mecha1. In the equation, arg maxx f ( x ) returns a value of x which maximizes the value of the functionf ( x ). Similarly, arg min x f ( x ) returns a value of x which minimizes the value of the functionf ( x ).Online edition (c) 2009 Cambridge UP1829 Relevance feedback and query expansion◮ Figure 9.4 An application of Rocchio’s algorithm. Some documents have beenlabeled as relevant and nonrelevant and the initial query vector is moved in responseto this feedback.nism introduced in and popularized by Salton’s SMART system around 1970.In a real IR query context, we have a user query and partial knowledge ofknown relevant and nonrelevant documents.
The algorithm proposes usingthe modified query ~qm :(9.3)~qm = α~q0 + β1| Dr |∑d~ j ∈ Drd~j − γ1| Dnr |∑d~jd~j ∈ Dnrwhere q0 is the original query vector, Dr and Dnr are the set of known relevant and nonrelevant documents respectively, and α, β, and γ are weightsattached to each term. These control the balance between trusting the judgeddocument set versus the query: if we have a lot of judged documents, wewould like a higher β and γ. Starting from q0 , the new query moves yousome distance toward the centroid of the relevant documents and some distance away from the centroid of the nonrelevant documents.
This new querycan be used for retrieval in the standard vector space model (see Section 6.3).We can easily leave the positive quadrant of the vector space by subtractingoff a nonrelevant document’s vector. In the Rocchio algorithm, negative termweights are ignored. That is, the term weight is set to 0. Figure 9.4 shows theeffect of applying relevance feedback.Relevance feedback can improve both recall and precision. But, in practice, it has been shown to be most useful for increasing recall in situationsOnline edition (c) 2009 Cambridge UP9.1 Relevance feedback and pseudo relevance feedbackI DE DEC - HI✄9.1.2183where recall is important.
This is partly because the technique expands thequery, but it is also partly an effect of the use case: when they want highrecall, users can be expected to take time to review results and to iterate onthe search. Positive feedback also turns out to be much more valuable thannegative feedback, and so most IR systems set γ < β. Reasonable valuesmight be α = 1, β = 0.75, and γ = 0.15. In fact, many systems, such asthe image search system in Figure 9.1, allow only positive feedback, whichis equivalent to setting γ = 0. Another alternative is to use only the markednonrelevant document which received the highest ranking from the IR system as negative feedback (here, | Dnr | = 1 in Equation (9.3)). While many ofthe experimental results comparing various relevance feedback variants arerather inconclusive, some studies have suggested that this variant, called Idedec-hi is the most effective or at least the most consistent performer.Probabilistic relevance feedbackRather than reweighting the query in a vector space, if a user has told ussome relevant and nonrelevant documents, then we can proceed to build aclassifier.