An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 14
Текст из файла (страница 14)
O’Rourke[ ⋆]Exercise 2.3The following pairs of words are stemmed to the same form by the Porter stemmer.Which pairs would you argue shouldn’t be conflated. Give your reasoning.a. abandon/abandonmentb. absorbency/absorbentc. marketing/marketsd. university/universee. volume/volumes[ ⋆]Exercise 2.4For the Porter stemmer rule group shown in (2.1):a. What is the purpose of including an identity rule such as SS → SS?b.
Applying just this rule group, what will the following words be stemmed to?circus canariesbossc. What rule should be added to correctly stem pony?d. The stemming for ponies and pony might seem strange. Does it have a deleteriouseffect on retrieval? Why or why not?Online edition (c) 2009 Cambridge UP362 The term vocabulary and postings lists◮ Figure 2.9 Postings lists with skip pointers. The postings intersection can use askip pointer when the end point is still less than the item on the other list.2.3SKIP LISTFaster postings list intersection via skip pointersIn the remainder of this chapter, we will discuss extensions to postings listdata structures and ways to increase the efficiency of using postings lists.
Recall the basic postings list intersection operation from Section 1.3 (page 10):we walk through the two postings lists simultaneously, in time linear in thetotal number of postings entries. If the list lengths are m and n, the intersection takes O(m + n) operations. Can we do better than this? That is, empirically, can we usually process postings list intersection in sublinear time? Wecan, if the index isn’t changing too fast.One way to do this is to use a skip list by augmenting postings lists withskip pointers (at indexing time), as shown in Figure 2.9. Skip pointers areeffectively shortcuts that allow us to avoid processing parts of the postingslist that will not figure in the search results. The two questions are then whereto place skip pointers and how to do efficient merging using skip pointers.Consider first efficient merging, with Figure 2.9 as an example.
Supposewe’ve stepped through the lists in the figure until we have matched 8 oneach list and moved it to the results list. We advance both pointers, giving us16 on the upper list and 41 on the lower list. The smallest item is then theelement 16 on the top list. Rather than simply advancing the upper pointer,we first check the skip list pointer and note that 28 is also less than 41. Hencewe can follow the skip list pointer, and then we advance the upper pointerto 28 . We thus avoid stepping to 19 and 23 on the upper list.
A numberof variant versions of postings list intersection with skip pointers is possibledepending on when exactly you check the skip pointer. One version is shownOnline edition (c) 2009 Cambridge UP2.3 Faster postings list intersection via skip pointers37I NTERSECT W ITH S KIPS ( p1 , p2 )1 answer ← h i2 while p1 6= NIL and p2 6= NIL3 do if docID ( p1) = docID ( p2)4then A DD ( answer, docID ( p1 ))5p1 ← next( p1 )6p2 ← next( p2 )7else if docID ( p1) < docID ( p2)8then if hasSkip( p1 ) and (docID (skip( p1)) ≤ docID ( p2 ))9then while hasSkip( p1 ) and (docID (skip( p1)) ≤ docID ( p2))10do p1 ← skip( p1 )11else p1 ← next( p1 )12else if hasSkip( p2 ) and (docID (skip( p2)) ≤ docID ( p1 ))13then while hasSkip( p2 ) and (docID (skip( p2)) ≤ docID ( p1))14do p2 ← skip( p2 )15else p2 ← next( p2 )16 return answer◮ Figure 2.10 Postings lists intersection with skip pointers.in Figure 2.10.
Skip pointers will only be available for the original postingslists. For an intermediate result in a complex query, the call hasSkip( p) willalways return false. Finally, note that the presence of skip pointers only helpsfor AND queries, not for OR queries.Where do we place skips? There is a tradeoff. More skips means shorterskip spans, and that we are more likely to skip. But it also means lots ofcomparisons to skip pointers, and lots of space storing skip pointers. Fewerskips means few pointer comparisons, but then long skip spans which meansthat there will be fewer opportunities to skip.
A simple heuristic for placingskips, which has been√ found to work well in practice, is that for a postingslist of length P, use P evenly-spaced skip pointers. This heuristic can beimproved upon; it ignores any details of the distribution of query terms.Building effective skip pointers is easy if an index is relatively static; itis harder if a postings list keeps changing because of updates.
A maliciousdeletion strategy can render skip lists ineffective.Choosing the optimal encoding for an inverted index is an ever-changinggame for the system builder, because it is strongly dependent on underlying computer technologies and their relative speeds and sizes. Traditionally,CPUs were slow, and so highly compressed techniques were not optimal.Now CPUs are fast and disk is slow, so reducing disk postings list size dominates. However, if you’re running a search engine with everything in mem-Online edition (c) 2009 Cambridge UP382 The term vocabulary and postings listsory then the equation changes again.
We discuss the impact of hardwareparameters on index construction time in Section 4.1 (page 68) and the impact of index size on system speed in Chapter 5.?[⋆]Exercise 2.5Why are skip pointers not useful for queries of the form xORy?[⋆]Exercise 2.6We have a two-word query. For one term the postings list consists of the following 16entries:[4,6,10,12,14,16,18,20,22,32,47,81,120,122,157,180]and for the other it is the one entry postings list:[47].Work out how many comparisons would be done to intersect the two postings listswith the following two strategies. Briefly justify your answers:a. Using standard postings listsb.
Using postings lists stored with skip pointers, with a skip length ofgested in Section 2.3.√P, as sug[⋆]Exercise 2.7Consider a postings intersection between this postings list, with skip pointers:35915 24 39 60 68 75 81 84 89 92 96 97 100 115and the following intermediate result postings list (which hence has no skip pointers):35 89 95 9799 100101Trace through the postings intersection algorithm in Figure 2.10 (page 37).a. How often is a skip pointer followed (i.e., p1 is advanced to skip( p1 ))?b.
How many postings comparisons will be made by this algorithm while intersecting the two lists?c. How many postings comparisons would be made if the postings lists are intersected without the use of skip pointers?Online edition (c) 2009 Cambridge UP392.4 Positional postings and phrase queries2.4PHRASE QUERIES2.4.1BIWORD INDEXPositional postings and phrase queriesMany complex or technical concepts and many organization and productnames are multiword compounds or phrases. We would like to be able topose a query such as Stanford University by treating it as a phrase so that asentence in a document like The inventor Stanford Ovshinsky never went to university.
is not a match. Most recent search engines support a double quotessyntax (“stanford university”) for phrase queries, which has proven to be veryeasily understood and successfully used by users. As many as 10% of webqueries are phrase queries, and many more are implicit phrase queries (suchas person names), entered without use of double quotes.
To be able to support such queries, it is no longer sufficient for postings lists to be simply listsof documents that contain individual terms. In this section we consider twoapproaches to supporting phrase queries and their combination. A searchengine should not only support phrase queries, but implement them efficiently. A related but distinct concept is term proximity weighting, where adocument is preferred to the extent that the query terms appear close to eachother in the text. This technique is covered in Section 7.2.2 (page 144) in thecontext of ranked retrieval.Biword indexesOne approach to handling phrases is to consider every pair of consecutiveterms in a document as a phrase. For example, the text Friends, Romans,Countrymen would generate the biwords:friends romansromans countrymenIn this model, we treat each of these biwords as a vocabulary term.
Beingable to process two-word phrase queries is immediate. Longer phrases canbe processed by breaking them down. The query stanford university palo altocan be broken into the Boolean query on biwords:“stanford university”AND“university palo”AND“palo alto”This query could be expected to work fairly well in practice, but there canand will be occasional false positives. Without examining the documents,we cannot verify that the documents matching the above Boolean query doactually contain the original 4 word phrase.Among possible queries, nouns and noun phrases have a special status indescribing the concepts people are interested in searching for.
But relatednouns can often be divided from each other by various function words, inphrases such as the abolition of slavery or renegotiation of the constitution. Theseneeds can be incorporated into the biword indexing model in the followingOnline edition (c) 2009 Cambridge UP402 The term vocabulary and postings listsway. First, we tokenize the text and perform part-of-speech-tagging.6 Wecan then group terms into nouns, including proper nouns, (N) and functionwords, including articles and prepositions, (X), among other classes. Nowdeem any string of terms of the form NX*N to be an extended biword. Eachsuch extended biword is made a term in the vocabulary.
For example:renegotiationNofXtheXconstitutionNTo process a query using such an extended biword index, we need to alsoparse it into N’s and X’s, and then segment the query into extended biwords,which can be looked up in the index.This algorithm does not always work in an intuitively optimal mannerwhen parsing longer queries into Boolean queries. Using the above algorithm, the querycost overruns on a power plantis parsed into“cost overruns”PHRASE INDEXAND“overruns power”AND“power plant”whereas it might seem a better query to omit the middle biword. Betterresults can be obtained by using more precise part-of-speech patterns thatdefine which extended biwords should be indexed.The concept of a biword index can be extended to longer sequences ofwords, and if the index includes variable length word sequences, it is generally referred to as a phrase index. Indeed, searches for a single term arenot naturally handled in a biword index (you would need to scan the dictionary for all biwords containing the term), and so we also need to have anindex of single-word terms.
While there is always a chance of false positivematches, the chance of a false positive match on indexed phrases of length 3or more becomes very small indeed. But on the other hand, storing longerphrases has the potential to greatly expand the vocabulary size. Maintaining exhaustive phrase indexes for phrases of length greater than two is adaunting prospect, and even use of an exhaustive biword dictionary greatlyexpands the size of the vocabulary.
However, towards the end of this section we discuss the utility of the strategy of using a partial phrase index in acompound indexing scheme.6. Part of speech taggers classify words as nouns, verbs, etc. – or, in practice, often as finergrained classes like “plural proper noun”. Many fairly accurate (c. 96% per-tag accuracy) partof-speech taggers now exist, usually trained by machine learning methods on hand-tagged text.See, for instance, Manning and Schütze (1999, ch. 10).Online edition (c) 2009 Cambridge UP2.4 Positional postings and phrase queries41to, 993427:h 1, 6:2, 5:4, 5:5, 2:7, 3:h7, 18, 33, 72, 86, 231i;h1, 17, 74, 222, 255i;h8, 16, 190, 429, 433i;h363, 367i;h13, 23, 191i; . .