An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 18
Текст из файла (страница 18)
How many original vocabulary termscan there be in the postings list of a permuterm vocabulary term?Exercise 3.2Write down the entries in the permuterm index dictionary that are generated by theterm mama.Exercise 3.3If you wanted to search for s*ng in a permuterm wildcard index, what key(s) wouldone do the lookup on?Exercise 3.4Refer to Figure 3.4; it is pointed out in the caption that the vocabulary terms in thepostings are lexicographically ordered. Why is this ordering useful?Exercise 3.5Consider again the query fi*mo*er from Section 3.2.1. What Boolean query on a bigramindex would be generated for this query? Can you think of a term that matches thepermuterm query in Section 3.2.1, but does not satisfy this Boolean query?Exercise 3.6Give an example of a sentence that falsely matches the wildcard query mon*h if thesearch were to simply use a conjunction of bigrams.3.3Spelling correctionWe next look at the problem of correcting spelling errors in queries.
For instance, we may wish to retrieve documents containing the term carrot whenthe user types the query carot. Google reports (http://www.google.com/jobs/britney.html)that the following are all treated as misspellings of the query britney spears:britian spears, britney’s spears, brandy spears and prittany spears. We look at twosteps to solving this problem: the first based on edit distance and the secondbased on k-gram overlap. Before getting into the algorithmic details of thesemethods, we first review how search engines provide spell-correction as partof a user experience.Online edition (c) 2009 Cambridge UP3.3 Spelling correction3.3.157Implementing spelling correctionThere are two basic principles underlying most spelling correction algorithms.1. Of various alternative correct spellings for a mis-spelled query, choosethe “nearest” one. This demands that we have a notion of nearness orproximity between a pair of queries.
We will develop these proximitymeasures in Section 3.3.3.2. When two correctly spelled queries are tied (or nearly tied), select the onethat is more common. For instance, grunt and grant both seem equallyplausible as corrections for grnt. Then, the algorithm should choose themore common of grunt and grant as the correction. The simplest notionof more common is to consider the number of occurrences of the termin the collection; thus if grunt occurs more often than grant, it would bethe chosen correction. A different notion of more common is employedin many search engines, especially on the web. The idea is to use thecorrection that is most common among queries typed in by other users.The idea here is that if grunt is typed as a query more often than grant, thenit is more likely that the user who typed grnt intended to type the querygrunt.Beginning in Section 3.3.3 we describe notions of proximity between queries,as well as their efficient computation.
Spelling correction algorithms build onthese computations of proximity; their functionality is then exposed to usersin one of several ways:1. On the query carot always retrieve documents containing carot as well asany “spell-corrected” version of carot, including carrot and tarot.2.
As in (1) above, but only when the query term carot is not in the dictionary.3. As in (1) above, but only when the original query returned fewer than apreset number of documents (say fewer than five documents).4. When the original query returns fewer than a preset number of documents, the search interface presents a spelling suggestion to the end user:this suggestion consists of the spell-corrected query term(s). Thus, thesearch engine might respond to the user: “Did you mean carrot?”3.3.2Forms of spelling correctionWe focus on two specific forms of spelling correction that we refer to asisolated-term correction and context-sensitive correction.
In isolated-term correction, we attempt to correct a single query term at a time – even when weOnline edition (c) 2009 Cambridge UP583 Dictionaries and tolerant retrievalhave a multiple-term query. The carot example demonstrates this type of correction. Such isolated-term correction would fail to detect, for instance, thatthe query flew form Heathrow contains a mis-spelling of the term from – becauseeach term in the query is correctly spelled in isolation.We begin by examining two techniques for addressing isolated-term correction: edit distance, and k-gram overlap. We then proceed to contextsensitive correction.3.3.3EDIT DISTANCEL EVENSHTEINDISTANCEEdit distanceGiven two character strings s1 and s2 , the edit distance between them is theminimum number of edit operations required to transform s1 into s2 .
Mostcommonly, the edit operations allowed for this purpose are: (i) insert a character into a string; (ii) delete a character from a string and (iii) replace a character of a string by another character; for these operations, edit distance issometimes known as Levenshtein distance. For example, the edit distance between cat and dog is 3.
In fact, the notion of edit distance can be generalizedto allowing different weights for different kinds of edit operations, for instance a higher weight may be placed on replacing the character s by thecharacter p, than on replacing it by the character a (the latter being closer to son the keyboard). Setting weights in this way depending on the likelihood ofletters substituting for each other is very effective in practice (see Section 3.4for the separate issue of phonetic similarity). However, the remainder of ourtreatment here will focus on the case in which all edit operations have thesame weight.It is well-known how to compute the (weighted) edit distance betweentwo strings in time O(|s1 | × |s2 |), where |si | denotes the length of a string si .The idea is to use the dynamic programming algorithm in Figure 3.5, wherethe characters in s1 and s2 are given in array form.
The algorithm fills the(integer) entries in a matrix m whose two dimensions equal the lengths ofthe two strings whose edit distances is being computed; the (i, j) entry of thematrix will hold (after the algorithm is executed) the edit distance betweenthe strings consisting of the first i characters of s1 and the first j charactersof s2 . The central dynamic programming step is depicted in Lines 8-10 ofFigure 3.5, where the three quantities whose minimum is taken correspondto substituting a character in s1 , inserting a character in s1 and inserting acharacter in s2 .Figure 3.6 shows an example Levenshtein distance computation of Figure 3.5.
The typical cell [i, j] has four entries formatted as a 2 × 2 cell. Thelower right entry in each cell is the min of the other three, corresponding tothe main dynamic programming step in Figure 3.5. The other three entriesare the three entries m[i − 1, j − 1] + 0 or 1 depending on whether s1 [i ] =Online edition (c) 2009 Cambridge UP593.3 Spelling correctionE DIT D ISTANCE (s1 , s2 )1 int m[i, j] = 02 for i ← 1 to |s1 |3 do m[i, 0] = i4 for j ← 1 to |s2 |5 do m[0, j] = j6 for i ← 1 to |s1 |7 do for j ← 1 to |s2 |8do m[i, j] = min{m[i − 1, j − 1] + if (s1 [i ] = s2 [ j]) then 0 else 1fi,9m[i − 1, j] + 1,10m[i, j − 1] + 1}11 return m[|s1 |, |s2 |]◮ Figure 3.5 Dynamic programming algorithm for computing the edit distance between strings s1 and s2 .fcats011223344112233445a121223344222133445s232312233333322324t343423232444432333454534233◮ Figure 3.6 Example Levenshtein distance computation. The 2 × 2 cell in the [i, j]entry of the table shows the three numbers whose minimum yields the fourth.
Thecells in italics determine the edit distance in this example.s2 [ j], m[i − 1, j] + 1 and m[i, j − 1] + 1. The cells with numbers in italics depictthe path by which we determine the Levenshtein distance.The spelling correction problem however demands more than computingedit distance: given a set S of strings (corresponding to terms in the vocabulary) and a query string q, we seek the string(s) in V of least edit distancefrom q. We may view this as a decoding problem, in which the codewords(the strings in V) are prescribed in advance. The obvious way of doing thisis to compute the edit distance from q to each string in V, before selecting theOnline edition (c) 2009 Cambridge UP603 Dictionaries and tolerant retrievalstring(s) of minimum edit distance.
This exhaustive search is inordinatelyexpensive. Accordingly, a number of heuristics are used in practice to efficiently retrieve vocabulary terms likely to have low edit distance to the queryterm(s).The simplest such heuristic is to restrict the search to dictionary terms beginning with the same letter as the query string; the hope would be thatspelling errors do not occur in the first character of the query. A more sophisticated variant of this heuristic is to use a version of the permuterm index,in which we omit the end-of-word symbol $.
Consider the set of all rotations of the query string q. For each rotation r from this set, we traverse theB-tree into the permuterm index, thereby retrieving all dictionary terms thathave a rotation beginning with r. For instance, if q is mase and we considerthe rotation r = sema, we would retrieve dictionary terms such as semanticand semaphore that do not have a small edit distance to q.
Unfortunately, wewould miss more pertinent dictionary terms such as mare and mane. To address this, we refine this rotation scheme: for each rotation, we omit a suffixof ℓ characters before performing the B-tree traversal. This ensures that eachterm in the set R of terms retrieved from the dictionary includes a “long”substring in common with q. The value of ℓ could depend on the length of q.Alternatively, we may set it to a fixed constant such as 2.3.3.4k-gram indexes for spelling correctionTo further limit the set of vocabulary terms for which we compute edit distances to the query term, we now show how to invoke the k-gram index ofSection 3.2.2 (page 54) to assist with retrieving vocabulary terms with lowedit distance to the query q.