An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 19
Текст из файла (страница 19)
Once we retrieve such terms, we can then findthe ones of least edit distance from q.In fact, we will use the k-gram index to retrieve vocabulary terms thathave many k-grams in common with the query. We will argue that for reasonable definitions of “many k-grams in common,” the retrieval process isessentially that of a single scan through the postings for the k-grams in thequery string q.The 2-gram (or bigram) index in Figure 3.7 shows (a portion of) the postings for the three bigrams in the query bord. Suppose we wanted to retrievevocabulary terms that contained at least two of these three bigrams.
A singlescan of the postings (much as in Chapter 1) would let us enumerate all suchterms; in the example of Figure 3.7 we would enumerate aboard, boardroomand border.This straightforward application of the linear scan intersection of postingsimmediately reveals the shortcoming of simply requiring matched vocabulary terms to contain a fixed number of k-grams from the query q: termslike boardroom, an implausible “correction” of bord, get enumerated. Conse-Online edition (c) 2009 Cambridge UP613.3 Spelling correctionbo- aboard-aboutor-border-lordrd- aboard-ardent- boardroom -border- morbid-sordid- boardroom -border◮ Figure 3.7 Matching at least two of the three 2-grams in the query bord.J ACCARD COEFFICIENTquently, we require more nuanced measures of the overlap in k-grams between a vocabulary term and q.
The linear scan intersection can be adaptedwhen the measure of overlap is the Jaccard coefficient for measuring the overlap between two sets A and B, defined to be | A ∩ B|/| A ∪ B|. The two sets weconsider are the set of k-grams in the query q, and the set of k-grams in a vocabulary term. As the scan proceeds, we proceed from one vocabulary termt to the next, computing on the fly the Jaccard coefficient between q and t. Ifthe coefficient exceeds a preset threshold, we add t to the output; if not, wemove on to the next term in the postings.
To compute the Jaccard coefficient,we need the set of k-grams in q and t.Since we are scanning the postings for all k-grams in q, we immediatelyhave these k-grams on hand. What about the k-grams of t? In principle,we could enumerate these on the fly from t; in practice this is not only slowbut potentially infeasible since, in all likelihood, the postings entries themselves do not contain the complete string t but rather some encoding of t. Thecrucial observation is that to compute the Jaccard coefficient, we only needthe length of the string t.
To see this, recall the example of Figure 3.7 andconsider the point when the postings scan for query q = bord reaches termt = boardroom. We know that two bigrams match. If the postings stored the(pre-computed) number of bigrams in boardroom (namely, 8), we have all theinformation we require to compute the Jaccard coefficient to be 2/(8 + 3 − 2);the numerator is obtained from the number of postings hits (2, from bo andrd) while the denominator is the sum of the number of bigrams in bord andboardroom, less the number of postings hits.We could replace the Jaccard coefficient by other measures that allow efficient on the fly computation during postings scans.
How do we use theseOnline edition (c) 2009 Cambridge UP623 Dictionaries and tolerant retrievalfor spelling correction? One method that has some empirical support is tofirst use the k-gram index to enumerate a set of candidate vocabulary termsthat are potential corrections of q. We then compute the edit distance from qto each term in this set, selecting terms from the set with small edit distanceto q.3.3.5Context sensitive spelling correctionIsolated-term correction would fail to correct typographical errors such asflew form Heathrow, where all three query terms are correctly spelled.
Whena phrase such as this retrieves few documents, a search engine may like tooffer the corrected query flew from Heathrow. The simplest way to do this is toenumerate corrections of each of the three query terms (using the methodsleading up to Section 3.3.4) even though each query term is correctly spelled,then try substitutions of each correction in the phrase. For the example flewform Heathrow, we enumerate such phrases as fled form Heathrow and flew foreHeathrow. For each such substitute phrase, the search engine runs the queryand determines the number of matching results.This enumeration can be expensive if we find many corrections of the individual terms, since we could encounter a large number of combinations ofalternatives.
Several heuristics are used to trim this space. In the exampleabove, as we expand the alternatives for flew and form, we retain only themost frequent combinations in the collection or in the query logs, which contain previous queries by users. For instance, we would retain flew from as analternative to try and extend to a three-term corrected query, but perhaps notfled fore or flea form. In this example, the biword fled fore is likely to be rarecompared to the biword flew from. Then, we only attempt to extend the list oftop biwords (such as flew from), to corrections of Heathrow.
As an alternativeto using the biword statistics in the collection, we may use the logs of queriesissued by users; these could of course include queries with spelling errors.?Exercise 3.7If | si | denotes the length of string si , show that the edit distance between s1 and s2 isnever more than max{| s1 |, | s2 |}.Exercise 3.8Compute the edit distance between paris and alice. Write down the 5 × 5 array ofdistances between all prefixes as computed by the algorithm in Figure 3.5.Exercise 3.9Write pseudocode showing the details of computing on the fly the Jaccard coefficientwhile scanning the postings of the k-gram index, as mentioned on page 61.Exercise 3.10Compute the Jaccard coefficients between the query bord and each of the terms inFigure 3.7 that contain the bigram or.Online edition (c) 2009 Cambridge UP3.4 Phonetic correction63Exercise 3.11Consider the four-term query catched in the rye and suppose that each of the queryterms has five alternative terms suggested by isolated-term correction.
How manypossible corrected phrases must we consider if we do not trim the space of correctedphrases, but instead try all six variants for each of the terms?Exercise 3.12For each of the prefixes of the query — catched, catched in and catched in the — we havea number of substitute prefixes arising from each term and its alternatives. Supposethat we were to retain only the top 10 of these substitute prefixes, as measured byits number of occurrences in the collection. We eliminate the rest from considerationfor extension to longer prefixes: thus, if batched in is not one of the 10 most common2-term queries in the collection, we do not consider any extension of batched in as possibly leading to a correction of catched in the rye. How many of the possible substituteprefixes are we eliminating at each phase?Exercise 3.13Are we guaranteed that retaining and extending only the 10 commonest substituteprefixes of catched in will lead to one of the 10 commonest substitute prefixes of catchedin the?3.4SOUNDEXPhonetic correctionOur final technique for tolerant retrieval has to do with phonetic correction:misspellings that arise because the user types a query that sounds like the target term.
Such algorithms are especially applicable to searches on the namesof people. The main idea here is to generate, for each term, a “phonetic hash”so that similar-sounding terms hash to the same value. The idea owes itsorigins to work in international police departments from the early 20th century, seeking to match names for wanted criminals despite the names beingspelled differently in different countries. It is mainly used to correct phoneticmisspellings in proper nouns.Algorithms for such phonetic hashing are commonly collectively known assoundex algorithms.
However, there is an original soundex algorithm, withvarious variants, built on the following scheme:1. Turn every term to be indexed into a 4-character reduced form. Build aninverted index from these reduced forms to the original terms; call thisthe soundex index.2. Do the same with query terms.3. When the query calls for a soundex match, search this soundex index.The variations in different soundex algorithms have to do with the conversion of terms to 4-character forms. A commonly used conversion results ina 4-character code, with the first character being a letter of the alphabet andthe other three being digits between 0 and 9.Online edition (c) 2009 Cambridge UP643 Dictionaries and tolerant retrieval1. Retain the first letter of the term.2. Change all occurrences of the following letters to ’0’ (zero): ’A’, E’, ’I’, ’O’,’U’, ’H’, ’W’, ’Y’.3.