An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 12
Текст из файла (страница 12)
For instance,if the tokens anti-discriminatory and antidiscriminatory are both mapped ontothe term antidiscriminatory, in both the document text and queries, then searchesfor one term will retrieve documents that contain either.The advantage of just using mapping rules that remove characters like hyphens is that the equivalence classing to be done is implicit, rather than beingfully calculated in advance: the terms that happen to become identical as theresult of these rules are the equivalence classes. It is only easy to write rulesof this sort that remove characters. Since the equivalence classes are implicit,it is not obvious when you might want to add characters.
For instance, itwould be hard to know to turn antidiscriminatory into anti-discriminatory.An alternative to creating equivalence classes is to maintain relations between unnormalized tokens. This method can be extended to hand-constructedlists of synonyms such as car and automobile, a topic we discuss further inChapter 9. These term relationships can be achieved in two ways.
The usualway is to index unnormalized tokens and to maintain a query expansion listof multiple vocabulary entries to consider for a certain query term. A queryterm is then effectively a disjunction of several postings lists. The alternative is to perform the expansion during index construction.
When the document contains automobile, we index it under car as well (and, usually, alsovice-versa). Use of either of these methods is considerably less efficient thanequivalence classing, as there are more postings to store and merge. The first4. It is also often referred to as term normalization, but we prefer to reserve the name term for theoutput of the normalization process.Online edition (c) 2009 Cambridge UP2.2 Determining the vocabulary of terms29method adds a query expansion dictionary and requires more processing atquery time, while the second method requires more space for storing postings. Traditionally, expanding the space required for the postings lists wasseen as more disadvantageous, but with modern storage costs, the increasedflexibility that comes from distinct postings lists is appealing.These approaches are more flexible than equivalence classes because theexpansion lists can overlap while not being identical.
This means there canbe an asymmetry in expansion. An example of how such an asymmetry canbe exploited is shown in Figure 2.6: if the user enters windows, we wish toallow matches with the capitalized Windows operating system, but this is notplausible if the user enters window, even though it is plausible for this queryto also match lowercase windows.The best amount of equivalence classing or query expansion to do is afairly open question. Doing some definitely seems a good idea.
But doing alot can easily have unexpected consequences of broadening queries in unintended ways. For instance, equivalence-classing U.S.A. and USA to the latterby deleting periods from tokens might at first seem very reasonable, giventhe prevalent pattern of optional use of periods in acronyms. However, if Iput in as my query term C.A.T., I might be rather upset if it matches everyappearance of the word cat in documents.5Below we present some of the forms of normalization that are commonlyemployed and how they are implemented. In many cases they seem helpful,but they can also do harm. In fact, you can worry about many details ofequivalence classing, but it often turns out that providing processing is doneconsistently to the query and to documents, the fine details may not havemuch aggregate effect on performance.Accents and diacritics.
Diacritics on characters in English have a fairlymarginal status, and we might well want cliché and cliche to match, or naiveand naïve. This can be done by normalizing tokens to remove diacritics. Inmany other languages, diacritics are a regular part of the writing system anddistinguish different sounds. Occasionally words are distinguished only bytheir accents.
For instance, in Spanish, peña is ‘a cliff’, while pena is ‘sorrow’.Nevertheless, the important question is usually not prescriptive or linguisticbut is a question of how users are likely to write queries for these words. Inmany cases, users will enter queries for words without diacritics, whetherfor reasons of speed, laziness, limited software, or habits born of the dayswhen it was hard to use non-ASCII text on many computer systems.
In thesecases, it might be best to equate all words to a form without diacritics.5. At the time we wrote this chapter (Aug. 2005), this was actually the case on Google: the topresult for the query C.A.T. was a site about cats, the Cat Fanciers Web Site http://www.fanciers.com/.Online edition (c) 2009 Cambridge UP302 The term vocabulary and postings listsCASE - FOLDINGTRUECASINGCapitalization/case-folding.
A common strategy is to do case-folding by reducing all letters to lower case. Often this is a good idea: it will allow instances of Automobile at the beginning of a sentence to match with a query ofautomobile. It will also help on a web search engine when most of your userstype in ferrari when they are interested in a Ferrari car. On the other hand,such case folding can equate words that might better be kept apart. Manyproper nouns are derived from common nouns and so are distinguished onlyby case, including companies (General Motors, The Associated Press), government organizations (the Fed vs.
fed) and person names (Bush, Black). We already mentioned an example of unintended query expansion with acronyms,which involved not only acronym normalization (C.A.T. → CAT) but alsocase-folding (CAT → cat).For English, an alternative to making every token lowercase is to just makesome tokens lowercase. The simplest heuristic is to convert to lowercasewords at the beginning of a sentence and all words occurring in a title that isall uppercase or in which most or all words are capitalized.
These words areusually ordinary words that have been capitalized. Mid-sentence capitalizedwords are left as capitalized (which is usually correct). This will mostly avoidcase-folding in cases where distinctions should be kept apart. The same taskcan be done more accurately by a machine learning sequence model whichuses more features to make the decision of when to case-fold. This is knownas truecasing. However, trying to get capitalization right in this way probablydoesn’t help if your users usually use lowercase regardless of the correct caseof words.
Thus, lowercasing everything often remains the most practicalsolution.Other issues in English. Other possible normalizations are quite idiosyncratic and particular to English. For instance, you might wish to equatene’er and never or the British spelling colour and the American spelling color.Dates, times and similar items come in multiple formats, presenting additional challenges. You might wish to collapse together 3/12/91 and Mar. 12,1991. However, correct processing here is complicated by the fact that in theU.S., 3/12/91 is Mar.
12, 1991, whereas in Europe it is 3 Dec 1991.Other languages. English has maintained a dominant position on the WWW;approximately 60% of web pages are in English (Gerrand 2007). But that stillleaves 40% of the web, and the non-English portion might be expected togrow over time, since less than one third of Internet users and less than 10%of the world’s population primarily speak English.
And there are signs ofchange: Sifry (2007) reports that only about one third of blog posts are inEnglish.Other languages again present distinctive issues in equivalence classing.Online edition (c) 2009 Cambridge UP312.2 Determining the vocabulary of terms!"!#'($%&)*+,-./0)1234567&+89:;:<=6S89;Tc:;Ude:<VV*)WfFXVGYg&h*HZ[V-QqVrstu&vwx)IN*??iNyz{>)j+\kh|l&?@ABC-D)J:+]m}K;nLM^:~_A+oNM??O`4@AEQRPap+Nb5+◮ Figure 2.7 Japanese makes use of multiple intermingled writing systems and,like Chinese, does not segment words.
The text is mainly Chinese characters withthe hiragana syllabary for inflectional endings and function words. The part in latinletters is actually a Japanese expression, but has been taken up as the name of anenvironmental campaign by 2004 Nobel Peace Prize winner Wangari Maathai. Hisname is written using the katakana syllabary in the middle of the first line. The firstfour characters of the final line express a monetary amount that we would want tomatch with ¥500,000 (500,000 Japanese yen).The French word for the has distinctive forms based not only on the gender(masculine or feminine) and number of the following noun, but also depending on whether the following word begins with a vowel: le, la, l’, les.
We maywell wish to equivalence class these various forms of the. German has a convention whereby vowels with an umlaut can be rendered instead as a twovowel digraph. We would want to treat Schütze and Schuetze as equivalent.Japanese is a well-known difficult writing system, as illustrated in Figure 2.7. Modern Japanese is standardly an intermingling of multiple alphabets, principally Chinese characters, two syllabaries (hiragana and katakana)and western characters (Latin letters, Arabic numerals, and various symbols). While there are strong conventions and standardization through theeducation system over the choice of writing system, in many cases the sameword can be written with multiple writing systems.
For example, a wordmay be written in katakana for emphasis (somewhat like italics). Or a wordmay sometimes be written in hiragana and sometimes in Chinese characters. Successful retrieval thus requires complex equivalence classing acrossthe writing systems. In particular, an end user might commonly present aquery entirely in hiragana, because it is easier to type, just as Western endusers commonly use all lowercase.Document collections being indexed can include documents from manydifferent languages. Or a single document can easily contain text from multiple languages.