An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 10
Текст из файла (страница 10)
Day-to-day text is unvocalized (short vowels are not representedbut the letter for ā would still appear) or partially vocalized, with short vowels inserted in places where the writer perceives ambiguities. These choices add furthercomplexities to indexing.."!" ! ال ا# 132 1962 ا اا←→ ←→← START‘Algeria achieved its independence in 1962 after 132 years of French occupation.’◮ Figure 2.2 The conceptual linear order of characters is not necessarily the orderthat you see on the page. In languages that are written right-to-left, such as Hebrewand Arabic, it is quite common to also have left-to-right text interspersed, such asnumbers and dollar amounts.
With modern Unicode representation concepts, theorder of characters in files matches the conceptual order, and the reversal of displayedcharacters is handled by the rendering system, but this may not be true for documentsin older encodings.INDEXINGGRANULARITYare many cases in which you might want to do something different. A traditional Unix (mbox-format) email file stores a sequence of email messages(an email folder) in one file, but you might wish to regard each email message as a separate document.
Many email messages now contain attacheddocuments, and you might then want to regard the email message and eachcontained attachment as separate documents. If an email message has anattached zip file, you might want to decode the zip file and regard each fileit contains as a separate document. Going in the opposite direction, variouspieces of web software (such as latex2html) take things that you might regardas a single document (e.g., a Powerpoint file or a LATEX document) and splitthem into separate HTML pages for each slide or subsection, stored as separate files.
In these cases, you might want to combine multiple files into asingle document.More generally, for very long documents, the issue of indexing granularityarises. For a collection of books, it would usually be a bad idea to index anOnline edition (c) 2009 Cambridge UP222 The term vocabulary and postings listsentire book as a document. A search for Chinese toys might bring up a bookthat mentions China in the first chapter and toys in the last chapter, but thisdoes not make it relevant to the query. Instead, we may well wish to indexeach chapter or paragraph as a mini-document.
Matches are then more likelyto be relevant, and since the documents are smaller it will be much easier forthe user to find the relevant passages in the document. But why stop there?We could treat individual sentences as mini-documents. It becomes clearthat there is a precision/recall tradeoff here. If the units get too small, weare likely to miss important passages because terms were distributed overseveral mini-documents, while if units are too large we tend to get spuriousmatches and the relevant information is hard for the user to find.The problems with large document units can be alleviated by use of explicit or implicit proximity search (Sections 2.4.2 and 7.2.2), and the tradeoffs in resulting system performance that we are hinting at are discussedin Chapter 8. The issue of index granularity, and in particular a need tosimultaneously index documents at multiple levels of granularity, appearsprominently in XML retrieval, and is taken up again in Chapter 10.
An IRsystem should be designed to offer choices of granularity. For this choice tobe made well, the person who is deploying the system must have a goodunderstanding of the document collection, the users, and their likely information needs and usage patterns. For now, we will henceforth assume thata suitable size document unit has been chosen, together with an appropriateway of dividing or aggregating files, if needed.2.22.2.1Determining the vocabulary of termsTokenizationGiven a character sequence and a defined document unit, tokenization is thetask of chopping it up into pieces, called tokens, perhaps at the same timethrowing away certain characters, such as punctuation.
Here is an exampleof tokenization:Input: Friends, Romans, Countrymen, lend me your ears;Output: Friends Romans Countrymen lend me your earsTOKENTYPETERMThese tokens are often loosely referred to as terms or words, but it is sometimes important to make a type/token distinction.
A token is an instanceof a sequence of characters in some particular document that are groupedtogether as a useful semantic unit for processing. A type is the class of alltokens containing the same character sequence. A term is a (perhaps normalized) type that is included in the IR system’s dictionary. The set of indexterms could be entirely distinct from the tokens, for instance, they could beOnline edition (c) 2009 Cambridge UP2.2 Determining the vocabulary of terms23semantic identifiers in a taxonomy, but in practice in modern IR systems theyare strongly related to the tokens in the document.
However, rather than being exactly the tokens that appear in the document, they are usually derivedfrom them by various normalization processes which are discussed in Section 2.2.3.2 For example, if the document to be indexed is to sleep perchance todream, then there are 5 tokens, but only 4 types (since there are 2 instances ofto). However, if to is omitted from the index (as a stop word, see Section 2.2.2(page 27)), then there will be only 3 terms: sleep, perchance, and dream.The major question of the tokenization phase is what are the correct tokensto use? In this example, it looks fairly trivial: you chop on whitespace andthrow away punctuation characters.
This is a starting point, but even forEnglish there are a number of tricky cases. For example, what do you doabout the various uses of the apostrophe for possession and contractions?Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’tamusing.For O’Neill, which of the following is the desired tokenization?neilloneillo’neillo’ neillo neill ?And for aren’t, is it:aren’tarentare n’taren t ?A simple strategy is to just split on all non-alphanumeric characters, butwhile o neill looks okay, aren t looks intuitively bad. For all of them,the choices determine which Boolean queries will match.
A query of neillAND capital will match in three cases but not the other two. In how manycases would a query of o’neill AND capital match? If no preprocessing of aquery is done, then it would match in only one of the five cases. For either2. That is, as defined here, tokens that are not indexed (stop words) are not terms, and if multiple tokens are collapsed together via normalization, they are indexed as one term, under thenormalized form.
However, we later relax this definition when discussing classification andclustering in Chapters 13–18, where there is no index. In these chapters, we drop the requirement of inclusion in the dictionary. A term means a normalized word.Online edition (c) 2009 Cambridge UP242 The term vocabulary and postings listsLANGUAGEIDENTIFICATIONHYPHENSBoolean or free text queries, you always want to do the exact same tokenization of document and query words, generally by processing queries with thesame tokenizer.
This guarantees that a sequence of characters in a text willalways match the same sequence typed in a query.3These issues of tokenization are language-specific. It thus requires the language of the document to be known. Language identification based on classifiers that use short character subsequences as features is highly effective;most languages have distinctive signature patterns (see page 46 for references).For most languages and particular domains within them there are unusualspecific tokens that we wish to recognize as terms, such as the programminglanguages C++ and C#, aircraft names like B-52, or a T.V. show name suchas M*A*S*H – which is sufficiently integrated into popular culture that youfind usages such as M*A*S*H-style hospitals.
Computer technology has introduced new types of character sequences that a tokenizer should probablytokenize as a single token, including email addresses (jblack@mail.yahoo.com),web URLs (http://stuff.big.com/new/specials.html), numeric IP addresses (142.32.48.231),package tracking numbers (1Z9999W99845399981), and more. One possiblesolution is to omit from indexing tokens such as monetary amounts, numbers, and URLs, since their presence greatly expands the size of the vocabulary. However, this comes at a large cost in restricting what people cansearch for.