An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf), страница 6
Описание файла
PDF-файл из архива "An introduction to information retrieval. Manning_ Raghavan (2009).pdf", который расположен в категории "". Всё это находится в предмете "анализ текстовых данных и информационный поиск" из 10 семестр (2 семестр магистратуры), которые можно найти в файловом архиве МГУ им. Ломоносова. Не смотря на прямую связь этого архива с МГУ им. Ломоносова, его также можно найти и в других разделах. .
Просмотр PDF-файла онлайн
Текст 6 страницы из PDF
We willthen examine the Boolean retrieval model and how Boolean queries are processed (Sections 1.3 and 1.4).1.1GREPAn example information retrieval problemA fat book which many people own is Shakespeare’s Collected Works. Suppose you wanted to determine which plays of Shakespeare contain the wordsBrutus AND Caesar AND NOT Calpurnia. One way to do that is to start at thebeginning and to read through all the text, noting for each play whetherit contains Brutus and Caesar and excluding it from consideration if it contains Calpurnia.
The simplest form of document retrieval is for a computerto do this sort of linear scan through documents. This process is commonlyreferred to as grepping through text, after the Unix command grep, whichperforms this process. Grepping through text can be a very effective process,especially given the speed of modern computers, and often allows usefulpossibilities for wildcard pattern matching through the use of regular expressions.
With modern computers, for simple querying of modest collections(the size of Shakespeare’s Collected Works is a bit under one million wordsof text in total), you really need nothing more.But for many purposes, you do need more:1. To process large document collections quickly. The amount of online datahas grown at least as quickly as the speed of computers, and we wouldnow like to be able to search collections that total in the order of billionsto trillions of words.2.
To allow more flexible matching operations. For example, it is impracticalto perform the query Romans NEAR countrymen with grep, where NEARmight be defined as “within 5 words” or “within the same sentence”.3. To allow ranked retrieval: in many cases you want the best answer to aninformation need among many documents that contain certain words.INDEXINCIDENCE MATRIXTERMThe way to avoid linearly scanning the texts for each query is to index thedocuments in advance. Let us stick with Shakespeare’s Collected Works,and use it to introduce the basics of the Boolean retrieval model. Supposewe record for each document – here a play of Shakespeare’s – whether itcontains each word out of all the words Shakespeare used (Shakespeare usedabout 32,000 different words).
The result is a binary term-document incidencematrix, as in Figure 1.1. Terms are the indexed units (further discussed inSection 2.2); they are usually words, and for the moment you can think ofOnline edition (c) 2009 Cambridge UP41 Boolean retrievalAntonyBrutusCaesarCalpurniaCleopatramercyworser...AntonyandCleopatra1110111JuliusCaesarTheTempestHamletOthelloMacbeth11110000000011011001100100111010010...◮ Figure 1.1 A term-document incidence matrix.
Matrix element (t, d) is 1 if theplay in column d contains the word in row t, and is 0 otherwise.them as words, but the information retrieval literature normally speaks ofterms because some of them, such as perhaps I-9 or Hong Kong are not usuallythought of as words. Now, depending on whether we look at the matrix rowsor columns, we can have a vector for each term, which shows the documentsit appears in, or a vector for each document, showing the terms that occur init.2To answer the query Brutus AND Caesar AND NOT Calpurnia, we take thevectors for Brutus, Caesar and Calpurnia, complement the last, and then do abitwise AND:110100 AND 110111 AND 101111 = 100100B OOLEAN RETRIEVALMODELDOCUMENTCOLLECTIONCORPUSThe answers for this query are thus Antony and Cleopatra and Hamlet (Figure 1.2).The Boolean retrieval model is a model for information retrieval in which wecan pose any query which is in the form of a Boolean expression of terms,that is, in which terms are combined with the operators AND, OR, and NOT.The model views each document as just a set of words.Let us now consider a more realistic scenario, simultaneously using theopportunity to introduce some terminology and notation.
Suppose we haveN = 1 million documents. By documents we mean whatever units we havedecided to build a retrieval system over. They might be individual memosor chapters of a book (see Section 2.1.2 (page 20) for further discussion). Wewill refer to the group of documents over which we perform retrieval as the(document) collection. It is sometimes also referred to as a corpus (a body oftexts). Suppose each document is about 1000 words long (2–3 book pages). If2. Formally, we take the transpose of the matrix to be able to get the terms as column vectors.Online edition (c) 2009 Cambridge UP51.1 An example information retrieval problemAntony and Cleopatra, Act III, Scene iiAgrippa [Aside to Domitius Enobarbus]: Why, Enobarbus,When Antony found Julius Caesar dead,He cried almost to roaring; and he weptWhen at Philippi he found Brutus slain.Hamlet, Act III, Scene iiLord Polonius:I did enact Julius Caesar: I was killed i’ theCapitol; Brutus killed me.◮ Figure 1.2 Results from Shakespeare for the query BrutusAND Caesar AND NOTCalpurnia.AD HOC RETRIEVALINFORMATION NEEDQUERYRELEVANCEEFFECTIVENESSPRECISIONRECALLwe assume an average of 6 bytes per word including spaces and punctuation,then this is a document collection about 6 GB in size.
Typically, there mightbe about M = 500,000 distinct terms in these documents. There is nothingspecial about the numbers we have chosen, and they might vary by an orderof magnitude or more, but they give us some idea of the dimensions of thekinds of problems we need to handle. We will discuss and model these sizeassumptions in Section 5.1 (page 86).Our goal is to develop a system to address the ad hoc retrieval task.
This isthe most standard IR task. In it, a system aims to provide documents fromwithin the collection that are relevant to an arbitrary user information need,communicated to the system by means of a one-off, user-initiated query. Aninformation need is the topic about which the user desires to know more, andis differentiated from a query, which is what the user conveys to the computer in an attempt to communicate the information need.
A document isrelevant if it is one that the user perceives as containing information of valuewith respect to their personal information need. Our example above wasrather artificial in that the information need was defined in terms of particular words, whereas usually a user is interested in a topic like “pipelineleaks” and would like to find relevant documents regardless of whether theyprecisely use those words or express the concept with other words such aspipeline rupture. To assess the effectiveness of an IR system (i.e., the quality ofits search results), a user will usually want to know two key statistics aboutthe system’s returned results for a query:Precision: What fraction of the returned results are relevant to the information need?Recall: What fraction of the relevant documents in the collection were returned by the system?Online edition (c) 2009 Cambridge UP61 Boolean retrievalINVERTED INDEXDICTIONARYVOCABULARYLEXICONPOSTINGPOSTINGS LISTPOSTINGS1.2Detailed discussion of relevance and evaluation measures including precision and recall is found in Chapter 8.We now cannot build a term-document matrix in a naive way.
A 500K ×1M matrix has half-a-trillion 0’s and 1’s – too many to fit in a computer’smemory. But the crucial observation is that the matrix is extremely sparse,that is, it has few non-zero entries. Because each document is 1000 wordslong, the matrix has no more than one billion 1’s, so a minimum of 99.8% ofthe cells are zero.
A much better representation is to record only the thingsthat do occur, that is, the 1 positions.This idea is central to the first major concept in information retrieval, theinverted index. The name is actually redundant: an index always maps backfrom terms to the parts of a document where they occur. Nevertheless, inverted index, or sometimes inverted file, has become the standard term in information retrieval.3 The basic idea of an inverted index is shown in Figure 1.3.We keep a dictionary of terms (sometimes also referred to as a vocabulary orlexicon; in this book, we use dictionary for the data structure and vocabularyfor the set of terms). Then for each term, we have a list that records whichdocuments the term occurs in.
Each item in the list – which records that aterm appeared in a document (and, later, often, the positions in the document) – is conventionally called a posting.4 The list is then called a postingslist (or inverted list), and all the postings lists taken together are referred to asthe postings.
The dictionary in Figure 1.3 has been sorted alphabetically andeach postings list is sorted by document ID. We will see why this is useful inSection 1.3, below, but later we will also consider alternatives to doing this(Section 7.1.5).A first take at building an inverted indexTo gain the speed benefits of indexing at retrieval time, we have to build theindex in advance. The major steps in this are:1. Collect the documents to be indexed:Friends, Romans, countrymen. So let it be with Caesar . . .2. Tokenize the text, turning each document into a list of tokens:Friends Romans countrymen So . .