An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 7

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 7 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 72020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 7)

.3. Some information retrieval researchers prefer the term inverted file, but expressions like index construction and index compression are much more common than inverted file construction andinverted file compression. For consistency, we use (inverted) index throughout this book.4. In a (non-positional) inverted index, a posting is just a document ID, but it is inherentlyassociated with a term, via the postings list it is placed on; sometimes we will also talk of a(term, docID) pair as a posting.Online edition (c) 2009 Cambridge UP71.2 A first take at building an inverted indexBrutus−→124113145173174Caesar−→124561657132Calpurnia−→23154101......| {z }Dictionary|{zPostings}◮ Figure 1.3 The two parts of an inverted index.

The dictionary is commonly keptin memory, with pointers to each postings list, which is stored on disk.3. Do linguistic preprocessing, producing a list of normalized tokens, whichare the indexing terms: friend roman countryman so . . .4. Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings.DOC IDSORTINGDOCUMENTFREQUENCYWe will define and discuss the earlier stages of processing, that is, steps 1–3,in Section 2.2 (page 22). Until then you can think of tokens and normalizedtokens as also loosely equivalent to words. Here, we assume that the first3 steps have already been done, and we examine building a basic invertedindex by sort-based indexing.Within a document collection, we assume that each document has a uniqueserial number, known as the document identifier (docID).

During index construction, we can simply assign successive integers to each new documentwhen it is first encountered. The input to indexing is a list of normalizedtokens for each document, which we can equally think of as a list of pairs ofterm and docID, as in Figure 1.4. The core indexing step is sorting this listso that the terms are alphabetical, giving us the representation in the middlecolumn of Figure 1.4. Multiple occurrences of the same term from the samedocument are then merged.5 Instances of the same term are then grouped,and the result is split into a dictionary and postings, as shown in the rightcolumn of Figure 1.4. Since a term generally occurs in a number of documents, this data organization already reduces the storage requirements ofthe index.

The dictionary also records some statistics, such as the number ofdocuments which contain each term (the document frequency, which is herealso the length of each postings list). This information is not vital for a basic Boolean search engine, but it allows us to improve the efficiency of the5.

Unix users can note that these steps are similar to use of the sort and then uniq commands.Online edition (c) 2009 Cambridge UP81 Boolean retrievalDoc 1I did enact Julius Caesar: I was killedi’ the Capitol; Brutus killed me.Doc 2So let it be with Caesar. The noble Brutushath told you Caesar was ambitious:termdocIDtermdocIDI1ambitious2term doc. freq.did1be2ambitious 1enact1brutus1be 1julius1brutus2brutus 2caesar1capitol1I1caesar1capitol 1was1caesar2caesar 2killed1caesar2did 1i’1did1enact 1the1enact1hath 1capitol1hath1brutus1I1I 1killed1I1i’ 1me1i’1=⇒=⇒ it 1so2it2julius 1let2julius1killed 1it2killed1be2killed1let 1with2let2me 1caesar2me1noble 1the2noble2so 1noble2so2the 2brutus2the1told 1hath2the2told2told2you 1you2you2was 2caesar2was1with 1was2was2ambitious2with2→→→→→→→→→→→→→→→→→→→→→→→postings lists221 → 211 → 21121121121221 → 2221 → 22◮ Figure 1.4 Building an index by sorting and grouping.

The sequence of termsin each document, tagged by their documentID (left) is sorted alphabetically (middle). Instances of the same term are then grouped by word and then by documentID.The terms and documentIDs are then separated out (right). The dictionary storesthe terms, and has a pointer to the postings list for each term.

It commonly alsostores other summary information such as, here, the document frequency of eachterm. We use this information for improving query time efficiency and, later, forweighting in ranked retrieval models. Each postings list stores the list of documentsin which a term occurs, and may store other information such as the term frequency(the frequency of each term in each document) or the position(s) of the term in eachdocument.Online edition (c) 2009 Cambridge UP1.2 A first take at building an inverted index9search engine at query time, and it is a statistic later used in many ranked retrieval models. The postings are secondarily sorted by docID.

This providesthe basis for efficient query processing. This inverted index structure is essentially without rivals as the most efficient structure for supporting ad hoctext search.In the resulting index, we pay for storage of both the dictionary and thepostings lists. The latter are much larger, but the dictionary is commonlykept in memory, while postings lists are normally kept on disk, so the sizeof each is important, and in Chapter 5 we will examine how each can beoptimized for storage and access efficiency. What data structure should beused for a postings list? A fixed length array would be wasteful as somewords occur in many documents, and others in very few.

For an in-memorypostings list, two good alternatives are singly linked lists or variable lengtharrays. Singly linked lists allow cheap insertion of documents into postingslists (following updates, such as when recrawling the web for updated documents), and naturally extend to more advanced indexing strategies such asskip lists (Section 2.3), which require additional pointers. Variable length arrays win in space requirements by avoiding the overhead for pointers and intime requirements because their use of contiguous memory increases speedon modern processors with memory caches.

Extra pointers can in practice beencoded into the lists as offsets. If updates are relatively infrequent, variablelength arrays will be more compact and faster to traverse. We can also use ahybrid scheme with a linked list of fixed length arrays for each term. Whenpostings lists are stored on disk, they are stored (perhaps compressed) as acontiguous run of postings without explicit pointers (as in Figure 1.3), so asto minimize the size of the postings list and the number of disk seeks to reada postings list into memory.?Exercise 1.1[ ⋆]Draw the inverted index that would be built for the following document collection.(See Figure 1.3 for an example.)Doc 1Doc 2Doc 3Doc 4new home sales top forecastshome sales rise in julyincrease in home sales in julyjuly new home sales riseExercise 1.2Consider these documents:Doc 1Doc 2Doc 3Doc 4breakthrough drug for schizophrenianew schizophrenia drugnew approach for treatment of schizophrenianew hopes for schizophrenia patientsa.

Draw the term-document incidence matrix for this document collection.Online edition (c) 2009 Cambridge UP[ ⋆]101 Boolean retrievalBrutusCalpurniaIntersection−→1 → 2 → 4 → 11 → 31 → 45 → 173 → 174−→2 → 31 → 54 → 101=⇒2 → 31◮ Figure 1.5 Intersecting the postings lists for Brutus and Calpurnia from Figure 1.3.b. Draw the inverted index representation for this collection, as in Figure 1.3 (page 7).Exercise 1.3[⋆]For the document collection shown in Exercise 1.2, what are the returned results forthese queries:a.

schizophrenia AND drugb. for AND NOT (drug OR approach)1.3SIMPLE CONJUNCTIVEQUERIES(1.1)Processing Boolean queriesHow do we process a query using an inverted index and the basic Booleanretrieval model? Consider processing the simple conjunctive query:Brutus AND Calpurniaover the inverted index partially shown in Figure 1.3 (page 7). We:1. Locate Brutus in the Dictionary2. Retrieve its postings3. Locate Calpurnia in the Dictionary4. Retrieve its postings5.

Intersect the two postings lists, as shown in Figure 1.5.POSTINGS LISTINTERSECTIONPOSTINGS MERGEThe intersection operation is the crucial one: we need to efficiently intersectpostings lists so as to be able to quickly find documents that contain bothterms. (This operation is sometimes referred to as merging postings lists:this slightly counterintuitive name reflects using the term merge algorithm fora general family of algorithms that combine multiple sorted lists by interleaved advancing of pointers through each; here we are merging the listswith a logical AND operation.)There is a simple and effective method of intersecting postings lists usingthe merge algorithm (see Figure 1.6): we maintain pointers into both listsOnline edition (c) 2009 Cambridge UP1.3 Processing Boolean queries11I NTERSECT ( p1, p2 )1 answer ← h i2 while p1 6= NIL and p2 6= NIL3 do if docID ( p1) = docID ( p2)4then A DD ( answer, docID ( p1 ))5p1 ← next( p1 )6p2 ← next( p2 )7else if docID ( p1) < docID ( p2)8then p1 ← next( p1 )9else p2 ← next( p2 )10 return answer◮ Figure 1.6 Algorithm for the intersection of two postings lists p1 and p2 .and walk through the two postings lists simultaneously, in time linear inthe total number of postings entries.

At each step, we compare the docIDpointed to by both pointers. If they are the same, we put that docID in theresults list, and advance both pointers. Otherwise we advance the pointerpointing to the smaller docID. If the lengths of the postings lists are x andy, the intersection takes O( x + y) operations. Formally, the complexity ofquerying is Θ( N ), where N is the number of documents in the collection.6Our indexing methods gain us just a constant, not a difference in Θ timecomplexity compared to a linear scan, but in practice the constant is huge.To use this algorithm, it is crucial that postings be sorted by a single globalordering.

Using a numeric sort by docID is one simple way to achieve this.We can extend the intersection operation to process more complicated querieslike:(1.2)QUERY OPTIMIZATION(1.3)(Brutus OR Caesar) AND NOT CalpurniaQuery optimization is the process of selecting how to organize the work of answering a query so that the least total amount of work needs to be done bythe system.

A major element of this for Boolean queries is the order in whichpostings lists are accessed. What is the best order for query processing? Consider a query that is an AND of t terms, for instance:Brutus AND Caesar AND CalpurniaFor each of the t terms, we need to get its postings, then AND them together.The standard heuristic is to process terms in order of increasing document6. The notation Θ(·) is used to express an asymptotically tight bound on the complexity ofan algorithm. Informally, this is often written as O (·), but this notation really expresses anasymptotic upper bound, which need not be tight (Cormen et al. 1990).Online edition (c) 2009 Cambridge UP121 Boolean retrievalI NTERSECT (ht1, . .

. , tn i)1 terms ← S ORT B Y I NCREASING F REQUENCY (ht1 , . . . , tn i)2 result ← postings( f irst(terms))3 terms ← rest(terms)4 while terms 6= NIL and result 6= NIL5 do result ← I NTERSECT (result, postings( f irst(terms)))6terms ← rest(terms)7 return result◮ Figure 1.7 Algorithm for conjunctive queries that returns the set of documentscontaining each term in the input list of terms.frequency: if we start by intersecting the two smallest postings lists, then allintermediate results must be no bigger than the smallest postings list, and weare therefore likely to do the least amount of total work.

So, for the postingslists in Figure 1.3 (page 7), we execute the above query as:(1.4)(Calpurnia AND Brutus) AND CaesarThis is a first justification for keeping the frequency of terms in the dictionary:it allows us to make this ordering decision based on in-memory data beforeaccessing any postings list.Consider now the optimization of more general queries, such as:(1.5)(madding OR crowd) AND (ignoble OR strife) AND (killed OR slain)As before, we will get the frequencies for all terms, and we can then (conservatively) estimate the size of each OR by the sum of the frequencies of itsdisjuncts. We can then process the query in increasing order of the size ofeach disjunctive term.For arbitrary Boolean queries, we have to evaluate and temporarily storethe answers for intermediate expressions in a complex expression.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.