An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 15

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 15 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 152020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 15)

. ibe, 178239:h 1, 2: h17, 25i;4, 5: h17, 191, 291, 430, 434i;5, 3: h14, 19, 101i; . . . i◮ Figure 2.11 Positional index example. The word to has a document frequency993,477, and occurs 6 times in document 1 at positions 7, 18, 33, etc.2.4.2POSITIONAL INDEX✎Positional indexesFor the reasons given, a biword index is not the standard solution. Rather,a positional index is most commonly employed. Here, for each term in thevocabulary, we store postings of the form docID: hposition1, position2, . .

. i,as shown in Figure 2.11, where each position is a token index in the document. Each posting will also usually record the term frequency, for reasonsdiscussed in Chapter 6.To process a phrase query, you still need to access the inverted index entries for each distinct term.

As before, you would start with the least frequentterm and then work to further restrict the list of possible candidates. In themerge operation, the same general technique is used as before, but ratherthan simply checking that both terms are in a document, you also need tocheck that their positions of appearance in the document are compatible withthe phrase query being evaluated.

This requires working out offsets betweenthe words.Example 2.1: Satisfying phrase queries. Suppose the postings lists for to andbe are as in Figure 2.11, and the query is “to be or not to be”. The postings lists to accessare: to, be, or, not. We will examine intersecting the postings lists for to and be. Wefirst look for documents that contain both terms. Then, we look for places in the listswhere there is an occurrence of be with a token index one higher than a position of to,and then we look for another occurrence of each word with token index 4 higher thanthe first occurrence. In the above lists, the pattern of occurrences that is a possiblematch is:to: h.

. . ; 4:h. . . ,429,433i; . . . ibe: h. . . ; 4:h. . . ,430,434i; . . . iOnline edition (c) 2009 Cambridge UP422 The term vocabulary and postings listsP OSITIONAL I NTERSECT ( p1 , p2, k)1 answer ← h i2 while p1 6= NIL and p2 6= NIL3 do if docID ( p1) = docID ( p2)4then l ← h i5pp1 ← positions( p1 )6pp2 ← positions( p2 )7while pp1 6= NIL8do while pp2 6= NIL9do if | pos( pp1) − pos( pp2)| ≤ k10then A DD (l, pos( pp2))11else if pos( pp2) > pos( pp1)12then break13pp2 ← next( pp2 )14while l 6= h i and |l [0] − pos( pp1)| > k15do D ELETE(l [0])16for each ps ∈ l17do A DD ( answer, hdocID ( p1 ), pos( pp1), psi)18pp1 ← next( pp1 )19p1 ← next( p1 )20p2 ← next( p2 )21else if docID ( p1) < docID ( p2)22then p1 ← next( p1 )23else p2 ← next( p2 )24 return answer◮ Figure 2.12 An algorithm for proximity intersection of postings lists p1 and p2 .The algorithm finds places where the two terms appear within k words of each otherand returns a list of triples giving docID and the term position in p1 and p2 .The same general method is applied for within k word proximity searches,of the sort we saw in Example 1.1 (page 15):employment /3 placeHere, /k means “within k words of (on either side)”.

Clearly, positional indexes can be used for such queries; biword indexes cannot. We show inFigure 2.12 an algorithm for satisfying within k word proximity searches; itis further discussed in Exercise 2.12.Positional index size. Adopting a positional index expands required postings storage significantly, even if we compress position values/offsets as weOnline edition (c) 2009 Cambridge UP2.4 Positional postings and phrase queries43will discuss in Section 5.3 (page 95). Indeed, moving to a positional indexalso changes the asymptotic complexity of a postings intersection operation,because the number of items to check is now bounded not by the number ofdocuments but by the total number of tokens in the document collection T.That is, the complexity of a Boolean query is Θ( T ) rather than Θ( N ).

However, most applications have little choice but to accept this, since most usersnow expect to have the functionality of phrase and proximity searches.Let’s examine the space implications of having a positional index. A posting now needs an entry for each occurrence of a term. The index size thusdepends on the average document size. The average web page has less than1000 terms, but documents like SEC stock filings, books, and even some epicpoems easily reach 100,000 terms. Consider a term with frequency 1 in 1000terms on average.

The result is that large documents cause an increase of twoorders of magnitude in the space required to store the postings list:Document size1000100,000Expectedpostings11Expected entriesin positional posting1100While the exact numbers depend on the type of documents and the languagebeing indexed, some rough rules of thumb are to expect a positional index tobe 2 to 4 times as large as a non-positional index, and to expect a compressedpositional index to be about one third to one half the size of the raw text(after removal of markup, etc.) of the original uncompressed documents.Specific numbers for an example collection are given in Table 5.1 (page 87)and Table 5.6 (page 103).2.4.3Combination schemesThe strategies of biword indexes and positional indexes can be fruitfullycombined. If users commonly query on particular phrases, such as MichaelJackson, it is quite inefficient to keep merging positional postings lists.

Acombination strategy uses a phrase index, or just a biword index, for certainqueries and uses a positional index for other phrase queries. Good queriesto include in the phrase index are ones known to be common based on recent querying behavior. But this is not the only criterion: the most expensivephrase queries to evaluate are ones where the individual words are common but the desired phrase is comparatively rare. Adding Britney Spears asa phrase index entry may only give a speedup factor to that query of about3, since most documents that mention either word are valid results, whereasadding The Who as a phrase index entry may speed up that query by a factorof 1000.

Hence, having the latter is more desirable, even if it is a relativelyless common query.Online edition (c) 2009 Cambridge UP442 The term vocabulary and postings listsNEXT WORD INDEX?Williams et al. (2004) evaluate an even more sophisticated scheme whichemploys indexes of both these sorts and additionally a partial next wordindex as a halfway house between the first two strategies. For each term, anext word index records terms that follow it in a document. They concludethat such a strategy allows a typical mixture of web phrase queries to becompleted in one quarter of the time taken by use of a positional index alone,while taking up 26% more space than use of a positional index alone.Exercise 2.8[⋆]Assume a biword index.

Give an example of a document which will be returnedfor a query of New York University but is actually a false positive which should not bereturned.Exercise 2.9[⋆]Shown below is a portion of a positional index in the format: term: doc1: hposition1,position2, . . . i; doc2: hposition1, position2, . . . i; etc.angels: 2: h36,174,252,651i; 4: h12,22,102,432i; 7: h17i;fools: 2: h1,17,74,222i; 4: h8,78,108,458i; 7: h3,13,23,193i;fear: 2: h87,704,722,901i; 4: h13,43,113,433i; 7: h18,328,528i;in: 2: h3,37,76,444,851i; 4: h10,20,110,470,500i; 7: h5,15,25,195i;rush: 2: h2,66,194,321,702i; 4: h9,69,149,429,569i; 7: h4,14,404i;to: 2: h47,86,234,999i; 4: h14,24,774,944i; 7: h199,319,599,709i;tread: 2: h57,94,333i; 4: h15,35,155i; 7: h20,320i;where: 2: h67,124,393,1001i; 4: h11,41,101,421,431i; 7: h16,36,736i;Which document(s) if any match each of the following queries, where each expressionwithin quotes is a phrase query?a.

“fools rush in”b. “fools rush in” AND “angels fear to tread”Exercise 2.10[⋆]Consider the following fragment of a positional index with the format:word: document: hposition, position, . . . i; document: hposition, . . . i...Gates: 1: h3i; 2: h6i; 3: h2,17i; 4: h1i;IBM: 4: h3i; 7: h14i;Microsoft: 1: h1i; 2: h1,21i; 3: h3i; 5: h16,22,51i;The /k operator, word1 /k word2 finds occurrences of word1 within k words of word2 (oneither side), where k is a positive integer argument. Thus k = 1 demands that word1be adjacent to word2.a.

Describe the set of documents that satisfy the query Gates /2 Microsoft.b. Describe each set of values for k for which the query Gates /k Microsoft returns adifferent set of documents as the answer.Online edition (c) 2009 Cambridge UP2.5 References and further readingExercise 2.1145[⋆⋆]Consider the general procedure for merging two positional postings lists for a givendocument, to determine the document positions where a document satisfies a /kclause (in general there can be multiple positions at which each term occurs in a single document).

We begin with a pointer to the position of occurrence of each termand move each pointer along the list of occurrences in the document, checking as wedo so whether we have a hit for /k. Each move of either pointer counts as a step.

LetL denote the total number of occurrences of the two terms in the document. What isthe big-O complexity of the merge procedure, if we wish to have postings includingpositions in the result?Exercise 2.12[⋆⋆]Consider the adaptation of the basic algorithm for intersection of two postings lists(Figure 1.6, page 11) to the one in Figure 2.12 (page 42), which handles proximityqueries. A naive algorithm for this operation could be O( PLmax 2 ), where P is thesum of the lengths of the postings lists (i.e., the sum of document frequencies) andLmax is the maximum length of a document (in tokens).a. Go through this algorithm carefully and explain how it works.b.

What is the complexity of this algorithm? Justify your answer carefully.c. For certain queries and data distributions, would another algorithm be more efficient? What complexity does it have?Exercise 2.13[⋆⋆]Suppose we wish to use a postings intersection procedure to determine simply thelist of documents that satisfy a /k clause, rather than returning the list of positions,as in Figure 2.12 (page 42). For simplicity, assume k ≥ 2. Let L denote the totalnumber of occurrences of the two terms in the document collection (i.e., the sum oftheir collection frequencies). Which of the following is true? Justify your answer.a.

The merge can be accomplished in a number of steps linear in L and independentof k, and we can ensure that each pointer moves only to the right.b. The merge can be accomplished in a number of steps linear in L and independentof k, but a pointer may be forced to move non-monotonically (i.e., to sometimesback up)c. The merge can require kL steps in some cases.Exercise 2.14[⋆⋆]How could an IR system combine use of a positional index and use of stop words?What is the potential problem, and how could it be handled?2.5E AST A SIANLANGUAGESReferences and further readingExhaustive discussion of the character-level processing of East Asian languages can be found in Lunde (1998). Character bigram indexes are perhapsthe most standard approach to indexing Chinese, although some systems useword segmentation.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.