An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 25

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 25 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 252020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 25)

Thetreatment of the most frequent words is also important. The rule of 30 statesthat the 30 most common words account for 30% of the tokens in written text(31% in the table). Eliminating the 150 most common words from indexing(as stop words; cf. Section 2.2.2, page 27) cuts 25% to 30% of the nonpositionalpostings. But, although a stop list of 150 words reduces the number of postings by a quarter or more, this size reduction does not carry over to the sizeof the compressed index.

As we will see later in this chapter, the postingslists of frequent words require only a few bits per posting after compression.The deltas in the table are in a range typical of large collections. Note,Online edition (c) 2009 Cambridge UP875.1 Statistical properties of terms in information retrieval◮ Table 5.1 The effect of preprocessing on the number of terms, nonpositional postings, and tokens for Reuters-RCV1. “∆%” indicates the reduction in size from the previous line, except that “30 stop words” and “150 stop words” both use “case folding”as their reference line.

“T%” is the cumulative (“total”) reduction from unfiltered. Weperformed stemming with the Porter stemmer (Chapter 2, page 33).(distinct) termsunfilteredno numberscase folding30 stop words150 stop wordsstemmingLOSSLESSLOSSY COMPRESSIONnumber484,494473,723391,523391,493391,373322,383∆%T%−2−17−0−0−17−2−19−19−19−33nonpositional postingsnumber109,971,179100,680,24296,969,05683,390,44367,001,84763,812,300∆%T%−8−3−14−30−4−8−12−24−39−42tokens (= number of positioentries in postings)number197,879,290179,158,204179,158,204121,857,82594,516,59994,516,599however, that the percentage reductions can be very different for some textcollections. For example, for a collection of web pages with a high proportionof French text, a lemmatizer for French reduces vocabulary size much morethan the Porter stemmer does for an English-only collection because Frenchis a morphologically richer language than English.The compression techniques we describe in the remainder of this chapterare lossless, that is, all information is preserved.

Better compression ratioscan be achieved with lossy compression, which discards some information.Case folding, stemming, and stop word elimination are forms of lossy compression. Similarly, the vector space model (Chapter 6) and dimensionalityreduction techniques like latent semantic indexing (Chapter 18) create compact representations from which we cannot fully restore the original collection. Lossy compression makes sense when the “lost” information is unlikelyever to be used by the search system.

For example, web search is characterized by a large number of documents, short queries, and users who only lookat the first few pages of results. As a consequence, we can discard postings ofdocuments that would only be used for hits far down the list. Thus, there areretrieval scenarios where lossy methods can be used for compression withoutany reduction in effectiveness.Before introducing techniques for compressing the dictionary, we want toestimate the number of distinct terms M in a collection.

It is sometimes saidthat languages have a vocabulary of a certain size. The second edition ofthe Oxford English Dictionary (OED) defines more than 600,000 words. Butthe vocabulary of most large collections is much larger than the OED. TheOED does not include most names of people, locations, products, or scientificOnline edition (c) 2009 Cambridge UP∆%T%−9−0−31−47−0−9−9−38−52−52883012log10 M4565 Index compression02468log10 T◮ Figure 5.1 Heaps’ law. Vocabulary size M as a function of collection size T(number of tokens) for Reuters-RCV1.

For these data, the dashed line log10 M =0.49 ∗ log10 T + 1.64 is the best least-squares fit. Thus, k = 101.64 ≈ 44 and b = 0.49.entities like genes. These names need to be included in the inverted index,so our users can search for them.5.1.1H EAPS ’ LAW(5.1)Heaps’ law: Estimating the number of termsA better way of getting a handle on M is Heaps’ law, which estimates vocabulary size as a function of collection size:M = kT bwhere T is the number of tokens in the collection.

Typical values for theparameters k and b are: 30 ≤ k ≤ 100 and b ≈ 0.5. The motivation forHeaps’ law is that the simplest possible relationship between collection sizeand vocabulary size is linear in log–log space and the assumption of linearityis usually born out in practice as shown in Figure 5.1 for Reuters-RCV1. Inthis case, the fit is excellent for T > 105 = 100,000, for the parameter valuesb = 0.49 and k = 44.

For example, for the first 1,000,020 tokens Heaps’ lawOnline edition (c) 2009 Cambridge UP5.1 Statistical properties of terms in information retrieval89predicts 38,323 terms:44 × 1,000,0200.49 ≈ 38,323.The actual number is 38,365 terms, very close to the prediction.The parameter k is quite variable because vocabulary growth depends alot on the nature of the collection and how it is processed. Case-folding andstemming reduce the growth rate of the vocabulary, whereas including numbers and spelling errors increase it.

Regardless of the values of the parameters for a particular collection, Heaps’ law suggests that (i) the dictionarysize continues to increase with more documents in the collection, rather thana maximum vocabulary size being reached, and (ii) the size of the dictionaryis quite large for large collections. These two hypotheses have been empirically shown to be true of large text collections (Section 5.4). So dictionarycompression is important for an effective information retrieval system.5.1.2Z IPF ’ S LAW(5.2)POWER LAWZipf’s law: Modeling the distribution of termsWe also want to understand how terms are distributed across documents.This helps us to characterize the properties of the algorithms for compressingpostings lists in Section 5.3.A commonly used model of the distribution of terms in a collection is Zipf’slaw.

It states that, if t1 is the most common term in the collection, t2 is thenext most common, and so on, then the collection frequency cfi of the ithmost common term is proportional to 1/i:cfi ∝1.iSo if the most frequent term occurs cf1 times, then the second most frequentterm has half as many occurrences, the third most frequent term a third asmany occurrences, and so on. The intuition is that frequency decreases veryrapidly with rank. Equation (5.2) is one of the simplest ways of formalizingsuch a rapid decrease and it has been found to be a reasonably good model.Equivalently, we can write Zipf’s law as cfi = ci k or as log cfi = log c +k log i where k = −1 and c is a constant to be defined in Section 5.3.2.

Itis therefore a power law with exponent k = −1. See Chapter 19, page 426,for another power law, a law characterizing the distribution of links on webpages.The log–log graph in Figure 5.2 plots the collection frequency of a term asa function of its rank for Reuters-RCV1. A line with slope –1, correspondingto the Zipf function log cfi = log c − log i, is also shown.

The fit of the datato the law is not particularly good, but good enough to serve as a model forterm distributions in our calculations in Section 5.3.Online edition (c) 2009 Cambridge UP900123log10 cf45675 Index compression01234567log10 rank◮ Figure 5.2 Zipf’s law for Reuters-RCV1. Frequency is plotted as a function offrequency rank for the terms in the collection. The line is the distribution predictedby Zipf’s law (weighted least-squares fit; intercept is 6.95).?Exercise 5.15.2Dictionary compression[⋆]Assuming one machine word per posting, what is the size of the uncompressed (nonpositional) index for different tokenizations based on Table 5.1? How do these numbers compare with Table 5.6?This section presents a series of dictionary data structures that achieve increasingly higher compression ratios.

The dictionary is small compared withthe postings file as suggested by Table 5.1. So why compress it if it is responsible for only a small percentage of the overall space requirements of the IRsystem?One of the primary factors in determining the response time of an IR system is the number of disk seeks necessary to process a query. If parts of thedictionary are on disk, then many more disk seeks are necessary in queryevaluation. Thus, the main goal of compressing the dictionary is to fit it inmain memory, or at least a large portion of it, to support high query through-Online edition (c) 2009 Cambridge UP915.2 Dictionary compressiontermspace needed:◮ Figure 5.3aaachen...zulu20 bytesdocumentfrequency656,26565...2214 bytespointertopostings list−→−→...−→4 bytesStoring the dictionary as an array of fixed-width entries.put.

Although dictionaries of very large collections fit into the memory of astandard desktop machine, this is not true of many other application scenarios. For example, an enterprise search server for a large corporation mayhave to index a multiterabyte collection with a comparatively large vocabulary because of the presence of documents in many different languages.We also want to be able to design search systems for limited hardware suchas mobile phones and onboard computers. Other reasons for wanting toconserve memory are fast startup time and having to share resources withother applications. The search system on your PC must get along with thememory-hogging word processing suite you are using at the same time.5.2.1Dictionary as a stringThe simplest data structure for the dictionary is to sort the vocabulary lexicographically and store it in an array of fixed-width entries as shown inFigure 5.3.

We allocate 20 bytes for the term itself (because few terms havemore than twenty characters in English), 4 bytes for its document frequency,and 4 bytes for the pointer to its postings list. Four-byte pointers resolve a4 gigabytes (GB) address space. For large collections like the web, we needto allocate more bytes per pointer. We look up terms in the array by binarysearch.

For Reuters-RCV1, we need M × (20 + 4 + 4) = 400,000 × 28 =11.2megabytes (MB) for storing the dictionary in this scheme.Using fixed-width entries for terms is clearly wasteful. The average lengthof a term in English is about eight characters (Table 4.2, page 70), so on average we are wasting twelve characters in the fixed-width scheme. Also,we have no way of storing terms with more than twenty characters likehydrochlorofluorocarbons and supercalifragilisticexpialidocious. We can overcomethese shortcomings by storing the dictionary terms as one long string of characters, as shown in Figure 5.4. The pointer to the next term is also used todemarcate the end of the current term.

As before, we locate terms in the datastructure by way of binary search in the (now smaller) table. This schemesaves us 60% compared to fixed-width storage – 12 bytes on average of theOnline edition (c) 2009 Cambridge UP925 Index compression. . . s y s t i l e s yzyge t i c s yzyg i a l s yzygy s za i be l y i t e s z e c i ns zono . . .freq.postings ptr.9→92→5→71→12→.........4 bytes4 bytes3 bytesterm ptr.◮ Figure 5.4 Dictionary-as-a-string storage.

Pointers mark the end of the precedingterm and the beginning of the next. For example, the first three terms in this exampleare systile, syzygetic, and syzygial.20 bytes we allocated for terms before. However, we now also need to storeterm pointers. The term pointers resolve 400,000 × 8 = 3.2 × 106 positions,so they need to be log2 3.2 × 106 ≈ 22 bits or 3 bytes long.In this new scheme, we need 400,000 × (4 + 4 + 3 + 8) = 7.6 MB for theReuters-RCV1 dictionary: 4 bytes each for frequency and postings pointer, 3bytes for the term pointer, and 8 bytes on average for the term.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.