An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 27

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 27 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 272020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 27)

The1. Note that the origin is 0 in the table. Because we never need to encode a docID or a gap of0, in practice the origin is usually 1, so that 10000000 encodes 1, 10000101 encodes 6 (not 5 as inthe table), and so on.Online edition (c) 2009 Cambridge UP995.3 Postings file compressionUNARY CODEγ ENCODINGENTROPYsimplest bit-level code is unary code. The unary code of n is a string of n 1sfollowed by a 0 (see the first two columns of Table 5.5). Obviously, this is nota very efficient code, but it will come in handy in a moment.How efficient can a code be in principle? Assuming the 2n gaps G with1 ≤ G ≤ 2n are all equally likely, the optimal encoding uses n bits for eachG. So some gaps (G = 2n in this case) cannot be encoded with fewer thanlog2 G bits.

Our goal is to get as close to this lower bound as possible.A method that is within a factor of optimal is γ encoding. γ codes implement variable-length encoding by splitting the representation of a gap Ginto a pair of length and offset. Offset is G in binary, but with the leading 1removed.2 For example, for 13 (binary 1101) offset is 101. Length encodes thelength of offset in unary code. For 13, the length of offset is 3 bits, which is 1110in unary.

The γ code of 13 is therefore 1110101, the concatenation of length1110 and offset 101. The right hand column of Table 5.5 gives additionalexamples of γ codes.A γ code is decoded by first reading the unary code up to the 0 that terminates it, for example, the four bits 1110 when decoding 1110101. Now weknow how long the offset is: 3 bits. The offset 101 can then be read correctlyand the 1 that was chopped off in encoding is prepended: 101 → 1101 = 13.The length of offset is ⌊log2 G ⌋ bits and the length of length is ⌊log2 G ⌋ + 1bits, so the length of the entire code is 2 × ⌊log2 G ⌋ + 1 bits.

γ codes arealways of odd length and they are within a factor of 2 of what we claimedto be the optimal encoding length log2 G. We derived this optimum fromthe assumption that the 2n gaps between 1 and 2n are equiprobable. But thisneed not be the case. In general, we do not know the probability distributionover gaps a priori.The characteristic of a discrete probability distribution3 P that determinesits coding properties (including whether a code is optimal) is its entropy H ( P),which is defined as follows:H ( P) = −∑x∈XP( x ) log2 P( x )where X is the set of all possible numbers we need to be able to encode(and therefore ∑ x ∈ X P( x ) = 1.0).

Entropy is a measure of uncertainty asshown in Figure 5.9 for a probability distribution P over two possible outcomes, namely, X = { x1 , x2 }. Entropy is maximized (H ( P) = 1) for P( x1 ) =P( x2 ) = 0.5 when uncertainty about which xi will appear next is largest; and2. We assume here that G has no leading 0s. If there are any, they are removed before deletingthe leading 1.3. Readers who want to review basic concepts of probability theory may want to consult Rice(2006) or Ross (2006).

Note that we are interested in probability distributions over integers (gaps,frequencies, etc.), but that the coding properties of a probability distribution are independent ofwhether the outcomes are integers or something else.Online edition (c) 2009 Cambridge UP1000.00.20.4H(P)0.60.81.05 Index compression0.00.20.40.6P(x1)0.81.0◮ Figure 5.9 Entropy H ( P ) as a function of P ( x1 ) for a sample space with twooutcomes x1 and x2 .minimized (H ( P) = 0) for P( x1 ) = 1, P( x2 ) = 0 and for P( x1 ) = 0, P( x2) = 1when there is absolute certainty.It can be shown that the lower bound for the expected length E( L) of acode L is H ( P) if certain conditions hold (see the references).

It can furtherbe shown that for 1 < H ( P) < ∞, γ encoding is within a factor of 3 of thisoptimal encoding, approaching 2 for large H ( P):E( Lγ )1≤ 2+≤ 3.H ( P)H ( P)UNIVERSAL CODEPREFIX FREEPARAMETER FREEWhat is remarkable about this result is that it holds for any probability distribution P. So without knowing anything about the properties of the distribution of gaps, we can apply γ codes and be certain that they are within a factorof ≈ 2 of the optimal code for distributions of large entropy. A code like γcode with the property of being within a factor of optimal for an arbitrarydistribution P is called universal.In addition to universality, γ codes have two other properties that are useful for index compression. First, they are prefix free, namely, no γ code is theprefix of another.

This means that there is always a unique decoding of asequence of γ codes – and we do not need delimiters between them, whichwould decrease the efficiency of the code. The second property is that γcodes are parameter free. For many other efficient codes, we have to fit theparameters of a model (e.g., the binomial distribution) to the distributionOnline edition (c) 2009 Cambridge UP1015.3 Postings file compressionof gaps in the index.

This complicates the implementation of compressionand decompression. For instance, the parameters need to be stored and retrieved. And in dynamic indexing, the distribution of gaps can change, sothat the original parameters are no longer appropriate. These problems areavoided with a parameter-free code.How much compression of the inverted index do γ codes achieve? Toanswer this question we use Zipf’s law, the term distribution model introduced in Section 5.1.2. According to Zipf’s law, the collection frequency cfiis proportional to the inverse of the rank i, that is, there is a constant c′ suchthat:cfi =(5.3)c′.iWe can choose a different constant c such that the fractions c/i are relativefrequencies and sum to 1 (that is, c/i = cfi /T):(5.4)(5.5)1=MMc1=c∑iii =1i =1∑c== c HM1HMwhere M is the number of distinct terms and H M is the Mth harmonic number.

4 Reuters-RCV1 has M = 400,000 distinct terms and H M ≈ ln M, so wehave1111≈=≈ .c=HMln Mln 400,00013Thus the ith term has a relative frequency of roughly 1/(13i ), and the expected average number of occurrences of term i in a document of length Lis:1200 × 1315c≈L ≈iiiwhere we interpret the relative frequency as a term occurrence probability.Recall that 200 is the average number of tokens per document in ReutersRCV1 (Table 4.2).Now we have derived term statistics that characterize the distribution ofterms in the collection and, by extension, the distribution of gaps in the postings lists.

From these statistics, we can calculate the space requirements foran inverted index compressed with γ encoding. We first stratify the vocabulary into blocks of size Lc = 15. On average, term i occurs 15/i times per4. Note that, unfortunately, the conventional symbol for both entropy and harmonic number isH. Context should make clear which is meant in this chapter.Online edition (c) 2009 Cambridge UP1025 Index compressionN documentsLc mostfrequenttermsLc next mostfrequenttermsLc next mostfrequentterms...◮ Figure 5.10index.N gaps of 1 eachN/2 gaps of 2 eachN/3 gaps of 3 each...Stratification of terms for estimating the size of a γ encoded inverteddocument. So the average number of occurrences f per document is 1 ≤ f forterms in the first block, corresponding to a total number of N gaps per term.The average is 21 ≤ f < 1 for terms in the second block, corresponding toN/2 gaps per term, and 31 ≤ f < 21 for terms in the third block, corresponding to N/3 gaps per term, and so on.

(We take the lower bound because itsimplifies subsequent calculations. As we will see, the final estimate is toopessimistic, even with this assumption.) We will make the somewhat unrealistic assumption that all gaps for a given term have the same size as shownin Figure 5.10. Assuming such a uniform distribution of gaps, we then havegaps of size 1 in block 1, gaps of size 2 in block 2, and so on.Encoding the N/j gaps of size j with γ codes, the number of bits neededfor the postings list of a term in the jth block (corresponding to one row inthe figure) is:bits-per-rowN× (2 × ⌊log2 j⌋ + 1)j2N log2 j.j=≈To encode the entire block, we need ( Lc) · (2N log2 j)/j bits. There are M/( Lc)blocks, so the postings file as a whole will take up:(5.6)MLc2NLc log2 j.jj =1∑Online edition (c) 2009 Cambridge UP1035.3 Postings file compression◮ Table 5.6 Index and dictionary compression for Reuters-RCV1.

The compressionratio depends on the proportion of actual text in the collection. Reuters-RCV1 contains a large amount of XML markup. Using the two best compression schemes, γencoding and blocking with front coding, the ratio compressed index to collectionsize is therefore especially small for Reuters-RCV1: (101 + 5.9)/3600 ≈ 0.03.data structuredictionary, fixed-widthdictionary, term pointers into string∼, with blocking, k = 4∼, with blocking & front codingcollection (text, xml markup etc)collection (text)term incidence matrixpostings, uncompressed (32-bit words)postings, uncompressed (20 bits)postings, variable byte encodedpostings, γ encodedFor Reuters-RCV1,(5.7)MLcsize in MB11.27.67.15.93600.0960.040,000.0400.0250.0116.0101.0≈ 400,000/15 ≈ 27,000 and27,000∑j =12 × 106 × 15 log2 j≈ 224 MB.jSo the postings file of the compressed inverted index for our 960 MB collection has a size of 224 MB, one fourth the size of the original collection.When we run γ compression on Reuters-RCV1, the actual size of the compressed index is even lower: 101 MB, a bit more than one tenth of the size ofthe collection.

The reason for the discrepancy between predicted and actualvalue is that (i) Zipf’s law is not a very good approximation of the actual distribution of term frequencies for Reuters-RCV1 and (ii) gaps are not uniform.The Zipf model predicts an index size of 251 MB for the unrounded numbersfrom Table 4.2. If term frequencies are generated from the Zipf model anda compressed index is created for these artificial terms, then the compressedsize is 254 MB. So to the extent that the assumptions about the distributionof term frequencies are accurate, the predictions of the model are correct.Table 5.6 summarizes the compression techniques covered in this chapter.The term incidence matrix (Figure 1.1, page 4) for Reuters-RCV1 has size400,000 × 800,000 = 40 × 8 × 109 bits or 40 GB.γ codes achieve great compression ratios – about 15% better than variable byte codes for Reuters-RCV1.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.