An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 26
Текст из файла (страница 26)
So we havereduced the space requirements by one third from 11.2 to 7.6 MB.5.2.2Blocked storageWe can further compress the dictionary by grouping terms in the string intoblocks of size k and keeping a term pointer only for the first term of eachblock (Figure 5.5). We store the length of the term in the string as an additional byte at the beginning of the term. We thus eliminate k − 1 termpointers, but need an additional k bytes for storing the length of each term.For k = 4, we save (k − 1) × 3 = 9 bytes for term pointers, but need an additional k = 4 bytes for term lengths. So the total space requirements for thedictionary of Reuters-RCV1 are reduced by 5 bytes per four-term block, or atotal of 400,000 × 1/4 × 5 = 0.5 MB, bringing us down to 7.1 MB.Online edition (c) 2009 Cambridge UP935.2 Dictionary compression. .
. 7 s y s t i l e 9 s y z y g e t i c 8 s y z y g i a l 6 s y z y g y11 s z a i b e l y i t e 6 s z e c i n . . .freq.postings ptr.9→92→5→71→12→......term ptr....◮ Figure 5.5 Blocked storage with four terms per block. The first block consists ofsystile, syzygetic, syzygial, and syzygy with lengths of seven, nine, eight, and six charac-ters, respectively. Each term is preceded by a byte encoding its length that indicateshow many bytes to skip to reach subsequent terms.FRONT CODINGBy increasing the block size k, we get better compression. However, thereis a tradeoff between compression and the speed of term lookup.
For theeight-term dictionary in Figure 5.6, steps in binary search are shown as double lines and steps in list search as simple lines. We search for terms in the uncompressed dictionary by binary search (a). In the compressed dictionary, wefirst locate the term’s block by binary search and then its position within thelist by linear search through the block (b). Searching the uncompressed dictionary in (a) takes on average (0 + 1 + 2 + 3 + 2 + 1 + 2 + 2)/8 ≈ 1.6 steps,assuming each term is equally likely to come up in a query. For example,finding the two terms, aid and box, takes three and two steps, respectively.With blocks of size k = 4 in (b), we need (0 + 1 + 2 + 3 + 4 + 1 + 2 + 3)/8 = 2steps on average, ≈ 25% more.
For example, finding den takes one binarysearch step and two steps through the block. By increasing k, we can getthe size of the compressed dictionary arbitrarily close to the minimum of400,000 × (4 + 4 + 1 + 8) = 6.8 MB, but term lookup becomes prohibitivelyslow for large values of k.One source of redundancy in the dictionary we have not exploited yet isthe fact that consecutive entries in an alphabetically sorted list share commonprefixes.
This observation leads to front coding (Figure 5.7). A common prefixOnline edition (c) 2009 Cambridge UP945 Index compression(a)aidboxdenexjoboxpitwin(b)aidjobboxoxdenexpitwin◮ Figure 5.6 Search of the uncompressed dictionary (a) and a dictionary compressed by blocking with k = 4 (b).One block in blocked compression (k = 4) . . .8automata8automate9au t omatic10automation⇓.
. . further compressed with front coding.8automat∗a1 ⋄ e2 ⋄ i c3⋄ i on◮ Figure 5.7 Front coding. A sequence of terms with identical prefix (“automat”) isencoded by marking the end of the prefix with ∗ and replacing it with ⋄ in subsequentterms. As before, the first byte of each entry encodes the number of characters.Online edition (c) 2009 Cambridge UP955.3 Postings file compression◮ Table 5.2 Dictionary compression for Reuters-RCV1.data structuredictionary, fixed-widthdictionary, term pointers into string∼, with blocking, k = 4∼, with blocking & front codingsize in MB11.27.67.15.9is identified for a subsequence of the term list and then referred to with aspecial character. In the case of Reuters, front coding saves another 1.2 MB,as we found in an experiment.Other schemes with even greater compression rely on minimal perfecthashing, that is, a hash function that maps M terms onto [1, .
. . , M ] withoutcollisions. However, we cannot adapt perfect hashes incrementally becauseeach new term causes a collision and therefore requires the creation of a newperfect hash function. Therefore, they cannot be used in a dynamic environment.Even with the best compression scheme, it may not be feasible to storethe entire dictionary in main memory for very large text collections and forhardware with limited memory. If we have to partition the dictionary ontopages that are stored on disk, then we can index the first term of each pageusing a B-tree.
For processing most queries, the search system has to go todisk anyway to fetch the postings. One additional seek for retrieving theterm’s dictionary page from disk is a significant, but tolerable increase in thetime it takes to process a query.Table 5.2 summarizes the compression achieved by the four dictionarydata structures.?Exercise 5.2Estimate the space usage of the Reuters-RCV1 dictionary with blocks of size k = 8and k = 16 in blocked dictionary storage.Exercise 5.3Estimate the time needed for term lookup in the compressed dictionary of ReutersRCV1 with block sizes of k = 4 (Figure 5.6, b), k = 8, and k = 16. What is theslowdown compared with k = 1 (Figure 5.6, a)?5.3Postings file compressionRecall from Table 4.2 (page 70) that Reuters-RCV1 has 800,000 documents,200 tokens per document, six characters per token, and 100,000,000 postings where we define a posting in this chapter as a docID in a postingslist, that is, excluding frequency and position information.
These numbersOnline edition (c) 2009 Cambridge UP965 Index compression◮ Table 5.3 Encoding gaps instead of document IDs. For example, we store gaps107, 5, 43, . . . , instead of docIDs 283154, 283159, 283202, . . . for computer. The firstdocID is left unchanged (only shown for arachnocentric).thecomputerarachnocentricencodingdocIDsgapsdocIDsgapsdocIDsgapspostings list...2830422830431...28304728315410725200025200028304412831595500100248100correspond to line 3 (“case folding”) in Table 5.1.
Document identifiers arelog2 800,000 ≈ 20 bits long. Thus, the size of the collection is about 800,000 ×200 × 6 bytes = 960 MB and the size of the uncompressed postings file is100,000,000 × 20/8 = 250 MB.To devise a more efficient representation of the postings file, one that usesfewer than 20 bits per document, we observe that the postings for frequentterms are close together. Imagine going through the documents of a collection one by one and looking for a frequent term like computer. We will finda document containing computer, then we skip a few documents that do notcontain it, then there is again a document with the term and so on (see Table 5.3).
The key idea is that the gaps between postings are short, requiring alot less space than 20 bits to store. In fact, gaps for the most frequent termssuch as the and for are mostly equal to 1. But the gaps for a rare term thatoccurs only once or twice in a collection (e.g., arachnocentric in Table 5.3) havethe same order of magnitude as the docIDs and need 20 bits. For an economical representation of this distribution of gaps, we need a variable encodingmethod that uses fewer bits for short gaps.To encode small numbers in less space than large numbers, we look at twotypes of methods: bytewise compression and bitwise compression. As thenames suggest, these methods attempt to encode gaps with the minimumnumber of bytes and bits, respectively.5.3.1VARIABLE BYTEENCODINGCONTINUATION BITVariable byte codesVariable byte (VB) encoding uses an integral number of bytes to encode a gap.The last 7 bits of a byte are “payload” and encode part of the gap.
The firstbit of the byte is a continuation bit.It is set to 1 for the last byte of the encodedgap and to 0 otherwise. To decode a variable byte code, we read a sequenceof bytes with continuation bit 0 terminated by a byte with continuation bit 1.We then extract and concatenate the 7-bit parts. Figure 5.8 gives pseudocodeOnline edition (c) 2009 Cambridge UP283045128320243975.3 Postings file compressionVBE NCODE N UMBER (n)1 bytes ← hi2 while true3 do P REPEND (bytes, n mod 128)4if n < 1285then B REAK6n ← n div 1287 bytes[ L ENGTH (bytes)] += 1288 return bytesVBE NCODE (numbers)1 bytestream ← hi2 for each n ∈ numbers3 do bytes ← VBE NCODE N UMBER (n)4bytestream ← E XTEND (bytestream, bytes)5 return bytestreamVBD ECODE (bytestream)1 numbers ← hi2 n←03 for i ← 1 to L ENGTH (bytestream)4 do if bytestream[i ] < 1285then n ← 128 × n + bytestream[i ]6else n ← 128 × n + (bytestream[i ] − 128)7A PPEND (numbers, n)8n←09 return numbers◮ Figure 5.8 VB encoding and decoding.
The functions div and mod computeinteger division and remainder after integer division, respectively. P REPEND adds anelement to the beginning of a list, for example, P REPEND (h1, 2i, 3) = h3, 1, 2i. E XTENDextends a list, for example, E XTEND (h1, 2i, h3, 4i) = h1, 2, 3, 4i.◮ Table 5.4 VB encoding. Gaps are encoded using an integral number of bytes.The first bit, the continuation bit, of each byte indicates whether the code ends withthis byte (1) or not (0).docIDsgapsVB code82400000110 1011100082951000010121540621457700001101 00001100 10110001Online edition (c) 2009 Cambridge UP985 Index compression◮ Table 5.5 Some examples of unary and γ codes. Unary codes are only shown forthe smaller numbers.
Commas in γ codes are for readability only and are not part ofthe actual codes.number01234913245111025NIBBLE✄5.3.2unary code0101101110111101111111110lengthoffsetγ code0101011011101110111101111111101111111111001000011011000111111110000000001010,010,1110,001110,0011110,10111110,1000111111110,1111111111111111110,0000000001for VB encoding and decoding and Table 5.4 an example of a VB-encodedpostings list. 1With VB compression, the size of the compressed index for Reuters-RCV1is 116 MB as we verified in an experiment.
This is a more than 50% reductionof the size of the uncompressed index (see Table 5.6).The idea of VB encoding can also be applied to larger or smaller units thanbytes: 32-bit words, 16-bit words, and 4-bit words or nibbles. Larger wordsfurther decrease the amount of bit manipulation necessary at the cost of lesseffective (or no) compression. Word sizes smaller than bytes get even bettercompression ratios at the cost of more bit manipulation. In general, bytesoffer a good compromise between compression ratio and speed of decompression.For most IR systems variable byte codes offer an excellent tradeoff betweentime and space. They are also simple to implement – most of the alternativesreferred to in Section 5.4 are more complex. But if disk space is a scarceresource, we can achieve better compression ratios by using bit-level encodings, in particular two closely related encodings: γ codes, which we will turnto next, and δ codes (Exercise 5.9).γ codesVB codes use an adaptive number of bytes depending on the size of the gap.Bit-level codes adapt the length of the code on the finer grained bit level.