An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 28
Текст из файла (страница 28)
But they are expensive to decode. This isbecause many bit-level operations – shifts and masks – are necessary to decode a sequence of γ codes as the boundaries between codes will usually beOnline edition (c) 2009 Cambridge UP1045 Index compressionsomewhere in the middle of a machine word. As a result, query processing ismore expensive for γ codes than for variable byte codes.
Whether we choosevariable byte or γ encoding depends on the characteristics of an application,for example, on the relative weights we give to conserving disk space versusmaximizing query response time.The compression ratio for the index in Table 5.6 is about 25%: 400 MB (uncompressed, each posting stored as a 32-bit word) versus 101 MB (γ) and 116MB (VB). This shows that both γ and VB codes meet the objectives we statedin the beginning of the chapter.
Index compression substantially improvestime and space efficiency of indexes by reducing the amount of disk spaceneeded, increasing the amount of information that can be kept in the cache,and speeding up data transfers from disk to memory.?Exercise 5.4[⋆]Compute variable byte codes for the numbers in Tables 5.3 and 5.5.Exercise 5.5[⋆]Compute variable byte and γ codes for the postings list h 777, 17743, 294068, 31251336i.Use gaps instead of docIDs where possible. Write binary codes in 8-bit blocks.Exercise 5.6Consider the postings list h4, 10, 11, 12, 15, 62, 63, 265, 268, 270, 400i with a corresponding list of gaps h4, 6, 1, 1, 3, 47, 1, 202, 3, 2, 130i.
Assume that the length of the postingslist is stored separately, so the system knows when a postings list is complete. Using variable byte encoding: (i) What is the largest gap you can encode in 1 byte? (ii)What is the largest gap you can encode in 2 bytes? (iii) How many bytes will theabove postings list require under this encoding? (Count only space for encoding thesequence of numbers.)Exercise 5.7A little trick is to notice that a gap cannot be of length 0 and that the stuff left to encodeafter shifting cannot be 0. Based on these observations: (i) Suggest a modification tovariable byte encoding that allows you to encode slightly larger gaps in the sameamount of space.
(ii) What is the largest gap you can encode in 1 byte? (iii) Whatis the largest gap you can encode in 2 bytes? (iv) How many bytes will the postingslist in Exercise 5.6 require under this encoding? (Count only space for encoding thesequence of numbers.)Exercise 5.8[⋆]From the following sequence of γ-coded gaps, reconstruct first the gap sequence andthen the postings sequence: 1110001110101011111101101111011.Exercise 5.9δ CODESγ codes are relatively inefficient for large numbers (e.g., 1025 in Table 5.5) as theyencode the length of the offset in inefficient unary code.
δ codes differ from γ codesin that they encode the first part of the code (length) in γ code instead of unary code.The encoding of offset is the same. For example, the δ code of 7 is 10,0,11 (again, weadd commas for readability). 10,0 is the γ code for length (2 in this case) and theencoding of offset (11) is unchanged. (i) Compute the δ codes for the other numbersOnline edition (c) 2009 Cambridge UP1055.4 References and further reading◮ Table 5.7Two gap sequences to be merged in blocked sort-based indexingγ encoded gap sequence of run 1γ encoded gap sequence of run 2111011011111100101111111111010001111100111111010000111111000100011111110010000011111010101in Table 5.5. For what range of numbers is the δ code shorter than the γ code? (ii) γcode beats variable byte code in Table 5.6 because the index contains stop words andthus many small gaps. Show that variable byte code is more compact if larger gapsdominate.
(iii) Compare the compression ratios of δ code and variable byte code fora distribution of gaps dominated by large gaps.Exercise 5.10Go through the above calculation of index size and explicitly state all the approximations that were made to arrive at Equation (5.6).Exercise 5.11For a collection of your choosing, determine the number of documents and terms andthe average length of a document. (i) How large is the inverted index predicted to beby Equation (5.6)? (ii) Implement an indexer that creates a γ-compressed invertedindex for the collection. How large is the actual index? (iii) Implement an indexerthat uses variable byte encoding. How large is the variable byte encoded index?Exercise 5.12To be able to hold as many postings as possible in main memory, it is a good idea tocompress intermediate index files during index construction.
(i) This makes mergingruns in blocked sort-based indexing more complicated. As an example, work out theγ-encoded merged sequence of the gaps in Table 5.7. (ii) Index construction is morespace efficient when using compression. Would you also expect it to be faster?Exercise 5.13(i) Show that the size of the vocabulary is finite according to Zipf’s law and infiniteaccording to Heaps’ law. (ii) Can we derive Heaps’ law from Zipf’s law?5.4References and further readingHeaps’ law was discovered by Heaps (1978).
See also Baeza-Yates and RibeiroNeto (1999). A detailed study of vocabulary growth in large collections is(Williams and Zobel 2005). Zipf’s law is due to Zipf (1949). Witten and Bell(1990) investigate the quality of the fit obtained by the law. Other term distribution models, including K mixture and two-poisson model, are discussedby Manning and Schütze (1999, Chapter 15).
Carmel et al. (2001), Büttcherand Clarke (2006), Blanco and Barreiro (2007), and Ntoulas and Cho (2007)show that lossy compression can achieve good compression with no or nosignificant decrease in retrieval effectiveness.Dictionary compression is covered in detail by Witten et al. (1999, Chapter 4), which is recommended as additional reading.Online edition (c) 2009 Cambridge UP1065 Index compressionPARAMETERIZED CODEG OLOMB CODESSubsection 5.3.1 is based on (Scholer et al.
2002). The authors find thatvariable byte codes process queries two times faster than either bit-levelcompressed indexes or uncompressed indexes with a 30% penalty in compression ratio compared with the best bit-level compression method. Theyalso show that compressed indexes can be superior to uncompressed indexesnot only in disk usage, but also in query processing speed.
Compared withVB codes, “variable nibble” codes showed 5% to 10% better compressionand up to one third worse effectiveness in one experiment (Anh and Moffat2005). Trotman (2003) also recommends using VB codes unless disk space isat a premium. In recent work, Anh and Moffat (2005; 2006a) and Zukowskiet al. (2006) have constructed word-aligned binary codes that are both fasterin decompression and at least as efficient as VB codes.
Zhang et al. (2007) investigate the increased effectiveness of caching when a number of differentcompression techniques for postings lists are used on modern hardware.δ codes (Exercise 5.9) and γ codes were introduced by Elias (1975), whoproved that both codes are universal. In addition, δ codes are asymptoticallyoptimal for H ( P) → ∞.
δ codes perform better than γ codes if large numbers (greater than 15) dominate. A good introduction to information theory,including the concept of entropy, is (Cover and Thomas 1991). While Eliascodes are only asymptotically optimal, arithmetic codes (Witten et al. 1999,Section 2.4) can be constructed to be arbitrarily close to the optimum H ( P)for any P.Several additional index compression techniques are covered by Witten etal. (1999; Sections 3.3 and 3.4 and Chapter 5).
They recommend using parameterized codes for index compression, codes that explicitly model the probability distribution of gaps for each term. For example, they show that Golombcodes achieve better compression ratios than γ codes for large collections.Moffat and Zobel (1992) compare several parameterized methods, includingLLRUN (Fraenkel and Klein 1985).The distribution of gaps in a postings list depends on the assignment ofdocIDs to documents. A number of researchers have looked into assigning docIDs in a way that is conducive to the efficient compression of gapsequences (Moffat and Stuiver 1996; Blandford and Blelloch 2002; Silvestriet al. 2004; Blanco and Barreiro 2006; Silvestri 2007). These techniques assigndocIDs in a small range to documents in a cluster where a cluster can consistof all documents in a given time period, on a particular web site, or sharinganother property. As a result, when a sequence of documents from a cluster occurs in a postings list, their gaps are small and can be more effectivelycompressed.Different considerations apply to the compression of term frequencies andword positions than to the compression of docIDs in postings lists.
See Scholer et al. (2002) and Zobel and Moffat (2006). Zobel and Moffat (2006) isrecommended in general as an in-depth and up-to-date tutorial on invertedOnline edition (c) 2009 Cambridge UP5.4 References and further reading107indexes, including index compression.This chapter only looks at index compression for Boolean retrieval. Forranked retrieval (Chapter 6), it is advantageous to order postings accordingto term frequency instead of docID.