An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 24

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 24 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 242020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 24)

Fill out the time column of the table for Reuters-RCV1 assuming a systemwith the parameters given in Table 4.1.Exercise 4.7Repeat Exercise 4.6 for the larger collection in Table 4.4. Choose a block size that isrealistic for current technology (remember that a block should easily fit into mainmemory). How many blocks do you need?Exercise 4.8Assume that we have a collection of modest size whose index can be constructed withthe simple in-memory indexing algorithm in Figure 1.4 (page 8). For this collection,compare memory, disk and time requirements of the simple algorithm in Figure 1.4and blocked sort-based indexing.Exercise 4.9Assume that machines in MapReduce have 100 GB of disk space each.

Assume further that the postings list of the term the has a size of 200 GB. Then the MapReducealgorithm as described cannot be run to construct the index. How would you modifyMapReduce so that it can handle this case?Online edition (c) 2009 Cambridge UP4.7 References and further reading83Exercise 4.10For optimal load balancing, the inverters in MapReduce must get segmented postingsfiles of similar sizes. For a new collection, the distribution of key-value pairs may notbe known in advance.

How would you solve this problem?Exercise 4.11Apply MapReduce to the problem of counting how often each term occurs in a set offiles. Specify map and reduce operations for this task. Write down an example alongthe lines of Figure 4.6.Exercise 4.12We claimed (on page 80) that an auxiliary index can impair the quality of collection statistics. An example is the term weighting method idf, which is defined aslog( N/dfi ) where N is the total number of documents and dfi is the number of documents that term i occurs in (Section 6.2.1, page 117). Show that even a small auxiliaryindex can cause significant error in idf when it is computed on the main index only.Consider a rare term that suddenly occurs frequently (e.g., Flossie as in Tropical StormFlossie).4.7References and further readingWitten et al.

(1999, Chapter 5) present an extensive treatment of the subject ofindex construction and additional indexing algorithms with different tradeoffs of memory, disk space, and time. In general, blocked sort-based indexingdoes well on all three counts. However, if conserving memory or disk spaceis the main criterion, then other algorithms may be a better choice.

See Witten et al. (1999), Tables 5.4 and 5.5; BSBI is closest to “sort-based multiwaymerge,” but the two algorithms differ in dictionary structure and use of compression.Moffat and Bell (1995) show how to construct an index “in situ,” thatis, with disk space usage close to what is needed for the final index andwith a minimum of additional temporary files (cf. also Harman and Candela(1990)). They give Lesk (1988) and Somogyi (1990) credit for being amongthe first to employ sorting for index construction.The SPIMI method in Section 4.3 is from (Heinz and Zobel 2003). We havesimplified several aspects of the algorithm, including compression and thefact that each term’s data structure also contains, in addition to the postingslist, its document frequency and house keeping information. We recommendHeinz and Zobel (2003) and Zobel and Moffat (2006) as up-do-date, in-depthtreatments of index construction.

Other algorithms with good scaling properties with respect to vocabulary size require several passes through the data,e.g., FAST-INV (Fox and Lee 1991, Harman et al. 1992).The MapReduce architecture was introduced by Dean and Ghemawat (2004).An open source implementation of MapReduce is available at http://lucene.apache.org/hadoop/.Ribeiro-Neto et al. (1999) and Melnik et al. (2001) describe other approachesOnline edition (c) 2009 Cambridge UP844 Index constructionto distributed indexing.

Introductory chapters on distributed IR are (BaezaYates and Ribeiro-Neto 1999, Chapter 9) and (Grossman and Frieder 2004,Chapter 8). See also Callan (2000).Lester et al. (2005) and Büttcher and Clarke (2005a) analyze the properties of logarithmic merging and compare it with other construction methods.One of the first uses of this method was in Lucene (http://lucene.apache.org).Other dynamic indexing methods are discussed by Büttcher et al. (2006) andLester et al. (2006).

The latter paper also discusses the strategy of replacingthe old index by one built from scratch.Heinz et al. (2002) compare data structures for accumulating the vocabulary in memory. Büttcher and Clarke (2005b) discuss security models for acommon inverted index for multiple users. A detailed characterization of theReuters-RCV1 collection can be found in (Lewis et al.

2004). NIST distributesthe collection (see http://trec.nist.gov/data/reuters/reuters.html).Garcia-Molina et al. (1999, Chapter 2) review computer hardware relevantto system design in depth.An effective indexer for enterprise search needs to be able to communicateefficiently with a number of applications that hold text data in corporations,including Microsoft Outlook, IBM’s Lotus software, databases like Oracleand MySQL, content management systems like Open Text, and enterpriseresource planning software like SAP.Online edition (c) 2009 Cambridge UPDRAFT! © April 1, 2009 Cambridge University Press. Feedback welcome.585Index compressionChapter 1 introduced the dictionary and the inverted index as the centraldata structures in information retrieval (IR).

In this chapter, we employ anumber of compression techniques for dictionary and inverted index thatare essential for efficient IR systems.One benefit of compression is immediately clear. We need less disk space.As we will see, compression ratios of 1:4 are easy to achieve, potentially cutting the cost of storing the index by 75%.There are two more subtle benefits of compression.

The first is increaseduse of caching. Search systems use some parts of the dictionary and the indexmuch more than others. For example, if we cache the postings list of a frequently used query term t, then the computations necessary for respondingto the one-term query t can be entirely done in memory. With compression,we can fit a lot more information into main memory.

Instead of having toexpend a disk seek when processing a query with t, we instead access itspostings list in memory and decompress it. As we will see below, there aresimple and efficient decompression methods, so that the penalty of having todecompress the postings list is small. As a result, we are able to decrease theresponse time of the IR system substantially. Because memory is a more expensive resource than disk space, increased speed owing to caching – ratherthan decreased space requirements – is often the prime motivator for compression.The second more subtle advantage of compression is faster transfer of datafrom disk to memory.

Efficient decompression algorithms run so fast onmodern hardware that the total time of transferring a compressed chunk ofdata from disk and then decompressing it is usually less than transferringthe same chunk of data in uncompressed form. For instance, we can reduceinput/output (I/O) time by loading a much smaller compressed postingslist, even when you add on the cost of decompression. So, in most cases,the retrieval system runs faster on compressed postings lists than on uncompressed postings lists.If the main goal of compression is to conserve disk space, then the speedOnline edition (c) 2009 Cambridge UP865 Index compressionPOSTING5.1RULE OF30of compression algorithms is of no concern.

But for improved cache utilization and faster disk-to-memory transfer, decompression speeds must behigh. The compression algorithms we discuss in this chapter are highly efficient and can therefore serve all three purposes of index compression.In this chapter, we define a posting as a docID in a postings list. For example, the postings list (6; 20, 45, 100), where 6 is the termID of the list’s term,contains three postings. As discussed in Section 2.4.2 (page 41), postings inmost search systems also contain frequency and position information; but wewill only consider simple docID postings here.

See Section 5.4 for referenceson compressing frequencies and positions.This chapter first gives a statistical characterization of the distribution ofthe entities we want to compress – terms and postings in large collections(Section 5.1). We then look at compression of the dictionary, using the dictionaryas-a-string method and blocked storage (Section 5.2). Section 5.3 describestwo techniques for compressing the postings file, variable byte encoding andγ encoding.Statistical properties of terms in information retrievalAs in the last chapter, we use Reuters-RCV1 as our model collection (see Table 4.2, page 70).

We give some term and postings statistics for the collectionin Table 5.1. “∆%” indicates the reduction in size from the previous line.“T%” is the cumulative reduction from unfiltered.The table shows the number of terms for different levels of preprocessing(column 2).

The number of terms is the main factor in determining the sizeof the dictionary. The number of nonpositional postings (column 3) is anindicator of the expected size of the nonpositional index of the collection.The expected size of a positional index is related to the number of positionsit must encode (column 4).In general, the statistics in Table 5.1 show that preprocessing affects the sizeof the dictionary and the number of nonpositional postings greatly. Stemming and case folding reduce the number of (distinct) terms by 17% eachand the number of nonpositional postings by 4% and 3%, respectively.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.