An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 21

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 21 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 212020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 21)

Some key statistics of the collection are shownin Table 4.2.Online edition (c) 2009 Cambridge UP704 Index construction◮ Table 4.2 Collection statistics for Reuters-RCV1. Values are rounded for the computations in this book. The unrounded values are: 806,791 documents, 222 tokensper document, 391,523 (distinct) terms, 6.04 bytes per token with spaces and punctuation, 4.5 bytes per token without spaces and punctuation, 7.5 bytes per term, and96,969,056 tokens. The numbers in this table correspond to the third line (“case folding”) in Table 5.1 (page 87).SymbolNLaveMTStatisticdocumentsavg.

# tokens per documenttermsavg. # bytes per token (incl. spaces/punct.)avg. # bytes per token (without spaces/punct.)avg. # bytes per termtokensValue800,000200400,00064.57.5100,000,000REUTERSYou are here: Home > News > Science > ArticleGo to a Section:U.S.InternationalBusinessMarketsPoliticsEntertainmentTechnologySportsOddly EnoughExtreme conditions create rare Antarctic cloudsTue Aug 1, 2006 3:20am ETEmail This Article | Print This Article | ReprintsSYDNEY (Reuters) - Rare, mother-of-pearl colored cloudscaused by extreme weather conditions above Antarctica are apossible indication of global warming, Australian scientists said onTuesday.[-] Text [+]Known as nacreous clouds, the spectacular formations showing delicatewisps of colors were photographed in the sky over an Australianmeteorological base at Mawson Station on July 25.◮ Figure 4.1 Document from the Reuters newswire.EXTERNAL SORTINGALGORITHMReuters-RCV1 has 100 million tokens.

Collecting all termID–docID pairs ofthe collection using 4 bytes each for termID and docID therefore requires 0.8GB of storage. Typical collections today are often one or two orders of magnitude larger than Reuters-RCV1. You can easily see how such collectionsoverwhelm even large computers if we try to sort their termID–docID pairsin memory. If the size of the intermediate files during index construction iswithin a small factor of available memory, then the compression techniquesintroduced in Chapter 5 can help; however, the postings file of many largecollections cannot fit into memory even after compression.With main memory insufficient, we need to use an external sorting algorithm, that is, one that uses disk.

For acceptable speed, the central require-Online edition (c) 2009 Cambridge UP4.2 Blocked sort-based indexing71BSBI NDEX C ONSTRUCTION ()1 n←02 while (all documents have not been processed)3 do n ← n + 14block ← PARSE N EXT B LOCK ()5BSBI-I NVERT(block)6W RITE B LOCK T O D ISK (block, f n )7 M ERGE B LOCKS ( f 1 , .

. . , f n ; f merged )◮ Figure 4.2 Blocked sort-based indexing. The algorithm stores inverted blocks infiles f 1 , . . . , f n and the merged index in f merged .BLOCKED SORT- BASEDINDEXING ALGORITHMINVERSIONPOSTINGment of such an algorithm is that it minimize the number of random diskseeks during sorting – sequential disk reads are far faster than seeks as weexplained in Section 4.1. One solution is the blocked sort-based indexing algorithm or BSBI in Figure 4.2.

BSBI (i) segments the collection into parts of equalsize, (ii) sorts the termID–docID pairs of each part in memory, (iii) stores intermediate sorted results on disk, and (iv) merges all intermediate resultsinto the final index.The algorithm parses documents into termID–docID pairs and accumulates the pairs in memory until a block of a fixed size is full (PARSE N EXT B LOCKin Figure 4.2).

We choose the block size to fit comfortably into memory topermit a fast in-memory sort. The block is then inverted and written to disk.Inversion involves two steps. First, we sort the termID–docID pairs. Next,we collect all termID–docID pairs with the same termID into a postings list,where a posting is simply a docID.

The result, an inverted index for the blockwe have just read, is then written to disk. Applying this to Reuters-RCV1 andassuming we can fit 10 million termID–docID pairs into memory, we end upwith ten blocks, each an inverted index of one part of the collection.In the final step, the algorithm simultaneously merges the ten blocks intoone large merged index. An example with two blocks is shown in Figure 4.3,where we use di to denote the ith document of the collection. To do the merging, we open all block files simultaneously, and maintain small read buffersfor the ten blocks we are reading and a write buffer for the final merged index we are writing.

In each iteration, we select the lowest termID that hasnot been processed yet using a priority queue or a similar data structure. Allpostings lists for this termID are read and merged, and the merged list iswritten back to disk. Each read buffer is refilled from its file when necessary.How expensive is BSBI? Its time complexity is Θ( T log T ) because the stepwith the highest time complexity is sorting and T is an upper bound for thenumber of items we must sort (i.e., the number of termID–docID pairs).

ButOnline edition (c) 2009 Cambridge UP724 Index constructionpostings liststo be mergedbrutuscaesarnoblewithd1,d3d1,d2,d4d5d1,d2,d3,d5brutuscaesarjuliuskilledd6,d7d8,d9d10d8brutuscaesarjuliuskillednoblewithd1,d3,d6,d7d1,d2,d4,d8,d9d10d8d5d1,d2,d3,d5mergedpostings listsdisk◮ Figure 4.3 Merging in blocked sort-based indexing.

Two blocks (“postings lists tobe merged”) are loaded from disk into memory, merged in memory (“merged postings lists”) and written back to disk. We show terms instead of termIDs for betterreadability.the actual indexing time is usually dominated by the time it takes to parse thedocuments (PARSE N EXT B LOCK) and to do the final merge (M ERGE B LOCKS).Exercise 4.6 asks you to compute the total index construction time for RCV1that includes these steps as well as inverting the blocks and writing them todisk.Notice that Reuters-RCV1 is not particularly large in an age when one ormore GB of memory are standard on personal computers. With appropriatecompression (Chapter 5), we could have created an inverted index for RCV1in memory on a not overly beefy server. The techniques we have describedare needed, however, for collections that are several orders of magnitudelarger.?Exercise 4.1If we need T log2 T comparisons (where T is the number of termID–docID pairs) andtwo disk seeks for each comparison, how much time would index construction forReuters-RCV1 take if we used disk instead of memory for storage and an unoptimized sorting algorithm (i.e., not an external sorting algorithm)? Use the systemparameters in Table 4.1.Exercise 4.2[⋆]How would you create the dictionary in blocked sort-based indexing on the fly toavoid an extra pass through the data?Online edition (c) 2009 Cambridge UP4.3 Single-pass in-memory indexing73SPIMI-I NVERT (token_stream)1 output_ f ile = N EW F ILE ()2 dictionary = N EW H ASH ()3 while (free memory available)4 do token ← next(token_stream)5if term(token) ∈/ dictionary6then postings_list = A DD T O D ICTIONARY (dictionary, term(token))7else postings_list = G ET P OSTINGS L IST (dictionary, term(token))8if f ull ( postings_list)9then postings_list = D OUBLE P OSTINGS L IST (dictionary, term(token))10A DD T O P OSTINGS L IST ( postings_list, docID (token))11 sorted_terms ← S ORT T ERMS (dictionary)12 W RITE B LOCK T O D ISK (sorted_terms, dictionary, output_ f ile)13 return output_ f ile◮ Figure 4.4 Inversion of a block in single-pass in-memory indexing4.3SINGLE - PASSIN - MEMORY INDEXINGSingle-pass in-memory indexingBlocked sort-based indexing has excellent scaling properties, but it needsa data structure for mapping terms to termIDs.

For very large collections,this data structure does not fit into memory. A more scalable alternative issingle-pass in-memory indexing or SPIMI. SPIMI uses terms instead of termIDs,writes each block’s dictionary to disk, and then starts a new dictionary for thenext block. SPIMI can index collections of any size as long as there is enoughdisk space available.The SPIMI algorithm is shown in Figure 4.4. The part of the algorithm thatparses documents and turns them into a stream of term–docID pairs, whichwe call tokens here, has been omitted. SPIMI-I NVERT is called repeatedly onthe token stream until the entire collection has been processed.Tokens are processed one by one (line 4) during each successive call ofSPIMI-I NVERT.

When a term occurs for the first time, it is added to thedictionary (best implemented as a hash), and a new postings list is created(line 6). The call in line 7 returns this postings list for subsequent occurrencesof the term.A difference between BSBI and SPIMI is that SPIMI adds a posting directly to its postings list (line 10). Instead of first collecting all termID–docIDpairs and then sorting them (as we did in BSBI), each postings list is dynamic(i.e., its size is adjusted as it grows) and it is immediately available to collectpostings. This has two advantages: It is faster because there is no sortingrequired, and it saves memory because we keep track of the term a postingsOnline edition (c) 2009 Cambridge UP744 Index constructionlist belongs to, so the termIDs of postings need not be stored.

As a result, theblocks that individual calls of SPIMI-I NVERT can process are much largerand the index construction process as a whole is more efficient.Because we do not know how large the postings list of a term will be whenwe first encounter it, we allocate space for a short postings list initially anddouble the space each time it is full (lines 8–9). This means that some memory is wasted, which counteracts the memory savings from the omission oftermIDs in intermediate data structures. However, the overall memory requirements for the dynamically constructed index of a block in SPIMI arestill lower than in BSBI.When memory has been exhausted, we write the index of the block (whichconsists of the dictionary and the postings lists) to disk (line 12).

We have tosort the terms (line 11) before doing this because we want to write postingslists in lexicographic order to facilitate the final merging step. If each block’spostings lists were written in unsorted order, merging blocks could not beaccomplished by a simple linear scan through each block.Each call of SPIMI-I NVERT writes a block to disk, just as in BSBI. The laststep of SPIMI (corresponding to line 7 in Figure 4.2; not shown in Figure 4.4)is then to merge the blocks into the final inverted index.In addition to constructing a new dictionary structure for each block andeliminating the expensive sorting step, SPIMI has a third important component: compression.

Both the postings and the dictionary terms can be storedcompactly on disk if we employ compression. Compression increases the efficiency of the algorithm further because we can process even larger blocks,and because the individual blocks require less space on disk. We refer readersto the literature for this aspect of the algorithm (Section 4.7).The time complexity of SPIMI is Θ( T ) because no sorting of tokens is required and all operations are at most linear in the size of the collection.4.4Distributed indexingCollections are often so large that we cannot perform index construction efficiently on a single machine.

This is particularly true of the World Wide Webfor which we need large computer clusters1 to construct any reasonably sizedweb index. Web search engines, therefore, use distributed indexing algorithmsfor index construction. The result of the construction process is a distributedindex that is partitioned across several machines – either according to termor according to document. In this section, we describe distributed indexingfor a term-partitioned index.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.