An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 22

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 22 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 222020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 22)

Most large search engines prefer a document1. A cluster in this chapter is a group of tightly coupled computers that work together closely.This sense of the word is different from the use of cluster as a group of documents that aresemantically similar in Chapters 16–18.Online edition (c) 2009 Cambridge UP4.4 Distributed indexingM AP R EDUCEMASTER NODESPLITSKEY- VALUE PAIRSMAP PHASEPARSERSEGMENT FILEREDUCE PHASE75partitioned index (which can be easily generated from a term-partitionedindex). We discuss this topic further in Section 20.3 (page 454).The distributed index construction method we describe in this section is anapplication of MapReduce, a general architecture for distributed computing.MapReduce is designed for large computer clusters.

The point of a cluster isto solve large computing problems on cheap commodity machines or nodesthat are built from standard parts (processor, memory, disk) as opposed to ona supercomputer with specialized hardware. Although hundreds or thousands of machines are available in such clusters, individual machines canfail at any time. One requirement for robust distributed indexing is, therefore, that we divide the work up into chunks that we can easily assign and– in case of failure – reassign. A master node directs the process of assigningand reassigning tasks to individual worker nodes.The map and reduce phases of MapReduce split up the computing jobinto chunks that standard machines can process in a short time.

The varioussteps of MapReduce are shown in Figure 4.5 and an example on a collectionconsisting of two documents is shown in Figure 4.6. First, the input data,in our case a collection of web pages, are split into n splits where the size ofthe split is chosen to ensure that the work can be distributed evenly (chunksshould not be too large) and efficiently (the total number of chunks we needto manage should not be too large); 16 or 64 MB are good sizes in distributedindexing.

Splits are not preassigned to machines, but are instead assignedby the master node on an ongoing basis: As a machine finishes processingone split, it is assigned the next one. If a machine dies or becomes a laggarddue to hardware problems, the split it is working on is simply reassigned toanother machine.In general, MapReduce breaks a large computing problem into smallerparts by recasting it in terms of manipulation of key-value pairs. For indexing, a key-value pair has the form (termID,docID).

In distributed indexing,the mapping from terms to termIDs is also distributed and therefore morecomplex than in single-machine indexing. A simple solution is to maintaina (perhaps precomputed) mapping for frequent terms that is copied to allnodes and to use terms directly (instead of termIDs) for infrequent terms.We do not address this problem here and assume that all nodes share a consistent term → termID mapping.The map phase of MapReduce consists of mapping splits of the input datato key-value pairs.

This is the same parsing task we also encountered in BSBIand SPIMI, and we therefore call the machines that execute the map phaseparsers. Each parser writes its output to local intermediate files, the segmentfiles (shown as a-f g-p q-z in Figure 4.5).For the reduce phase, we want all values for a given key to be stored closetogether, so that they can be read and processed quickly. This is achieved byOnline edition (c) 2009 Cambridge UP764 Index constructionsplitsassignmasterassignparsera-f g-p q-zparsera-f g-p q-zparsera-f g-p q-zmapphasesegmentfilespostingsinve rtera-finve rterg-pinve rterq-zreducephase◮ Figure 4.5 An example of distributed indexing with MapReduce.

Adapted fromDean and Ghemawat (2004).INVERTERpartitioning the keys into j term partitions and having the parsers write keyvalue pairs for each term partition into a separate segment file. In Figure 4.5,the term partitions are according to first letter: a–f, g–p, q–z, and j = 3. (Wechose these key ranges for ease of exposition. In general, key ranges need notcorrespond to contiguous terms or termIDs.) The term partitions are definedby the person who operates the indexing system (Exercise 4.10). The parsersthen write corresponding segment files, one for each term partition. Eachterm partition thus corresponds to r segments files, where r is the numberof parsers.

For instance, Figure 4.5 shows three a–f segment files of the a–fpartition, corresponding to the three parsers shown in the figure.Collecting all values (here: docIDs) for a given key (here: termID) into onelist is the task of the inverters in the reduce phase. The master assigns eachterm partition to a different inverter – and, as in the case of parsers, reassigns term partitions in case of failing or slow inverters.

Each term partition(corresponding to r segment files, one on each parser) is processed by one inverter. We assume here that segment files are of a size that a single machinecan handle (Exercise 4.9). Finally, the list of values is sorted for each key andwritten to the final sorted postings list (“postings” in the figure). (Note thatpostings in Figure 4.6 include term frequencies, whereas each posting in theother sections of this chapter is simply a docID without term frequency information.) The data flow is shown for a–f in Figure 4.5. This completes theconstruction of the inverted index.Online edition (c) 2009 Cambridge UP774.4 Distributed indexingSchema of map and reduce functionsmap: inputreduce: (k,list(v))→ list(k , v)→ outputInstantiation of the schema for index construction→ list(termID, docID)map: web collectionreduce: (〈termID1, list(docID)〉, 〈termID2 , list(docID)〉, .

. . ) → (postings list1, postings list 2 , . . . )Example for index constructionmap: d2 : C died. d1 : C came, C c’ed.reduce: ( 〈C,(d2 ,d1 ,d1 )〉, 〈died,(d2 )〉, 〈came,(d1 )〉, 〈c’ed,(d1 ) 〉)→ ( 〈C, d2 〉, 〈died,d2〉, 〈C,d1〉, 〈came,d1〉, 〈C,d1〉, 〈c’ed,d1〉)→ (〈C,(d1:2,d2:1)〉, 〈died,(d2:1)〉, 〈came,(d1:1)〉, 〈c’ed,(d1:1)〉 )◮ Figure 4.6 Map and reduce functions in MapReduce.

In general, the map function produces a list of key-value pairs. All values for a key are collected into onelist in the reduce phase. This list is then processed further. The instantiations of thetwo functions and an example are shown for index construction. Because the mapphase processes documents in a distributed fashion, termID–docID pairs need not beordered correctly initially as in this example. The example shows terms instead oftermIDs for better readability. We abbreviate Caesar as C and conquered as c’ed.Parsers and inverters are not separate sets of machines.

The master identifies idle machines and assigns tasks to them. The same machine can be aparser in the map phase and an inverter in the reduce phase. And there areoften other jobs that run in parallel with index construction, so in betweenbeing a parser and an inverter a machine might do some crawling or anotherunrelated task.To minimize write times before inverters reduce the data, each parser writesits segment files to its local disk. In the reduce phase, the master communicates to an inverter the locations of the relevant segment files (e.g., of the rsegment files of the a–f partition). Each segment file only requires one sequential read because all data relevant to a particular inverter were writtento a single segment file by the parser. This setup minimizes the amount ofnetwork traffic needed during indexing.Figure 4.6 shows the general schema of the MapReduce functions.

Input and output are often lists of key-value pairs themselves, so that severalMapReduce jobs can run in sequence. In fact, this was the design of theGoogle indexing system in 2004. What we describe in this section corresponds to only one of five to ten MapReduce operations in that indexingsystem. Another MapReduce operation transforms the term-partitioned index we just created into a document-partitioned one.MapReduce offers a robust and conceptually simple framework for implementing index construction in a distributed environment.

By providing asemiautomatic method for splitting index construction into smaller tasks, itcan scale to almost arbitrarily large collections, given computer clusters ofOnline edition (c) 2009 Cambridge UP784 Index constructionsufficient size.?Exercise 4.34.5Dynamic indexingAUXILIARY INDEXFor n = 15 splits, r = 10 segments, and j = 3 term partitions, how long woulddistributed index creation take for Reuters-RCV1 in a MapReduce architecture? Baseyour assumptions about cluster machines on Table 4.1.Thus far, we have assumed that the document collection is static.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.