An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 17

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 17 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 172020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 17)

The principal issue here is that of rebalancing: as terms areinserted into or deleted from the binary search tree, it needs to be rebalancedso that the balance property is maintained.To mitigate rebalancing, one approach is to allow the number of sub-treesunder an internal node to vary in a fixed interval.

A search tree commonlyused for a dictionary is the B-tree – a search tree in which every internal nodehas a number of children in the interval [ a, b], where a and b are appropriatepositive integers; Figure 3.2 shows an example with a = 2 and b = 4. Eachbranch under an internal node again represents a test for a range of char1.

So-called perfect hash functions are designed to preclude collisions, but are rather more complicated both to implement and to compute.Online edition (c) 2009 Cambridge UP3.2 Wildcard queries51◮ Figure 3.1 A binary search tree. In this example the branch at the root partitionsvocabulary terms into two subtrees, those whose first letter is between a and m, andthe rest.acter sequences, as in the binary tree example of Figure 3.1.

A B-tree maybe viewed as “collapsing” multiple levels of the binary tree into one; thisis especially advantageous when some of the dictionary is disk-resident, inwhich case this collapsing serves the function of pre-fetching imminent binary tests. In such cases, the integers a and b are determined by the sizes ofdisk blocks. Section 3.5 contains pointers to further background on searchtrees and B-trees.It should be noted that unlike hashing, search trees demand that the characters used in the document collection have a prescribed ordering; for instance, the 26 letters of the English alphabet are always listed in the specificorder A through Z. Some Asian languages such as Chinese do not alwayshave a unique ordering, although by now all languages (including Chineseand Japanese) have adopted a standard ordering system for their charactersets.3.2Wildcard queriesWildcard queries are used in any of the following situations: (1) the useris uncertain of the spelling of a query term (e.g., Sydney vs.

Sidney, whichOnline edition (c) 2009 Cambridge UP523 Dictionaries and tolerant retrieval◮ Figure 3.2 A B-tree. In this example every internal node has between 2 and 4children.WILDCARD QUERYleads to the wildcard query S*dney); (2) the user is aware of multiple variants of spelling a term and (consciously) seeks documents containing any ofthe variants (e.g., color vs. colour); (3) the user seeks documents containingvariants of a term that would be caught by stemming, but is unsure whetherthe search engine performs stemming (e.g., judicial vs. judiciary, leading to thewildcard query judicia*); (4) the user is uncertain of the correct rendition of aforeign word or phrase (e.g., the query Universit* Stuttgart).A query such as mon* is known as a trailing wildcard query, because the *symbol occurs only once, at the end of the search string.

A search tree onthe dictionary is a convenient way of handling trailing wildcard queries: wewalk down the tree following the symbols m, o and n in turn, at which pointwe can enumerate the set W of terms in the dictionary with the prefix mon.Finally, we use |W | lookups on the standard inverted index to retrieve alldocuments containing any term in W.But what about wildcard queries in which the * symbol is not constrainedto be at the end of the search string? Before handling this general case, wemention a slight generalization of trailing wildcard queries. First, considerleading wildcard queries, or queries of the form *mon. Consider a reverse B-treeon the dictionary – one in which each root-to-leaf path of the B-tree corresponds to a term in the dictionary written backwards: thus, the term lemonwould, in the B-tree, be represented by the path root-n-o-m-e-l.

A walk downthe reverse B-tree then enumerates all terms R in the vocabulary with a givenprefix.In fact, using a regular B-tree together with a reverse B-tree, we can handlean even more general case: wildcard queries in which there is a single * symbol, such as se*mon. To do this, we use the regular B-tree to enumerate the setW of dictionary terms beginning with the prefix se, then the reverse B-tree toOnline edition (c) 2009 Cambridge UP3.2 Wildcard queries53enumerate the set R of terms ending with the suffix mon. Next, we take theintersection W ∩ R of these two sets, to arrive at the set of terms that beginwith the prefix se and end with the suffix mon.

Finally, we use the standardinverted index to retrieve all documents containing any terms in this intersection. We can thus handle wildcard queries that contain a single * symbolusing two B-trees, the normal B-tree and a reverse B-tree.3.2.1General wildcard queriesWe now study two techniques for handling general wildcard queries.

Bothtechniques share a common strategy: express the given wildcard query qw asa Boolean query Q on a specially constructed index, such that the answer toQ is a superset of the set of vocabulary terms matching qw . Then, we checkeach term in the answer to Q against qw , discarding those vocabulary termsthat do not match qw . At this point we have the vocabulary terms matchingqw and can resort to the standard inverted index.Permuterm indexesPERMUTERM INDEXOur first special index for general wildcard queries is the permuterm index,a form of inverted index. First, we introduce a special symbol $ into ourcharacter set, to mark the end of a term.

Thus, the term hello is shown here asthe augmented term hello$. Next, we construct a permuterm index, in whichthe various rotations of each term (augmented with $) all link to the originalvocabulary term. Figure 3.3 gives an example of such a permuterm indexentry for the term hello.We refer to the set of rotated terms in the permuterm index as the permuterm vocabulary.How does this index help us with wildcard queries? Consider the wildcardquery m*n.

The key is to rotate such a wildcard query so that the * symbolappears at the end of the string – thus the rotated wildcard query becomesn$m*. Next, we look up this string in the permuterm index, where seekingn$m* (via a search tree) leads to rotations of (among others) the terms manand moron.Now that the permuterm index enables us to identify the original vocabulary terms matching a wildcard query, we look up these terms in the standard inverted index to retrieve matching documents.

We can thus handleany wildcard query with a single * symbol. But what about a query such asfi*mo*er? In this case we first enumerate the terms in the dictionary that arein the permuterm index of er$fi*. Not all such dictionary terms will havethe string mo in the middle - we filter these out by exhaustive enumeration, checking each candidate to see if it contains mo. In this example, theterm fishmonger would survive this filtering but filibuster would not. We thenOnline edition (c) 2009 Cambridge UP543 Dictionaries and tolerant retrieval◮ Figure 3.3 A portion of a permuterm index.run the surviving terms through the standard inverted index for documentretrieval.

One disadvantage of the permuterm index is that its dictionarybecomes quite large, including as it does all rotations of each term.Notice the close interplay between the B-tree and the permuterm indexabove. Indeed, it suggests that the structure should perhaps be viewed asa permuterm B-tree. However, we follow traditional terminology here indescribing the permuterm index as distinct from the B-tree that allows us toselect the rotations with a given prefix.3.2.2k- GRAM INDEXk-gram indexes for wildcard queriesWhereas the permuterm index is simple, it can lead to a considerable blowupfrom the number of rotations per term; for a dictionary of English terms, thiscan represent an almost ten-fold space increase. We now present a secondtechnique, known as the k-gram index, for processing wildcard queries.

Wewill also use k-gram indexes in Section 3.3.4. A k-gram is a sequence of kcharacters. Thus cas, ast and stl are all 3-grams occurring in the term castle.We use a special character $ to denote the beginning or end of a term, so thefull set of 3-grams generated for castle is: $ca, cas, ast, stl, tle, le$.In a k-gram index, the dictionary contains all k-grams that occur in any termin the vocabulary. Each postings list points from a k-gram to all vocabularyOnline edition (c) 2009 Cambridge UP553.2 Wildcard queriesetr- beetroot-metric-petrify- retrieval◮ Figure 3.4 Example of a postings list in a 3-gram index.

Here the 3-gram etr isillustrated. Matching vocabulary terms are lexicographically ordered in the postings.terms containing that k-gram. For instance, the 3-gram etr would point to vocabulary terms such as metric and retrieval. An example is given in Figure 3.4.How does such an index help us with wildcard queries? Consider thewildcard query re*ve. We are seeking documents containing any term thatbegins with re and ends with ve. Accordingly, we run the Boolean query $reAND ve$.

This is looked up in the 3-gram index and yields a list of matchingterms such as relive, remove and retrieve. Each of these matching terms is thenlooked up in the standard inverted index to yield documents matching thequery.There is however a difficulty with the use of k-gram indexes, that demandsone further step of processing. Consider using the 3-gram index describedabove for the query red*. Following the process described above, we firstissue the Boolean query $re AND red to the 3-gram index.

This leads to amatch on terms such as retired, which contain the conjunction of the two 3grams $re and red, yet do not match the original wildcard query red*.To cope with this, we introduce a post-filtering step, in which the terms enumerated by the Boolean query on the 3-gram index are checked individuallyagainst the original query red*. This is a simple string-matching operationand weeds out terms such as retired that do not match the original query.Terms that survive are then searched in the standard inverted index as usual.We have seen that a wildcard query can result in multiple terms beingenumerated, each of which becomes a single-term query on the standard inverted index. Search engines do allow the combination of wildcard queriesusing Boolean operators, for example, re*d AND fe*ri. What is the appropriatesemantics for such a query? Since each wildcard query turns into a disjunction of single-term queries, the appropriate interpretation of this exampleis that we have a conjunction of disjunctions: we seek all documents thatcontain any term matching re*d and any term matching fe*ri.Even without Boolean combinations of wildcard queries, the processing ofa wildcard query can be quite expensive, because of the added lookup in thespecial index, filtering and finally the standard inverted index.

A search engine may support such rich functionality, but most commonly, the capabilityis hidden behind an interface (say an “Advanced Query” interface) that mostOnline edition (c) 2009 Cambridge UP563 Dictionaries and tolerant retrievalusers never use. Exposing such functionality in the search interface often encourages users to invoke it even when they do not require it (say, by typinga prefix of their query followed by a *), increasing the processing load on thesearch engine.?Exercise 3.1In the permuterm index, each permuterm vocabulary term points to the original vocabulary term(s) from which it was derived.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.