An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 35

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 35 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 352020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 35)

We now consider schemes by which we produce K documents that are likely to be among the K highest scoring documents for aquery. In doing so, we hope to dramatically lower the cost of computingthe K documents we output, without materially altering the user’s perceivedrelevance of the top K results. Consequently, in most applications it sufficesto retrieve K documents whose scores are very close to those of the K best.In the sections that follow we detail schemes that retrieve K such documentswhile potentially avoiding computing scores for most of the N documents inthe collection.Such inexact top-K retrieval is not necessarily, from the user’s perspective,a bad thing. The top K documents by the cosine measure are in any case notnecessarily the K best for the query: cosine similarity is only a proxy for theuser’s perceived relevance.

In Sections 7.1.2–7.1.6 below, we give heuristicsusing which we are likely to retrieve K documents with cosine scores closeto those of the top K documents. The principal cost in computing the output stems from computing cosine similarities between the query and a largenumber of documents.

Having a large number of documents in contentionalso increases the selection cost in the final stage of culling the top K documents from a heap. We now consider a series of ideas designed to eliminatea large number of documents without computing their cosine scores. Theheuristics have the following two-step scheme:1. Find a set A of documents that are contenders, where K < | A| ≪ N.

Adoes not necessarily contain the K top-scoring documents for the query,but is likely to have many documents with scores near those of the top K.2. Return the K top-scoring documents in A.From the descriptions of these ideas it will be clear that many of them requireparameters to be tuned to the collection and application at hand; pointersto experience in setting these parameters may be found at the end of thischapter.

It should also be noted that most of these heuristics are well-suitedto free text queries, but not for Boolean or phrase queries.7.1.2Index eliminationFor a multi-term query q, it is clear we only consider documents containing atleast one of the query terms. We can take this a step further using additionalheuristics:1.

We only consider documents containing terms whose idf exceeds a presetthreshold. Thus, in the postings traversal, we only traverse the postingsOnline edition (c) 2009 Cambridge UP1387 Computing scores in a complete search systemfor terms with high idf. This has a fairly significant benefit: the postings lists of low-idf terms are generally long; with these removed fromcontention, the set of documents for which we compute cosines is greatlyreduced. One way of viewing this heuristic: low-idf terms are treated asstop words and do not contribute to scoring. For instance, on the querycatcher in the rye, we only traverse the postings for catcher and rye.

Thecutoff threshold can of course be adapted in a query-dependent manner.2. We only consider documents that contain many (and as a special case,all) of the query terms. This can be accomplished during the postingstraversal; we only compute scores for documents containing all (or many)of the query terms. A danger of this scheme is that by requiring all (oreven many) query terms to be present in a document before consideringit for cosine computation, we may end up with fewer than K candidatedocuments in the output. This issue will discussed further in Section 7.2.1.7.1.3Champion listsThe idea of champion lists (sometimes also called fancy lists or top docs) is toprecompute, for each term t in the dictionary, the set of the r documentswith the highest weights for t; the value of r is chosen in advance.

For tfidf weighting, these would be the r documents with the highest tf values forterm t. We call this set of r documents the champion list for term t.Now, given a query q we create a set A as follows: we take the unionof the champion lists for each of the terms comprising q. We now restrictcosine computation to only the documents in A. A critical parameter in thisscheme is the value r, which is highly application dependent.

Intuitively, rshould be large compared with K, especially if we use any form of the indexelimination described in Section 7.1.2. One issue here is that the value r is setat the time of index construction, whereas K is application dependent andmay not be available until the query is received; as a result we may (as in thecase of index elimination) find ourselves with a set A that has fewer than Kdocuments. There is no reason to have the same value of r for all terms in thedictionary; it could for instance be set to be higher for rarer terms.7.1.4STATIC QUALITYSCORESStatic quality scores and orderingWe now further develop the idea of champion lists, in the somewhat moregeneral setting of static quality scores.

In many search engines, we have available a measure of quality g(d) for each document d that is query-independentand thus static. This quality measure may be viewed as a number betweenzero and one. For instance, in the context of news stories on the web, g(d)may be derived from the number of favorable reviews of the story by webOnline edition (c) 2009 Cambridge UP1397.1 Efficient scoring and ranking◮ Figure 7.2 A static quality-ordered index. In this example we assume that Doc1,Doc2 and Doc3 respectively have static quality scores g(1) = 0.25, g(2) = 0.5, g(3) =1.surfers.

Section 4.6 (page 80) provides further discussion on this topic, asdoes Chapter 21 in the context of web search.The net score for a document d is some combination of g(d) together withthe query-dependent score induced (say) by (6.12). The precise combinationmay be determined by the learning methods of Section 6.1.2, to be developedfurther in Section 15.4.1; but for the purposes of our exposition here, let usconsider a simple sum:(7.2)net-score(q, d) = g(d) +~ ( q) · V~ (d)V.~ (q)||V~ (d)||VIn this simple form, the static quality g(d) and the query-dependent scorefrom (6.10) have equal contributions, assuming each is between 0 and 1.Other relative weightings are possible; the effectiveness of our heuristics willdepend on the specific relative weighting.First, consider ordering the documents in the postings list for each term bydecreasing value of g(d).

This allows us to perform the postings intersectionalgorithm of Figure 1.6. In order to perform the intersection by a single passthrough the postings of each query term, the algorithm of Figure 1.6 relied onthe postings being ordered by document IDs. But in fact, we only requiredthat all postings be ordered by a single common ordering; here we rely on theg(d) values to provide this common ordering. This is illustrated in Figure 7.2,where the postings are ordered in decreasing order of g(d).The first idea is a direct extension of champion lists: for a well-chosenvalue r, we maintain for each term t a global champion list of the r documentsOnline edition (c) 2009 Cambridge UP1407 Computing scores in a complete search systemwith the highest values for g(d) + tf-idft,d .

The list itself is, like all the postings lists considered so far, sorted by a common order (either by documentIDs or by static quality). Then at query time, we only compute the net scores(7.2) for documents in the union of these global champion lists. Intuitively,this has the effect of focusing on documents likely to have large net scores.We conclude the discussion of global champion lists with one further idea.We maintain for each term t two postings lists consisting of disjoint sets ofdocuments, each sorted by g(d) values.

The first list, which we call high,contains the m documents with the highest tf values for t. The second list,which we call low, contains all other documents containing t. When processing a query, we first scan only the high lists of the query terms, computingnet scores for any document on the high lists of all (or more than a certainnumber of) query terms. If we obtain scores for K documents in the process,we terminate. If not, we continue the scanning into the low lists, scoring documents in these postings lists.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.