An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 30

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 30 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 302020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 30)

We provide a brief introduction to this topic herebecause weighted zone scoring presents a clean setting for introducing it; acomplete development demands an understanding of machine learning andis deferred to Chapter 15.1. We are provided with a set of training examples, each of which is a tuple consisting of a query q and a document d, together with a relevanceOnline edition (c) 2009 Cambridge UP1146 Scoring, term weighting and the vector space modeljudgment for d on q. In the simplest form, each relevance judgments is either Relevant or Non-relevant.

More sophisticated implementations of themethodology make use of more nuanced judgments.2. The weights gi are then “learned” from these examples, in order that thelearned scores approximate the relevance judgments in the training examples.For weighted zone scoring, the process may be viewed as learning a linear function of the Boolean match scores contributed by the various zones.The expensive component of this methodology is the labor-intensive assembly of user-generated relevance judgments from which to learn the weights,especially in a collection that changes frequently (such as the Web).

We nowdetail a simple example that illustrates how we can reduce the problem oflearning the weights gi to a simple optimization problem.We now consider a simple case of weighted zone scoring, where each document has a title zone and a body zone. Given a query q and a document d, weuse the given Boolean match function to compute Boolean variables s T (d, q)and s B (d, q), depending on whether the title (respectively, body) zone of dmatches query q. For instance, the algorithm in Figure 6.4 uses an AND ofthe query terms for this Boolean function. We will compute a score between0 and 1 for each (document, query) pair using s T (d, q) and s B (d, q) by usinga constant g ∈ [0, 1], as follows:(6.2)score(d, q) = g · s T (d, q) + (1 − g)s B (d, q).We now describe how to determine the constant g from a set of training examples, each of which is a triple of the form Φ j = (d j , q j , r (d j , q j )).

In eachtraining example, a given training document d j and a given training query q jare assessed by a human editor who delivers a relevance judgment r (d j , q j )that is either Relevant or Non-relevant. This is illustrated in Figure 6.5, whereseven training examples are shown.For each training example Φ j we have Boolean values s T (d j , q j ) and s B (d j , q j )that we use to compute a score from (6.2)(6.3)score(d j , q j ) = g · s T (d j , q j ) + (1 − g)s B (d j , q j ).We now compare this computed score to the human relevance judgment forthe same document-query pair (d j , q j ); to this end, we will quantize eachRelevant judgment as a 1 and each Non-relevant judgment as a 0.

Supposethat we define the error of the scoring function with weight g asε( g, Φ j ) = (r (d j , q j ) − score(d j , q j ))2 ,Online edition (c) 2009 Cambridge UP1156.1 Parametric and zone indexesExampleΦ1Φ2Φ3Φ4Φ5Φ6Φ7DocID3737238238174120943191QuerylinuxpenguinsystempenguinkerneldriverdriversT1000101sB1110110JudgmentRelevantNon-relevantRelevantNon-relevantRelevantRelevantNon-relevant◮ Figure 6.5 An illustration of training examples.sT0011sB0101Score01−gg1◮ Figure 6.6 The four possible combinations of s T and s B .where we have quantized the editorial relevance judgment r (d j , q j ) to 0 or 1.Then, the total error of a set of training examples is given by(6.4)∑ ε( g, Φ j ).jThe problem of learning the constant g from the given training examplesthen reduces to picking the value of g that minimizes the total error in (6.4).Picking the best value of g in (6.4) in the formulation of Section 6.1.3 reduces to the problem of minimizing a quadratic function of g over the interval [0, 1].

This reduction is detailed in Section 6.1.3.✄6.1.3The optimal weight gWe begin by noting that for any training example Φ j for which s T (d j , q j ) = 0and s B (d j , q j ) = 1, the score computed by Equation (6.2) is 1 − g. In similarfashion, we may write down the score computed by Equation (6.2) for thethree other possible combinations of s T (d j , q j ) and s B (d j , q j ); this is summarized in Figure 6.6.Let n01r (respectively, n01n ) denote the number of training examples forwhich s T (d j , q j ) = 0 and s B (d j , q j ) = 1 and the editorial judgment is Relevant(respectively, Non-relevant).

Then the contribution to the total error in Equation (6.4) from training examples for which s T (d j , q j ) = 0 and s B (d j , q j ) = 1Online edition (c) 2009 Cambridge UP1166 Scoring, term weighting and the vector space modelis[1 − (1 − g)]2 n01r + [0 − (1 − g)]2 n01n .By writing in similar fashion the error contributions from training examplesof the other three combinations of values for s T (d j , q j ) and s B (d j , q j ) (andextending the notation in the obvious manner), the total error correspondingto Equation (6.4) is(n01r + n10n ) g2 + (n10r + n01n )(1 − g)2 + n00r + n11n .(6.5)By differentiating Equation (6.5) with respect to g and setting the result tozero, it follows that the optimal value of g isn10r + n01n.n10r + n10n + n01r + n01n(6.6)?Exercise 6.1When using weighted zone scoring, is it necessary for all zones to use the same Boolean match function?Exercise 6.2In Example 6.1 above with weights g1 = 0.2, g2 = 0.31 and g3 = 0.49, what are all thedistinct score values a document may get?Exercise 6.3Rewrite the algorithm in Figure 6.4 to the case of more than two query terms.Exercise 6.4Write pseudocode for the function WeightedZone for the case of two postings lists inFigure 6.4.Exercise 6.5Apply Equation 6.6 to the sample training set in Figure 6.5 to estimate the best valueof g for this sample.Exercise 6.6For the value of g estimated in Exercise 6.5, compute the weighted zone score for each(query, document) example.

How do these scores relate to the relevance judgmentsin Figure 6.5 (quantized to 0/1)?Exercise 6.7Why does the expression for g in (6.6) not involve training examples in which s T (dt , q t )and s B (dt , q t ) have the same value?Online edition (c) 2009 Cambridge UP6.2 Term frequency and weighting6.2TERM FREQUENCYBAG OF WORDS6.2.1117Term frequency and weightingThus far, scoring has hinged on whether or not a query term is present ina zone within a document. We take the next logical step: a document orzone that mentions a query term more often has more to do with that queryand therefore should receive a higher score.

To motivate this, we recall thenotion of a free text query introduced in Section 1.4: a query in which theterms of the query are typed freeform into the search interface, without anyconnecting search operators (such as Boolean operators). This query style,which is extremely popular on the web, views the query as simply a set ofwords. A plausible scoring mechanism then is to compute a score that is thesum, over the query terms, of the match scores between each query term andthe document.Towards this end, we assign to each term in a document a weight for thatterm, that depends on the number of occurrences of the term in the document.

We would like to compute a score between a query term t and adocument d, based on the weight of t in d. The simplest approach is to assignthe weight to be equal to the number of occurrences of term t in document d.This weighting scheme is referred to as term frequency and is denoted tft,d ,with the subscripts denoting the term and the document in order.For a document d, the set of weights determined by the tf weights above(or indeed any weighting function that maps the number of occurrences of tin d to a positive real value) may be viewed as a quantitative digest of thatdocument.

In this view of a document, known in the literature as the bagof words model, the exact ordering of the terms in a document is ignored butthe number of occurrences of each term is material (in contrast to Booleanretrieval). We only retain information on the number of occurrences of eachterm. Thus, the document “Mary is quicker than John” is, in this view, identical to the document “John is quicker than Mary”.

Nevertheless, it seemsintuitive that two documents with similar bag of words representations aresimilar in content. We will develop this intuition further in Section 6.3.Before doing so we first study the question: are all words in a documentequally important? Clearly not; in Section 2.2.2 (page 27) we looked at theidea of stop words – words that we decide not to index at all, and therefore donot contribute in any way to retrieval and scoring.Inverse document frequencyRaw term frequency as above suffers from a critical problem: all terms areconsidered equally important when it comes to assessing relevancy on aquery.

In fact certain terms have little or no discriminating power in determining relevance. For instance, a collection of documents on the autoindustry is likely to have the term auto in almost every document. To thisOnline edition (c) 2009 Cambridge UP1186 Scoring, term weighting and the vector space modelWordtryinsurancecf1042210440df87603997◮ Figure 6.7 Collection frequency (cf) and document frequency (df) behave differently, as in this example from the Reuters collection.DOCUMENTFREQUENCYINVERSE DOCUMENTFREQUENCYend, we introduce a mechanism for attenuating the effect of terms that occurtoo often in the collection to be meaningful for relevance determination. Animmediate idea is to scale down the term weights of terms with high collection frequency, defined to be the total number of occurrences of a term in thecollection.

The idea would be to reduce the tf weight of a term by a factorthat grows with its collection frequency.Instead, it is more commonplace to use for this purpose the document frequency dft , defined to be the number of documents in the collection that contain a term t. This is because in trying to discriminate between documentsfor the purpose of scoring it is better to use a document-level statistic (suchas the number of documents containing a term) than to use a collection-widestatistic for the term. The reason to prefer df to cf is illustrated in Figure 6.7,where a simple example shows that collection frequency (cf) and documentfrequency (df) can behave rather differently. In particular, the cf values forboth try and insurance are roughly equal, but their df values differ significantly.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.