An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 33

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 33 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 332020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 33)

Compute the Euclidean normalized document vectors for each of the documents, where each vector has fourcomponents, one for each of the four terms.Exercise 6.16Verify that the sum of the squares of the components of each of the document vectorsin Exercise 6.15 is 1 (to within rounding error). Why is this the case?Exercise 6.17With term weights as computed in Exercise 6.15, rank the three documents by computed score for the query car insurance, for each of the following cases of term weighting in the query:1.

The weight of a term is 1 if present in the query, 0 otherwise.2. Euclidean normalized idf.6.4Variant tf-idf functionsFor assigning a weight for each term in each document, a number of alternatives to tf and tf-idf have been considered. We discuss some of the principalones here; a more complete development is deferred to Chapter 11. We willsummarize these alternatives in Section 6.4.3 (page 128).6.4.1Sublinear tf scalingIt seems unlikely that twenty occurrences of a term in a document truly carrytwenty times the significance of a single occurrence.

Accordingly, there hasbeen considerable research into variants of term frequency that go beyondcounting the number of occurrences of a term. A common modification isOnline edition (c) 2009 Cambridge UP1276.4 Variant tf-idf functions(6.13)(6.14)to use instead the logarithm of the term frequency, which assigns a weightgiven by1 + log tft,d if tft,d > 0wft,d =.0otherwiseIn this form, we may replace tf by some other function wf as in (6.13), toobtain:wf-idft,d = wft,d × idft .Equation (6.9) can then be modified by replacing tf-idf by wf-idf as definedin (6.14).6.4.2Maximum tf normalizationOne well-studied technique is to normalize the tf weights of all terms occurring in a document by the maximum tf in that document.

For each documentd, let tfmax (d) = maxτ ∈d tfτ,d , where τ ranges over all terms in d. Then, wecompute a normalized term frequency for each term t in document d by(6.15)SMOOTHINGntft,d = a + (1 − a)tft,d,tfmax (d)where a is a value between 0 and 1 and is generally set to 0.4, although someearly work used the value 0.5. The term a in (6.15) is a smoothing term whoserole is to damp the contribution of the second term – which may be viewed asa scaling down of tf by the largest tf value in d.

We will encounter smoothingfurther in Chapter 13 when discussing classification; the basic idea is to avoida large swing in ntft,d from modest changes in tft,d (say from 1 to 2). The mainidea of maximum tf normalization is to mitigate the following anomaly: weobserve higher term frequencies in longer documents, merely because longerdocuments tend to repeat the same words over and over again. To appreciatethis, consider the following extreme example: supposed we were to take adocument d and create a new document d′ by simply appending a copy of dto itself. While d′ should be no more relevant to any query than d is, the useof (6.9) would assign it twice as high a score as d. Replacing tf-idft,d in (6.9) byntf-idft,d eliminates the anomaly in this example.

Maximum tf normalizationdoes suffer from the following issues:1. The method is unstable in the following sense: a change in the stop wordlist can dramatically alter term weightings (and therefore ranking). Thus,it is hard to tune.2. A document may contain an outlier term with an unusually large number of occurrences of that term, not representative of the content of thatdocument.Online edition (c) 2009 Cambridge UP1286 Scoring, term weighting and the vector space modelTerm frequencyn (natural)tft,dDocument frequencyn (no)1l (logarithm)t (idf)logp (prob idf)max{0, loga (augmented)b (boolean)L (log ave)1 + log(tft,d )0.5×tft,d0.5 +maxt (tft,d )1 if tft,d > 00 otherwiseNdftN −dftdft }n (none)Normalization11w21 +w22 +...+w2Mc (cosine)√u (pivotedunique)1/u (Section 6.4.4)b (byte size)1/CharLengthα , α < 11+log (tft,d )1+log(avet ∈ d (tft,d ))◮ Figure 6.15 SMART notation for tf-idf variants. Here CharLength is the numberof characters in the document.3.

More generally, a document in which the most frequent term appearsroughly as often as many other terms should be treated differently fromone with a more skewed distribution.6.4.3Document and query weighting schemesEquation (6.12) is fundamental to information retrieval systems that use anyform of vector space scoring. Variations from one vector space scoring method~ (d) andto another hinge on the specific choices of weights in the vectors V~ (q). Figure 6.15 lists some of the principal weighting schemes in use forV~ (d) and V~ (q), together with a mnemonic for representing a speeach of Vcific combination of weights; this system of mnemonics is sometimes calledSMART notation, following the authors of an early text retrieval system.

Themnemonic for representing a combination of weights takes the form ddd.qqqwhere the first triplet gives the term weighting of the document vector, whilethe second triplet gives the weighting in the query vector. The first letter ineach triplet specifies the term frequency component of the weighting, thesecond the document frequency component, and the third the form of normalization used. It is quite common to apply different normalization func~ (d) and V~ (q).

For example, a very standard weighting schemetions to Vis lnc.ltc, where the document vector has log-weighted term frequency, noidf (for both effectiveness and efficiency reasons), and cosine normalization,while the query vector uses log-weighted term frequency, idf weighting, andcosine normalization.Online edition (c) 2009 Cambridge UP6.4 Variant tf-idf functions✄6.4.4PIVOTED DOCUMENTLENGTHNORMALIZATION129Pivoted normalized document lengthIn Section 6.3.1 we normalized each document vector by the Euclidean lengthof the vector, so that all document vectors turned into unit vectors. In doingso, we eliminated all information on the length of the original document;this masks some subtleties about longer documents.

First, longer documentswill – as a result of containing more terms – have higher tf values. Second,longer documents contain more distinct terms. These factors can conspire toraise the scores of longer documents, which (at least for some informationneeds) is unnatural. Longer documents can broadly be lumped into two categories: (1) verbose documents that essentially repeat the same content – inthese, the length of the document does not alter the relative weights of different terms; (2) documents covering multiple different topics, in which thesearch terms probably match small segments of the document but not all ofit – in this case, the relative weights of terms are quite different from a singleshort document that matches the query terms.

Compensating for this phenomenon is a form of document length normalization that is independent ofterm and document frequencies. To this end, we introduce a form of normalizing the vector representations of documents in the collection, so that theresulting “normalized” documents are not necessarily of unit length. Then,when we compute the dot product score between a (unit) query vector andsuch a normalized document, the score is skewed to account for the effectof document length on relevance.

This form of compensation for documentlength is known as pivoted document length normalization.Consider a document collection together with an ensemble of queries forthat collection. Suppose that we were given, for each query q and for eachdocument d, a Boolean judgment of whether or not d is relevant to the queryq; in Chapter 8 we will see how to procure such a set of relevance judgmentsfor a query ensemble and a document collection. Given this set of relevancejudgments, we may compute a probability of relevance as a function of document length, averaged over all queries in the ensemble.

The resulting plotmay look like the curve drawn in thick lines in Figure 6.16. To compute thiscurve, we bucket documents by length and compute the fraction of relevantdocuments in each bucket, then plot this fraction against the median document length of each bucket. (Thus even though the “curve” in Figure 6.16appears to be continuous, it is in fact a histogram of discrete buckets of document length.)On the other hand, the curve in thin lines shows what might happen withthe same documents and query ensemble if we were to use relevance as prescribed by cosine normalization Equation (6.12) – thus, cosine normalizationhas a tendency to distort the computed relevance vis-à-vis the true relevance,at the expense of longer documents. The thin and thick curves crossover at apoint p corresponding to document length ℓ p , which we refer to as the pivotOnline edition (c) 2009 Cambridge UP1306 Scoring, term weighting and the vector space modelRelevance6p Document- lengthℓp◮ Figure 6.16 Pivoted document length normalization.length; dashed lines mark this point on the x − and y− axes.

The idea ofpivoted document length normalization would then be to “rotate” the cosine normalization curve counter-clockwise about p so that it more closelymatches thick line representing the relevance vs. document length curve.As mentioned at the beginning of this section, we do so by using in Equa~ (d) that is nottion (6.12) a normalization factor for each document vector Vthe Euclidean length of that vector, but instead one that is larger than the Euclidean length for documents of length less than ℓ p , and smaller for longerdocuments.~ (d) in the deTo this end, we first note that the normalizing term for V~ (d)|. In thenominator of Equation (6.12) is its Euclidean length, denoted |Vsimplest implementation of pivoted document length normalization, we use~ (d)|, but onea normalization factor in the denominator that is linear in |V~ (d)|,of slope < 1 as in Figure 6.17.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.