An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 59

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 59 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 592020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 59)

Of course, in other cases, they do not. The answer to this within the language modelingapproach is translation language models, as briefly discussed in Section 12.4.Online edition (c) 2009 Cambridge UP24412 Language models for information retrievalestimation (MLE) and the unigram assumption is:(12.9)P̂(q| Md ) = ∏ P̂mle (t| Md ) =t∈q∏t∈qtft,dLdwhere Md is the language model of document d, tft,d is the (raw) term frequency of term t in document d, and Ld is the number of tokens in document d.

That is, we just count up how often each word occurred, and dividethrough by the total number of words in the document d. This is the samemethod of calculating an MLE as we saw in Section 11.3.2 (page 226), butnow using a multinomial over word counts.The classic problem with using language models is one of estimation (theˆ symbol on the P’s is used above to stress that the model is estimated):terms appear very sparsely in documents. In particular, some words willnot have appeared in the document at all, but are possible words for the information need, which the user may have used in the query. If we estimateP̂(t| Md ) = 0 for a term missing from a document d, then we get a strictconjunctive semantics: documents will only give a query non-zero probability if all of the query terms appear in the document.

Zero probabilities areclearly a problem in other uses of language models, such as when predictingthe next word in a speech recognition application, because many words willbe sparsely represented in the training data. It may seem rather less clearwhether this is problematic in an IR application. This could be thought ofas a human-computer interface issue: vector space systems have generallypreferred more lenient matching, though recent web search developmentshave tended more in the direction of doing searches with such conjunctivesemantics.

Regardless of the approach here, there is a more general problem of estimation: occurring words are also badly estimated; in particular,the probability of words occurring once in the document is normally overestimated, since their one occurrence was partly by chance. The answer tothis (as we saw in Section 11.3.2, page 226) is smoothing. But as people havecome to understand the LM approach better, it has become apparent that therole of smoothing in this model is not only to avoid zero probabilities.

Thesmoothing of terms actually implements major parts of the term weightingcomponent (Exercise 12.8). It is not just that an unsmoothed model has conjunctive semantics; an unsmoothed model works badly because it lacks partsof the term weighting component.Thus, we need to smooth probabilities in our document language models: to discount non-zero probabilities and to give some probability mass tounseen words. There’s a wide space of approaches to smoothing probability distributions to deal with this problem. In Section 11.3.2 (page 226), wealready discussed adding a number (1, 1/2, or a small α) to the observedOnline edition (c) 2009 Cambridge UP24512.2 The query likelihood modelcounts and renormalizing to give a probability distribution.4 In this section we will mention a couple of other smoothing methods, which involvecombining observed counts with a more general reference probability distribution.

The general approach is that a non-occurring term should be possible in a query, but its probability should be somewhat close to but no morelikely than would be expected by chance from the whole collection. That is,if tft,d = 0 thenP̂(t| Md ) ≤ cft /Twhere cft is the raw count of the term in the collection, and T is the raw size(number of tokens) of the entire collection. A simple idea that works well inpractice is to use a mixture between a document-specific multinomial distribution and a multinomial distribution estimated from the entire collection:(12.10)LINEARINTERPOLATIONB AYESIAN SMOOTHING(12.11)P̂(t|d) = λ P̂mle (t| Md ) + (1 − λ) P̂mle (t| Mc )where 0 < λ < 1 and Mc is a language model built from the entire document collection.

This mixes the probability from the document with thegeneral collection frequency of the word. Such a model is referred to as alinear interpolation language model.5 Correctly setting λ is important to thegood performance of this model.An alternative is to use a language model built from the whole collectionas a prior distribution in a Bayesian updating process (rather than a uniformdistribution, as we saw in Section 11.3.2). We then get the following equation:P̂(t|d) =tft,d + α P̂(t| Mc )Ld + αBoth of these smoothing methods have been shown to perform well in IRexperiments; we will stick with the linear interpolation smoothing methodfor the rest of this section. While different in detail, they are both conceptually similar: in both cases the probability estimate for a word present in thedocument combines a discounted MLE and a fraction of the estimate of itsprevalence in the whole collection, while for words not present in a document, the estimate is just a fraction of the estimate of the prevalence of theword in the whole collection.The role of smoothing in LMs for IR is not simply or principally to avoid estimation problems.

This was not clear when the models were first proposed,but it is now understood that smoothing is essential to the good properties4. In the context of probability theory, (re)normalization refers to summing numbers that coveran event space and dividing them through by their sum, so that the result is a probability distribution which sums to 1. This is distinct from both the concept of term normalization in Chapter 2and the concept of length normalization in Chapter 6, which is done with a L2 norm.5.

It is also referred to as Jelinek-Mercer smoothing.Online edition (c) 2009 Cambridge UP24612 Language models for information retrievalof the models. The reason for this is explored in Exercise 12.8. The extentof smoothing in these two models is controlled by the λ and α parameters: asmall value of λ or a large value of α means more smoothing. This parametercan be tuned to optimize performance using a line search (or, for the linearinterpolation model, by other methods, such as the expectation maximimization algorithm; see Section 16.5, page 368). The value need not be a constant.One approach is to make the value a function of the query size. This is usefulbecause a small amount of smoothing (a “conjunctive-like” search) is moresuitable for short queries, while a lot of smoothing is more suitable for longqueries.To summarize, the retrieval ranking for a query q under the basic LM forIR we have been considering is given by:(12.12)P(d|q) ∝ P(d) ∏((1 − λ) P(t| Mc ) + λP(t| Md ))t∈qThis equation captures the probability that the document that the user hadin mind was in fact d.✎Example 12.3:Suppose the document collection contains two documents:• d1 : Xyzzy reports a profit but revenue is down• d2 : Quorus narrows quarter loss but revenue decreases furtherThe model will be MLE unigram models from the documents and collection, mixedwith λ = 1/2.Suppose the query is revenue down.

Then:(12.13)P ( q | d1 )P ( q | d2 )====[(1/8 + 2/16)/2] × [(1/8 + 1/16)/2]1/8 × 3/32 = 3/256[(1/8 + 2/16)/2] × [(0/8 + 1/16)/2]1/8 × 1/32 = 1/256So, the ranking is d1 > d2 .12.2.3Ponte and Croft’s ExperimentsPonte and Croft (1998) present the first experiments on the language modeling approach to information retrieval. Their basic approach is the model thatwe have presented until now.

However, we have presented an approachwhere the language model is a mixture of two multinomials, much as in(Miller et al. 1999, Hiemstra 2000) rather than Ponte and Croft’s multivariate Bernoulli model. The use of multinomials has been standard in mostsubsequent work in the LM approach and experimental results in IR, aswell as evidence from text classification which we consider in Section 13.3Online edition (c) 2009 Cambridge UP24712.2 The query likelihood modelRec.0.00.10.20.30.40.50.60.70.80.91.0Avetf-idf0.74390.45210.35140.27610.20930.15580.10240.04510.01600.00330.00280.1868PrecisionLM%chg0.7590+2.00.4910+8.60.4045+15.10.3342+21.00.2572+22.90.2061+32.30.1405+37.10.0760+68.70.0432 +169.60.0063+89.30.0050+76.90.2233 +19.55********◮ Figure 12.4 Results of a comparison of tf-idf with language modeling (LM) termweighting by Ponte and Croft (1998). The version of tf-idf from the INQUERY IR system includes length normalization of tf.

The table gives an evaluation according to11-point average precision with significance marked with a * according to a Wilcoxonsigned rank test. The language modeling approach always does better in these experiments, but note that where the approach shows significant gains is at higher levelsof recall.(page 263), suggests that it is superior. Ponte and Croft argued strongly forthe effectiveness of the term weights that come from the language modelingapproach over traditional tf-idf weights. We present a subset of their resultsin Figure 12.4 where they compare tf-idf to language modeling by evaluatingTREC topics 202–250 over TREC disks 2 and 3.

The queries are sentencelength natural language queries. The language modeling approach yieldssignificantly better results than their baseline tf-idf based term weighting approach. And indeed the gains shown here have been extended in subsequentwork.?Exercise 12.6[ ⋆]Consider making a language model from the following training text:the martian has landed on the latin pop sensation ricky martina. Under a MLE-estimated unigram probability model, what are P (the) and P (martian)?b. Under a MLE-estimated bigram model, what are P (sensation|pop) and P (pop|the)?Online edition (c) 2009 Cambridge UP24812 Language models for information retrieval[⋆⋆]Exercise 12.7Suppose we have a collection that consists of the 4 documents given in the belowtable.docID1234Document textclick go the shears boys click click clickclick clickmetal heremetal shears click hereBuild a query likelihood language model for this document collection.

Assume amixture model between the documents and the collection, with both weighted at 0.5.Maximum likelihood estimation (mle) is used to estimate both as unigram models.Work out the model probabilities of the queries click, shears, and hence click shears foreach document, and use those probabilities to rank the documents returned by eachquery. Fill in these probabilities in the below table:QueryDoc 1Doc 2Doc 3Doc 4clickshearsclick shearsWhat is the final ranking of the documents for the query click shears?Exercise 12.8[⋆⋆]Using the calculations in Exercise 12.7 as inspiration or as examples where appropriate, write one sentence each describing the treatment that the model in Equation (12.10) gives to each of the following quantities.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.