An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 58

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 58 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 582020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 58)

Each gives a probability estimate to a sequence of2. In the IR context that we are leading up to, taking the stop probability to be fixed acrossmodels seems reasonable. This is because we are generating queries, and the length distributionof queries is fixed and independent of the document from which we are generating the languagemodel.Online edition (c) 2009 Cambridge UP24012 Language models for information retrievalterms, as already illustrated in Example 12.1. The language model that gives thehigher probability to the sequence of terms is more likely to have generated the termsequence. This time, we will omit STOP probabilities from our calculations.

For thesequence shown, we get:(12.3)sM1M2frog0.010.0002said0.030.03that0.040.04toad0.010.0001likes0.020.04that0.040.04dog0.0050.01P (s| M1 ) = 0.00000000000048P (s| M2 ) = 0.000000000000000384and we see that P (s| M1 ) > P (s| M2 ). We present the formulas here in terms of products of probabilities, but, as is common in probabilistic applications, in practice it isusually best to work with sums of log probabilities (cf.

page 258).12.1.2Types of language modelsHow do we build probabilities over sequences of terms? We can alwaysuse the chain rule from Equation (11.1) to decompose the probability of asequence of events into the probability of each successive event conditionedon earlier events:(12.4)UNIGRAM LANGUAGEMODEL(12.5)BIGRAM LANGUAGEMODEL(12.6)P ( t1 t2 t3 t4 ) = P ( t1 ) P ( t2 | t1 ) P ( t3 | t1 t2 ) P ( t4 | t1 t2 t3 )The simplest form of language model simply throws away all conditioningcontext, and estimates each term independently. Such a model is called aunigram language model:Puni (t1 t2 t3 t4 ) = P(t1 ) P(t2 ) P(t3 ) P(t4 )There are many more complex kinds of language models, such as bigramlanguage models, which condition on the previous term,Pbi (t1 t2 t3 t4 ) = P(t1 ) P(t2 |t1 ) P(t3 |t2 ) P(t4 |t3 )and even more complex grammar-based language models such as probabilistic context-free grammars.

Such models are vital for tasks like speechrecognition, spelling correction, and machine translation, where you needthe probability of a term conditioned on surrounding context. However,most language-modeling work in IR has used unigram language models.IR is not the place where you most immediately need complex languagemodels, since IR does not directly depend on the structure of sentences tothe extent that other tasks like speech recognition do. Unigram models areoften sufficient to judge the topic of a text. Moreover, as we shall see, IR language models are frequently estimated from a single document and so it isOnline edition (c) 2009 Cambridge UP12.1 Language models241questionable whether there is enough training data to do more.

Losses fromdata sparseness (see the discussion on page 260) tend to outweigh any gainsfrom richer models. This is an example of the bias-variance tradeoff (cf. Section 14.6, page 308): With limited training data, a more constrained modeltends to perform better. In addition, unigram models are more efficient toestimate and apply than higher-order models. Nevertheless, the importanceof phrase and proximity queries in IR in general suggests that future workshould make use of more sophisticated language models, and some has begun to (see Section 12.5, page 252).

Indeed, making this move parallels themodel of van Rijsbergen in Chapter 11 (page 231).12.1.3MULTINOMIALDISTRIBUTION(12.7)Multinomial distributions over wordsUnder the unigram language model the order of words is irrelevant, and sosuch models are often called “bag of words” models, as discussed in Chapter 6 (page 117). Even though there is no conditioning on preceding context,this model nevertheless still gives the probability of a particular ordering ofterms. However, any other ordering of this bag of terms will have the sameprobability.

So, really, we have a multinomial distribution over words. So longas we stick to unigram models, the language model name and motivationcould be viewed as historical rather than necessary. We could instead justrefer to the model as a multinomial model. From this perspective, the equations presented above do not present the multinomial probability of a bag ofwords, since they do not sum over all possible orderings of those words, asis done by the multinomial coefficient (the first term on the right-hand side)in the standard presentation of a multinomial model:P(d) =Ld !tfP(t1 )tft1 ,d P(t2 )tft2 ,d · · · P(t M ) t M ,dtft1 ,d !tft2 ,d ! · · · tft M ,d !Here, Ld = ∑1≤i≤ M tfti ,d is the length of document d, M is the size of the termvocabulary, and the products are now over the terms in the vocabulary, notthe positions in the document.

However, just as with STOP probabilities, inpractice we can also leave out the multinomial coefficient in our calculations,since, for a particular bag of words, it will be a constant, and so it has no effecton the likelihood ratio of two different models generating a particular bag ofwords. Multinomial distributions also appear in Section 13.2 (page 258).The fundamental problem in designing language models is that we do notknow what exactly we should use as the model Md .

However, we do generally have a sample of text that is representative of that model. This problemmakes a lot of sense in the original, primary uses of language models. For example, in speech recognition, we have a training sample of (spoken) text. Butwe have to expect that, in the future, users will use different words and inOnline edition (c) 2009 Cambridge UP24212 Language models for information retrievaldifferent sequences, which we have never observed before, and so the modelhas to generalize beyond the observed data to allow unknown words and sequences.

This interpretation is not so clear in the IR case, where a documentis finite and usually fixed. The strategy we adopt in IR is as follows. Wepretend that the document d is only a representative sample of text drawnfrom a model distribution, treating it like a fine-grained topic. We then estimate a language model from this sample, and use that model to calculate theprobability of observing any word sequence, and, finally, we rank documentsaccording to their probability of generating the query.?Exercise 12.1[⋆]Including stop probabilities in the calculation, what will the sum of the probabilityestimates of all strings in the language of length 1 be? Assume that you generate aword and then decide whether to stop or not (i.e., the null string is not part of thelanguage).Exercise 12.2[⋆]If the stop probability is omitted from calculations, what will the sum of the scoresassigned to strings in the language of length 1 be?Exercise 12.3[⋆]What is the likelihood ratio of the document according to M1 and M2 in Example 12.2?Exercise 12.4[⋆]No explicit STOP probability appeared in Example 12.2.

Assuming that the STOPprobability of each model is 0.1, does this change the likelihood ratio of a documentaccording to the two models?Exercise 12.5[⋆⋆]How might a language model be used in a spelling correction system? In particular,consider the case of context-sensitive spelling correction, and correcting incorrect usages of words, such as their in Are you their? (See Section 3.5 (page 65) for pointers tosome literature on this topic.)12.212.2.1QUERY LIKELIHOODMODELThe query likelihood modelUsing query likelihood language models in IRLanguage modeling is a quite general formal approach to IR, with many variant realizations.

The original and basic method for using language modelsin IR is the query likelihood model. In it, we construct from each document din the collection a language model Md . Our goal is to rank documents byP(d|q), where the probability of a document is interpreted as the likelihoodthat it is relevant to the query. Using Bayes rule (as introduced in Section 11.1,page 220), we have:P(d|q) = P(q|d) P(d)/P(q)Online edition (c) 2009 Cambridge UP24312.2 The query likelihood modelP(q) is the same for all documents, and so can be ignored.

The prior probability of a document P(d) is often treated as uniform across all d and so itcan also be ignored, but we could implement a genuine prior which could include criteria like authority, length, genre, newness, and number of previouspeople who have read the document. But, given these simplifications, wereturn results ranked by simply P(q|d), the probability of the query q underthe language model derived from d. The Language Modeling approach thusattempts to model the query generation process: Documents are ranked bythe probability that a query would be observed as a random sample from therespective document model.The most common way to do this is using the multinomial unigram language model, which is equivalent to a multinomial Naive Bayes model (page 263),where the documents are the classes, each treated in the estimation as a separate “language”.

Under this model, we have that:(12.8)P ( q | Md ) = K q∏ P(t| Md )tft,dt ∈Vwhere, again Kq = Ld !/(tft1 ,d !tft2 ,d ! · · · tft M ,d !) is the multinomial coefficientfor the query q, which we will henceforth ignore, since it is a constant for aparticular query.For retrieval based on a language model (henceforth LM), we treat thegeneration of queries as a random process.

The approach is to1. Infer a LM for each document.2. Estimate P(q| Mdi ), the probability of generating the query according toeach of these document models.3. Rank the documents according to these probabilities.The intuition of the basic model is that the user has a prototype document inmind, and generates a query based on words that appear in this document.Often, users have a reasonable idea of terms that are likely to occur in documents of interest and they will choose query terms that distinguish thesedocuments from others in the collection.3 Collection statistics are an integralpart of the language model, rather than being used heuristically as in manyother approaches.12.2.2Estimating the query generation probabilityIn this section we describe how to estimate P(q| Md ). The probability of producing the query given the LM Md of document d using maximum likelihood3.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.