An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 60

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 60 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 602020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 60)

Include whether it is presentin the model or not and whether the effect is raw or scaled.a. Term frequency in a documentb. Collection frequency of a termc. Document frequency of a termd. Length normalization of a termExercise 12.9[⋆⋆]In the mixture model approach to the query likelihood model (Equation (12.12)), theprobability estimate of a term is based on the term frequency of a word in a document,and the collection frequency of the word. Doing this certainly guarantees that eachterm of a query (in the vocabulary) has a non-zero chance of being generated by eachdocument.

But it has a more subtle but important effect of implementing a form ofterm weighting, related to what we saw in Chapter 6. Explain how this works. Inparticular, include in your answer a concrete numeric example showing this termweighting at work.12.3Language modeling versus other approaches in IRThe language modeling approach provides a novel way of looking at theproblem of text retrieval, which links it with a lot of recent work in speechOnline edition (c) 2009 Cambridge UP12.3 Language modeling versus other approaches in IR249and language processing. As Ponte and Croft (1998) emphasize, the languagemodeling approach to IR provides a different approach to scoring matchesbetween queries and documents, and the hope is that the probabilistic language modeling foundation improves the weights that are used, and hencethe performance of the model.

The major issue is estimation of the document model, such as choices of how to smooth it effectively. The modelhas achieved very good retrieval results. Compared to other probabilisticapproaches, such as the BIM from Chapter 11, the main difference initiallyappears to be that the LM approach does away with explicitly modeling relevance (whereas this is the central variable evaluated in the BIM approach).But this may not be the correct way to think about things, as some of thepapers in Section 12.5 further discuss. The LM approach assumes that documents and expressions of information needs are objects of the same type, andassesses their match by importing the tools and methods of language modeling from speech and natural language processing. The resulting model ismathematically precise, conceptually simple, computationally tractable, andintuitively appealing.

This seems similar to the situation with XML retrieval(Chapter 10): there the approaches that assume queries and documents areobjects of the same type are also among the most successful.On the other hand, like all IR models, you can also raise objections to themodel. The assumption of equivalence between document and informationneed representation is unrealistic. Current LM approaches use very simplemodels of language, usually unigram models. Without an explicit notion ofrelevance, relevance feedback is difficult to integrate into the model, as areuser preferences.

It also seems necessary to move beyond a unigram modelto accommodate notions of phrase or passage matching or Boolean retrievaloperators. Subsequent work in the LM approach has looked at addressingsome of these concerns, including putting relevance back into the model andallowing a language mismatch between the query language and the document language.The model has significant relations to traditional tf-idf models.

Term frequency is directly represented in tf-idf models, and much recent work hasrecognized the importance of document length normalization. The effect ofdoing a mixture of document generation probability with collection generation probability is a little like idf: terms rare in the general collection butcommon in some documents will have a greater influence on the ranking ofdocuments. In most concrete realizations, the models share treating terms asif they were independent. On the other hand, the intuitions are probabilisticrather than geometric, the mathematical models are more principled ratherthan heuristic, and the details of how statistics like term frequency and document length are used differ. If you are concerned mainly with performancenumbers, recent work has shown the LM approach to be very effective in retrieval experiments, beating tf-idf and BM25 weights. Nevertheless, there isOnline edition (c) 2009 Cambridge UP25012 Language models for information retrievalQueryQuery modelP(t|Query)(a)(c)(b)DocumentDoc.

modelP(t|Document)◮ Figure 12.5 Three ways of developing the language modeling approach: (a) querylikelihood, (b) document likelihood, and (c) model comparison.perhaps still insufficient evidence that its performance so greatly exceeds thatof a well-tuned traditional vector space retrieval system as to justify changing an existing implementation.12.4DOCUMENTLIKELIHOOD MODELExtended language modeling approachesIn this section we briefly mention some of the work that extends the basiclanguage modeling approach.There are other ways to think of using the language modeling idea in IRsettings, and many of them have been tried in subsequent work.

Rather thanlooking at the probability of a document language model Md generating thequery, you can look at the probability of a query language model Mq generating the document. The main reason that doing things in this direction andcreating a document likelihood model is less appealing is that there is much lesstext available to estimate a language model based on the query text, and sothe model will be worse estimated, and will have to depend more on beingsmoothed with some other language model. On the other hand, it is easy tosee how to incorporate relevance feedback into such a model: you can expand the query with terms taken from relevant documents in the usual wayand hence update the language model Mq (Zhai and Lafferty 2001a).

Indeed,with appropriate modeling choices, this approach leads to the BIM model ofChapter 11. The relevance model of Lavrenko and Croft (2001) is an instanceof a document likelihood model, which incorporates pseudo-relevance feedback into a language modeling approach. It achieves very strong empiricalresults.Rather than directly generating in either direction, we can make a language model from both the document and query, and then ask how differentthese two language models are from each other. Lafferty and Zhai (2001) layOnline edition (c) 2009 Cambridge UP25112.4 Extended language modeling approachesK ULLBACK -L EIBLERDIVERGENCE(12.14)TRANSLATION MODEL(12.15)out these three ways of thinking about the problem, which we show in Figure 12.5, and develop a general risk minimization approach for documentretrieval.

For instance, one way to model the risk of returning a document das relevant to a query q is to use the Kullback-Leibler (KL) divergence betweentheir respective language models:R(d; q) = KL( Md k Mq ) =P ( t | Mq )∑ P(t| Mq ) log P(t| Md )t ∈VKL divergence is an asymmetric divergence measure originating in information theory, which measures how bad the probability distribution Mq is atmodeling Md (Cover and Thomas 1991, Manning and Schütze 1999). Lafferty and Zhai (2001) present results suggesting that a model comparisonapproach outperforms both query-likelihood and document-likelihood approaches.

One disadvantage of using KL divergence as a ranking functionis that scores are not comparable across queries. This does not matter for adhoc retrieval, but is important in other applications such as topic tracking.Kraaij and Spitters (2003) suggest an alternative proposal which models similarity as a normalized log-likelihood ratio (or, equivalently, as a differencebetween cross-entropies).Basic LMs do not address issues of alternate expression, that is, synonymy,or any deviation in use of language between queries and documents.

Bergerand Lafferty (1999) introduce translation models to bridge this query-documentgap. A translation model lets you generate query words not in a document bytranslation to alternate terms with similar meaning. This also provides a basis for performing cross-language IR. We assume that the translation modelcan be represented by a conditional probability distribution T (·|·) betweenvocabulary terms. The form of the translation query generation model isthen:P ( q | Md ) = ∏ ∑ P ( v | Md ) T ( t | v )t ∈ q v ∈VThe term P(v| Md ) is the basic document language model, and the term T (t|v)performs translation.

This model is clearly more computationally intensiveand we need to build a translation model. The translation model is usuallybuilt using separate resources (such as a traditional thesaurus or bilingualdictionary or a statistical machine translation system’s translation dictionary), but can be built using the document collection if there are pieces oftext that naturally paraphrase or summarize other pieces of text. Candidate examples are documents and their titles or abstracts, or documents andanchor-text pointing to them in a hypertext environment.Building extended LM approaches remains an active area of research.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.