An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 54

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 54 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 542020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 54)

You lose a pointfor either returning a nonrelevant document or failing to return a relevantdocument (such a binary situation where you are evaluated on your accuracyis called 1/0 loss). The goal is to return the best possible results as the top kdocuments, for any value of k the user chooses to examine. The PRP thensays to simply rank all documents in decreasing order of P( R = 1|d, q). Ifa set of retrieval results is to be returned, rather than an ordering, the Bayes1. The term likelihood is just a synonym of probability.

It is the probability of an event or dataaccording to a model. The term is usually used when people are thinking of holding the datafixed, while varying the model.Online edition (c) 2009 Cambridge UP22211 Probabilistic information retrievalOptimal Decision Rule, the decision which minimizes the risk of loss, is tosimply return documents that are more likely relevant than nonrelevant:(11.6)B AYES RISKd is relevant iff P( R = 1|d, q) > P( R = 0|d, q)Theorem 11.1.

The PRP is optimal, in the sense that it minimizes the expected loss(also known as the Bayes risk) under 1/0 loss.The proof can be found in Ripley (1996). However, it requires that all probabilities are known correctly. This is never the case in practice. Nevertheless,the PRP still provides a very useful foundation for developing models of IR.11.2.2The PRP with retrieval costsSuppose, instead, that we assume a model of retrieval costs.

Let C1 be thecost of not retrieving a relevant document and C0 the cost of retrieval of anonrelevant document. Then the Probability Ranking Principle says that iffor a specific document d and for all documents d′ not yet retrieved(11.7)C0 · P( R = 0|d) − C1 · P( R = 1|d) ≤ C0 · P( R = 0|d′ ) − C1 · P( R = 1|d′ )then d is the next document to be retrieved. Such a model gives a formalframework where we can model differential costs of false positives and falsenegatives and even system performance issues at the modeling stage, ratherthan simply at the evaluation stage, as we did in Section 8.6 (page 168). However, we will not further consider loss/utility models in this chapter.11.3B INARYI NDEPENDENCEM ODELThe Binary Independence ModelThe Binary Independence Model (BIM) we present in this section is the modelthat has traditionally been used with the PRP.

It introduces some simple assumptions, which make estimating the probability function P( R|d, q) practical. Here, “binary” is equivalent to Boolean: documents and queries areboth represented as binary term incidence vectors. That is, a document dis represented by the vector ~x = ( x1 , . . . , x M ) where xt = 1 if term t ispresent in document d and xt = 0 if t is not present in d.

With this representation, many possible documents have the same vector representation.Similarly, we represent q by the incidence vector ~q (the distinction betweenq and ~q is less central since commonly q is in the form of a set of words).“Independence” means that terms are modeled as occurring in documentsindependently. The model recognizes no association between terms. Thisassumption is far from correct, but it nevertheless often gives satisfactoryresults in practice; it is the “naive” assumption of Naive Bayes models, discussed further in Section 13.4 (page 265). Indeed, the Binary IndependenceOnline edition (c) 2009 Cambridge UP22311.3 The Binary Independence ModelModel is exactly the same as the multivariate Bernoulli Naive Bayes modelpresented in Section 13.3 (page 263). In a sense this assumption is equivalentto an assumption of the vector space model, where each term is a dimensionthat is orthogonal to all other terms.We will first present a model which assumes that the user has a singlestep information need.

As discussed in Chapter 9, seeing a range of resultsmight let the user refine their information need. Fortunately, as mentionedthere, it is straightforward to extend the Binary Independence Model so as toprovide a framework for relevance feedback, and we present this model inSection 11.3.4.To make a probabilistic retrieval strategy precise, we need to estimate howterms in documents contribute to relevance, specifically, we wish to knowhow term frequency, document frequency, document length, and other statistics that we can compute influence judgments about document relevance,and how they can be reasonably combined to estimate the probability of document relevance. We then order documents by decreasing estimated probability of relevance.We assume here that the relevance of each document is independent of therelevance of other documents.

As we noted in Section 8.5.1 (page 166), thisis incorrect: the assumption is especially harmful in practice if it allows asystem to return duplicate or near duplicate documents. Under the BIM, wemodel the probability P( R|d, q) that a document is relevant via the probability in terms of term incidence vectors P( R|~x, ~q). Then, using Bayes rule, wehave:(11.8)P( R = 1|~x, ~q)=P( R = 0|~x, ~q)=P(~x| R = 1, ~q) P( R = 1|~q)P(~x|~q )P(~x| R = 0, ~q) P( R = 0|~q)P(~x|~q )Here, P(~x| R = 1, ~q) and P(~x| R = 0, ~q) are the probability that if a relevant ornonrelevant, respectively, document is retrieved, then that document’s representation is ~x.

You should think of this quantity as defined with respect toa space of possible documents in a domain. How do we compute all theseprobabilities? We never know the exact probabilities, and so we have to useestimates: Statistics about the actual document collection are used to estimatethese probabilities. P( R = 1|~q) and P( R = 0|~q) indicate the prior probabilityof retrieving a relevant or nonrelevant document respectively for a query ~q.Again, if we knew the percentage of relevant documents in the collection,then we could use this number to estimate P( R = 1|~q ) and P( R = 0|~q). Sincea document is either relevant or nonrelevant to a query, we must have that:(11.9)P( R = 1|~x, ~q ) + P( R = 0|~x, ~q) = 1Online edition (c) 2009 Cambridge UP22411 Probabilistic information retrieval11.3.1Deriving a ranking function for query termsGiven a query q, we wish to order returned documents by descending P( R =1|d, q).

Under the BIM, this is modeled as ordering by P( R = 1|~x, ~q). Ratherthan estimating this probability directly, because we are interested only in theranking of documents, we work with some other quantities which are easierto compute and which give the same ordering of documents. In particular,we can rank documents by their odds of relevance (as the odds of relevanceis monotonic with the probability of relevance). This makes things easier,because we can ignore the common denominator in (11.8), giving:(11.10)N AIVE B AYESASSUMPTIONP( R = 1|~x, ~q)O( R|~x, ~q) ==P( R = 0|~x, ~q)P ( R =1|~q) P (~x| R =1,~q)P (~x|~q)P ( R =0|~q) P (~x| R =0,~q)P (~x|~q)=P( R = 1|~q) P(~x| R = 1, ~q)·P( R = 0|~q) P(~x| R = 0, ~q)The left term in the rightmost expression of Equation (11.10) is a constant fora given query. Since we are only ranking documents, there is thus no needfor us to estimate it. The right-hand term does, however, require estimation,and this initially appears to be difficult: How can we accurately estimate theprobability of an entire term incidence vector occurring? It is at this point thatwe make the Naive Bayes conditional independence assumption that the presenceor absence of a word in a document is independent of the presence or absenceof any other word (given the query):P(~x| R = 1, ~q)=P(~x| R = 0, ~q)(11.11)So:(11.12)MP( xt | R = 1, ~q)∏ P(xt | R = 0, ~q)t =1MO( R|~x, ~q) = O( R|~q) · ∏t =1P( xt | R = 1, ~q)P( xt | R = 0, ~q)Since each xt is either 0 or 1, we can separate the terms to give:(11.13)O( R|~x, ~q) = O( R|~q) ·∏t:x t =1P( xt = 1| R = 1, ~q)P( xt = 0| R = 1, ~q)· ∏P( xt = 1| R = 0, ~q) t:xt =0 P( xt = 0| R = 0, ~q)Henceforth, let pt = P( xt = 1| R = 1, ~q) be the probability of a term appearing in a document relevant to the query, and ut = P( xt = 1| R = 0, ~q) be theprobability of a term appearing in a nonrelevant document.

These quantitiescan be visualized in the following contingency table where the columns addto 1:(11.14)Term presentTerm absentdocumentxt = 1xt = 0relevant (R = 1)pt1 − ptnonrelevant (R = 0)ut1 − utOnline edition (c) 2009 Cambridge UP22511.3 The Binary Independence ModelLet us make an additional simplifying assumption that terms not occurring in the query are equally likely to occur in relevant and nonrelevant documents: that is, if qt = 0 then pt = ut . (This assumption can be changed,as when doing relevance feedback in Section 11.3.4.) Then we need onlyconsider terms in the products that appear in the query, and so,(11.15)O( R|~q, ~x) = O( R|~q) ·1 − ptpt·∏u1 − utt:x =q =1 t t:x =0,q =1∏ttttThe left product is over query terms found in the document and the rightproduct is over query terms not found in the document.We can manipulate this expression by including the query terms found inthe document into the right product, but simultaneously dividing throughby them in the left product, so the value is unchanged.

Then we have:(11.16)R ETRIEVAL S TATUSVALUE(11.17)O( R|~q, ~x ) = O( R|~q) ·p t (1 − u t )1 − pt· ∏u (1 − pt ) t:q =1 1 − utt:x =q =1 t∏tttThe left product is still over query terms found in the document, but the rightproduct is now over all query terms. That means that this right product is aconstant for a particular query, just like the odds O( R|~q). So the only quantitythat needs to be estimated to rank documents for relevance to a query is theleft product. We can equally rank documents by the logarithm of this term,since log is a monotonic function.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.