An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 45

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 45 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 452020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 45)

One way of doing this is with a Naive Bayes probabilistic model.If R is a Boolean indicator variable expressing the relevance of a document,then we can estimate P( xt = 1| R), the probability of a term t appearing in adocument, depending on whether it is relevant or not, as:(9.4)P̂( x t = 1| R = 1)P̂( x t = 1| R = 0)= |VRt |/|VR|= (d f t − |VRt |)/( N − |VR|)where N is the total number of documents, d f t is the number that containt, VR is the set of known relevant documents, and VRt is the subset of thisset containing t. Even though the set of known relevant documents is a perhaps small subset of the true set of relevant documents, if we assume thatthe set of relevant documents is a small subset of the set of all documentsthen the estimates given above will be reasonable.

This gives a basis foranother way of changing the query term weights. We will discuss such probabilistic approaches more in Chapters 11 and 13, and in particular outlinethe application to relevance feedback in Section 11.3.4 (page 228). For themoment, observe that using just Equation (9.4) as a basis for term-weightingis likely insufficient. The equations use only collection statistics and information about the term distribution within the documents judged relevant.They preserve no memory of the original query.9.1.3When does relevance feedback work?The success of relevance feedback depends on certain assumptions. Firstly,the user has to have sufficient knowledge to be able to make an initial queryOnline edition (c) 2009 Cambridge UP1849 Relevance feedback and query expansionwhich is at least somewhere close to the documents they desire.

This isneeded anyhow for successful information retrieval in the basic case, butit is important to see the kinds of problems that relevance feedback cannotsolve alone. Cases where relevance feedback alone is not sufficient include:• Misspellings. If the user spells a term in a different way to the way itis spelled in any document in the collection, then relevance feedback isunlikely to be effective.

This can be addressed by the spelling correctiontechniques of Chapter 3.• Cross-language information retrieval. Documents in another languageare not nearby in a vector space based on term distribution. Rather, documents in the same language cluster more closely together.• Mismatch of searcher’s vocabulary versus collection vocabulary. If theuser searches for laptop but all the documents use the term notebook computer, then the query will fail, and relevance feedback is again most likelyineffective.Secondly, the relevance feedback approach requires relevant documents tobe similar to each other.

That is, they should cluster. Ideally, the term distribution in all relevant documents will be similar to that in the documentsmarked by the users, while the term distribution in all nonrelevant documents will be different from those in relevant documents. Things will workwell if all relevant documents are tightly clustered around a single prototype, or, at least, if there are different prototypes, if the relevant documentshave significant vocabulary overlap, while similarities between relevant andnonrelevant documents are small. Implicitly, the Rocchio relevance feedbackmodel treats relevant documents as a single cluster, which it models via thecentroid of the cluster.

This approach does not work as well if the relevantdocuments are a multimodal class, that is, they consist of several clusters ofdocuments within the vector space. This can happen with:• Subsets of the documents using different vocabulary, such as Burma vs.Myanmar• A query for which the answer set is inherently disjunctive, such as Popstars who once worked at Burger King.• Instances of a general concept, which often appear as a disjunction ofmore specific concepts, for example, felines.Good editorial content in the collection can often provide a solution to thisproblem. For example, an article on the attitudes of different groups to thesituation in Burma could introduce the terminology used by different parties,thus linking the document clusters.Online edition (c) 2009 Cambridge UP9.1 Relevance feedback and pseudo relevance feedback185Relevance feedback is not necessarily popular with users.

Users are oftenreluctant to provide explicit feedback, or in general do not wish to prolongthe search interaction. Furthermore, it is often harder to understand why aparticular document was retrieved after relevance feedback is applied.Relevance feedback can also have practical problems. The long queriesthat are generated by straightforward application of relevance feedback techniques are inefficient for a typical IR system. This results in a high computingcost for the retrieval and potentially long response times for the user.

A partial solution to this is to only reweight certain prominent terms in the relevantdocuments, such as perhaps the top 20 terms by term frequency. Some experimental results have also suggested that using a limited number of termslike this may give better results (Harman 1992) though other work has suggested that using more terms is better in terms of retrieved document quality(Buckley et al. 1994b).9.1.4Relevance feedback on the webSome web search engines offer a similar/related pages feature: the user indicates a document in the results set as exemplary from the standpoint ofmeeting his information need and requests more documents like it.

This canbe viewed as a particular simple form of relevance feedback. However, ingeneral relevance feedback has been little used in web search. One exceptionwas the Excite web search engine, which initially provided full relevancefeedback. However, the feature was in time dropped, due to lack of use. Onthe web, few people use advanced search interfaces and most would like tocomplete their search in a single interaction. But the lack of uptake also probably reflects two other factors: relevance feedback is hard to explain to theaverage user, and relevance feedback is mainly a recall enhancing strategy,and web search users are only rarely concerned with getting sufficient recall.Spink et al.

(2000) present results from the use of relevance feedback inthe Excite search engine. Only about 4% of user query sessions used therelevance feedback option, and these were usually exploiting the “More likethis” link next to each result. About 70% of users only looked at the firstpage of results and did not pursue things any further. For people who usedrelevance feedback, results were improved about two thirds of the time.An important more recent thread of work is the use of clickstream data(what links a user clicks on) to provide indirect relevance feedback. Useof this data is studied in detail in (Joachims 2002b, Joachims et al.

2005).The very successful use of web link structure (see Chapter 21) can also beviewed as implicit feedback, but provided by page authors rather than readers (though in practice most authors are also readers).Online edition (c) 2009 Cambridge UP1869 Relevance feedback and query expansion?Exercise 9.1In Rocchio’s algorithm, what weight setting for α/β/γ does a “Find pages like thisone” search correspond to?Exercise 9.2[⋆]Give three reasons why relevance feedback has been little used in web search.9.1.5Evaluation of relevance feedback strategiesInteractive relevance feedback can give very substantial gains in retrievalperformance.

Empirically, one round of relevance feedback is often veryuseful. Two rounds is sometimes marginally more useful. Successful use ofrelevance feedback requires enough judged documents, otherwise the process is unstable in that it may drift away from the user’s information need.Accordingly, having at least five judged documents is recommended.There is some subtlety to evaluating the effectiveness of relevance feedback in a sound and enlightening way.

The obvious first strategy is to startwith an initial query q0 and to compute a precision-recall graph. Followingone round of feedback from the user, we compute the modified query qmand again compute a precision-recall graph. Here, in both rounds we assessperformance over all documents in the collection, which makes comparisonsstraightforward. If we do this, we find spectacular gains from relevance feedback: gains on the order of 50% in mean average precision. But unfortunatelyit is cheating. The gains are partly due to the fact that known relevant documents (judged by the user) are now ranked higher. Fairness demands thatwe should only evaluate with respect to documents not seen by the user.A second idea is to use documents in the residual collection (the set of documents minus those assessed relevant) for the second round of evaluation.This seems like a more realistic evaluation.

Unfortunately, the measured performance can then often be lower than for the original query. This is particularly the case if there are few relevant documents, and so a fair proportionof them have been judged by the user in the first round. The relative performance of variant relevance feedback methods can be validly compared,but it is difficult to validly compare performance with and without relevancefeedback because the collection size and the number of relevant documentschanges from before the feedback to after it.Thus neither of these methods is fully satisfactory. A third method is tohave two collections, one which is used for the initial query and relevancejudgments, and the second that is then used for comparative evaluation.

Theperformance of both q0 and qm can be validly compared on the second collection.Perhaps the best evaluation of the utility of relevance feedback is to do userstudies of its effectiveness, in particular by doing a time-based comparison:Online edition (c) 2009 Cambridge UP9.1 Relevance feedback and pseudo relevance feedbackTerm weightinglnc.ltcLnu.ltu187Precision at k = 50no RF pseudo RF64.2%72.7%74.2%87.0%◮ Figure 9.5 Results showing pseudo relevance feedback greatly improving performance. These results are taken from the Cornell SMART system at TREC 4 (Buckleyet al. 1995), and also contrast the use of two different length normalization schemes(L vs.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.