An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 34

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 34 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 342020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 34)

In this figure, the x − axis represents |Vwhile the y−axis represents possible normalization factors we can use. Thethin line y = x depicts the use of cosine normalization. Notice the followingaspects of the thick line representing pivoted length normalization:1. It is linear in the document length and has the form(6.16)~ (d)| + (1 − a)piv,a |VOnline edition (c) 2009 Cambridge UP1316.4 Variant tf-idf functionsPivoted normalization6y = x; Cosine Pivoted-piv~ (d)||V◮ Figure 6.17 Implementing pivoted document length normalization by linear scaling.where piv is the cosine normalization value at which the two curves intersect.2. Its slope is a < 1 and (3) it crosses the y = x line at piv.It has been argued that in practice, Equation (6.16) is well approximated byaud + (1 − a)piv,where ud is the number of unique terms in document d.Of course, pivoted document length normalization is not appropriate forall applications.

For instance, in a collection of answers to frequently askedquestions (say, at a customer service website), relevance may have little todo with document length. In other cases the dependency may be more complex than can be accounted for by a simple linear pivoted normalization.

Insuch cases, document length can be used as a feature in the machine learningbased scoring approach of Section 6.1.2.?E UCLIDEAN DISTANCEExercise 6.18One measure of the similarity of two vectors is the Euclidean distance (or L2 distance)between them:vuMu|~x − ~y | = t ∑ ( xi − yi )2i =1Online edition (c) 2009 Cambridge UP1326 Scoring, term weighting and the vector space modelworddigitalvideocamerastfwfquerydfidf10,000100,00050,000qi = wf-idftfwfdocumentd i = normalized wfqi · d i◮ Table 6.1 Cosine computation for Exercise 6.19.Given a query q and documents d1 , d2 , . .

., we may rank the documents di in orderof increasing Euclidean distance from q. Show that if q and the di are all normalizedto unit vectors, then the rank ordering produced by Euclidean distance is identical tothat produced by cosine similarities.Exercise 6.19Compute the vector space similarity between the query “digital cameras” and thedocument “digital cameras and video cameras” by filling out the empty columns inTable 6.1. Assume N = 10,000,000, logarithmic term weighting (wf columns) forquery and document, idf weighting for the query only and cosine normalization forthe document only. Treat and as a stop word. Enter term counts in the tf columns.What is the final similarity score?Exercise 6.20Show that for the query affection, the relative ordering of the scores of the three documents in Figure 6.13 is the reverse of the ordering of the scores for the query jealousgossip.Exercise 6.21In turning a query into a unit vector in Figure 6.13, we assigned equal weights to eachof the query terms.

What other principled approaches are plausible?Exercise 6.22Consider the case of a query term that is not in the set of M indexed terms; thus our~ (q ) not being in the vector spacestandard construction of the query vector results in Vcreated from the collection. How would one adapt the vector space representation tohandle this case?Exercise 6.23Refer to the tf and idf values for four terms and three documents in Exercise 6.10.Compute the two top scoring documents on the query best car insurance for each ofthe following weighing schemes: (i) nnn.atc; (ii) ntc.atc.Exercise 6.24Suppose that the word coyote does not occur in the collection used in Exercises 6.10and 6.23. How would one compute ntc.atc scores for the query coyote insurance?Online edition (c) 2009 Cambridge UP6.5 References and further reading6.5133References and further readingChapter 7 develops the computational aspects of vector space scoring.

Luhn(1957; 1958) describes some of the earliest reported applications of term weighting. His paper dwells on the importance of medium frequency terms (termsthat are neither too commonplace nor too rare) and may be thought of as anticipating tf-idf and related weighting schemes. Spärck Jones (1972) buildson this intuition through detailed experiments showing the use of inversedocument frequency in term weighting.

A series of extensions and theoretical justifications of idf are due to Salton and Buckley (1987) Robertson andJones (1976), Croft and Harper (1979) and Papineni (2001). Robertson maintains a web page (http://www.soi.city.ac.uk/˜ser/idf.html) containing the historyof idf, including soft copies of early papers that predated electronic versionsof journal article. Singhal et al. (1996a) develop pivoted document lengthnormalization. Probabilistic language models (Chapter 11) develop weighting techniques that are more nuanced than tf-idf; the reader will find thisdevelopment in Section 11.4.3.We observed that by assigning a weight for each term in a document, adocument may be viewed as a vector of term weights, one for each term inthe collection.

The SMART information retrieval system at Cornell (Salton1971b) due to Salton and colleagues was perhaps the first to view a document as a vector of weights. The basic computation of cosine scores asdescribed in Section 6.3.3 is due to Zobel and Moffat (2006). The two queryevaluation strategies term-at-a-time and document-at-a-time are discussedby Turtle and Flood (1995).The SMART notation for tf-idf term weighting schemes in Figure 6.15 ispresented in (Salton and Buckley 1988, Singhal et al. 1995; 1996b). Not allversions of the notation are consistent; we most closely follow (Singhal et al.1996b). A more detailed and exhaustive notation was developed in Moffatand Zobel (1998), considering a larger palette of schemes for term and document frequency weighting.

Beyond the notation, Moffat and Zobel (1998)sought to set up a space of feasible weighting functions through which hillclimbing approaches could be used to begin with weighting schemes thatperformed well, then make local improvements to identify the best combinations. However, they report that such hill-climbing methods failed to leadto any conclusions on the best weighting schemes.Online edition (c) 2009 Cambridge UPOnline edition (c) 2009 Cambridge UPDRAFT! © April 1, 2009 Cambridge University Press.

Feedback welcome.7135Computing scores in a completesearch systemChapter 6 developed the theory underlying term weighting in documentsfor the purposes of scoring, leading up to vector space models and the basiccosine scoring algorithm of Section 6.3.3 (page 124). In this chapter we begin in Section 7.1 with heuristics for speeding up this computation; many ofthese heuristics achieve their speed at the risk of not finding quite the top Kdocuments matching the query.

Some of these heuristics generalize beyondcosine scoring. With Section 7.1 in place, we have essentially all the components needed for a complete search engine. We therefore take a step backfrom cosine scoring, to the more general problem of computing scores in asearch engine. In Section 7.2 we outline a complete search engine, including indexes and structures to support not only cosine scoring but also moregeneral ranking factors such as query term proximity. We describe how allof the various pieces fit together in Section 7.2.4. We conclude this chapterwith Section 7.3, where we discuss how the vector space model for free textqueries interacts with common query operators.7.1Efficient scoring and rankingWe begin by recapping the algorithm of Figure 6.14.

For a query such as q =jealous gossip, two observations are immediate:1. The unit vector ~v(q) has only two non-zero components.2. In the absence of any weighting for query terms, these non-zero components are equal – in this case, both equal 0.707.For the purpose of ranking the documents matching this query, we arereally interested in the relative (rather than absolute) scores of the documentsin the collection. To this end, it suffices to compute the cosine similarity from~ (q) (in which all non-zero componentseach document unit vector ~v(d) to Vof the query vector are set to 1), rather than to the unit vector ~v(q).

For anyOnline edition (c) 2009 Cambridge UP1367 Computing scores in a complete search systemFAST C OSINE S CORE (q)1 float Scores[ N ] = 02 for each d3 do Initialize Length[d] to the length of doc d4 for each query term t5 do calculate wt,q and fetch postings list for t6for each pair(d, tft,d ) in postings list7do add wft,d to Scores[d]8 Read the array Length[d]9 for each d10 do Divide Scores[d] by Length[d]11 return Top K components of Scores[]◮ Figure 7.1 A faster algorithm for vector space scores.two documents d1 , d2(7.1)~ (q) · ~v(d1 ) > V~ (q) · ~v(d2 ) ⇔ ~v(q) · ~v(d1 ) > ~v(q) · ~v(d2 ).V~ (q) · ~v(d) is the weighted sum,For any document d, the cosine similarity Vover all terms in the query q, of the weights of those terms in d.

This in turncan be computed by a postings intersection exactly as in the algorithm ofFigure 6.14, with line 8 altered since we take wt,q to be 1 so that the multiplyadd in that step becomes just an addition; the result is shown in Figure 7.1.We walk through the postings in the inverted index for the terms in q, accumulating the total score for each document – very much as in processing aBoolean query, except we assign a positive score to each document that appears in any of the postings being traversed. As mentioned in Section 6.3.3we maintain an idf value for each dictionary term and a tf value for eachpostings entry.

This scheme computes a score for every document in thepostings of any of the query terms; the total number of such documents maybe considerably smaller than N.Given these scores, the final step before presenting results to a user is topick out the K highest-scoring documents. While one could sort the completeset of scores, a better approach is to use a heap to retrieve only the top Kdocuments in order. Where J is the number of documents with non-zerocosine scores, constructing such a heap can be performed in 2J comparisonsteps, following which each of the K highest scoring documents can be “readoff” the heap with log J comparison steps.Online edition (c) 2009 Cambridge UP7.1 Efficient scoring and ranking7.1.1137Inexact top K document retrievalThus far, we have focused on retrieving precisely the K highest-scoring documents for a query.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.