An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 61

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 61 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 612020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 61)

Ingeneral, translation models, relevance feedback models, and model compar-Online edition (c) 2009 Cambridge UP25212 Language models for information retrievalison approaches have all been demonstrated to improve performance overthe basic query likelihood LM.12.5References and further readingFor more details on the basic concepts of probabilistic language models andtechniques for smoothing, see either Manning and Schütze (1999, Chapter 6)or Jurafsky and Martin (2008, Chapter 4).The important initial papers that originated the language modeling approach to IR are: (Ponte and Croft 1998, Hiemstra 1998, Berger and Lafferty1999, Miller et al.

1999). Other relevant papers can be found in the next several years of SIGIR proceedings. (Croft and Lafferty 2003) contains a collection of papers from a workshop on language modeling approaches andHiemstra and Kraaij (2005) review one prominent thread of work on usinglanguage modeling approaches for TREC tasks. Zhai and Lafferty (2001b)clarify the role of smoothing in LMs for IR and present detailed empiricalcomparisons of different smoothing methods.

Zaragoza et al. (2003) advocate using full Bayesian predictive distributions rather than MAP point estimates, but while they outperform Bayesian smoothing, they fail to outperform a linear interpolation. Zhai and Lafferty (2002) argue that a two-stagesmoothing model with first Bayesian smoothing followed by linear interpolation gives a good model of the task, and performs better and more stablythan a single form of smoothing. A nice feature of the LM approach is that itprovides a convenient and principled way to put various kinds of prior information into the model; Kraaij et al. (2002) demonstrate this by showing thevalue of link information as a prior in improving web entry page retrievalperformance. As briefly discussed in Chapter 16 (page 353), Liu and Croft(2004) show some gains by smoothing a document LM with estimates froma cluster of similar documents; Tao et al.

(2006) report larger gains by doingdocument-similarity based smoothing.Hiemstra and Kraaij (2005) present TREC results showing a LM approachbeating use of BM25 weights. Recent work has achieved some gains bygoing beyond the unigram model, providing the higher order models aresmoothed with lower order models (Gao et al.

2004, Cao et al. 2005), thoughthe gains to date remain modest. Spärck Jones (2004) presents a critical viewpoint on the rationale for the language modeling approach, but Lafferty andZhai (2003) argue that a unified account can be given of the probabilisticsemantics underlying both the language modeling approach presented inthis chapter and the classical probabilistic information retrieval approach ofChapter 11. The Lemur Toolkit (http://www.lemurproject.org/) provides a flexible open source framework for investigating language modeling approachesto IR.Online edition (c) 2009 Cambridge UPDRAFT! © April 1, 2009 Cambridge University Press. Feedback welcome.13STANDING QUERYCLASSIFICATIONROUTINGFILTERINGTEXT CLASSIFICATION253Text classification and NaiveBayesThus far, this book has mainly discussed the process of ad hoc retrieval, whereusers have transient information needs that they try to address by posingone or more queries to a search engine.

However, many users have ongoinginformation needs. For example, you might need to track developments inmulticore computer chips. One way of doing this is to issue the query multicore AND computer AND chip against an index of recent newswire articles eachmorning. In this and the following two chapters we examine the question:How can this repetitive task be automated? To this end, many systems support standing queries. A standing query is like any other query except that itis periodically executed on a collection to which new documents are incrementally added over time.If your standing query is just multicore AND computer AND chip, you will tendto miss many relevant new articles which use other terms such as multicoreprocessors.

To achieve good recall, standing queries thus have to be refinedover time and can gradually become quite complex. In this example, using aBoolean search engine with stemming, you might end up with a query like(multicore OR multi-core) AND (chip OR processor OR microprocessor).To capture the generality and scope of the problem space to which standing queries belong, we now introduce the general notion of a classificationproblem. Given a set of classes, we seek to determine which class(es) a givenobject belongs to. In the example, the standing query serves to divide newnewswire articles into the two classes: documents about multicore computer chipsand documents not about multicore computer chips.

We refer to this as two-classclassification. Classification using standing queries is also called routing orfilteringand will be discussed further in Section 15.3.1 (page 335).A class need not be as narrowly focused as the standing query multicorecomputer chips. Often, a class is a more general subject area like China or coffee.Such more general classes are usually referred to as topics, and the classification task is then called text classification, text categorization, topic classification,or topic spotting.

An example for China appears in Figure 13.1. Standingqueries and topics differ in their degree of specificity, but the methods forOnline edition (c) 2009 Cambridge UP25413 Text classification and Naive Bayessolving routing, filtering, and text classification are essentially the same. Wetherefore include routing and filtering under the rubric of text classificationin this and the following chapters.The notion of classification is very general and has many applications withinand beyond information retrieval (IR).

For instance, in computer vision, aclassifier may be used to divide images into classes such as landscape, portrait, and neither. We focus here on examples from information retrieval suchas:• Several of the preprocessing steps necessary for indexing as discussed inChapter 2: detecting a document’s encoding (ASCII, Unicode UTF-8 etc;page 20); word segmentation (Is the white space between two letters aword boundary or not? page 24 ) ; truecasing (page 30); and identifyingthe language of a document (page 46).• The automatic detection of spam pages (which then are not included inthe search engine index).• The automatic detection of sexually explicit content (which is included insearch results only if the user turns an option such as SafeSearch off).SENTIMENT DETECTIONEMAIL SORTINGVERTICAL SEARCHENGINE• Sentiment detection or the automatic classification of a movie or productreview as positive or negative.

An example application is a user searching for negative reviews before buying a camera to make sure it has noundesirable features or quality problems.• Personal email sorting. A user may have folders like talk announcements,electronic bills, email from family and friends, and so on, and may want aclassifier to classify each incoming email and automatically move it to theappropriate folder. It is easier to find messages in sorted folders than ina very large inbox. The most common case of this application is a spamfolder that holds all suspected spam messages.• Topic-specific or vertical search.

Vertical search engines restrict searches toa particular topic. For example, the query computer science on a verticalsearch engine for the topic China will return a list of Chinese computerscience departments with higher precision and recall than the query computer science China on a general purpose search engine. This is because thevertical search engine does not include web pages in its index that containthe term china in a different sense (e.g., referring to a hard white ceramic),but does include relevant pages even if they do not explicitly mention theterm China.• Finally, the ranking function in ad hoc information retrieval can also bebased on a document classifier as we will explain in Section 15.4 (page 341).Online edition (c) 2009 Cambridge UP255RULES IN TEXTCLASSIFICATIONSTATISTICAL TEXTCLASSIFICATIONLABELINGThis list shows the general importance of classification in IR.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.