An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 37

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 37 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 372020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 37)

. . , tk . Let ωbe the width of the smallest window in a document d that contains all thequery terms, measured in the number of words in the window. For instance,if the document were to simply consist of the sentence The quality of mercyis not strained, the smallest window for the query strained mercy would be 4.Intuitively, the smaller that ω is, the better that d matches the query. In caseswhere the document does not contain all of the query terms, we can set ωto be some enormous number.

We could also consider variants in whichonly words that are not stop words are considered in computing ω. Suchproximity-weighted scoring functions are a departure from pure cosine similarity and closer to the “soft conjunctive” semantics that Google and otherweb search engines evidently use.How can we design such a proximity-weighted scoring function to dependon ω? The simplest answer relies on a “hand coding” technique we introducebelow in Section 7.2.3. A more scalable approach goes back to Section 6.1.2 –we treat the integer ω as yet another feature in the scoring function, whoseimportance is assigned by machine learning, as will be developed further inSection 15.4.1.Designing parsing and scoring functionsCommon search interfaces, particularly for consumer-facing search applications on the web, tend to mask query operators from the end user.

The intentis to hide the complexity of these operators from the largely non-technical audience for such applications, inviting free text queries. Given such interfaces,how should a search equipped with indexes for various retrieval operatorstreat a query such as rising interest rates? More generally, given the various factors we have studied that could affect the score of a document, how shouldwe combine these features?The answer of course depends on the user population, the query distribution and the collection of documents.

Typically, a query parser is used totranslate the user-specified keywords into a query with various operatorsthat is executed against the underlying indexes. Sometimes, this executioncan entail multiple queries against the underlying indexes; for example, thequery parser may issue a stream of queries:1.

Run the user-generated query string as a phrase query. Rank them byvector space scoring using as query the vector consisting of the 3 termsrising interest rates.2. If fewer than ten documents contain the phrase rising interest rates, run thetwo 2-term phrase queries rising interest and interest rates; rank these usingvector space scoring, as well.Online edition (c) 2009 Cambridge UP1467 Computing scores in a complete search system3. If we still have fewer than ten results, run the vector space query consisting of the three individual query terms.EVIDENCEACCUMULATION7.2.4Each of these steps (if invoked) may yield a list of scored documents, foreach of which we compute a score.

This score must combine contributionsfrom vector space scoring, static quality, proximity weighting and potentiallyother factors – particularly since a document may appear in the lists frommultiple steps. This demands an aggregate scoring function that accumulatesevidence of a document’s relevance from multiple sources. How do we devisea query parser and how do we devise the aggregate scoring function?The answer depends on the setting. In many enterprise settings we haveapplication builders who make use of a toolkit of available scoring operators, along with a query parsing layer, with which to manually configurethe scoring function as well as the query parser.

Such application buildersmake use of the available zones, metadata and knowledge of typical documents and queries to tune the parsing and scoring. In collections whosecharacteristics change infrequently (in an enterprise application, significantchanges in collection and query characteristics typically happen with infrequent events such as the introduction of new document formats or documentmanagement systems, or a merger with another company). Web search onthe other hand is faced with a constantly changing document collection withnew characteristics being introduced all the time.

It is also a setting in whichthe number of scoring factors can run into the hundreds, making hand-tunedscoring a difficult exercise. To address this, it is becoming increasingly common to use machine-learned scoring, extending the ideas we introduced inSection 6.1.2, as will be discussed further in Section 15.4.1.Putting it all togetherWe have now studied all the components necessary for a basic search systemthat supports free text queries as well as Boolean, zone and field queries.

Webriefly review how the various pieces fit together into an overall system; thisis depicted in Figure 7.5.In this figure, documents stream in from the left for parsing and linguistic processing (language and format detection, tokenization and stemming).The resulting stream of tokens feeds into two modules. First, we retain acopy of each parsed document in a document cache. This will enable usto generate results snippets: snippets of text accompanying each documentin the results list for a query.

This snippet tries to give a succinct explanation to the user of why the document matches the query. The automaticgeneration of such snippets is the subject of Section 8.7. A second copyof the tokens is fed to a bank of indexers that create a bank of indexes including zone and field indexes that store the metadata for each document,Online edition (c) 2009 Cambridge UP7.3 Vector space scoring and query operator interaction147◮ Figure 7.5 A complete search system. Data paths are shown primarily for a freetext query.(tiered) positional indexes, indexes for spelling correction and other tolerantretrieval, and structures for accelerating inexact top-K retrieval. A free textuser query (top center) is sent down to the indexes both directly and througha module for generating spelling-correction candidates.

As noted in Chapter 3 the latter may optionally be invoked only when the original query failsto retrieve enough results. Retrieved documents (dark arrow) are passedto a scoring module that computes scores based on machine-learned ranking (MLR), a technique that builds on Section 6.1.2 (to be further developedin Section 15.4.1) for scoring and ranking documents. Finally, these rankeddocuments are rendered as a results page.?Exercise 7.9Explain how the postings intersection algorithm first introduced in Section 1.3 can beadapted to find the smallest integer ω that contains all query terms.Exercise 7.10Adapt this procedure to work when not all query terms are present in a document.7.3Vector space scoring and query operator interactionWe introduced the vector space model as a paradigm for free text queries.We conclude this chapter by discussing how the vector space scoring modelOnline edition (c) 2009 Cambridge UP1487 Computing scores in a complete search systemrelates to the query operators we have studied in earlier chapters.

The relationship should be viewed at two levels: in terms of the expressivenessof queries that a sophisticated user may pose, and in terms of the index thatsupports the evaluation of the various retrieval methods. In building a searchengine, we may opt to support multiple query operators for an end user. Indoing so we need to understand what components of the index can be sharedfor executing various query operators, as well as how to handle user queriesthat mix various query operators.Vector space scoring supports so-called free text retrieval, in which a queryis specified as a set of words without any query operators connecting them. Itallows documents matching the query to be scored and thus ranked, unlikethe Boolean, wildcard and phrase queries studied earlier.

Classically, theinterpretation of such free text queries was that at least one of the query termsbe present in any retrieved document. However more recently, web searchengines such as Google have popularized the notion that a set of terms typedinto their query boxes (thus on the face of it, a free text query) carries thesemantics of a conjunctive query that only retrieves documents containingall or most query terms.Boolean retrievalClearly a vector space index can be used to answer Boolean queries, as longas the weight of a term t in the document vector for d is non-zero whenever t occurs in d.

The reverse is not true, since a Boolean index does not bydefault maintain term weight information. There is no easy way of combining vector space and Boolean queries from a user’s standpoint: vector spacequeries are fundamentally a form of evidence accumulation, where the presence of more query terms in a document adds to the score of a document.Boolean retrieval on the other hand, requires a user to specify a formulafor selecting documents through the presence (or absence) of specific combinations of keywords, without inducing any relative ordering among them.Mathematically, it is in fact possible to invoke so-called p-norms to combineBoolean and vector space queries, but we know of no system that makes useof this fact.Wildcard queriesWildcard and vector space queries require different indexes, except at thebasic level that both can be implemented using postings and a dictionary(e.g., a dictionary of trigrams for wildcard queries).

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.