An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 8

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 8 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 82020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 8)

However,in many circumstances, either because of the nature of the query language,or just because this is the most common type of query that users submit, aquery is purely conjunctive. In this case, rather than viewing merging postings lists as a function with two inputs and a distinct output, it is more efficient to intersect each retrieved postings list with the current intermediateresult in memory, where we initialize the intermediate result by loading thepostings list of the least frequent term. This algorithm is shown in Figure 1.7.The intersection operation is then asymmetric: the intermediate results listis in memory while the list it is being intersected with is being read fromdisk.

Moreover the intermediate results list is always at least as short as theother list, and in many cases it is orders of magnitude shorter. The postingsOnline edition (c) 2009 Cambridge UP131.3 Processing Boolean queriesintersection can still be done by the algorithm in Figure 1.6, but when thedifference between the list lengths is very large, opportunities to use alternative techniques open up. The intersection can be calculated in place bydestructively modifying or marking invalid items in the intermediate resultslist.

Or the intersection can be done as a sequence of binary searches in thelong postings lists for each posting in the intermediate results list. Anotherpossibility is to store the long postings list as a hashtable, so that membershipof an intermediate result item can be calculated in constant rather than linearor log time. However, such alternative techniques are difficult to combinewith postings list compression of the sort discussed in Chapter 5. Moreover,standard postings list intersection operations remain necessary when bothterms of a query are very common.?[ ⋆]Exercise 1.4For the queries below, can we still run through the intersection in time O( x + y),where x and y are the lengths of the postings lists for Brutus and Caesar? If not, whatcan we achieve?a. Brutus AND NOT Caesarb. Brutus OR NOT Caesar[ ⋆]Exercise 1.5Extend the postings merge algorithm to arbitrary Boolean query formulas. What isits time complexity? For instance, consider:c.

(Brutus OR Caesar) AND NOT (Antony OR Cleopatra)Can we always merge in linear time? Linear in what? Can we do better than this?Exercise 1.6[⋆⋆]We can use distributive laws for AND and OR to rewrite queries.a. Show how to rewrite the query in Exercise 1.5 into disjunctive normal form usingthe distributive laws.b. Would the resulting query be more or less efficiently evaluated than the originalform of this query?c.

Is this result true in general or does it depend on the words and the contents ofthe document collection?Exercise 1.7Recommend a query processing order ford. (tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)given the following postings list sizes:Online edition (c) 2009 Cambridge UP[ ⋆]141 Boolean retrievalTermeyeskaleidoscopemarmaladeskiestangerinetreesPostings size2133128700910791327165846653316812Exercise 1.8[⋆]If the query is:e.

friends AND romans AND (NOT countrymen)how could we use the frequency of countrymen in evaluating the best query evaluationorder? In particular, propose a way of handling negation in determining the order ofquery processing.Exercise 1.9[⋆⋆]For a conjunctive query, is processing postings lists in order of size guaranteed to beoptimal? Explain why it is, or give an example where it isn’t.Exercise 1.10Write out a postings merge algorithm, in the style of Figure 1.6 (page 11), for an xquery.[⋆⋆]ORyExercise 1.11[⋆⋆]How should the Boolean query x AND NOT y be handled? Why is naive evaluationof this query normally very expensive? Write out a postings merge algorithm thatevaluates this query efficiently.1.4RANKED RETRIEVALMODELFREE TEXT QUERIESPROXIMITY OPERATORThe extended Boolean model versus ranked retrievalThe Boolean retrieval model contrasts with ranked retrieval models such as thevector space model (Section 6.3), in which users largely use free text queries,that is, just typing one or more words rather than using a precise languagewith operators for building up query expressions, and the system decideswhich documents best satisfy the query.

Despite decades of academic research on the advantages of ranked retrieval, systems implementing the Boolean retrieval model were the main or only search option provided by largecommercial information providers for three decades until the early 1990s (approximately the date of arrival of the World Wide Web).

However, thesesystems did not have just the basic Boolean operations (AND, OR, and NOT)which we have presented so far. A strict Boolean expression over terms withan unordered results set is too limited for many of the information needsthat people have, and these systems implemented extended Boolean retrievalmodels by incorporating additional operators such as term proximity operators.

A proximity operator is a way of specifying that two terms in a queryOnline edition (c) 2009 Cambridge UP1.4 The extended Boolean model versus ranked retrieval15must occur close to each other in a document, where closeness may be measured by limiting the allowed number of intervening words or by referenceto a structural unit such as a sentence or paragraph.✎Example 1.1: Commercial Boolean searching: Westlaw. Westlaw (http://www.westlaw.com/)is the largest commercial legal search service (in terms of the number of paying subscribers), with over half a million subscribers performing millions of searches a dayover tens of terabytes of text data. The service was started in 1975.

In 2005, Booleansearch (called “Terms and Connectors” by Westlaw) was still the default, and usedby a large percentage of users, although ranked free text querying (called “NaturalLanguage” by Westlaw) was added in 1992. Here are some example Boolean querieson Westlaw:Information need: Information on the legal theories involved in preventing thedisclosure of trade secrets by employees formerly employed by a competingcompany. Query: "trade secret" /s disclos! /s prevent /s employe!Information need: Requirements for disabled people to be able to access a workplace.Query: disab! /p access! /s work-site work-place (employment /3 place)Information need: Cases about a host’s responsibility for drunk guests.Query: host! /p (responsib! liab!) /p (intoxicat! drunk!) /p guestNote the long, precise queries and the use of proximity operators, both uncommonin web search.

Submitted queries average about ten words in length. Unlike websearch conventions, a space between words represents disjunction (the tightest binding operator), & is AND and /s, /p, and /k ask for matches in the same sentence,same paragraph or within k words respectively. Double quotes give a phrase search(consecutive words); see Section 2.4 (page 39). The exclamation mark (!) gives a trailing wildcard query (see Section 3.2, page 51); thus liab! matches all words startingwith liab. Additionally work-site matches any of worksite, work-site or work site; seeSection 2.2.1 (page 22). Typical expert queries are usually carefully defined and incrementally developed until they obtain what look to be good results to the user.Many users, particularly professionals, prefer Boolean query models.

Booleanqueries are precise: a document either matches the query or it does not. This offers the user greater control and transparency over what is retrieved. And some domains, such as legal materials, allow an effective means of document ranking within aBoolean model: Westlaw returns documents in reverse chronological order, which isin practice quite effective. In 2007, the majority of law librarians still seem to recommend terms and connectors for high recall searches, and the majority of legalusers think they are getting greater control by using them.

However, this does notmean that Boolean queries are more effective for professional searchers. Indeed, experimenting on a Westlaw subcollection, Turtle (1994) found that free text queriesproduced better results than Boolean queries prepared by Westlaw’s own referencelibrarians for the majority of the information needs in his experiments. A generalproblem with Boolean search is that using AND operators tends to produce high precision but low recall searches, while using OR operators gives low precision but highrecall searches, and it is difficult or impossible to find a satisfactory middle ground.In this chapter, we have looked at the structure and construction of a basicOnline edition (c) 2009 Cambridge UP161 Boolean retrievalinverted index, comprising a dictionary and postings lists.

We introducedthe Boolean retrieval model, and examined how to do efficient retrieval vialinear time merges and simple query optimization. In Chapters 2–7 we willconsider in detail richer query models and the sort of augmented index structures that are needed to handle them efficiently. Here we just mention a fewof the main additional things we would like to be able to do:1.

We would like to better determine the set of terms in the dictionary andto provide retrieval that is tolerant to spelling mistakes and inconsistentchoice of words.2. It is often useful to search for compounds or phrases that denote a conceptsuch as “operating system”.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.