An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 9

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 9 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 92020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 9)

As the Westlaw examples show, we might alsowish to do proximity queries such as Gates NEAR Microsoft. To answersuch queries, the index has to be augmented to capture the proximities ofterms in documents.TERM FREQUENCY3. A Boolean model only records term presence or absence, but often wewould like to accumulate evidence, giving more weight to documents thathave a term several times as opposed to ones that contain it only once.

Tobe able to do this we need term frequency information (the number of timesa term occurs in a document) in postings lists.4. Boolean queries just retrieve a set of matching documents, but commonlywe wish to have an effective method to order (or “rank”) the returnedresults. This requires having a mechanism for determining a documentscore which encapsulates how good a match a document is for a query.With these additional ideas, we will have seen most of the basic technology that supports ad hoc searching over unstructured information.

Ad hocsearching over documents has recently conquered the world, powering notonly web search engines but the kind of unstructured search that lies behindthe large eCommerce websites. Although the main web search engines differby emphasizing free text querying, most of the basic issues and technologiesof indexing and querying remain the same, as we will see in later chapters.Moreover, over time, web search engines have added at least partial implementations of some of the most popular operators from extended Booleanmodels: phrase search is especially popular and most have a very partialimplementation of Boolean operators.

Nevertheless, while these options areliked by expert searchers, they are little used by most people and are not themain focus in work on trying to improve web search engine performance.?Exercise 1.12[⋆]Write a query using Westlaw syntax which would find any of the words professor,teacher, or lecturer in the same sentence as a form of the verb explain.Online edition (c) 2009 Cambridge UP1.5 References and further reading17Exercise 1.13[ ⋆]Try using the Boolean search features on a couple of major web search engines.

Forinstance, choose a word, such as burglar, and submit the queries (i) burglar, (ii) burglarAND burglar, and (iii) burglar OR burglar. Look at the estimated number of results andtop hits. Do they make sense in terms of Boolean logic? Often they haven’t for majorsearch engines. Can you make sense of what is going on? What about if you trydifferent words? For example, query for (i) knight, (ii) conquer, and then (iii) knight ORconquer.

What bound should the number of results from the first two queries placeon the third query? Is this bound observed?1.5References and further readingThe practical pursuit of computerized information retrieval began in the late1940s (Cleverdon 1991, Liddy 2005). A great increase in the production ofscientific literature, much in the form of less formal technical reports ratherthan traditional journal articles, coupled with the availability of computers,led to interest in automatic document retrieval. However, in those days, document retrieval was always based on author, title, and keywords; full-textsearch came much later.The article of Bush (1945) provided lasting inspiration for the new field:“Consider a future device for individual use, which is a sort of mechanized private file and library.

It needs a name, and, to coin one atrandom, ‘memex’ will do. A memex is a device in which an individualstores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility.It is an enlarged intimate supplement to his memory.”The term Information Retrieval was coined by Calvin Mooers in 1948/1950(Mooers 1950).In 1958, much newspaper attention was paid to demonstrations at a conference (see Taube and Wooster 1958) of IBM “auto-indexing” machines, basedprimarily on the work of H.

P. Luhn. Commercial interest quickly gravitatedtowards Boolean retrieval systems, but the early years saw a heady debateover various disparate technologies for retrieval systems. For example Mooers (1961) dissented:“It is a common fallacy, underwritten at this date by the investment ofseveral million dollars in a variety of retrieval hardware, that the algebra of George Boole (1847) is the appropriate formalism for retrievalsystem design. This view is as widely and uncritically accepted as it iswrong.”The observation of AND vs. OR giving you opposite extremes in a precision/recall tradeoff, but not the middle ground comes from (Lee and Fox 1988).Online edition (c) 2009 Cambridge UP18REGULAR EXPRESSIONS1 Boolean retrievalThe book (Witten et al. 1999) is the standard reference for an in-depth comparison of the space and time efficiency of the inverted index versus otherpossible data structures; a more succinct and up-to-date presentation appears in Zobel and Moffat (2006).

We further discuss several approaches inChapter 5.Friedl (2006) covers the practical usage of regular expressions for searching.The underlying computer science appears in (Hopcroft et al. 2000).Online edition (c) 2009 Cambridge UPDRAFT! © April 1, 2009 Cambridge University Press. Feedback welcome.219The term vocabulary and postingslistsRecall the major steps in inverted index construction:1.

Collect the documents to be indexed.2. Tokenize the text.3. Do linguistic preprocessing of tokens.4. Index the documents that each term occurs in.In this chapter we first briefly mention how the basic unit of a document canbe defined and how the character sequence that it comprises is determined(Section 2.1). We then examine in detail some of the substantive linguistic issues of tokenization and linguistic preprocessing, which determine thevocabulary of terms which a system uses (Section 2.2).

Tokenization is theprocess of chopping character streams into tokens, while linguistic preprocessing then deals with building equivalence classes of tokens which are theset of terms that are indexed. Indexing itself is covered in Chapters 1 and 4.Then we return to the implementation of postings lists.

In Section 2.3, weexamine an extended postings list data structure that supports faster querying, while Section 2.4 covers building postings data structures suitable forhandling phrase and proximity queries, of the sort that commonly appear inboth extended Boolean models and on the web.2.12.1.1Document delineation and character sequence decodingObtaining the character sequence in a documentDigital documents that are the input to an indexing process are typicallybytes in a file or on a web server. The first step of processing is to convert thisbyte sequence into a linear sequence of characters. For the case of plain English text in ASCII encoding, this is trivial. But often things get much moreOnline edition (c) 2009 Cambridge UP202 The term vocabulary and postings listscomplex.

The sequence of characters may be encoded by one of various single byte or multibyte encoding schemes, such as Unicode UTF-8, or variousnational or vendor-specific standards. We need to determine the correct encoding. This can be regarded as a machine learning classification problem,as discussed in Chapter 13,1 but is often handled by heuristic methods, userselection, or by using provided document metadata. Once the encoding isdetermined, we decode the byte sequence to a character sequence. We mightsave the choice of encoding because it gives some evidence about what language the document is written in.The characters may have to be decoded out of some binary representationlike Microsoft Word DOC files and/or a compressed format such as zip files.Again, we must determine the document format, and then an appropriatedecoder has to be used.

Even for plain text documents, additional decodingmay need to be done. In XML documents (Section 10.1, page 197), character entities, such as &, need to be decoded to give the correct character,namely & for &. Finally, the textual part of the document may need tobe extracted out of other material that will not be processed.

This might bethe desired handling for XML files, if the markup is going to be ignored; wewould almost certainly want to do this with postscript or PDF files. We willnot deal further with these issues in this book, and will assume henceforththat our documents are a list of characters. Commercial products usuallyneed to support a broad range of document types and encodings, since userswant things to just work with their data as is. Often, they just think of documents as text inside applications and are not even aware of how it is encodedon disk. This problem is usually solved by licensing a software library thathandles decoding document formats and character encodings.The idea that text is a linear sequence of characters is also called into question by some writing systems, such as Arabic, where text takes on sometwo dimensional and mixed order characteristics, as shown in Figures 2.1and 2.2. But, despite some complicated writing system conventions, thereis an underlying sequence of sounds being represented and hence an essentially linear structure remains, and this is what is represented in the digitalrepresentation of Arabic, as shown in Figure 2.1.2.1.2DOCUMENT UNITChoosing a document unitThe next phase is to determine what the document unit for indexing is.

Thusfar we have assumed that documents are fixed units for the purposes of indexing. For example, we take each file in a folder as a document. But there1. A classifier is a function that takes objects of some sort and assigns them to one of a numberof distinct classes (see Chapter 13). Usually classification is done by machine learning methodssuch as probabilistic models, but it can also be done by hand-written rules.Online edition (c) 2009 Cambridge UP2.1 Document delineation and character sequence decoding21‫ب‬ٌ َ‫⇐ ِآ‬ٌ ‫ك ِ ت ا ب‬un b ā t i k/kitābun/ ‘a book’◮ Figure 2.1 An example of a vocalized Modern Standard Arabic word. The writingis from right to left and letters undergo complex mutations as they are combined.

Therepresentation of short vowels (here, /i/ and /u/) and the final /n/ (nunation) departs from strict linearity by being represented as diacritics above and below letters.Nevertheless, the represented text is still clearly a linear ordering of characters representing sounds. Full vocalization, as here, normally appears only in the Koran andchildren’s books.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.