An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 11

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 11 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 112020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 11)

For instance, people might want to search in a bug database forthe line number where an error occurs. Items such as the date of an email,which have a clear semantic type, are often indexed separately as documentmetadata (see Section 6.1, page 110).In English, hyphenation is used for various purposes ranging from splitting up vowels in words (co-education) to joining nouns as names (HewlettPackard) to a copyediting device to show word grouping (the hold-him-backand-drag-him-away maneuver). It is easy to feel that the first example should beregarded as one token (and is indeed more commonly written as just coeducation), the last should be separated into words, and that the middle case isunclear.

Handling hyphens automatically can thus be complex: it can eitherbe done as a classification problem, or more commonly by some heuristicrules, such as allowing short hyphenated prefixes on words, but not longerhyphenated forms.Conceptually, splitting on white space can also split what should be regarded as a single token. This occurs most commonly with names (San Francisco, Los Angeles) but also with borrowed foreign phrases (au fait) and com3. For the free text case, this is straightforward. The Boolean case is more complex: this tokenization may produce multiple terms from one query word.

This can be handled by combiningthe terms with an AND or as a phrase query (see Section 2.4, page 39). It is harder for a systemto handle the opposite case where the user entered as two terms something that was tokenizedtogether in the document processing.Online edition (c) 2009 Cambridge UP2.2 Determining the vocabulary of termsCOMPOUNDSCOMPOUND - SPLITTERWORD SEGMENTATION25pounds that are sometimes written as a single word and sometimes spaceseparated (such as white space vs.

whitespace). Other cases with internal spacesthat we might wish to regard as a single token include phone numbers ((800) 2342333) and dates (Mar 11, 1983). Splitting tokens on spaces can cause badretrieval results, for example, if a search for York University mainly returnsdocuments containing New York University. The problems of hyphens andnon-separating whitespace can even interact. Advertisements for air faresfrequently contain items like San Francisco-Los Angeles, where simply doingwhitespace splitting would give unfortunate results.

In such cases, issues oftokenization interact with handling phrase queries (which we discuss in Section 2.4 (page 39)), particularly if we would like queries for all of lowercase,lower-case and lower case to return the same results. The last two can be handled by splitting on hyphens and using a phrase index.

Getting the first caseright would depend on knowing that it is sometimes written as two wordsand also indexing it in this way. One effective strategy in practice, whichis used by some Boolean retrieval systems such as Westlaw and Lexis-Nexis(Example 1.1), is to encourage users to enter hyphens wherever they may bepossible, and whenever there is a hyphenated form, the system will generalize the query to cover all three of the one word, hyphenated, and two wordforms, so that a query for over-eager will search for over-eager OR “over eager”OR overeager.

However, this strategy depends on user training, since if youquery using either of the other two forms, you get no generalization.Each new language presents some new issues. For instance, French has avariant use of the apostrophe for a reduced definite article ‘the’ before a wordbeginning with a vowel (e.g., l’ensemble) and has some uses of the hyphenwith postposed clitic pronouns in imperatives and questions (e.g., donnemoi ‘give me’). Getting the first case correct will affect the correct indexingof a fair percentage of nouns and adjectives: you would want documentsmentioning both l’ensemble and un ensemble to be indexed under ensemble.Other languages make the problem harder in new ways.

German writescompound nouns without spaces (e.g., Computerlinguistik ‘computational linguistics’; Lebensversicherungsgesellschaftsangestellter ‘life insurance companyemployee’). Retrieval systems for German greatly benefit from the use of acompound-splitter module, which is usually implemented by seeing if a wordcan be subdivided into multiple words that appear in a vocabulary. This phenomenon reaches its limit case with major East Asian Languages (e.g., Chinese, Japanese, Korean, and Thai), where text is written without any spacesbetween words. An example is shown in Figure 2.3.

One approach here is toperform word segmentation as prior linguistic processing. Methods of wordsegmentation vary from having a large vocabulary and taking the longestvocabulary match with some heuristics for unknown words to the use ofmachine learning sequence models, such as hidden Markov models or conditional random fields, trained over hand-segmented words (see the referencesOnline edition (c) 2009 Cambridge UP262 The term vocabulary and postings lists()*+!"#$%&'',#-./◮ Figure 2.3 The standard unsegmented form of Chinese text using the simplifiedcharacters of mainland China.

There is no whitespace between words, not even between sentences – the apparent space after the Chinese period (◦ ) is just a typographical illusion caused by placing the character on the left side of its square box. Thefirst sentence is just words in Chinese characters with no spaces between them. Thesecond and third sentences include Arabic numerals and punctuation breaking upthe Chinese characters.◮ Figure 2.4 Ambiguities in Chinese word segmentation. The two characters canbe treated as one word meaning ‘monk’ or as a sequence of two words meaning ‘and’and ‘still’.ahastoanhewasandinwereareiswillasitwithatitsbeofbyonforthatfromthe◮ Figure 2.5 A stop list of 25 semantically non-selective words which are commonin Reuters-RCV1.in Section 2.5). Since there are multiple possible segmentations of charactersequences (see Figure 2.4), all such methods make mistakes sometimes, andso you are never guaranteed a consistent unique tokenization.

The other approach is to abandon word-based indexing and to do all indexing via justshort subsequences of characters (character k-grams), regardless of whetherparticular sequences cross word boundaries or not. Three reasons why thisapproach is appealing are that an individual Chinese character is more like asyllable than a letter and usually has some semantic content, that most wordsare short (the commonest length is 2 characters), and that, given the lack ofstandardization of word breaking in the writing system, it is not always clearwhere word boundaries should be placed anyway.

Even in English, somecases of where to put word boundaries are just orthographic conventions –think of notwithstanding vs. not to mention or into vs. on to – but people areeducated to write the words with consistent use of spaces.Online edition (c) 2009 Cambridge UP2.2 Determining the vocabulary of terms2.2.2STOP WORDSCOLLECTIONFREQUENCYSTOP LIST27Dropping common terms: stop wordsSometimes, some extremely common words which would appear to be oflittle value in helping select documents matching a user need are excludedfrom the vocabulary entirely. These words are called stop words. The generalstrategy for determining a stop list is to sort the terms by collection frequency(the total number of times each term appears in the document collection),and then to take the most frequent terms, often hand-filtered for their semantic content relative to the domain of the documents being indexed, asa stop list, the members of which are then discarded during indexing.

Anexample of a stop list is shown in Figure 2.5. Using a stop list significantlyreduces the number of postings that a system has to store; we will presentsome statistics on this in Chapter 5 (see Table 5.1, page 87). And a lot ofthe time not indexing stop words does little harm: keyword searches withterms like the and by don’t seem very useful. However, this is not true forphrase searches. The phrase query “President of the United States”, which contains two stop words, is more precise than President AND “United States”. Themeaning of flights to London is likely to be lost if the word to is stopped out. Asearch for Vannevar Bush’s article As we may think will be difficult if the firstthree words are stopped out, and the system searches simply for documentscontaining the word think.

Some special query types are disproportionatelyaffected. Some song titles and well known pieces of verse consist entirely ofwords that are commonly on stop lists (To be or not to be, Let It Be, I don’t wantto be, . . . ).The general trend in IR systems over time has been from standard use ofquite large stop lists (200–300 terms) to very small stop lists (7–12 terms)to no stop list whatsoever. Web search engines generally do not use stoplists.

Some of the design of modern IR systems has focused precisely onhow we can exploit the statistics of language so as to be able to cope withcommon words in better ways. We will show in Section 5.3 (page 95) howgood compression techniques greatly reduce the cost of storing the postingsfor common words. Section 6.2.1 (page 117) then discusses how standardterm weighting leads to very common words having little impact on document rankings. Finally, Section 7.1.5 (page 140) shows how an IR systemwith impact-sorted indexes can terminate scanning a postings list early whenweights get small, and hence common words do not cause a large additionalprocessing cost for the average query, even though postings lists for stopwords are very long. So for most modern IR systems, the additional cost ofincluding stop words is not that big – neither in terms of index size nor interms of query processing time.Online edition (c) 2009 Cambridge UP282 The term vocabulary and postings listsQuery termWindowswindowswindowTerms in documents that should be matchedWindowsWindows, windows, windowwindow, windows◮ Figure 2.6 An example of how asymmetric expansion of query terms can usefullymodel users’ expectations.2.2.3TOKENNORMALIZATIONEQUIVALENCE CLASSESNormalization (equivalence classing of terms)Having broken up our documents (and also our query) into tokens, the easycase is if tokens in the query just match tokens in the token list of the document.

However, there are many cases when two character sequences arenot quite the same but you would like a match to occur. For instance, if yousearch for USA, you might hope to also match documents containing U.S.A.Token normalization is the process of canonicalizing tokens so that matchesoccur despite superficial differences in the character sequences of the tokens.4 The most standard way to normalize is to implicitly create equivalenceclasses, which are normally named after one member of the set.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.