An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 13

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 13 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 132020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 13)

For instance, a French email might quote clauses from acontract document written in English. Most commonly, the language is detected and language-particular tokenization and normalization rules are applied at a predetermined granularity, such as whole documents or individualparagraphs, but this still will not correctly deal with cases where languagechanges occur for brief quotations. When document collections contain mul-Online edition (c) 2009 Cambridge UP322 The term vocabulary and postings liststiple languages, a single index may have to contain terms of several languages.

One option is to run a language identification classifier on documents and then to tag terms in the vocabulary for their language. Or thistagging can simply be omitted, since it is relatively rare for the exact samecharacter sequence to be a word in different languages.When dealing with foreign or complex words, particularly foreign names,the spelling may be unclear or there may be variant transliteration standardsgiving different spellings (for example, Chebyshev and Tchebycheff or Beijingand Peking).

One way of dealing with this is to use heuristics to equivalence class or expand terms with phonetic equivalents. The traditional andbest known such algorithm is the Soundex algorithm, which we cover inSection 3.4 (page 63).2.2.4Stemming and lemmatizationFor grammatical reasons, documents are going to use different forms of aword, such as organize, organizes, and organizing.

Additionally, there are families of derivationally related words with similar meanings, such as democracy,democratic, and democratization. In many situations, it seems as if it would beuseful for a search for one of these words to return documents that containanother word in the set.The goal of both stemming and lemmatization is to reduce inflectionalforms and sometimes derivationally related forms of a word to a commonbase form.

For instance:am, are, is ⇒ becar, cars, car’s, cars’ ⇒ carThe result of this mapping of text will be something like:the boy’s cars are different colors ⇒the boy car be differ colorSTEMMINGLEMMATIZATIONLEMMAHowever, the two words differ in their flavor. Stemming usually refers toa crude heuristic process that chops off the ends of words in the hope ofachieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing thingsproperly with the use of a vocabulary and morphological analysis of words,normally aiming to remove inflectional endings only and to return the baseor dictionary form of a word, which is known as the lemma. If confrontedwith the token saw, stemming might return just s, whereas lemmatizationwould attempt to return either see or saw depending on whether the use ofthe token was as a verb or a noun.

The two may also differ in that stemmingmost commonly collapses derivationally related words, whereas lemmatization commonly only collapses the different inflectional forms of a lemma.Online edition (c) 2009 Cambridge UP332.2 Determining the vocabulary of termsP ORTER STEMMER(2.1)Linguistic processing for stemming or lemmatization is often done by an additional plug-in component to the indexing process, and a number of suchcomponents exist, both commercial and open-source.The most common algorithm for stemming English, and one that has repeatedly been shown to be empirically very effective, is Porter’s algorithm(Porter 1980).

The entire algorithm is too long and intricate to present here,but we will indicate its general nature. Porter’s algorithm consists of 5 phasesof word reductions, applied sequentially. Within each phase there are various conventions to select rules, such as selecting the rule from each rulegroup that applies to the longest suffix. In the first phase, this convention isused with the following rule group:RuleSSESIESSSS→→→→SSISSExamplecaressesponiescaresscats→→→→caressponicaresscatMany of the later rules use a concept of the measure of a word, which looselychecks the number of syllables to see whether a word is long enough that itis reasonable to regard the matching portion of a rule as a suffix rather thanas part of the stem of a word.

For example, the rule:(m > 1)EMENT →would map replacement to replac, but not cement to c. The official site for thePorter Stemmer is:http://www.tartarus.org/˜martin/PorterStemmer/Other stemmers exist, including the older, one-pass Lovins stemmer (Lovins1968), and newer entrants like the Paice/Husk stemmer (Paice 1990); see:http://www.cs.waikato.ac.nz/˜eibe/stemmers/http://www.comp.lancs.ac.uk/computing/research/stemming/LEMMATIZERFigure 2.8 presents an informal comparison of the different behaviors of thesestemmers.

Stemmers use language-specific rules, but they require less knowledge than a lemmatizer, which needs a complete vocabulary and morphological analysis to correctly lemmatize words. Particular domains may alsorequire special stemming rules. However, the exact stemmed form does notmatter, only the equivalence classes it forms.Rather than using a stemmer, you can use a lemmatizer, a tool from Natural Language Processing which does full morphological analysis to accurately identify the lemma for each word. Doing full morphological analysisproduces at most very modest benefits for retrieval. It is hard to say more,Online edition (c) 2009 Cambridge UP342 The term vocabulary and postings listsSample text: Such an analysis can reveal features that are not easily visiblefrom the variations in the individual genes and can lead to a picture ofexpression that is more biologically transparent and accessible tointerpretationLovins stemmer: such an analys can reve featur that ar not eas vis from thvari in th individu gen and can lead to a pictur of expres that is morbiolog transpar and acces to interpresPorter stemmer: such an analysi can reveal featur that ar not easili visiblfrom the variat in the individu gene and can lead to a pictur of expressthat is more biolog transpar and access to interpretPaice stemmer: such an analys can rev feat that are not easy vis from thevary in the individ gen and can lead to a pict of express that is morbiolog transp and access to interpret◮ Figure 2.8 A comparison of three stemming algorithms on a sample text.because either form of normalization tends not to improve English information retrieval performance in aggregate – at least not by very much.

Whileit helps a lot for some queries, it equally hurts performance a lot for others.Stemming increases recall while harming precision. As an example of whatcan go wrong, note that the Porter stemmer stems all of the following words:operate operating operates operation operative operatives operationalto oper. However, since operate in its various forms is a common verb, wewould expect to lose considerable precision on queries such as the followingwith Porter stemming:operational AND researchoperating AND systemoperative AND dentistryFor a case like this, moving to using a lemmatizer would not completely fixthe problem because particular inflectional forms are used in particular collocations: a sentence with the words operate and system is not a good matchfor the query operating AND system.

Getting better value from term normalization depends more on pragmatic issues of word use than on formal issues oflinguistic morphology.The situation is different for languages with much more morphology (suchas Spanish, German, and Finnish). Results in the European CLEF evaluationshave repeatedly shown quite large gains from the use of stemmers (and compound splitting for languages like German); see the references in Section 2.5.Online edition (c) 2009 Cambridge UP?2.2 Determining the vocabulary of terms35Exercise 2.1[ ⋆]Are the following statements true or false?a.

In a Boolean retrieval system, stemming never lowers precision.b. In a Boolean retrieval system, stemming never lowers recall.c. Stemming increases the size of the vocabulary.d. Stemming should be invoked at indexing time but not while processing a query.[ ⋆]Exercise 2.2Suggest what normalized form should be used for these words (including the worditself as a possibility):a. ’Cosb. Shi’itec. cont’dd. Hawai’ie.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.