An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 81

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 81 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 812020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 81)

We provide references to some work on hierarchical classification in Section 15.5.6. Using the small hierarchy in Figure 13.1 (page 257) as an example, the leaf classes are oneslike poultry and coffee, as opposed to higher-up classes like industries.Online edition (c) 2009 Cambridge UP33815 Support vector machines and machine learning on documentsA general result in machine learning is that you can always get a smallboost in classification accuracy by combining multiple classifiers, providedonly that the mistakes that they make are at least somewhat independent.There is now a large literature on techniques such as voting, bagging, andboosting multiple classifiers. Again, there are some pointers in the references.

Nevertheless, ultimately a hybrid automatic/manual solution may beneeded to achieve sufficient classification accuracy. A common approach insuch situations is to run a classifier first, and to accept all its high confidencedecisions, but to put low confidence decisions in a queue for manual review.Such a process also automatically leads to the production of new trainingdata which can be used in future versions of the machine learning classifier.However, note that this is a case in point where the resulting training data isclearly not randomly sampled from the space of documents.Features for textFEATURE ENGINEERINGThe default in both ad hoc retrieval and text classification is to use termsas features.

However, for text classification, a great deal of mileage can beachieved by designing additional features which are suited to a specific problem. Unlike the case of IR query languages, since these features are internalto the classifier, there is no problem of communicating these features to anend user.

This process is generally referred to as feature engineering. At present, feature engineering remains a human craft, rather than something doneby machine learning. Good feature engineering can often markedly improvethe performance of a text classifier. It is especially beneficial in some of themost important applications of text classification, like spam and porn filtering.Classification problems will often contain large numbers of terms whichcan be conveniently grouped, and which have a similar vote in text classification problems. Typical examples might be year mentions or strings ofexclamation marks. Or they may be more specialized tokens like ISBNs orchemical formulas. Often, using them directly in a classifier would greatly increase the vocabulary without providing classificatory power beyond knowing that, say, a chemical formula is present.

In such cases, the number offeatures and feature sparseness can be reduced by matching such items withregular expressions and converting them into distinguished tokens. Consequently, effectiveness and classifier speed are normally enhanced. Sometimes all numbers are converted into a single feature, but often some valuecan be had by distinguishing different kinds of numbers, such as four digitnumbers (which are usually years) versus other cardinal numbers versus realnumbers with a decimal point.

Similar techniques can be applied to dates,ISBN numbers, sports game scores, and so on.Going in the other direction, it is often useful to increase the number of fea-Online edition (c) 2009 Cambridge UP15.3 Issues in the classification of text documents339tures by matching parts of words, and by matching selected multiword patterns that are particularly discriminative. Parts of words are often matchedby character k-gram features. Such features can be particularly good at providing classification clues for otherwise unknown words when the classifieris deployed. For instance, an unknown word ending in -rase is likely to be anenzyme, even if it wasn’t seen in the training data.

Good multiword patternsare often found by looking for distinctively common word pairs (perhapsusing a mutual information criterion between words, in a similar way toits use in Section 13.5.1 (page 272) for feature selection) and then using feature selection methods evaluated against classes. They are useful when thecomponents of a compound would themselves be misleading as classification cues. For instance, this would be the case if the keyword ethnic wasmost indicative of the categories food and arts, the keyword cleansing wasmost indicative of the category home, but the collocation ethnic cleansing instead indicates the category world news. Some text classifiers also make useof features from named entity recognizers (cf.

page 195).Do techniques like stemming and lowercasing (Section 2.2, page 22) helpfor text classification? As always, the ultimate test is empirical evaluationsconducted on an appropriate test collection. But it is nevertheless useful tonote that such techniques have a more restricted chance of being useful forclassification. For IR, you often need to collapse forms of a word like oxygenate and oxygenation, because the appearance of either in a document is agood clue that the document will be relevant to a query about oxygenation.Given copious training data, stemming necessarily delivers no value for textclassification. If several forms that stem together have a similar signal, theparameters estimated for all of them will have similar weights.

Techniqueslike stemming help only in compensating for data sparseness. This can bea useful role (as noted at the start of this section), but often different formsof a word can convey significantly different cues about the correct documentclassification. Overly aggressive stemming can easily degrade classificationperformance.Document zones in text classificationAs already discussed in Section 6.1, documents usually have zones, such asmail message headers like the subject and author, or the title and keywordsof a research article. Text classifiers can usually gain from making use ofthese zones during training and classification.Upweighting document zones.

In text classification problems, you can frequently get a nice boost to effectiveness by differentially weighting contributions from different document zones. Often, upweighting title words isparticularly effective (Cohen and Singer 1999, p. 163). As a rule of thumb,Online edition (c) 2009 Cambridge UP34015 Support vector machines and machine learning on documentsit is often effective to double the weight of title words in text classificationproblems.

You can also get value from upweighting words from pieces oftext that are not so much clearly defined zones, but where nevertheless evidence from document structure or content suggests that they are important.Murata et al. (2000) suggest that you can also get value (in an ad hoc retrievalcontext) from upweighting the first sentence of a (newswire) document.PARAMETER TYINGSeparate feature spaces for document zones. There are two strategies thatcan be used for document zones. Above we upweighted words that appearin certain zones.

This means that we are using the same features (that is, parameters are “tied” across different zones), but we pay more attention to theoccurrence of terms in particular zones. An alternative strategy is to have acompletely separate set of features and corresponding parameters for wordsoccurring in different zones.

This is in principle more powerful: a wordcould usually indicate the topic Middle East when in the title but Commoditieswhen in the body of a document. But, in practice, tying parameters is usually more successful. Having separate feature sets means having two or moretimes as many parameters, many of which will be much more sparsely seenin the training data, and hence with worse estimates, whereas upweightinghas no bad effects of this sort.

Moreover, it is quite uncommon for words tohave different preferences when appearing in different zones; it is mainly thestrength of their vote that should be adjusted. Nevertheless, ultimately thisis a contingent result, depending on the nature and quantity of the trainingdata.Connections to text summarization. In Section 8.7, we mentioned the fieldof text summarization, and how most work in that field has adopted thelimited goal of extracting and assembling pieces of the original text that arejudged to be central based on features of sentences that consider the sentence’s position and content.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.