An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 80

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 80 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 802020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 80)

Jackson and Moulinier (2002) write: “There is no question concerning the commercial value ofbeing able to classify documents automatically by content. There are myriad5. These results are in terms of the break-even F1 (see Section 8.4). Many researchers dispreferthis measure for text classification evaluation, since its calculation may involve interpolationrather than an actual parameter setting of the system and it is not clear why this value shouldbe reported rather than maximal F1 or another point on the precision/recall curve motivated bythe task at hand. While earlier results in (Joachims 1998) suggested notable gains on this taskfrom the use of higher order polynomial or rbf kernels, this was with hard-margin SVMs.

Withsoft-margin SVMs, a simple linear SVM with the default C = 1 performs best.Online edition (c) 2009 Cambridge UP33515.3 Issues in the classification of text documentspotential applications of such a capability for corporate Intranets, government departments, and Internet publishers.”Most of our discussion of classification has focused on introducing variousmachine learning methods rather than discussing particular features of textdocuments relevant to classification.

This bias is appropriate for a textbook,but is misplaced for an application developer. It is frequently the case thatgreater performance gains can be achieved from exploiting domain-specifictext features than from changing from one machine learning method to another. Jackson and Moulinier (2002) suggest that “Understanding the datais one of the keys to successful categorization, yet this is an area in whichmost categorization tool vendors are extremely weak. Many of the ‘one sizefits all’ tools on the market have not been tested on a wide range of contenttypes.” In this section we wish to step back a little and consider the applications of text classification, the space of possible solutions, and the utility ofapplication-specific heuristics.15.3.1Choosing what kind of classifier to useWhen confronted with a need to build a text classifier, the first question toask is how much training data is there currently available? None? Very little?Quite a lot? Or a huge amount, growing every day? Often one of the biggestpractical challenges in fielding a machine learning classifier in real applications is creating or obtaining enough training data.

For many problems andalgorithms, hundreds or thousands of examples from each class are requiredto produce a high performance classifier and many real world contexts involve large sets of categories. We will initially assume that the classifier isneeded as soon as possible; if a lot of time is available for implementation,much of it might be spent on assembling data resources.If you have no labeled training data, and especially if there are existingstaff knowledgeable about the domain of the data, then you should neverforget the solution of using hand-written rules.

That is, you write standingqueries, as we touched on at the beginning of Chapter 13. For example:IF(wheatORgrain)AND NOT(wholeORbread)THENc = grainIn practice, rules get a lot bigger than this, and can be phrased using moresophisticated query languages than just Boolean expressions, including theuse of numeric scores. With careful crafting (that is, by humans tuning therules on development data), the accuracy of such rules can become very high.Jacobs and Rau (1990) report identifying articles about takeovers with 92%precision and 88.5% recall, and Hayes and Weinstein (1990) report 94% recall and 84% precision over 675 categories on Reuters newswire documents.Nevertheless the amount of work to create such well-tuned rules is verylarge.

A reasonable estimate is 2 days per class, and extra time has to goOnline edition (c) 2009 Cambridge UP33615 Support vector machines and machine learning on documentsSEMI - SUPERVISEDLEARNINGTRANSDUCTIVE SVM SACTIVE LEARNINGinto maintenance of rules, as the content of documents in classes drifts overtime (cf.

page 269).If you have fairly little data and you are going to train a supervised classifier, then machine learning theory says you should stick to a classifier withhigh bias, as we discussed in Section 14.6 (page 308). For example, thereare theoretical and empirical results that Naive Bayes does well in such circumstances (Ng and Jordan 2001, Forman and Cohen 2004), although thiseffect is not necessarily observed in practice with regularized models overtextual data (Klein and Manning 2002). At any rate, a very low bias modellike a nearest neighbor model is probably counterindicated.

Regardless, thequality of the model will be adversely affected by the limited training data.Here, the theoretically interesting answer is to try to apply semi-supervisedtraining methods. This includes methods such as bootstrapping or the EMalgorithm, which we will introduce in Section 16.5 (page 368). In these methods, the system gets some labeled documents, and a further large supplyof unlabeled documents over which it can attempt to learn. One of the bigadvantages of Naive Bayes is that it can be straightforwardly extended tobe a semi-supervised learning algorithm, but for SVMs, there is also semisupervised learning work which goes under the title of transductive SVMs.See the references for pointers.Often, the practical answer is to work out how to get more labeled data asquickly as you can. The best way to do this is to insert yourself into a processwhere humans will be willing to label data for you as part of their naturaltasks.

For example, in many cases humans will sort or route email for theirown purposes, and these actions give information about classes. The alternative of getting human labelers expressly for the task of training classifiersis often difficult to organize, and the labeling is often of lower quality, because the labels are not embedded in a realistic task context. Rather thangetting people to label all or a random sample of documents, there has alsobeen considerable research on active learning, where a system is built whichdecides which documents a human should label.

Usually these are the oneson which a classifier is uncertain of the correct classification. This can be effective in reducing annotation costs by a factor of 2–4, but has the problemthat the good documents to label to train one type of classifier often are notthe good documents to label to train a different type of classifier.If there is a reasonable amount of labeled data, then you are in the perfect position to use everything that we have presented about text classification.

For instance, you may wish to use an SVM. However, if you aredeploying a linear classifier such as an SVM, you should probably designan application that overlays a Boolean rule-based classifier over the machinelearning classifier. Users frequently like to adjust things that do not comeout quite right, and if management gets on the phone and wants the classification of a particular document fixed right now, then this is much easier toOnline edition (c) 2009 Cambridge UP15.3 Issues in the classification of text documents337do by hand-writing a rule than by working out how to adjust the weightsof an SVM without destroying the overall classification accuracy. This is onereason why machine learning models like decision trees which produce userinterpretable Boolean-like models retain considerable popularity.If a huge amount of data are available, then the choice of classifier probablyhas little effect on your results and the best choice may be unclear (cf.

Bankoand Brill 2001). It may be best to choose a classifier based on the scalabilityof training or even runtime efficiency. To get to this point, you need to havehuge amounts of data. The general rule of thumb is that each doubling ofthe training data size produces a linear increase in classifier performance,but with very large amounts of data, the improvement becomes sub-linear.15.3.2Improving classifier performanceFor any particular application, there is usually significant room for improving classifier effectiveness through exploiting features specific to the domainor document collection. Often documents will contain zones which are especially useful for classification.

Often there will be particular subvocabularieswhich demand special treatment for optimal classification effectiveness.Large and difficult category taxonomiesHIERARCHICALCLASSIFICATIONIf a text classification problem consists of a small number of well-separatedcategories, then many classification algorithms are likely to work well. Butmany real classification problems consist of a very large number of oftenvery similar categories. The reader might think of examples like web directories (the Yahoo! Directory or the Open Directory Project), library classification schemes (Dewey Decimal or Library of Congress) or the classification schemes used in legal or medical applications. For instance, the Yahoo!Directory consists of over 200,000 categories in a deep hierarchy. Accurateclassification over large sets of closely related classes is inherently difficult.Most large sets of categories have a hierarchical structure, and attemptingto exploit the hierarchy by doing hierarchical classification is a promising approach.

However, at present the effectiveness gains from doing this ratherthan just working with the classes that are the leaves of the hierarchy remain modest.6 But the technique can be very useful simply to improve thescalability of building classifiers over large hierarchies. Another simple wayto improve the scalability of classifiers over large hierarchies is the use ofaggressive feature selection.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.