An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 73

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 73 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 732020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 73)

Here, the index i, 1 ≤ i ≤ M, refersto terms of the vocabulary (not to positions in d as k does; cf. Section 13.4.1,page 270) and ~x and w~ are M-dimensional vectors. So in log space, NaiveBayes is a linear classifier.✎CLASS BOUNDARYNOISE DOCUMENTTable 14.4 defines a linear classifier for the category interest inReuters-21578 (see Section 13.6, page 279). We assign document d~1 “rate discountdlrs world” to interest since w~ T d~1 = 0.67 · 1 + 0.46 · 1 + (−0.71) · 1 + (−0.35) · 1 =0.07 > 0 = b. We assign d~2 “prime dlrs” to the complement class (not in interest) sincew~ T d~2 = −0.01 ≤ b.

For simplicity, we assume a simple binary vector representationin this example: 1 for occurring terms, 0 for non-occurring terms.Example 14.3:Figure 14.10 is a graphical example of a linear problem, which we define tomean that the underlying distributions P(d|c) and P(d|c) of the two classesare separated by a line. We call this separating line the class boundary. It isthe “true” boundary of the two classes and we distinguish it from the decision boundary that the learning method computes to approximate the classboundary.As is typical in text classification, there are some noise documents in Figure 14.10 (marked with arrows) that do not fit well into the overall distribution of the classes.

In Section 13.5 (page 271), we defined a noise featureas a misleading feature that, when included in the document representation,on average increases the classification error. Analogously, a noise documentis a document that, when included in the training set, misleads the learning method and increases classification error. Intuitively, the underlyingdistribution partitions the representation space into areas with mostly ho-Online edition (c) 2009 Cambridge UP30414 Vector space classification◮ Figure 14.10 A linear problem with noise. In this hypothetical web page classification scenario, Chinese-only web pages are solid circles and mixed Chinese-Englishweb pages are squares.

The two classes are separated by a linear class boundary(dashed line, short dashes), except for three noise documents (marked with arrows).LINEAR SEPARABILITYmogeneous class assignments. A document that does not conform with thedominant class in its area is a noise document.Noise documents are one reason why training a linear classifier is hard. Ifwe pay too much attention to noise documents when choosing the decisionhyperplane of the classifier, then it will be inaccurate on new data. Morefundamentally, it is usually difficult to determine which documents are noisedocuments and therefore potentially misleading.If there exists a hyperplane that perfectly separates the two classes, thenwe call the two classes linearly separable.

In fact, if linear separability holds,then there is an infinite number of linear separators (Exercise 14.4) as illustrated by Figure 14.8, where the number of possible separating hyperplanesis infinite.Figure 14.8 illustrates another challenge in training a linear classifier. If weare dealing with a linearly separable problem, then we need a criterion forselecting among all decision hyperplanes that perfectly separate the trainingdata. In general, some of these hyperplanes will do well on new data, someOnline edition (c) 2009 Cambridge UP3050.00.20.40.60.81.014.4 Linear versus nonlinear classifiers0.00.20.40.60.81.0◮ Figure 14.11 A nonlinear problem.NONLINEARCLASSIFIER?will not.An example of a nonlinear classifier is kNN. The nonlinearity of kNN isintuitively clear when looking at examples like Figure 14.6.

The decisionboundaries of kNN (the double lines in Figure 14.6) are locally linear segments, but in general have a complex shape that is not equivalent to a line in2D or a hyperplane in higher dimensions.Figure 14.11 is another example of a nonlinear problem: there is no goodlinear separator between the distributions P(d|c) and P(d|c) because of thecircular “enclave” in the upper left part of the graph. Linear classifiers misclassify the enclave, whereas a nonlinear classifier like kNN will be highlyaccurate for this type of problem if the training set is large enough.If a problem is nonlinear and its class boundaries cannot be approximatedwell with linear hyperplanes, then nonlinear classifiers are often more accurate than linear classifiers.

If a problem is linear, it is best to use a simplerlinear classifier.Exercise 14.4Prove that the number of linear separators of two classes is either infinite or zero.Online edition (c) 2009 Cambridge UP30614 Vector space classification14.5ANY- OFCLASSIFICATIONClassification with more than two classesWe can extend two-class linear classifiers to J > 2 classes. The method to usedepends on whether the classes are mutually exclusive or not.Classification for classes that are not mutually exclusive is called any-of ,multilabel, or multivalue classification. In this case, a document can belong toseveral classes simultaneously, or to a single class, or to none of the classes.A decision on one class leaves all options open for the others.

It is sometimes said that the classes are independent of each other, but this is misleadingsince the classes are rarely statistically independent in the sense defined onpage 275. In terms of the formal definition of the classification problem inEquation (13.1) (page 256), we learn J different classifiers γ j in any-of classification, each returning either c j or c j : γ j (d) ∈ {c j , c j }.Solving an any-of classification task with linear classifiers is straightforward:1. Build a classifier for each class, where the training set consists of the setof documents in the class (positive labels) and its complement (negativelabels).2.

Given the test document, apply each classifier separately. The decision ofone classifier has no influence on the decisions of the other classifiers.ONE - OFCLASSIFICATIONThe second type of classification with more than two classes is one-of classification. Here, the classes are mutually exclusive. Each document mustbelong to exactly one of the classes. One-of classification is also called multinomial, polytomous4 , multiclass, or single-label classification.

Formally, there is asingle classification function γ in one-of classification whose range is C, i.e.,γ(d) ∈ {c1 , . . . , c J }. kNN is a (nonlinear) one-of classifier.True one-of problems are less common in text classification than any-ofproblems. With classes like UK, China, poultry, or coffee, a document can berelevant to many topics simultaneously – as when the prime minister of theUK visits China to talk about the coffee and poultry trade.Nevertheless, we will often make a one-of assumption, as we did in Figure 14.1, even if classes are not really mutually exclusive. For the classification problem of identifying the language of a document, the one-of assumption is a good approximation as most text is written in only one language.In such cases, imposing a one-of constraint can increase the classifier’s effectiveness because errors that are due to the fact that the any-of classifiersassigned a document to either no class or more than one class are eliminated.J hyperplanes do not divide R |V | into J distinct regions as illustrated inFigure 14.12.

Thus, we must use a combination method when using twoclass linear classifiers for one-of classification. The simplest method is to4. A synonym of polytomous is polychotomous.Online edition (c) 2009 Cambridge UP14.5 Classification with more than two classes307?◮ Figure 14.12J hyperplanes do not divide space into J disjoint regions.rank classes and then select the top-ranked class.

Geometrically, the rankingcan be with respect to the distances from the J linear separators. Documentsclose to a class’s separator are more likely to be misclassified, so the greaterthe distance from the separator, the more plausible it is that a positive classification decision is correct. Alternatively, we can use a direct measure ofconfidence to rank classes, e.g., probability of class membership. We canstate this algorithm for one-of classification with linear classifiers as follows:1. Build a classifier for each class, where the training set consists of the setof documents in the class (positive labels) and its complement (negativelabels).2. Given the test document, apply each classifier separately.3.

Assign the document to the class with• the maximum score,• the maximum confidence value,• or the maximum probability.CONFUSION MATRIXAn important tool for analyzing the performance of a classifier for J > 2classes is the confusion matrix. The confusion matrix shows for each pair ofclasses hc1 , c2 i, how many documents from c1 were incorrectly assigned to c2 .In Table 14.5, the classifier manages to distinguish the three financial classesmoney-fx, trade, and interest from the three agricultural classes wheat, corn,and grain, but makes many errors within these two groups. The confusionmatrix can help pinpoint opportunities for improving the accuracy of theOnline edition (c) 2009 Cambridge UP30814 Vector space classificationassigned classtrue classmoney-fxtradeinterestwheatcorngrainmoney-fxtradeinterestwheatcorngrain951130100100001090012200034131401032650007510◮ Table 14.5 A confusion matrix for Reuters-21578.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.