An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 75

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 75 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 752020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 75)

The decision lines pro-Online edition (c) 2009 Cambridge UP31214 Vector space classificationOVERFITTINGMEMORY CAPACITYBIAS - VARIANCETRADEOFFduced by linear learning methods in Figures 14.10 and 14.11 will deviateslightly from the main class boundaries, depending on the training set, butthe class assignment for the vast majority of documents (with the exceptionof those close to the main boundary) will not be affected. The circular enclavein Figure 14.11 will be consistently misclassified.Nonlinear methods like kNN have high variance. It is apparent from Figure 14.6 that kNN can model very complex boundaries between two classes.It is therefore sensitive to noise documents of the sort depicted in Figure 14.10.As a result the variance term in Equation (14.11) is large for kNN: Test documents are sometimes misclassified – if they happen to be close to a noisedocument in the training set – and sometimes correctly classified – if thereare no noise documents in the training set near them.

This results in highvariation from training set to training set.High-variance learning methods are prone to overfitting the training data.The goal in classification is to fit the training data to the extent that we capture true properties of the underlying distribution P(hd, ci). In overfitting,the learning method also learns from noise. Overfitting increases MSE andfrequently is a problem for high-variance learning methods.We can also think of variance as the model complexity or, equivalently, memory capacity of the learning method – how detailed a characterization of thetraining set it can remember and then apply to new data.

This capacity corresponds to the number of independent parameters available to fit the trainingset. Each kNN neighborhood Sk makes an independent classification decision. The parameter in this case is the estimate P̂(c|Sk ) from Figure 14.7.Thus, kNN’s capacity is only limited by the size of the training set. It canmemorize arbitrarily large training sets.

In contrast, the number of parameters of Rocchio is fixed – J parameters per dimension, one for each centroid– and independent of the size of the training set. The Rocchio classifier (inform of the centroids defining it) cannot “remember” fine-grained details ofthe distribution of the documents in the training set.According to Equation (14.7), our goal in selecting a learning method is tominimize learning error.

The fundamental insight captured by Equation (14.11),which we can succinctly state as: learning-error = bias + variance, is that thelearning error has two components, bias and variance, which in general cannot be minimized simultaneously. When comparing two learning methodsΓ1 and Γ2 , in most cases the comparison comes down to one method havinghigher bias and lower variance and the other lower bias and higher variance.The decision for one learning method vs. another is then not simply a matter of selecting the one that reliably produces good classifiers across trainingsets (small variance) or the one that can learn classification problems withvery difficult decision boundaries (small bias).

Instead, we have to weighthe respective merits of bias and variance in our application and choose accordingly. This tradeoff is called the bias-variance tradeoff .Online edition (c) 2009 Cambridge UP14.6 The bias-variance tradeoff313Figure 14.10 provides an illustration, which is somewhat contrived, butwill be useful as an example for the tradeoff. Some Chinese text containsEnglish words written in the Roman alphabet like CPU, ONLINE, and GPS.Consider the task of distinguishing Chinese-only web pages from mixedChinese-English web pages.

A search engine might offer Chinese users without knowledge of English (but who understand loanwords like CPU) the option of filtering out mixed pages. We use two features for this classificationtask: number of Roman alphabet characters and number of Chinese characters on the web page. As stated earlier, the distribution P(hd, ci) of thegenerative model generates most mixed (respectively, Chinese) documentsabove (respectively, below) the short-dashed line, but there are a few noisedocuments.In Figure 14.10, we see three classifiers:• One-feature classifier.

Shown as a dotted horizontal line. This classifieruses only one feature, the number of Roman alphabet characters. Assuming a learning method that minimizes the number of misclassificationsin the training set, the position of the horizontal decision boundary isnot greatly affected by differences in the training set (e.g., noise documents).

So a learning method producing this type of classifier has lowvariance. But its bias is high since it will consistently misclassify squaresin the lower left corner and “solid circle” documents with more than 50Roman characters.• Linear classifier. Shown as a dashed line with long dashes. Learning linear classifiers has less bias since only noise documents and possibly a fewdocuments close to the boundary between the two classes are misclassified. The variance is higher than for the one-feature classifiers, but stillsmall: The dashed line with long dashes deviates only slightly from thetrue boundary between the two classes, and so will almost all linear decision boundaries learned from training sets.

Thus, very few documents(documents close to the class boundary) will be inconsistently classified.• “Fit-training-set-perfectly” classifier. Shown as a solid line. Here, thelearning method constructs a decision boundary that perfectly separatesthe classes in the training set. This method has the lowest bias becausethere is no document that is consistently misclassified – the classifierssometimes even get noise documents in the test set right. But the varianceof this learning method is high.

Because noise documents can move thedecision boundary arbitrarily, test documents close to noise documentsin the training set will be misclassified – something that a linear learningmethod is unlikely to do.It is perhaps surprising that so many of the best-known text classificationalgorithms are linear. Some of these methods, in particular linear SVMs, reg-Online edition (c) 2009 Cambridge UP31414 Vector space classificationularized logistic regression and regularized linear regression, are among themost effective known methods. The bias-variance tradeoff provides insightinto their success. Typical classes in text classification are complex and seemunlikely to be modeled well linearly.

However, this intuition is misleadingfor the high-dimensional spaces that we typically encounter in text applications. With increased dimensionality, the likelihood of linear separabilityincreases rapidly (Exercise 14.17). Thus, linear models in high-dimensionalspaces are quite powerful despite their linearity. Even more powerful nonlinear learning methods can model decision boundaries that are more complexthan a hyperplane, but they are also more sensitive to noise in the trainingdata. Nonlinear learning methods sometimes perform better if the trainingset is large, but by no means in all cases.14.7ROUTINGFILTERINGPUSH MODELPULL MODELCENTROID - BASEDCLASSIFICATIONReferences and further readingAs discussed in Chapter 9, Rocchio relevance feedback is due to Rocchio(1971).

Joachims (1997) presents a probabilistic analysis of the method. Rocchio classification was widely used as a classification method in TREC in the1990s (Buckley et al. 1994a;b, Voorhees and Harman 2005). Initially, it wasused as a form of routing. Routing merely ranks documents according to relevance to a class without assigning them. Early work on filtering, a true classification approach that makes an assignment decision on each document,was published by Ittner et al. (1995) and Schapire et al. (1998).

The definitionof routing we use here should not be confused with another sense. Routingcan also refer to the electronic distribution of documents to subscribers, theso-called push model of document distribution. In a pull model, each transferof a document to the user is initiated by the user – for example, by meansof search or by selecting it from a list of documents on a news aggregationwebsite.Some authors restrict the name Roccchio classification to two-class problemsand use the terms cluster-based (Iwayama and Tokunaga 1995) and centroidbased classification (Han and Karypis 2000, Tan and Cheng 2007) for Rocchioclassification with J > 2.A more detailed treatment of kNN can be found in (Hastie et al.

2001), including methods for tuning the parameter k. An example of an approximatefast kNN algorithm is locality-based hashing (Andoni et al. 2006). Kleinberg (1997) presents an approximate Θ(( M log2 M )( M + log N )) kNN algorithm (where M is the dimensionality of the space and N the number of datapoints), but at the cost of exponential storage requirements: Θ(( N log M )2M ).Indyk (2004) surveys nearest neighbor methods in high-dimensional spaces.Early work on kNN in text classification was motivated by the availabilityof massively parallel hardware architectures (Creecy et al. 1992). Yang (1994)Online edition (c) 2009 Cambridge UP14.8 Exercises315uses an inverted index to speed up kNN classification.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.