An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 72

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 72 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 722020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 72)

kNN has properties that arequite different from most other classification algorithms. Training a kNNclassifier simply consists of determining k and preprocessing documents. Infact, if we preselect a value for k and do not preprocess, then kNN requiresno training at all. In practice, we have to perform preprocessing steps liketokenization. It makes more sense to preprocess training documents onceas part of the training phase rather than repeatedly every time we classify anew test document.Test time is Θ(|D | Mave Ma ) for kNN.

It is linear in the size of the trainingset as we need to compute the distance of each training document from thetest document. Test time is independent of the number of classes J. kNNtherefore has a potential advantage for problems with large J.In kNN classification, we do not perform any estimation of parameters aswe do in Rocchio classification (centroids) or in Naive Bayes (priors and conditional probabilities). kNN simply memorizes all examples in the trainingOnline edition (c) 2009 Cambridge UP30014 Vector space classificationMEMORY- BASEDLEARNINGB AYES ERROR RATEset and then compares the test document to them.

For this reason, kNN isalso called memory-based learning or instance-based learning. It is usually desirable to have as much training data as possible in machine learning. But inkNN large training sets come with a severe efficiency penalty in classification.Can kNN testing be made more efficient than Θ(|D | Mave Ma ) or, ignoringthe length of documents, more efficient than Θ(|D |)? There are fast kNNalgorithms for small dimensionality M (Exercise 14.12). There are also approximations for large M that give error bounds for specific efficiency gains(see Section 14.7). These approximations have not been extensively testedfor text classification applications, so it is not clear whether they can achievemuch better efficiency than Θ(|D |) without a significant loss of accuracy.The reader may have noticed the similarity between the problem of findingnearest neighbors of a test document and ad hoc retrieval, where we searchfor the documents with the highest similarity to the query (Section 6.3.2,page 123).

In fact, the two problems are both k nearest neighbor problemsand only differ in the relative density of (the vector of) the test documentin kNN (10s or 100s of non-zero entries) versus the sparseness of (the vector of) the query in ad hoc retrieval (usually fewer than 10 non-zero entries).We introduced the inverted index for efficient ad hoc retrieval in Section 1.1(page 6).

Is the inverted index also the solution for efficient kNN?An inverted index restricts a search to those documents that have at leastone term in common with the query. Thus in the context of kNN, the inverted index will be efficient if the test document has no term overlap with alarge number of training documents. Whether this is the case depends on theclassification problem.

If documents are long and no stop list is used, thenless time will be saved. But with short documents and a large stop list, aninverted index may well cut the average test time by a factor of 10 or more.The search time in an inverted index is a function of the length of the postings lists of the terms in the query. Postings lists grow sublinearly with thelength of the collection since the vocabulary increases according to Heaps’law – if the probability of occurrence of some terms increases, then the probability of occurrence of others must decrease. However, most new terms areinfrequent. We therefore take the complexity of inverted index search to beΘ( T ) (as discussed in Section 2.4.2, page 41) and, assuming average document length does not change over time, Θ( T ) = Θ(|D |).As we will see in the next chapter, kNN’s effectiveness is close to that of themost accurate learning methods in text classification (Table 15.2, page 334).

Ameasure of the quality of a learning method is its Bayes error rate, the averageerror rate of classifiers learned by it for a particular problem. kNN is notoptimal for problems with a non-zero Bayes error rate – that is, for problemswhere even the best possible classifier has a non-zero classification error. Theerror of 1NN is asymptotically (as the training set increases) bounded byOnline edition (c) 2009 Cambridge UP14.4 Linear versus nonlinear classifiers301◮ Figure 14.8 There are an infinite number of hyperplanes that separate two linearlyseparable classes.twice the Bayes error rate.

That is, if the optimal classifier has an error rateof x, then 1NN has an asymptotic error rate of less than 2x. This is due to theeffect of noise – we already saw one example of noise in the form of noisyfeatures in Section 13.5 (page 271), but noise can also take other forms as wewill discuss in the next section. Noise affects two components of kNN: thetest document and the closest training document.

The two sources of noiseare additive, so the overall error of 1NN is twice the optimal error rate. Forproblems with Bayes error rate 0, the error rate of 1NN will approach 0 asthe size of the training set increases.?14.4LINEAR CLASSIFIERExercise 14.3Explain why kNN handles multimodal classes better than Rocchio.Linear versus nonlinear classifiersIn this section, we show that the two learning methods Naive Bayes andRocchio are instances of linear classifiers, the perhaps most important groupof text classifiers, and contrast them with nonlinear classifiers.

To simplifythe discussion, we will only consider two-class classifiers in this section anddefine a linear classifier as a two-class classifier that decides class membershipby comparing a linear combination of the features to a threshold.In two dimensions, a linear classifier is a line. Five examples are shownin Figure 14.8.

These lines have the functional form w1 x1 + w2 x2 = b. Theclassification rule of a linear classifier is to assign a document to c if w1 x1 +w2 x2 > b and to c if w1 x1 + w2 x2 ≤ b. Here, ( x1 , x2 ) T is the two-dimensionalvector representation of the document and (w1 , w2 ) T is the parameter vectorOnline edition (c) 2009 Cambridge UP30214 Vector space classificationA PPLY L INEAR C LASSIFIER (~w, b, ~x )1 score ← ∑iM=1 w i x i2 if score > b3then return 14else return 0◮ Figure 14.9 Linear classification algorithm.that defines (together with b) the decision boundary.

An alternative geometric interpretation of a linear classifier is provided in Figure 15.7 (page 343).We can generalize this 2D linear classifier to higher dimensions by defininga hyperplane as we did in Equation (14.2), repeated here as Equation (14.3):w~ T~x = b(14.3)DECISION HYPERPLANEThe assignment criterion then is: assign to c if w~ T~x > b and to c if w~ T~x ≤ b.We call a hyperplane that we use as a linear classifier a decision hyperplane.The corresponding algorithm for linear classification in M dimensions isshown in Figure 14.9. Linear classification at first seems trivial given thesimplicity of this algorithm.

However, the difficulty is in training the linear classifier, that is, in determining the parameters w~ and b based on thetraining set. In general, some learning methods compute much better parameters than others where our criterion for evaluating the quality of a learningmethod is the effectiveness of the learned linear classifier on new data.We now show that Rocchio and Naive Bayes are linear classifiers. To seethis for Rocchio, observe that a vector ~x is on the decision boundary if it hasequal distance to the two class centroids:(14.4)|~µ(c1 ) − ~x| = |~µ (c2 ) − ~x |Some basic arithmetic shows that this corresponds to a linear classifier withnormal vector w~ = ~µ(c1 ) − ~µ(c2 ) and b = 0.5 ∗ (|~µ(c1 )|2 − |~µ (c2 )|2 ) (Exercise 14.15).We can derive the linearity of Naive Bayes from its decision rule, whichchooses the category c with the largest P̂(c|d) (Figure 13.2, page 260) where:P̂(c|d) ∝ P̂(c)∏1≤k ≤n dP̂(tk |c)and nd is the number of tokens in the document that are part of the vocabulary.

Denoting the complement category as c̄, we obtain for the log odds:(14.5)logP̂(c)P̂(tk |c)P̂(c|d)= log+ ∑ logP̂(c̄|d)P̂(c̄) 1≤k≤ndP̂(tk |c̄)Online edition (c) 2009 Cambridge UP30314.4 Linear versus nonlinear classifierstiprimerateinterestratesdiscountbundesbankwi0.700.670.630.600.460.43d1i010010d2i100000tidlrsworldseesyeargroupdlrwi-0.71-0.35-0.33-0.25-0.24-0.24d1i110000d2i100000◮ Table 14.4 A linear classifier.

The dimensions ti and parameters wi of a linearclassifier for the class interest (as in interest rate) in Reuters-21578. The threshold isb = 0. Terms like dlr and world have negative weights because they are indicators forthe competing class currency.We choose class c if the odds are greater than 1 or, equivalently, if the logodds are greater than 0. It is easy to see that Equation (14.5) is an instanceof Equation (14.3) for wi = log[ P̂(ti |c)/ P̂(ti |c̄)], x i = number of occurrencesof ti in d, and b = − log[ P̂(c)/ P̂(c̄)].

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.