An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 76

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 76 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 762020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 76)

The optimality resultfor 1NN (twice the Bayes error rate asymptotically) is due to Cover and Hart(1967).The effectiveness of Rocchio classification and kNN is highly dependenton careful parameter tuning (in particular, the parameters b′ for Rocchio onpage 296 and k for kNN), feature engineering (Section 15.3, page 334) andfeature selection (Section 13.5, page 271). Buckley and Salton (1995), Schapireet al. (1998), Yang and Kisiel (2003) and Moschitti (2003) address these issuesfor Rocchio and Yang (2001) and Ault and Yang (2002) for kNN. Zavrel et al.(2000) compare feature selection methods for kNN.The bias-variance tradeoff was introduced by Geman et al.

(1992). Thederivation in Section 14.6 is for MSE(γ), but the tradeoff applies to manyloss functions (cf. Friedman (1997), Domingos (2000)). Schütze et al. (1995)and Lewis et al. (1996) discuss linear classifiers for text and Hastie et al. (2001)linear classifiers in general. Readers interested in the algorithms mentioned,but not described in this chapter may wish to consult Bishop (2006) for neural networks, Hastie et al.

(2001) for linear and logistic regression, and Minsky and Papert (1988) for the perceptron algorithm. Anagnostopoulos et al.(2006) show that an inverted index can be used for highly efficient documentclassification with any linear classifier, provided that the classifier is still effective when trained on a modest number of features via feature selection.We have only presented the simplest method for combining two-class classifiers into a one-of classifier. Another important method is the use of errorcorrecting codes, where a vector of decisions of different two-class classifiersis constructed for each document. A test document’s decision vector is then“corrected” based on the distribution of decision vectors in the training set,a procedure that incorporates information from all two-class classifiers andtheir correlations into the final classification decision (Dietterich and Bakiri1995). Ghamrawi and McCallum (2005) also exploit dependencies betweenclasses in any-of classification.

Allwein et al. (2000) propose a general framework for combining two-class classifiers.14.8?ExercisesExercise 14.6In Figure 14.14, which of the three vectors ~a, ~b, and ~c is (i) most similar to ~x accordingto dot product similarity, (ii) most similar to ~x according to cosine similarity, (iii)closest to ~x according to Euclidean distance?Exercise 14.7Download Reuters-21578 and train and test Rocchio and kNN classifiers for the classesacquisitions, corn, crude, earn, grain, interest, money-fx, ship, trade, and wheat. Use theModApte split.

You may want to use one of a number of software packages that im-Online edition (c) 2009 Cambridge UP31614 Vector space classification876543210cbax0 1 2 3 4 5 6 7 8◮ Figure 14.14 Example for differences between Euclidean distance, dot productsimilarity and cosine similarity. The vectors are ~a = (0.5 1.5) T , ~x = (2 2) T , ~b =(4 4) T , and ~c = (8 6) T .plement Rocchio classification and kNN classification, for example, the Bow toolkit(McCallum 1996).Exercise 14.8Download 20 Newgroups (page 154) and train and test Rocchio and kNN classifiersfor its 20 classes.Exercise 14.9Show that the decision boundaries in Rocchio classification are, as in kNN, given bythe Voronoi tessellation.Exercise 14.10[⋆]Computing the distance between a dense centroid and a sparse vector is Θ ( M ) fora naive implementation that iterates over all M dimensions.

Based on the equality∑( xi − µ i )2 = 1.0 + ∑ µ2i − 2 ∑ xi µ i and assuming that ∑ µ2i has been precomputed,write down an algorithm that is Θ ( Ma ) instead, where Ma is the number of distinctterms in the test document.Exercise 14.11[⋆ ⋆ ⋆]Prove that the region of the plane consisting of all points with the same k nearestneighbors is a convex polygon.Exercise 14.12Design an algorithm that performs an efficient 1NN search in 1 dimension (whereefficiency is with respect to the number of documents N). What is the time complexityof the algorithm?Exercise 14.13[⋆ ⋆ ⋆]Design an algorithm that performs an efficient 1NN search in 2 dimensions with atmost polynomial (in N) preprocessing time.Online edition (c) 2009 Cambridge UP31714.8 Exercisesbb◮ Figure 14.15 A simple non-separable set of points.Exercise 14.14[ ⋆ ⋆ ⋆]Can one design an exact efficient algorithm for 1NN for very large M along the ideasyou used to solve the last exercise?Exercise 14.15Show that Equation (14.4) defines a hyperplane with w~ = ~µ(c1 ) − ~µ (c2 ) and b =0.5 ∗ (|~µ(c1 )|2 − |~µ(c2 )|2 ).Exercise 14.16We can easily construct non-separable data sets in high dimensions by embeddinga non-separable set like the one shown in Figure 14.15.

Consider embedding Figure 14.15 in 3D and then perturbing the 4 points slightly (i.e., moving them a smalldistance in a random direction). Why would you expect the resulting configurationto be linearly separable? How likely is then a non-separable set of m ≪ M points inM-dimensional space?Exercise 14.17Assuming two classes, show that the percentage of non-separable assignments of thevertices of a hypercube decreases with dimensionality M for M > 1. For example,for M = 1 the proportion of non-separable assignments is 0, for M = 2, it is 2/16.One of the two non-separable cases for M = 2 is shown in Figure 14.15, the other isits mirror image.

Solve the exercise either analytically or by simulation.Exercise 14.18Although we point out the similarities of Naive Bayes with linear vector space classifiers, it does not make sense to represent count vectors (the document representationsin NB) in a continuous vector space. There is however a formalization of NB that isanalogous to Rocchio. Show that NB assigns a document to the class (represented asa parameter vector) whose Kullback-Leibler (KL) divergence (Section 12.4, page 251)to the document (represented as a count vector as in Section 13.4.1 (page 270), normalized to sum to 1) is smallest.Online edition (c) 2009 Cambridge UPOnline edition (c) 2009 Cambridge UPDRAFT! © April 1, 2009 Cambridge University Press.

Feedback welcome.15319Support vector machines andmachine learning on documentsImproving classifier effectiveness has been an area of intensive machinelearning research over the last two decades, and this work has led to a newgeneration of state-of-the-art classifiers, such as support vector machines,boosted decision trees, regularized logistic regression, neural networks, andrandom forests. Many of these methods, including support vector machines(SVMs), the main topic of this chapter, have been applied with success toinformation retrieval problems, particularly text classification. An SVM is akind of large-margin classifier: it is a vector space based machine learningmethod where the goal is to find a decision boundary between two classesthat is maximally far from any point in the training data (possibly discounting some points as outliers or noise).We will initially motivate and develop SVMs for the case of two-class datasets that are separable by a linear classifier (Section 15.1), and then extend themodel in Section 15.2 to non-separable data, multi-class problems, and nonlinear models, and also present some additional discussion of SVM performance.

The chapter then moves to consider the practical deployment of textclassifiers in Section 15.3: what sorts of classifiers are appropriate when, andhow can you exploit domain-specific text features in classification? Finally,we will consider how the machine learning technology that we have beenbuilding for text classification can be applied back to the problem of learninghow to rank documents in ad hoc retrieval (Section 15.4).

While several machine learning methods have been applied to this task, use of SVMs has beenprominent. Support vector machines are not necessarily better than othermachine learning methods (except perhaps in situations with little trainingdata), but they perform at the state-of-the-art level and have much currenttheoretical and empirical appeal.Online edition (c) 2009 Cambridge UP32015 Support vector machines and machine learning on documentsMaximummargindecisionhyperplaneSupport vectorstututubbbbb btutututubbbMargin ismaximized◮ Figure 15.1 The support vectors are the 5 points right up against the margin ofthe classifier.15.1MARGINSUPPORT VECTORSupport vector machines: The linearly separable caseFor two-class, separable training data sets, such as the one in Figure 14.8(page 301), there are lots of possible linear separators.

Intuitively, a decisionboundary drawn in the middle of the void between data items of the twoclasses seems better than one which approaches very close to examples ofone or both classes. While some learning methods such as the perceptronalgorithm (see references in Section 14.7, page 314) find just any linear separator, others, like Naive Bayes, search for the best linear separator accordingto some criterion. The SVM in particular defines the criterion to be lookingfor a decision surface that is maximally far away from any data point. Thisdistance from the decision surface to the closest data point determines themargin of the classifier.

This method of construction necessarily means thatthe decision function for an SVM is fully specified by a (usually small) subset of the data which defines the position of the separator. These points arereferred to as the support vectors (in a vector space, a point can be thought ofas a vector between the origin and that point). Figure 15.1 shows the marginand support vectors for a sample problem. Other data points play no part indetermining the decision surface that is chosen.Online edition (c) 2009 Cambridge UP15.1 Support vector machines: The linearly separable case321◮ Figure 15.2 An intuition for large-margin classification.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.