An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 71

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 71 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 712020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 71)

Rocchio classification can beOnline edition (c) 2009 Cambridge UP29514.2 Rocchio classificationT RAIN R OCCHIO (C, D )1 for each c j ∈ C2 do D j ← {d : hd, c j i ∈ D }3~µ j ← | D1 | ∑d∈ D j ~v(d)4jreturn {~µ1 , . . . , ~µ J }A PPLY R OCCHIO ({~µ1 , . . . , ~µ J }, d)1 return arg min j |~µ j − ~v(d)|◮ Figure 14.4 Rocchio classification: Training and testing.aaaaaaaaaaaaaaXaa aaaaaAaaaaaaaaaoabbb bbaX aa aaaaaabBb b b bb b bbbbbb◮ Figure 14.5 The multimodal class “a” consists of two different clusters (smallupper circles centered on X’s).

Rocchio classification will misclassify “o” as “a”because it is closer to the centroid A of the “a” class than to the centroid B of the “b”class.applied to J > 2 classes whereas Rocchio relevance feedback is designed todistinguish only two classes, relevant and nonrelevant.In addition to respecting contiguity, the classes in Rocchio classificationmust be approximate spheres with similar radii. In Figure 14.3, the solidsquare just below the boundary between UK and Kenya is a better fit for theclass UK since UK is more scattered than Kenya. But Rocchio assigns it toKenya because it ignores details of the distribution of points in a class andonly uses distance from the centroid for classification.The assumption of sphericity also does not hold in Figure 14.5.

We can-Online edition (c) 2009 Cambridge UP29614 Vector space classificationmodetrainingtestingtime complexityΘ(|D | Lave + |C ||V |)Θ( La + |C | Ma ) = Θ(|C | Ma )◮ Table 14.2 Training and test times for Rocchio classification. Lave is the averagenumber of tokens per document. La and Ma are the numbers of tokens and types,respectively, in the test document. Computing Euclidean distance between the classcentroids and a document is Θ (|C | Ma ).MULTIMODAL CLASSnot represent the “a” class well with a single prototype because it has twoclusters.

Rocchio often misclassifies this type of multimodal class. A text classification example for multimodality is a country like Burma, which changedits name to Myanmar in 1989. The two clusters before and after the namechange need not be close to each other in space. We also encountered theproblem of multimodality in relevance feedback (Section 9.1.2, page 184).Two-class classification is another case where classes are rarely distributedlike spheres with similar radii.

Most two-class classifiers distinguish betweena class like China that occupies a small region of the space and its widelyscattered complement. Assuming equal radii will result in a large numberof false positives. Most two-class classification problems therefore require amodified decision rule of the form:Assign d to class c iff |~µ (c) − ~v(d)| < |~µ (c) − ~v(d)| − bfor a positive constant b.

As in Rocchio relevance feedback, the centroid ofthe negative documents is often not used at all, so that the decision criterionsimplifies to |~µ (c) − ~v(d)| < b′ for a positive constant b′ .Table 14.2 gives the time complexity of Rocchio classification.2 Adding alldocuments to their respective (unnormalized) centroid is Θ(|D | Lave ) (as opposed to Θ(|D ||V |)) since we need only consider non-zero entries.

Dividingeach vector sum by the size of its class to compute the centroid is Θ(|V |).Overall, training time is linear in the size of the collection (cf. Exercise 13.1).Thus, Rocchio classification and Naive Bayes have the same linear trainingtime complexity.In the next section, we will introduce another vector space classificationmethod, kNN, that deals better with classes that have non-spherical, disconnected or other irregular shapes.?[⋆]Exercise 14.2Show that Rocchio classification can assign a label to a document that is different fromits training set label.2. We write Θ(|D | Lave ) for Θ( T ) and assume that the length of test documents is bounded aswe did on page 262.Online edition (c) 2009 Cambridge UP14.3297k nearest neighborxxxxxxxxxx⋄⋄⋄⋄⋆x⋄⋄⋄⋄⋄⋄⋄◮ Figure 14.6 Voronoi tessellation and decision boundaries (double lines) in 1NNclassification.

The three classes are: X, circle and diamond.14.3k NEAREST NEIGHBORCLASSIFICATIONV ORONOITESSELLATIONk nearest neighborUnlike Rocchio, k nearest neighbor or kNN classification determines the decision boundary locally. For 1NN we assign each document to the class of itsclosest neighbor. For kNN we assign each document to the majority class ofits k closest neighbors where k is a parameter. The rationale of kNN classification is that, based on the contiguity hypothesis, we expect a test documentd to have the same label as the training documents located in the local regionsurrounding d.Decision boundaries in 1NN are concatenated segments of the Voronoi tessellation as shown in Figure 14.6.

The Voronoi tessellation of a set of objectsdecomposes space into Voronoi cells, where each object’s cell consists of allpoints that are closer to the object than to other objects. In our case, the objects are documents. The Voronoi tessellation then partitions the plane into|D | convex polygons, each containing its corresponding document (and noother) as shown in Figure 14.6, where a convex polygon is a convex region in2-dimensional space bounded by lines.For general k ∈ N in kNN, consider the region in the space for which theset of k nearest neighbors is the same.

This again is a convex polygon and thespace is partitioned into convex polygons, within each of which the set of kOnline edition (c) 2009 Cambridge UP29814 Vector space classificationT RAIN - K NN (C, D )1 D ′ ← P REPROCESS (D )2 k ← S ELECT- K(C, D ′ )3 return D ′ , kA PPLY- K NN (C, D ′ , k, d)1 Sk ← C OMPUTE N EAREST N EIGHBORS (D ′ , k, d)2 for each c j ∈ C3 do p j ← |Sk ∩ c j |/k4 return arg maxj p j◮ Figure 14.7 kNN training (with preprocessing) and testing.

p j is an estimate forP (c j | Sk ) = P (c j | d). c j denotes the set of all documents in the class c j .nearest neighbors is invariant (Exercise 14.11).31NN is not very robust. The classification decision of each test documentrelies on the class of a single training document, which may be incorrectlylabeled or atypical. kNN for k > 1 is more robust. It assigns documents tothe majority class of their k closest neighbors, with ties broken randomly.There is a probabilistic version of this kNN classification algorithm.

Wecan estimate the probability of membership in class c as the proportion of thek nearest neighbors in c. Figure 14.6 gives an example for k = 3. Probability estimates for class membership of the star are P̂(circle class|star) = 1/3,P̂(X class|star) = 2/3, and P̂(diamond class|star) = 0. The 3nn estimate( P̂1 (circle class|star) = 1/3) and the 1nn estimate ( P̂1 (circle class|star) = 1)differ with 3nn preferring the X class and 1nn preferring the circle class .The parameter k in kNN is often chosen based on experience or knowledgeabout the classification problem at hand.

It is desirable for k to be odd tomake ties less likely. k = 3 and k = 5 are common choices, but much largervalues between 50 and 100 are also used. An alternative way of setting theparameter is to select the k that gives best results on a held-out portion of thetraining set.We can also weight the “votes” of the k nearest neighbors by their cosine3. The generalization of a polygon to higher dimensions is a polytope. A polytope is a regionin M-dimensional space bounded by ( M − 1)-dimensional hyperplanes. In M dimensions, thedecision boundaries for kNN consist of segments of ( M − 1)-dimensional hyperplanes that formthe Voronoi tessellation into convex polytopes for the training set of documents. The decisioncriterion of assigning a document to the majority class of its k nearest neighbors applies equallyto M = 2 (tessellation into polygons) and M > 2 (tessellation into polytopes).Online edition (c) 2009 Cambridge UP14.3299k nearest neighborkNN with preprocessing of training settraining Θ(|D | Lave )testingΘ( La + |D | Mave Ma ) = Θ(|D | Mave Ma )kNN without preprocessing of training settraining Θ(1)testingΘ( La + |D | Lave Ma ) = Θ(|D | Lave Ma )◮ Table 14.3 Training and test times for kNN classification.

Mave is the average sizeof the vocabulary of documents in the collection.similarity. In this scheme, a class’s score is computed as:score(c, d) =∑Ic (d′ ) cos(~v(d′ ), ~v(d))d ′ ∈ Sk ( d )where Sk (d) is the set of d’s k nearest neighbors and Ic (d′ ) = 1 iff d′ is in classc and 0 otherwise. We then assign the document to the class with the highestscore. Weighting by similarities is often more accurate than simple voting.For example, if two classes have the same number of neighbors in the top k,the class with the more similar neighbors wins.Figure 14.7 summarizes the kNN algorithm.✄✎Example 14.2:14.3.1Time complexity and optimality of kNNThe distances of the test document from the four training documents in Table 14.1 are | d~1 − d~5 | = | d~2 − d~5 | = | d~3 − d~5 | ≈ 1.41 and | d~4 − d~5 | = 0.0.d5 ’s nearest neighbor is therefore d4 and 1NN assigns d5 to d4 ’s class, c.Table 14.3 gives the time complexity of kNN.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.