An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 86

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 86 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 862020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 86)

The F measurein addition supports differential weighting of these two types of errors.To compute purity, each cluster is assigned to the class which is most frequent in the cluster, and then the accuracy of this assignment is measuredby counting the number of correctly assigned documents and dividing by N.1. An upper bound on the number of clusterings is K N /K!. The exact number of differentpartitions of N documents into K clusters is the Stirling number of the second kind.

Seehttp://mathworld.wolfram.com/StirlingNumberoftheSecondKind.html or Comtet (1974).Online edition (c) 2009 Cambridge UP35716.3 Evaluation of clusteringcluster 1xoxxcluster 2xox xcluster 3oo ⋄x⋄⋄ ⋄ox◮ Figure 16.4 Purity as an external evaluation criterion for cluster quality. Majorityclass and number of members of the majority class for the three clusters are: x, 5(cluster 1); o, 4 (cluster 2); and ⋄, 3 (cluster 3).

Purity is (1/17) × (5 + 4 + 3) ≈ 0.71.lower boundmaximumvalue for Figure 16.4purity0.010.71NMI0.010.36RI0.010.68F50.010.46◮ Table 16.2 The four external evaluation measures applied to the clustering inFigure 16.4.Formally:(16.1)NORMALIZED MUTUALINFORMATIONpurity(Ω, C ) =1N|ωk ∩ c j |∑ maxjkwhere Ω = {ω1 , ω2 , . . . , ωK } is the set of clusters and C = {c1 , c2 , . . . , c J } isthe set of classes.

We interpret ωk as the set of documents in ωk and c j as theset of documents in c j in Equation (16.1).We present an example of how to compute purity in Figure 16.4.2 Badclusterings have purity values close to 0, a perfect clustering has a purity of1. Purity is compared with the other three measures discussed in this chapterin Table 16.2.High purity is easy to achieve when the number of clusters is large – inparticular, purity is 1 if each document gets its own cluster. Thus, we cannotuse purity to trade off the quality of the clustering against the number ofclusters.A measure that allows us to make this tradeoff is normalized mutual infor2.

Recall our note of caution from Figure 14.2 (page 291) when looking at this and other 2Dfigures in this and the following chapter: these illustrations can be misleading because 2D projections of length-normalized vectors distort similarities and distances between points.Online edition (c) 2009 Cambridge UP35816Flat clusteringmation or NMI:(16.2)(16.3)I (Ω; C )[ H (Ω) + H (C )]/2I is mutual information (cf. Chapter 13, page 272):NMI(Ω, C ) =I (Ω; C )=P(ωk ∩ c j )∑ ∑ P(ωk ∩ c j ) log P(ωk ) P(c j )k(16.4)=j∑∑kj|ωk ∩ c j |N |ωk ∩ c j |logN|ωk ||c j |where P(ωk ), P(c j ), and P(ωk ∩ c j ) are the probabilities of a document beingin cluster ωk , class c j , and in the intersection of ωk and c j , respectively.

Equation (16.4) is equivalent to Equation (16.3) for maximum likelihood estimatesof the probabilities (i.e., the estimate of each probability is the correspondingrelative frequency).H is entropy as defined in Chapter 5 (page 99):(16.5)H (Ω)= − ∑ P(ωk ) log P(ωk )k(16.6)= −∑k|ωk ||ω |log kNNwhere, again, the second equation is based on maximum likelihood estimatesof the probabilities.I (Ω; C ) in Equation (16.3) measures the amount of information by whichour knowledge about the classes increases when we are told what the clustersare.

The minimum of I (Ω; C ) is 0 if the clustering is random with respect toclass membership. In that case, knowing that a document is in a particularcluster does not give us any new information about what its class might be.Maximum mutual information is reached for a clustering Ωexact that perfectlyrecreates the classes – but also if clusters in Ωexact are further subdivided intosmaller clusters (Exercise 16.7). In particular, a clustering with K = N onedocument clusters has maximum MI. So MI has the same problem as purity:it does not penalize large cardinalities and thus does not formalize our biasthat, other things being equal, fewer clusters are better.The normalization by the denominator [ H (Ω) + H (C )]/2 in Equation (16.2)fixes this problem since entropy tends to increase with the number of clusters.

For example, H (Ω) reaches its maximum log N for K = N, which ensures that NMI is low for K = N. Because NMI is normalized, we can useit to compare clusterings with different numbers of clusters. The particularform of the denominator is chosen because [ H (Ω) + H (C )]/2 is a tight upperbound on I (Ω; C ) (Exercise 16.8). Thus, NMI is always a number between 0and 1.Online edition (c) 2009 Cambridge UP35916.3 Evaluation of clusteringR AND INDEXRIAn alternative to this information-theoretic interpretation of clustering isto view it as a series of decisions, one for each of the N ( N − 1)/2 pairs ofdocuments in the collection. We want to assign two documents to the samecluster if and only if they are similar. A true positive (TP) decision assignstwo similar documents to the same cluster, a true negative (TN) decision assigns two dissimilar documents to different clusters.

There are two typesof errors we can commit. A false positive (FP) decision assigns two dissimilar documents to the same cluster. A false negative (FN) decision assignstwo similar documents to different clusters. The Rand index (RI) measuresthe percentage of decisions that are correct. That is, it is simply accuracy(Section 8.3, page 155).RI =TP + TNTP + FP + FN + TNAs an example, we compute RI for Figure 16.4.

We first compute TP + FP.The three clusters contain 6, 6, and 5 points, respectively, so the total numberof “positives” or pairs of documents that are in the same cluster is: 665TP + FP =++= 40222Of these, the x pairs in cluster 1, the o pairs in cluster 2, the ⋄ pairs in cluster 3,and the x pair in cluster 3 are true positives: 2345= 20+++TP =2222Thus, FP = 40 − 20 = 20.FN and TN are computed similarly, resulting in the following contingencytable:Same classDifferent classesFMEASURESame clusterTP = 20FP = 20Different clustersFN = 24TN = 72RI is then (20 + 72)/(20 + 20 + 24 + 72) ≈ 0.68.The Rand index gives equal weight to false positives and false negatives.Separating similar documents is sometimes worse than putting pairs of dissimilar documents in the same cluster.

We can use the F measure (Section 8.3,page 154) to penalize false negatives more strongly than false positives byselecting a value β > 1, thus giving more weight to recall.P=TPTP + FPR=TPTP + FNFβ =( β2 + 1) PRβ2 P + ROnline edition (c) 2009 Cambridge UP36016Flat clusteringBased on the numbers in the contingency table, P = 20/40 = 0.5 and R =20/44 ≈ 0.455. This gives us F1 ≈ 0.48 for β = 1 and F5 ≈ 0.456 for β = 5.In information retrieval, evaluating clustering with F has the advantage thatthe measure is already familiar to the research community.?16.4CENTROIDExercise 16.3Replace every point d in Figure 16.4 with two identical copies of d in the same class.(i) Is it less difficult, equally difficult or more difficult to cluster this set of 34 pointsas opposed to the 17 points in Figure 16.4? (ii) Compute purity, NMI, RI, and F5 forthe clustering with 34 points.

Which measures increase and which stay the same afterdoubling the number of points? (iii) Given your assessment in (i) and the results in(ii), which measures are best suited to compare the quality of the two clusterings?K-meansK-means is the most important flat clustering algorithm. Its objective is tominimize the average squared Euclidean distance (Chapter 6, page 131) ofdocuments from their cluster centers where a cluster center is defined as themean or centroid ~µ of the documents in a cluster ω:~µ (ω ) =RESIDUAL SUM OFSQUARESThe definition assumes that documents are represented as length-normalizedvectors in a real-valued space in the familiar way. We used centroids for Rocchio classification in Chapter 14 (page 292).

They play a similar role here.The ideal cluster in K-means is a sphere with the centroid as its center ofgravity. Ideally, the clusters should not overlap. Our desiderata for classesin Rocchio classification were the same. The difference is that we have no labeled training set in clustering for which we know which documents shouldbe in the same cluster.A measure of how well the centroids represent the members of their clusters is the residual sum of squares or RSS, the squared distance of each vectorfrom its centroid summed over all vectors:RSSk =∑~x ∈ ω k(16.7)1~x|ω | ~x∑∈ωRSS =|~x − ~µ(ωk )|2K∑ RSSkk =1RSS is the objective function in K-means and our goal is to minimize it.

SinceN is fixed, minimizing RSS is equivalent to minimizing the average squareddistance, a measure of how well centroids represent their documents.Online edition (c) 2009 Cambridge UP16.4K-means361K- MEANS ({~x1 , . . . , ~x N }, K )1 (~s1 ,~s2 , . . . ,~sK ) ← S ELECT R ANDOM S EEDS({~x1 , .

. . , ~x N }, K )2 for k ← 1 to K3 do ~µk ← ~sk4 while stopping criterion has not been met5 do for k ← 1 to K6do ωk ← {}7for n ← 1 to N8do j ← arg minj′ |~µ j′ − ~xn |9ω j ← ω j ∪ {~xn } (reassignment of vectors)10for k ← 1 to K11do ~µk ← |ω1 | ∑~x∈ωk ~x (recomputation of centroids)k12 return {~µ1 , .

. . , ~µK }◮ Figure 16.5 The K-means algorithm.For most IR applications, the vectors~xn ∈ R M should be length-normalized. Alternative methods of seed selection andinitialization are discussed on page 364.SEEDThe first step of K-means is to select as initial cluster centers K randomlyselected documents, the seeds. The algorithm then moves the cluster centersaround in space in order to minimize RSS. As shown in Figure 16.5, this isdone iteratively by repeating two steps until a stopping criterion is met: reassigning documents to the cluster with the closest centroid; and recomputingeach centroid based on the current members of its cluster. Figure 16.6 showssnapshots from nine iterations of the K-means algorithm for a set of points.The “centroid” column of Table 17.2 (page 397) shows examples of centroids.We can apply one of the following termination conditions.• A fixed number of iterations I has been completed.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.