An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 93

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 93 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 932020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 93)

All four HAC algorithms inthis chapter are Θ( N 2 ) with respect to similarity computations. So the difference in complexity is rarely a concern in practice when choosing one of thealgorithms.Exercise 17.1Show that complete-link clustering creates the two-cluster clustering depicted in Figure 17.7.Group-average agglomerative clusteringGroup-average agglomerative clustering or GAAC (see Figure 17.3, (d)) evaluatescluster quality based on all similarities between documents, thus avoidingthe pitfalls of the single-link and complete-link criteria, which equate clusterOnline edition (c) 2009 Cambridge UP38917.3 Group-average agglomerative clusteringsimilarity with the similarity of a single pair of documents.

GAAC is alsocalled group-average clustering and average-link clustering. GAAC computesthe average similarity SIM - GA of all pairs of documents, including pairs fromthe same cluster. But self-similarities are not included in the average:(17.1)SIM - GA ( ω i , ω j )=1( Ni + Nj )( Ni + Nj − 1) dm ∈∑ω ∪ωi∑d n ∈ ω i ∪ ω j ,d n 6 =d mjd~m · d~nwhere d~ is the length-normalized vector of document d, · denotes the dotproduct, and Ni and Nj are the number of documents in ωi and ω j , respectively.The motivation for GAAC is that our goal in selecting two clusters ωiand ω j as the next merge in HAC is that the resulting merge cluster ωk =ωi ∪ ω j should be coherent. To judge the coherence of ωk , we need to lookat all document-document similarities within ωk , including those that occurwithin ωi and those that occur within ω j .We can compute the measure SIM - GA efficiently because the sum of individual vector similarities is equal to the similarities of their sums:(17.2)∑ ∑dm ∈ωi dn ∈ω j(d~m · d~n ) = (∑dm ∈ωid~m ) · (d~n )∑dn ∈ω jWith (17.2), we have:(17.3)SIM - GA ( ω i , ω j )=1[(d~m )2 − ( Ni + Nj )]( Ni + Nj )( Ni + Nj − 1) dm ∈∑ω ∪ωijThe term ( Ni + Nj ) on the right is the sum of Ni + Nj self-similarities of value1.0.

With this trick we can compute cluster similarity in constant time (assuming we have available the two vector sums ∑dm ∈ωi d~m and ∑dm ∈ω j d~m )instead of in Θ( Ni Nj ). This is important because we need to be able to compute the function SIM on lines 18 and 20 in E FFICIENT HAC (Figure 17.8)in constant time for efficient implementations of GAAC.

Note that for twosingleton clusters, Equation (17.3) is equivalent to the dot product.Equation (17.2) relies on the distributivity of the dot product with respectto vector addition. Since this is crucial for the efficient computation of aGAAC clustering, the method cannot be easily applied to representations ofdocuments that are not real-valued vectors. Also, Equation (17.2) only holdsfor the dot product. While many algorithms introduced in this book havenear-equivalent descriptions in terms of dot product, cosine similarity andEuclidean distance (cf. Section 14.1, page 291), Equation (17.2) can only beexpressed using the dot product. This is a fundamental difference betweensingle-link/complete-link clustering and GAAC.

The first two only require aOnline edition (c) 2009 Cambridge UP39017 Hierarchical clusteringsquare matrix of similarities as input and do not care how these similaritieswere computed.To summarize, GAAC requires (i) documents represented as vectors, (ii)length normalization of vectors, so that self-similarities are 1.0, and (iii) thedot product as the measure of similarity between vectors and sums of vectors.The merge algorithms for GAAC and complete-link clustering are the sameexcept that we use Equation (17.3) as similarity function in Figure 17.8. Therefore, the overall time complexity of GAAC is the same as for complete-linkclustering: Θ( N 2 log N ).

Like complete-link clustering, GAAC is not bestmerge persistent (Exercise 17.6). This means that there is no Θ( N 2 ) algorithmfor GAAC that would be analogous to the Θ( N 2 ) algorithm for single-link inFigure 17.9.We can also define group-average similarity as including self-similarities:(17.4)SIM - GA′ (ωi , ω j ) =1(( Ni+Nj )2 d∑m ∈ωi ∪ω jd~m )2 =1Ni+Nj∑[d~m · ~µ(ωi ∪ ω j )]dm ∈ωi ∪ω jwhere the centroid ~µ (ω ) is defined as in Equation (14.1) (page 292). Thisdefinition is equivalent to the intuitive definition of cluster quality as averagesimilarity of documents d~m to the cluster’s centroid ~µ.Self-similarities are always equal to 1.0, the maximum possible value forlength-normalized vectors. The proportion of self-similarities in Equation (17.4)is i/i2 = 1/i for a cluster of size i.

This gives an unfair advantage to smallclusters since they will have proportionally more self-similarities. For twodocuments d1 , d2 with a similarity s, we have SIM - GA ′ (d1 , d2 ) = (1 + s)/2.In contrast, SIM - GA(d1 , d2 ) = s ≤ (1 + s)/2. This similarity SIM - GA (d1 , d2 )of two documents is the same as in single-link, complete-link and centroidclustering. We prefer the definition in Equation (17.3), which excludes selfsimilarities from the average, because we do not want to penalize large clusters for their smaller proportion of self-similarities and because we want aconsistent similarity value s for document pairs in all four HAC algorithms.?Exercise 17.2Apply group-average clustering to the points in Figures 17.6 and 17.7.

Map them ontothe surface of the unit sphere in a three-dimensional space to get length-normalizedvectors. Is the group-average clustering different from the single-link and completelink clusterings?Online edition (c) 2009 Cambridge UP39117.4 Centroid clustering×5×d1d3bc µ24×3×d2bccb××2d4µ31d5µ100123456d67◮ Figure 17.11 Three iterations of centroid clustering.

Each iteration merges thetwo clusters whose centroids are closest.17.4Centroid clusteringIn centroid clustering, the similarity of two clusters is defined as the similarity of their centroids:(17.5)SIM - CENT ( ω i , ω j )= ~µ (ωi ) · ~µ (ω j )11d~m ) · (= (Ni d ∑Nj∈ωm(17.6)INVERSION=1Ni Nji∑ ∑dm ∈ωi dn ∈ω j∑d~n )dn ∈ω jd~m · d~nEquation (17.5) is centroid similarity. Equation (17.6) shows that centroidsimilarity is equivalent to average similarity of all pairs of documents fromdifferent clusters. Thus, the difference between GAAC and centroid clusteringis that GAAC considers all pairs of documents in computing average pairwise similarity (Figure 17.3, (d)) whereas centroid clustering excludes pairsfrom the same cluster (Figure 17.3, (c)).Figure 17.11 shows the first three steps of a centroid clustering.

The firsttwo iterations form the clusters {d5 , d6 } with centroid µ1 and {d1 , d2 } withcentroid µ2 because the pairs hd5 , d6 i and hd1 , d2 i have the highest centroidsimilarities. In the third iteration, the highest centroid similarity is betweenµ1 and d4 producing the cluster {d4 , d5 , d6 } with centroid µ3 .Like GAAC, centroid clustering is not best-merge persistent and thereforeΘ( N 2 log N ) (Exercise 17.6).In contrast to the other three HAC algorithms, centroid clustering is notmonotonic. So-called inversions can occur: Similarity can increase duringOnline edition (c) 2009 Cambridge UP39217 Hierarchical clustering543210d3×d1×cbd2×0 1 2 3 4 5−4−3−2−10d1d2d3◮ Figure 17.12 Centroid clustering is not monotonic.

The documents d1 at (1 + ǫ, 1),√d2 at (5, 1), and d3 at (3, 1 + 2 3) are almost equidistant, with d1 and d2 closer toeach other than to d3 . The non-monotonic inversion in the hierarchical clusteringof the three points appears as an intersecting merge line in the dendrogram. Theintersection is circled.clustering as in the example in Figure 17.12, where we define similarity asnegative distance. In the first merge, the similarity of d1 and d2 is −(4 − ǫ). Inthe second merge, the similarityof the centroid of d1 and d2 (the circle) and d3√is ≈ − cos(π/6) × 4 = − 3/2 × 4 ≈ −3.46 > −(4 − ǫ). This is an exampleof an inversion: similarity increases in this sequence of two clustering steps.In a monotonic HAC algorithm, similarity is monotonically decreasing fromiteration to iteration.Increasing similarity in a series of HAC clustering steps contradicts thefundamental assumption that small clusters are more coherent than largeclusters.

An inversion in a dendrogram shows up as a horizontal merge linethat is lower than the previous merge line. All merge lines in Figures 17.1and 17.5 are higher than their predecessors because single-link and completelink clustering are monotonic clustering algorithms.Despite its non-monotonicity, centroid clustering is often used because itssimilarity measure – the similarity of two centroids – is conceptually simplerthan the average of all pairwise similarities in GAAC. Figure 17.11 is all oneneeds to understand centroid clustering.

There is no equally simple graphthat would explain how GAAC works.?Exercise 17.3For a fixed set of N documents there are up to N 2 distinct similarities between clustersin single-link and complete-link clustering. How many distinct cluster similarities arethere in GAAC and centroid clustering?Online edition (c) 2009 Cambridge UP39317.5 Optimality of HAC✄17.5Optimality of HACTo state the optimality conditions of hierarchical clustering precisely, we firstdefine the combination similarity COMB - SIM of a clustering Ω = {ω1 , . . . , ωK }as the smallest combination similarity of any of its K clusters:COMB - SIM ({ω1 , . . .

, ω K })OPTIMAL CLUSTERING= min COMB - SIM (ωk )kRecall that the combination similarity of a cluster ω that was created as themerge of ω1 and ω2 is the similarity of ω1 and ω2 (page 378).We then define Ω = {ω1 , . . . , ωK } to be optimal if all clusterings Ω′ with kclusters, k ≤ K, have lower combination similarities:|Ω′ | ≤ |Ω| ⇒ COMB - SIM (Ω′ ) ≤ COMB - SIM (Ω)COMBINATIONSIMILARITYFigure 17.12 shows that centroid clustering is not optimal. The clustering {{d1 , d2 }, {d3 }} (for K = 2) has combination similarity −(4 − ǫ) and{{d1 , d2 , d3 }} (for K = 1) has combination similarity -3.46.

So the clustering {{d1 , d2 }, {d3 }} produced in the first merge is not optimal since there isa clustering with fewer clusters ({{d1 , d2 , d3 }}) that has higher combinationsimilarity. Centroid clustering is not optimal because inversions can occur.The above definition of optimality would be of limited use if it was onlyapplicable to a clustering together with its merge history. However, we canshow (Exercise 17.4) that combination similarity for the three non-inversionalgorithms can be read off from the cluster without knowing its history. Thesedirect definitions of combination similarity are as follows.single-link The combination similarity of a cluster ω is the smallest similarity of any bipartition of the cluster, where the similarity of a bipartition isthe largest similarity between any two documents from the two parts:COMB - SIM ( ω )=minmax max{ ω ′ :ω ′ ⊂ω } d i ∈ ω ′ d j ∈ ω −ω ′SIM ( d i , d j )where each hω ′ , ω − ω ′ i is a bipartition of ω.complete-link The combination similarity of a cluster ω is the smallest similarity of any two points in ω: mindi ∈ω mind j ∈ω SIM (di , d j ).GAAC The combination similarity of a cluster ω is the average of all pairwise similarities in ω (where self-similarities are not included in the average): Equation (17.3).If we use these definitions of combination similarity, then optimality is aproperty of a set of clusters and not of a process that produces a set of clusters.Online edition (c) 2009 Cambridge UP39417 Hierarchical clusteringWe can now prove the optimality of single-link clustering by inductionover the number of clusters K.

We will give a proof for the case where no twopairs of documents have the same similarity, but it can easily be extended tothe case with ties.The inductive basis of the proof is that a clustering with K = N clusters hascombination similarity 1.0, which is the largest value possible. The induction hypothesis is that a single-link clustering ΩK with K clusters is optimal:COMB - SIM ( Ω K ) ≥ COMB - SIM ( Ω ′K ) for all Ω ′K .

Assume for contradiction thatthe clustering ΩK −1 we obtain by merging the two most similar clusters inΩK is not optimal and that instead a different sequence of merges Ω′K , Ω′K −1leads to the optimal clustering with K − 1 clusters. We can write the assumption that Ω′K −1 is optimal and that ΩK −1 is not as COMB - SIM (Ω′K −1 ) >COMB - SIM ( Ω K −1 ).Case 1: The two documents linked by s = COMB - SIM (Ω′K −1 ) are in thesame cluster in ΩK .

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.