An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 92

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 92 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 922020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 92)

A cliqueis a set of points that are completely linked with each other.These graph-theoretic interpretations motivate the terms single-link andcomplete-link clustering. Single-link clusters at step k are maximal sets ofpoints that are linked via at least one link (a single link) of similarity s ≥ sk ;complete-link clusters at step k are maximal sets of points that are completelylinked with each other via links of similarity s ≥ sk .Single-link and complete-link clustering reduce the assessment of clusterquality to a single similarity between a pair of documents: the two most similar documents in single-link clustering and the two most dissimilar documents in complete-link clustering.

A measurement based on one pair cannotfully reflect the distribution of documents in a cluster. It is therefore not surprising that both algorithms often produce undesirable clusters. Single-linkclustering can produce straggling clusters as shown in Figure 17.6. Since themerge criterion is strictly local, a chain of points can be extended for long4. If you are bothered by the possibility of ties, assume that d1 has coordinates (1 + ǫ, 3 − ǫ ) andthat all other points have integer coordinates.Online edition (c) 2009 Cambridge UP38517.2 Single-link and complete-link clusteringd110×d2 d3 d4 d5××××0 1 2 3 4 5 6 7◮ Figure 17.7 Outliers in complete-link clustering.The five documents havethe x-coordinates 1 + 2ǫ, 4, 5 + 2ǫ, 6 and 7 − ǫ.

Complete-link clustering creates the two clusters shown as ellipses. The most intuitive two-cluster clustering is {{d1 }, {d2 , d3 , d4 , d5 }}, but in complete-link clustering, the outlier d1 splits{d2 , d3 , d4 , d5 } as shown.CHAINING17.2.1distances without regard to the overall shape of the emerging cluster. Thiseffect is called chaining.The chaining effect is also apparent in Figure 17.1. The last eleven mergesof the single-link clustering (those above the 0.1 line) add on single documents or pairs of documents, corresponding to a chain. The complete-linkclustering in Figure 17.5 avoids this problem.

Documents are split into twogroups of roughly equal size when we cut the dendrogram at the last merge.In general, this is a more useful organization of the data than a clusteringwith chains.However, complete-link clustering suffers from a different problem. Itpays too much attention to outliers, points that do not fit well into the globalstructure of the cluster. In the example in Figure 17.7 the four documentsd2 , d3 , d4 , d5 are split because of the outlier d1 at the left edge (Exercise 17.1).Complete-link clustering does not find the most intuitive cluster structure inthis example.Time complexity of HACThe complexity of the naive HAC algorithm in Figure 17.2 is Θ( N 3 ) becausewe exhaustively scan the N × N matrix C for the largest similarity in each ofN − 1 iterations.For the four HAC methods discussed in this chapter a more efficient algorithm is the priority-queue algorithm shown in Figure 17.8.

Its time complexity is Θ( N 2 log N ). The rows C [k] of the N × N similarity matrix C are sortedin decreasing order of similarity in the priority queues P. P[k].MAX () thenreturns the cluster in P[k] that currently has the highest similarity with ωk ,where we use ωk to denote the kth cluster as in Chapter 16. After creating themerged cluster of ωk1 and ωk2 , ωk1 is used as its representative. The functionSIM computes the similarity function for potential merge pairs: largest similarity for single-link, smallest similarity for complete-link, average similarityfor GAAC (Section 17.3), and centroid similarity for centroid clustering (Sec-Online edition (c) 2009 Cambridge UP38617 Hierarchical clusteringE FFICIENT HAC (d~1 , .

. . , d~N )1 for n ← 1 to N2 do for i ← 1 to N3do C [n][i ].sim ← d~n · d~i4C [n][i ].index ← i5I [n] ← 16P[n] ← priority queue for C [n] sorted on sim7P[n].D ELETE(C [n][n]) (don’t want self-similarities)8 A ← []9 for k ← 1 to N − 110 do k1 ← arg max{k:I [k]=1} P[k].M AX ().sim11k2 ← P[k1 ].M AX ().index12A.A PPEND (hk1 , k2 i)13I [k2 ] ← 014P[k1 ] ← []15for each i with I [i ] = 1 ∧ i 6= k116do P[i ].D ELETE(C [i ][k1 ])17P[i ].D ELETE(C [i ][k2 ])18C [i ][k1 ].sim ← S IM (i, k1 , k2 )19P[i ].I NSERT(C [i ][k1 ])20C [k1 ][i ].sim ← S IM (i, k1 , k2 )21P[k1 ].I NSERT (C [k1 ][i ])22 return Aclustering algorithmsingle-linkcomplete-linkcentroidgroup-averageSIM (i, k 1 , k 2 )max( SIM (i, k1 ), SIM(i, k2 ))min( SIM (i, k1 ), SIM(i, k2 ))( N1m ~vm ) · ( N1i ~vi )1[(~vm + ~vi )2 − ( Nm + Ni )]( N + N )( N + N −1)mcompute C [5]create P[5] (by sorting)merge 2 and 3, updatesimilarity of 2, delete 3delete and reinsert 2imi10.220.820.340.420.830.640.420.330.640.410.210.240.410.251.0◮ Figure 17.8 The priority-queue algorithm for HAC.

Top: The algorithm. Center:Four different similarity measures. Bottom: An example for processing steps 6 and16–19. This is a made up example showing P [5] for a 5 × 5 matrix C.Online edition (c) 2009 Cambridge UP17.2 Single-link and complete-link clustering387S INGLE L INK C LUSTERING (d1 , . . . , d N )1 for n ← 1 to N2 do for i ← 1 to N3do C [n][i ].sim ← SIM(dn , di )4C [n][i ].index ← i5I [n] ← n6NBM [n] ← arg maxX ∈{C[n][i]:n6=i} X.sim7 A ← []8 for n ← 1 to N − 19 do i1 ← arg max{i:I [i]=i} NBM [i ].sim10i2 ← I [ NBM [i1 ].index]11A.A PPEND (hi1 , i2 i)12for i ← 1 to N13do if I [i ] = i ∧ i 6= i1 ∧ i 6= i214then C [i1 ][i ].sim ← C [i ][i1 ].sim ← max(C [i1 ][i ].sim, C [i2 ][i ].sim)15if I [i ] = i216then I [i ] ← i117NBM [i1 ] ← arg maxX ∈{C[i1][i]:I [i]=i∧i6=i1} X.sim18 return A◮ Figure 17.9 Single-link clustering algorithm using an NBM array.

After mergingtwo clusters i1 and i2 , the first one (i1 ) represents the merged cluster. If I [i ] = i, then iis the representative of its current cluster. If I [i ] 6= i, then i has been merged into thecluster represented by I [i ] and will therefore be ignored when updating NBM [i1 ].tion 17.4). We give an example of how a row of C is processed (Figure 17.8,bottom panel). The loop in lines 1–7 is Θ( N 2 ) and the loop in lines 9–21 isΘ( N 2 log N ) for an implementation of priority queues that supports deletionand insertion in Θ(log N ). The overall complexity of the algorithm is therefore Θ( N 2 log N ).

In the definition of the function SIM, ~vm and ~vi are thevector sums of ωk1 ∪ ωk2 and ωi , respectively, and Nm and Ni are the numberof documents in ωk1 ∪ ωk2 and ωi , respectively.The argument of E FFICIENT HAC in Figure 17.8 is a set of vectors (as opposed to a set of generic documents) because GAAC and centroid clustering(Sections 17.3 and 17.4) require vectors as input. The complete-link versionof E FFICIENT HAC can also be applied to documents that are not representedas vectors.For single-link, we can introduce a next-best-merge array (NBM) as a further optimization as shown in Figure 17.9.

NBM keeps track of what the bestmerge is for each cluster. Each of the two top level for-loops in Figure 17.9are Θ( N 2 ), thus the overall complexity of single-link clustering is Θ( N 2 ).Online edition (c) 2009 Cambridge UP38817 Hierarchical clustering10d1d2d3d4××××0 1 2 3 4 5 6 7 8 9 10◮ Figure 17.10 Complete-link clustering is not best-merge persistent. At first, d2 isthe best-merge cluster for d3 . But after merging d1 and d2 , d4 becomes d3 ’s best-mergecandidate. In a best-merge persistent algorithm like single-link, d3 ’s best-merge cluster would be {d1 , d2 }.BEST- MERGEPERSISTENCE?17.3GROUP - AVERAGEAGGLOMERATIVECLUSTERINGCan we also speed up the other three HAC algorithms with an NBM array? We cannot because only single-link clustering is best-merge persistent.Suppose that the best merge cluster for ωk is ω j in single-link clustering.Then after merging ω j with a third cluster ωi 6= ωk , the merge of ωi and ω jwill be ωk ’s best merge cluster (Exercise 17.6).

In other words, the best-mergecandidate for the merged cluster is one of the two best-merge candidates ofits components in single-link clustering. This means that C can be updatedin Θ( N ) in each iteration – by taking a simple max of two values on line 14in Figure 17.9 for each of the remaining ≤ N clusters.Figure 17.10 demonstrates that best-merge persistence does not hold forcomplete-link clustering, which means that we cannot use an NBM array tospeed up clustering. After merging d3 ’s best merge candidate d2 with clusterd1 , an unrelated cluster d4 becomes the best merge candidate for d3 . This isbecause the complete-link merge criterion is non-local and can be affected bypoints at a great distance from the area where two merge candidates meet.In practice, the efficiency penalty of the Θ( N 2 log N ) algorithm is smallcompared with the Θ( N 2 ) single-link algorithm since computing the similarity between two documents (e.g., as a dot product) is an order of magnitudeslower than comparing two scalars in sorting.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.