An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 92
Текст из файла (страница 92)
A cliqueis a set of points that are completely linked with each other.These graph-theoretic interpretations motivate the terms single-link andcomplete-link clustering. Single-link clusters at step k are maximal sets ofpoints that are linked via at least one link (a single link) of similarity s ≥ sk ;complete-link clusters at step k are maximal sets of points that are completelylinked with each other via links of similarity s ≥ sk .Single-link and complete-link clustering reduce the assessment of clusterquality to a single similarity between a pair of documents: the two most similar documents in single-link clustering and the two most dissimilar documents in complete-link clustering.
A measurement based on one pair cannotfully reflect the distribution of documents in a cluster. It is therefore not surprising that both algorithms often produce undesirable clusters. Single-linkclustering can produce straggling clusters as shown in Figure 17.6. Since themerge criterion is strictly local, a chain of points can be extended for long4. If you are bothered by the possibility of ties, assume that d1 has coordinates (1 + ǫ, 3 − ǫ ) andthat all other points have integer coordinates.Online edition (c) 2009 Cambridge UP38517.2 Single-link and complete-link clusteringd110×d2 d3 d4 d5××××0 1 2 3 4 5 6 7◮ Figure 17.7 Outliers in complete-link clustering.The five documents havethe x-coordinates 1 + 2ǫ, 4, 5 + 2ǫ, 6 and 7 − ǫ.
Complete-link clustering creates the two clusters shown as ellipses. The most intuitive two-cluster clustering is {{d1 }, {d2 , d3 , d4 , d5 }}, but in complete-link clustering, the outlier d1 splits{d2 , d3 , d4 , d5 } as shown.CHAINING17.2.1distances without regard to the overall shape of the emerging cluster. Thiseffect is called chaining.The chaining effect is also apparent in Figure 17.1. The last eleven mergesof the single-link clustering (those above the 0.1 line) add on single documents or pairs of documents, corresponding to a chain. The complete-linkclustering in Figure 17.5 avoids this problem.
Documents are split into twogroups of roughly equal size when we cut the dendrogram at the last merge.In general, this is a more useful organization of the data than a clusteringwith chains.However, complete-link clustering suffers from a different problem. Itpays too much attention to outliers, points that do not fit well into the globalstructure of the cluster. In the example in Figure 17.7 the four documentsd2 , d3 , d4 , d5 are split because of the outlier d1 at the left edge (Exercise 17.1).Complete-link clustering does not find the most intuitive cluster structure inthis example.Time complexity of HACThe complexity of the naive HAC algorithm in Figure 17.2 is Θ( N 3 ) becausewe exhaustively scan the N × N matrix C for the largest similarity in each ofN − 1 iterations.For the four HAC methods discussed in this chapter a more efficient algorithm is the priority-queue algorithm shown in Figure 17.8.
Its time complexity is Θ( N 2 log N ). The rows C [k] of the N × N similarity matrix C are sortedin decreasing order of similarity in the priority queues P. P[k].MAX () thenreturns the cluster in P[k] that currently has the highest similarity with ωk ,where we use ωk to denote the kth cluster as in Chapter 16. After creating themerged cluster of ωk1 and ωk2 , ωk1 is used as its representative. The functionSIM computes the similarity function for potential merge pairs: largest similarity for single-link, smallest similarity for complete-link, average similarityfor GAAC (Section 17.3), and centroid similarity for centroid clustering (Sec-Online edition (c) 2009 Cambridge UP38617 Hierarchical clusteringE FFICIENT HAC (d~1 , .
. . , d~N )1 for n ← 1 to N2 do for i ← 1 to N3do C [n][i ].sim ← d~n · d~i4C [n][i ].index ← i5I [n] ← 16P[n] ← priority queue for C [n] sorted on sim7P[n].D ELETE(C [n][n]) (don’t want self-similarities)8 A ← []9 for k ← 1 to N − 110 do k1 ← arg max{k:I [k]=1} P[k].M AX ().sim11k2 ← P[k1 ].M AX ().index12A.A PPEND (hk1 , k2 i)13I [k2 ] ← 014P[k1 ] ← []15for each i with I [i ] = 1 ∧ i 6= k116do P[i ].D ELETE(C [i ][k1 ])17P[i ].D ELETE(C [i ][k2 ])18C [i ][k1 ].sim ← S IM (i, k1 , k2 )19P[i ].I NSERT(C [i ][k1 ])20C [k1 ][i ].sim ← S IM (i, k1 , k2 )21P[k1 ].I NSERT (C [k1 ][i ])22 return Aclustering algorithmsingle-linkcomplete-linkcentroidgroup-averageSIM (i, k 1 , k 2 )max( SIM (i, k1 ), SIM(i, k2 ))min( SIM (i, k1 ), SIM(i, k2 ))( N1m ~vm ) · ( N1i ~vi )1[(~vm + ~vi )2 − ( Nm + Ni )]( N + N )( N + N −1)mcompute C [5]create P[5] (by sorting)merge 2 and 3, updatesimilarity of 2, delete 3delete and reinsert 2imi10.220.820.340.420.830.640.420.330.640.410.210.240.410.251.0◮ Figure 17.8 The priority-queue algorithm for HAC.
Top: The algorithm. Center:Four different similarity measures. Bottom: An example for processing steps 6 and16–19. This is a made up example showing P [5] for a 5 × 5 matrix C.Online edition (c) 2009 Cambridge UP17.2 Single-link and complete-link clustering387S INGLE L INK C LUSTERING (d1 , . . . , d N )1 for n ← 1 to N2 do for i ← 1 to N3do C [n][i ].sim ← SIM(dn , di )4C [n][i ].index ← i5I [n] ← n6NBM [n] ← arg maxX ∈{C[n][i]:n6=i} X.sim7 A ← []8 for n ← 1 to N − 19 do i1 ← arg max{i:I [i]=i} NBM [i ].sim10i2 ← I [ NBM [i1 ].index]11A.A PPEND (hi1 , i2 i)12for i ← 1 to N13do if I [i ] = i ∧ i 6= i1 ∧ i 6= i214then C [i1 ][i ].sim ← C [i ][i1 ].sim ← max(C [i1 ][i ].sim, C [i2 ][i ].sim)15if I [i ] = i216then I [i ] ← i117NBM [i1 ] ← arg maxX ∈{C[i1][i]:I [i]=i∧i6=i1} X.sim18 return A◮ Figure 17.9 Single-link clustering algorithm using an NBM array.
After mergingtwo clusters i1 and i2 , the first one (i1 ) represents the merged cluster. If I [i ] = i, then iis the representative of its current cluster. If I [i ] 6= i, then i has been merged into thecluster represented by I [i ] and will therefore be ignored when updating NBM [i1 ].tion 17.4). We give an example of how a row of C is processed (Figure 17.8,bottom panel). The loop in lines 1–7 is Θ( N 2 ) and the loop in lines 9–21 isΘ( N 2 log N ) for an implementation of priority queues that supports deletionand insertion in Θ(log N ). The overall complexity of the algorithm is therefore Θ( N 2 log N ).
In the definition of the function SIM, ~vm and ~vi are thevector sums of ωk1 ∪ ωk2 and ωi , respectively, and Nm and Ni are the numberof documents in ωk1 ∪ ωk2 and ωi , respectively.The argument of E FFICIENT HAC in Figure 17.8 is a set of vectors (as opposed to a set of generic documents) because GAAC and centroid clustering(Sections 17.3 and 17.4) require vectors as input. The complete-link versionof E FFICIENT HAC can also be applied to documents that are not representedas vectors.For single-link, we can introduce a next-best-merge array (NBM) as a further optimization as shown in Figure 17.9.
NBM keeps track of what the bestmerge is for each cluster. Each of the two top level for-loops in Figure 17.9are Θ( N 2 ), thus the overall complexity of single-link clustering is Θ( N 2 ).Online edition (c) 2009 Cambridge UP38817 Hierarchical clustering10d1d2d3d4××××0 1 2 3 4 5 6 7 8 9 10◮ Figure 17.10 Complete-link clustering is not best-merge persistent. At first, d2 isthe best-merge cluster for d3 . But after merging d1 and d2 , d4 becomes d3 ’s best-mergecandidate. In a best-merge persistent algorithm like single-link, d3 ’s best-merge cluster would be {d1 , d2 }.BEST- MERGEPERSISTENCE?17.3GROUP - AVERAGEAGGLOMERATIVECLUSTERINGCan we also speed up the other three HAC algorithms with an NBM array? We cannot because only single-link clustering is best-merge persistent.Suppose that the best merge cluster for ωk is ω j in single-link clustering.Then after merging ω j with a third cluster ωi 6= ωk , the merge of ωi and ω jwill be ωk ’s best merge cluster (Exercise 17.6).
In other words, the best-mergecandidate for the merged cluster is one of the two best-merge candidates ofits components in single-link clustering. This means that C can be updatedin Θ( N ) in each iteration – by taking a simple max of two values on line 14in Figure 17.9 for each of the remaining ≤ N clusters.Figure 17.10 demonstrates that best-merge persistence does not hold forcomplete-link clustering, which means that we cannot use an NBM array tospeed up clustering. After merging d3 ’s best merge candidate d2 with clusterd1 , an unrelated cluster d4 becomes the best merge candidate for d3 . This isbecause the complete-link merge criterion is non-local and can be affected bypoints at a great distance from the area where two merge candidates meet.In practice, the efficiency penalty of the Θ( N 2 log N ) algorithm is smallcompared with the Θ( N 2 ) single-link algorithm since computing the similarity between two documents (e.g., as a dot product) is an order of magnitudeslower than comparing two scalars in sorting.