An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 95
Текст из файла (страница 95)
We get a good sense of thedocuments in a cluster from scanning the selected terms.For hierarchical clustering, additional complications arise in cluster labeling. Not only do we need to distinguish an internal node in the tree fromits siblings, but also from its parent and its children. Documents in childnodes are by definition also members of their parent node, so we cannot usea naive differential method to find labels that distinguish the parent fromits children.
However, more complex criteria, based on a combination ofoverall collection frequency and prevalence in a given cluster, can determinewhether a term is a more informative label for a child node or a parent node(see Section 17.9).17.8Implementation notesMost problems that require the computation of a large number of dot products benefit from an inverted index. This is also the case for HAC clustering.Computational savings due to the inverted index are large if there are manyzero similarities – either because many documents do not share any terms orbecause an aggressive stop list is used.In low dimensions, more aggressive optimizations are possible that makethe computation of most pairwise similarities unnecessary (Exercise 17.10).However, no such algorithms are known in higher dimensions.
We encountered the same problem in kNN classification (see Section 14.7, page 314).When using GAAC on a large document set in high dimensions, we haveto take care to avoid dense centroids. For dense centroids, clustering cantake time Θ( MN 2 log N ) where M is the size of the vocabulary, whereascomplete-link clustering is Θ( Mave N 2 log N ) where Mave is the average sizeof the vocabulary of a document. So for large vocabularies complete-linkclustering can be more efficient than an unoptimized implementation of GAAC.We discussed this problem in the context of K-means clustering in Chapter 16 (page 365) and suggested two solutions: truncating centroids (keepingonly highly weighted terms) and representing clusters by means of sparsemedoids instead of dense centroids.
These optimizations can also be appliedto GAAC and centroid clustering.Even with these optimizations, HAC algorithms are all Θ( N 2 ) or Θ( N 2 log N )and therefore infeasible for large sets of 1,000,000 or more documents. Forsuch large sets, HAC can only be used in combination with a flat clusteringalgorithm like K-means. Recall that K-means requires a set of seeds as initialization (Figure 16.5, page 361). If these seeds are badly chosen, then the resulting clustering will be of poor quality. We can employ an HAC algorithmto compute seeds of√high quality. If the HAC algorithm is applied to a document subset of size N, then the overall runtime of K-means cum HAC seedOnline edition (c) 2009 Cambridge UP17.9 References and further readingB UCKSHOTALGORITHM17.9K RUSKAL’ SALGORITHMWARD ’ S METHOD399generation is Θ( N ).√This is because the application of a quadratic algorithmto a sample of size N has an overall complexity of Θ( N ).
An appropriateadjustment can be made for an Θ( N 2 log N ) algorithm to guarantee linearity. This algorithm is referred to as the Buckshot algorithm. It combines thedeterminism and higher reliability of HAC with the efficiency of K-means.References and further readingAn excellent general review of clustering is (Jain et al. 1999). Early referencesfor specific HAC algorithms are (King 1967) (single-link), (Sneath and Sokal1973) (complete-link, GAAC) and (Lance and Williams 1967) (discussing alarge variety of hierarchical clustering algorithms).
The single-link algorithmin Figure 17.9 is similar to Kruskal’s algorithm for constructing a minimumspanning tree. A graph-theoretical proof of the correctness of Kruskal’s algorithm (which is analogous to the proof in Section 17.5) is provided by Cormen et al. (1990, Theorem 23.1). See Exercise 17.5 for the connection betweenminimum spanning trees and single-link clusterings.It is often claimed that hierarchical clustering algorithms produce betterclusterings than flat algorithms (Jain and Dubes (1988, p.
140), Cutting et al.(1992), Larsen and Aone (1999)) although more recently there have been experimental results suggesting the opposite (Zhao and Karypis 2002). Evenwithout a consensus on average behavior, there is no doubt that results ofEM and K-means are highly variable since they will often converge to a localoptimum of poor quality. The HAC algorithms we have presented here aredeterministic and thus more predictable.The complexity of complete-link, group-average and centroid clusteringis sometimes given as Θ( N 2 ) (Day and Edelsbrunner 1984, Voorhees 1985b,Murtagh 1983) because a document similarity computation is an order ofmagnitude more expensive than a simple comparison, the main operationexecuted in the merging steps after the N × N similarity matrix has beencomputed.The centroid algorithm described here is due to Voorhees (1985b).
Voorheesrecommends complete-link and centroid clustering over single-link for a retrieval application. The Buckshot algorithm was originally published by Cutting et al. (1993). Allan et al. (1998) apply single-link clustering to first storydetection.An important HAC technique not discussed here is Ward’s method (WardJr. 1963, El-Hamdouchi and Willett 1986), also called minimum variance clustering. In each step, it selects the merge with the smallest RSS (Chapter 16,page 360). The merge criterion in Ward’s method (a function of all individualdistances from the centroid) is closely related to the merge criterion in GAAC(a function of all individual similarities to the centroid).Online edition (c) 2009 Cambridge UP400SPECTRAL CLUSTERING17 Hierarchical clusteringDespite its importance for making the results of clustering useful, comparatively little work has been done on labeling clusters.
Popescul and Ungar(2000) obtain good results with a combination of χ2 and collection frequencyof a term. Glover et al. (2002b) use information gain for labeling clusters ofweb pages. Stein and zu Eissen’s approach is ontology-based (2004). Themore complex problem of labeling nodes in a hierarchy (which requires distinguishing more general labels for parents from more specific labels for children) is tackled by Glover et al. (2002a) and Treeratpituk and Callan (2006).Some clustering algorithms attempt to find a set of labels first and then build(often overlapping) clusters around the labels, thereby avoiding the problemof labeling altogether (Zamir and Etzioni 1999, Käki 2005, Osiński and Weiss2005).
We know of no comprehensive study that compares the quality ofsuch “label-based” clustering to the clustering algorithms discussed in thischapter and in Chapter 16. In principle, work on multi-document summarization (McKeown and Radev 1995) is also applicable to cluster labeling, butmulti-document summaries are usually longer than the short text fragmentsneeded when labeling clusters (cf. Section 8.7, page 170). Presenting clustersin a way that users can understand is a UI problem. We recommend reading (Baeza-Yates and Ribeiro-Neto 1999, ch. 10) for an introduction to userinterfaces in IR.An example of an efficient divisive algorithm is bisecting K-means (Steinbach et al. 2000).
Spectral clustering algorithms (Kannan et al. 2000, Dhillon2001, Zha et al. 2001, Ng et al. 2001a), including principal direction divisivepartitioning (PDDP) (whose bisecting decisions are based on SVD, see Chapter 18) (Boley 1998, Savaresi and Boley 2004), are computationally more expensive than bisecting K-means, but have the advantage of being deterministic.Unlike K-means and EM, most hierarchical clustering algorithms do nothave a probabilistic interpretation. Model-based hierarchical clustering (Vaithyanathanand Dom 2000, Kamvar et al. 2002, Castro et al. 2004) is an exception.The evaluation methodology described in Section 16.3 (page 356) is alsoapplicable to hierarchical clustering.
Specialized evaluation measures for hierarchies are discussed by Fowlkes and Mallows (1983), Larsen and Aone(1999) and Sahoo et al. (2006).The R environment (R Development Core Team 2005) offers good supportfor hierarchical clustering. The R function hclust implements single-link,complete-link, group-average, and centroid clustering; and Ward’s method.Another option provided is median clustering which represents each clusterby its medoid (cf.
k-medoids in Chapter 16, page 365). Support for clustering vectors in high-dimensional spaces is provided by the software packageCLUTO (http://glaros.dtc.umn.edu/gkhome/views/cluto).Online edition (c) 2009 Cambridge UP17.1017.10?MINIMUM SPANNINGTREE401ExercisesExercisesExercise 17.5A single-link clustering can also be computed from the minimum spanning tree of agraph. The minimum spanning tree connects the vertices of a graph at the smallestpossible cost, where cost is defined as the sum over all edges of the graph. In ourcase the cost of an edge is the distance between two documents. Show that if ∆ k−1 >∆ k > . .
. > ∆1 are the costs of the edges of a minimum spanning tree, then theseedges correspond to the k − 1 merges in constructing a single-link clustering.Exercise 17.6Show that single-link clustering is best-merge persistent and that GAAC and centroidclustering are not best-merge persistent.Exercise 17.7a. Consider running 2-means clustering on a collection with documents from twodifferent languages. What result would you expect?b.
Would you expect the same result when running an HAC algorithm?Exercise 17.8Download Reuters-21578. Keep only documents that are in the classes crude, interest, and grain. Discard documents that are members of more than one of these threeclasses. Compute a (i) single-link, (ii) complete-link, (iii) GAAC, (iv) centroid clustering of the documents.
(v) Cut each dendrogram at the second branch from the top toobtain K = 3 clusters. Compute the Rand index for each of the 4 clusterings. Whichclustering method performs best?Exercise 17.9Suppose a run of HAC finds the clustering with K = 7 to have the highest value onsome prechosen goodness measure of clustering. Have we found the highest-valueclustering among all clusterings with K = 7?Exercise 17.10Consider the task of producing a single-link clustering of N points on a line:×× ××××××××Show that we only need to compute a total of about N similarities.