An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 90
Текст из файла (страница 90)
Fayyad et al.(1998) propose an initialization for EM clustering. For algorithms that cancluster very large data sets in one scan through the data see Bradley et al.(1998).The applications in Table 16.1 all cluster documents. Other information retrieval applications cluster words (e.g., Crouch 1988), contexts of words (e.g.,Schütze and Pedersen 1995) or words and documents simultaneously (e.g.,Tishby and Slonim 2000, Dhillon 2001, Zha et al.
2001). Simultaneous clustering of words and documents is an example of co-clustering or biclustering.16.7Exercises?Exercise 16.7Let Ω be a clustering that exactly reproduces a class structure C and Ω′ a clusteringthat further subdivides some clusters in Ω. Show that I (Ω; C ) = I (Ω′ ; C ).Exercise 16.8Show that I (Ω; C ) ≤ [ H (Ω) + H (C )] /2.Exercise 16.9Mutual information is symmetric in the sense that its value does not change if theroles of clusters and classes are switched: I (Ω; C ) = I (C; Ω).
Which of the otherthree evaluation measures are symmetric in this sense?Exercise 16.10Compute RSS for the two clusterings in Figure 16.7.Exercise 16.11(i) Give an example of a set of points and three initial centroids (which need not bemembers of the set of points) for which 3-means converges to a clustering with anempty cluster. (ii) Can a clustering with an empty cluster be the global optimum withrespect to RSS?Exercise 16.12Download Reuters-21578. Discard documents that do not occur in one of the 10classes acquisitions, corn, crude, earn, grain, interest, money-fx, ship, trade, and wheat.Discard documents that occur in two of these 10 classes. (i) Compute a K-means clustering of this subset into 10 clusters.
There are a number of software packages thatimplement K-means, such as WEKA (Witten and Frank 2005) and R (R DevelopmentCore Team 2005). (ii) Compute purity, normalized mutual information, F1 and RI forOnline edition (c) 2009 Cambridge UP37516.7 Exercisesthe clustering with respect to the 10 classes. (iii) Compile a confusion matrix (Table 14.5, page 308) for the 10 classes and 10 clusters. Identify classes that give rise tofalse positives and false negatives.Exercise 16.13Prove that RSSmin (K ) is monotonically decreasing in K.Exercise 16.14There is a soft version of K-means that computes the fractional membership of a document in a cluster as a monotonically decreasing function of the distance ∆ from itscentroid, e.g., as e−∆ .
Modify reassignment and recomputation steps of hard K-meansfor this soft version.Exercise 16.15In the last iteration in Table 16.3, document 6 is in cluster 2 even though it was theinitial seed for cluster 1. Why does the document change membership?Exercise 16.16The values of the parameters q mk in iteration 25 in Table 16.3 are rounded. What arethe exact values that EM will converge to?Exercise 16.17Perform a K-means clustering for the documents in Table 16.3. After how manyiterations does K-means converge? Compare the result with the EM clustering inTable 16.3 and discuss the differences.Exercise 16.18[ ⋆ ⋆ ⋆]Modify the expectation and maximization steps of EM for a Gaussian mixture. Themaximization step computes the maximum likelihood parameter estimates αk , ~µ k ,and Σk for each of the clusters.
The expectation step computes for each vector a softassignment to clusters (Gaussians) based on their current parameters. Write downthe equations for Gaussian mixtures corresponding to Equations (16.16) and (16.17).Exercise 16.19[ ⋆ ⋆ ⋆]Show that K-means can be viewed as the limiting case of EM for Gaussian mixturesif variance is very small and all covariances are 0.[ ⋆ ⋆ ⋆]Exercise 16.20WITHIN - POINTSCATTERThe within-point scatter of a clustering is defined as ∑k 12∑~xi ∈ωk ∑~x j ∈ωk |~xi −~x jthat minimizing RSS and minimizing within-point scatter are equivalent.|2 .ShowExercise 16.21[ ⋆ ⋆ ⋆]Derive an AIC criterion for the multivariate Bernoulli mixture model from Equation (16.12).Online edition (c) 2009 Cambridge UPOnline edition (c) 2009 Cambridge UPDRAFT! © April 1, 2009 Cambridge University Press. Feedback welcome.17HIERARCHICALCLUSTERING377Hierarchical clusteringFlat clustering is efficient and conceptually simple, but as we saw in Chapter 16 it has a number of drawbacks.
The algorithms introduced in Chapter 16 return a flat unstructured set of clusters, require a prespecified number of clusters as input and are nondeterministic. Hierarchical clustering (orhierarchic clustering) outputs a hierarchy, a structure that is more informativethan the unstructured set of clusters returned by flat clustering.1 Hierarchicalclustering does not require us to prespecify the number of clusters and mosthierarchical algorithms that have been used in IR are deterministic. These advantages of hierarchical clustering come at the cost of lower efficiency.
Themost common hierarchical clustering algorithms have a complexity that is atleast quadratic in the number of documents compared to the linear complexity of K-means and EM (cf. Section 16.4, page 364).This chapter first introduces agglomerative hierarchical clustering (Section 17.1)and presents four different agglomerative algorithms, in Sections 17.2–17.4,which differ in the similarity measures they employ: single-link, completelink, group-average, and centroid similarity.
We then discuss the optimalityconditions of hierarchical clustering in Section 17.5. Section 17.6 introducestop-down (or divisive) hierarchical clustering. Section 17.7 looks at labelingclusters automatically, a problem that must be solved whenever humans interact with the output of clustering. We discuss implementation issues inSection 17.8. Section 17.9 provides pointers to further reading, including references to soft hierarchical clustering, which we do not cover in this book.There are few differences between the applications of flat and hierarchical clustering in information retrieval. In particular, hierarchical clusteringis appropriate for any of the applications shown in Table 16.1 (page 351; seealso Section 16.6, page 372).
In fact, the example we gave for collection clustering is hierarchical. In general, we select flat clustering when efficiencyis important and hierarchical clustering when one of the potential problems1. In this chapter, we only consider hierarchies that are binary trees like the one shown in Figure 17.1 – but hierarchical clustering can be easily extended to other types of trees.Online edition (c) 2009 Cambridge UP37817 Hierarchical clusteringof flat clustering (not enough structure, predetermined number of clusters,non-determinism) is a concern.
In addition, many researchers believe that hierarchical clustering produces better clusters than flat clustering. However,there is no consensus on this issue (see references in Section 17.9).17.1HIERARCHICALAGGLOMERATIVECLUSTERINGHACDENDROGRAMCOMBINATIONSIMILARITYMONOTONICITYINVERSIONHierarchical agglomerative clusteringHierarchical clustering algorithms are either top-down or bottom-up. Bottomup algorithms treat each document as a singleton cluster at the outset andthen successively merge (or agglomerate) pairs of clusters until all clustershave been merged into a single cluster that contains all documents. Bottomup hierarchical clustering is therefore called hierarchical agglomerative clustering or HAC.
Top-down clustering requires a method for splitting a cluster.It proceeds by splitting clusters recursively until individual documents arereached. See Section 17.6. HAC is more frequently used in IR than top-downclustering and is the main subject of this chapter.Before looking at specific similarity measures used in HAC in Sections17.2–17.4, we first introduce a method for depicting hierarchical clusteringsgraphically, discuss a few key properties of HACs and present a simple algorithm for computing an HAC.An HAC clustering is typically visualized as a dendrogram as shown inFigure 17.1. Each merge is represented by a horizontal line. The y-coordinateof the horizontal line is the similarity of the two clusters that were merged,where documents are viewed as singleton clusters.
We call this similarity thecombination similarity of the merged cluster. For example, the combinationsimilarity of the cluster consisting of Lloyd’s CEO questioned and Lloyd’s chief/ U.S. grilling in Figure 17.1 is ≈ 0.56. We define the combination similarityof a singleton cluster as its document’s self-similarity (which is 1.0 for cosinesimilarity).By moving up from the bottom layer to the top node, a dendrogram allows us to reconstruct the history of merges that resulted in the depictedclustering.
For example, we see that the two documents entitled War heroColin Powell were merged first in Figure 17.1 and that the last merge addedAg trade reform to a cluster consisting of the other 29 documents.A fundamental assumption in HAC is that the merge operation is monotonic. Monotonic means that if s1 , s2 , .