An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 87
Текст из файла (страница 87)
This condition limitsthe runtime of the clustering algorithm, but in some cases the quality ofthe clustering will be poor because of an insufficient number of iterations.• Assignment of documents to clusters (the partitioning function γ) doesnot change between iterations. Except for cases with a bad local minimum, this produces a good clustering, but runtimes may be unacceptablylong.• Centroids ~µk do not change between iterations. This is equivalent to γ notchanging (Exercise 16.5).• Terminate when RSS falls below a threshold. This criterion ensures thatthe clustering is of a desired quality after termination.
In practice, weOnline edition (c) 2009 Cambridge UP3621644b bbbbbbbbbb bbb b bbbb bbbbbbbbbbbbbbbbb××bbb32100123b4521003210oooo+++××o++ + +o++o oo+ +o+ +++++ +++++++ ++++0123 +456××43210recomputation/movement of ~µ’s (iter. 1)43210××123b456assignment of documents (iter. 1)selection of seeds4b bbbbbbbbbb bbb b bbbb bbbbbbbbbbbbbbbbbbbb36Flat clustering+++++++o++ + +o++o oo+ ++ +o ooo+ +++o++o o++o0123 o456××~µ’s after convergence (iter. 9).
.......... . ..... . ... .. ........... ....0123 .456movement of ~µ’s in 9 iterations◮ Figure 16.6 A K-means example for K = 2 in R 2 . The position of the two centroids (~µ’s shown as X’s in the top four panels) converges after nine iterations.Online edition (c) 2009 Cambridge UP16.4363K-meansneed to combine it with a bound on the number of iterations to guaranteetermination.• Terminate when the decrease in RSS falls below a threshold θ. For small θ,this indicates that we are close to convergence. Again, we need to combineit with a bound on the number of iterations to prevent very long runtimes.We now show that K-means converges by proving that RSS monotonicallydecreases in each iteration.
We will use decrease in the meaning decrease or doesnot change in this section. First, RSS decreases in the reassignment step sinceeach vector is assigned to the closest centroid, so the distance it contributesto RSS decreases. Second, it decreases in the recomputation step because thenew centroid is the vector ~v for which RSSk reaches its minimum.(16.8)(16.9)RSSk (~v)∂RSSk (~v)∂vm=∑~x ∈ ω k=∑~x ∈ ω k|~v − ~x |2 =M∑ ∑ ( v m − x m )2~x ∈ ω k m=12( v m − x m )where xm and vm are the mth components of their respective vectors.
Settingthe partial derivative to zero, we get:(16.10)vm =1xm|ωk | ~x∑∈ωkOUTLIERSINGLETON CLUSTERwhich is the componentwise definition of the centroid. Thus, we minimizeRSSk when the old centroid is replaced with the new centroid. RSS, the sumof the RSSk , must then also decrease during recomputation.Since there is only a finite set of possible clusterings, a monotonically decreasing algorithm will eventually arrive at a (local) minimum. Take care,however, to break ties consistently, e.g., by assigning a document to the cluster with the lowest index if there are several equidistant centroids. Otherwise, the algorithm can cycle forever in a loop of clusterings that have thesame cost.While this proves the convergence of K-means, there is unfortunately noguarantee that a global minimum in the objective function will be reached.This is a particular problem if a document set contains many outliers, documents that are far from any other documents and therefore do not fit wellinto any cluster.
Frequently, if an outlier is chosen as an initial seed, then noother vector is assigned to it during subsequent iterations. Thus, we end upwith a singleton cluster (a cluster with only one document) even though thereis probably a clustering with lower RSS. Figure 16.7 shows an example of asuboptimal clustering resulting from a bad choice of initial seeds.Online edition (c) 2009 Cambridge UP364163d1×2100d2Flat clusteringd3×××××d4d5d61234◮ Figure 16.7 The outcome of clustering in K-means depends on the initial seeds.For seeds d2 and d5 , K-means converges to {{d1 , d2 , d3 }, {d4 , d5 , d6 }}, a suboptimalclustering.
For seeds d2 and d3 , it converges to {{d1 , d2 , d4 , d5 }, {d3 , d6 }}, the globaloptimum for K = 2.Another type of suboptimal clustering that frequently occurs is one withempty clusters (Exercise 16.11).Effective heuristics for seed selection include (i) excluding outliers fromthe seed set; (ii) trying out multiple starting points and choosing the clustering with lowest cost; and (iii) obtaining seeds from another method such ashierarchical clustering.
Since deterministic hierarchical clustering methodsare more predictable than K-means, a hierarchical clustering of a small random sample of size iK (e.g., for i = 5 or i = 10) often provides good seeds(see the description of the Buckshot algorithm, Chapter 17, page 399).Other initialization methods compute seeds that are not selected from thevectors to be clustered. A robust method that works well for a large varietyof document distributions is to select i (e.g., i = 10) random vectors for eachcluster and use their centroid as the seed for this cluster.
See Section 16.6 formore sophisticated initializations.What is the time complexity of K-means? Most of the time is spent on computing vector distances. One such operation costs Θ( M ). The reassignmentstep computes KN distances, so its overall complexity is Θ(KN M ). In therecomputation step, each vector gets added to a centroid once, so the complexity of this step is Θ( N M ). For a fixed number of iterations I, the overallcomplexity is therefore Θ( IKN M ). Thus, K-means is linear in all relevantfactors: iterations, number of clusters, number of vectors and dimensionalityof the space.
This means that K-means is more efficient than the hierarchicalalgorithms in Chapter 17. We had to fix the number of iterations I, which canbe tricky in practice. But in most cases, K-means quickly reaches either complete convergence or a clustering that is close to convergence. In the lattercase, a few documents would switch membership if further iterations werecomputed, but this has a small effect on the overall quality of the clustering.Online edition (c) 2009 Cambridge UP16.4K- MEDOIDSMEDOID✄16.4.1K-means365There is one subtlety in the preceding argument. Even a linear algorithmcan be quite slow if one of the arguments of Θ(. .
.) is large, and M usually islarge. High dimensionality is not a problem for computing the distance between two documents. Their vectors are sparse, so that only a small fractionof the theoretically possible M componentwise differences need to be computed. Centroids, however, are dense since they pool all terms that occur inany of the documents of their clusters.
As a result, distance computations aretime consuming in a naive implementation of K-means. However, there aresimple and effective heuristics for making centroid-document similarities asfast to compute as document-document similarities. Truncating centroids tothe most significant k terms (e.g., k = 1000) hardly decreases cluster qualitywhile achieving a significant speedup of the reassignment step (see references in Section 16.6).The same efficiency problem is addressed by K-medoids, a variant of Kmeans that computes medoids instead of centroids as cluster centers. Wedefine the medoid of a cluster as the document vector that is closest to thecentroid. Since medoids are sparse document vectors, distance computationsare fast.Cluster cardinality in K-meansWe stated in Section 16.2 that the number of clusters K is an input to most flatclustering algorithms.
What do we do if we cannot come up with a plausibleguess for K?A naive approach would be to select the optimal value of K according tothe objective function, namely the value of K that minimizes RSS. DefiningRSSmin (K ) as the minimal RSS of all clusterings with K clusters, we observethat RSSmin (K ) is a monotonically decreasing function in K (Exercise 16.13),which reaches its minimum 0 for K = N where N is the number of documents. We would end up with each document being in its own cluster.Clearly, this is not an optimal clustering.A heuristic method that gets around this problem is to estimate RSSmin (K )as follows. We first perform i (e.g., i = 10) clusterings with K clusters (eachwith a different initialization) and compute the RSS of each.
Then we take thed min (K ). Nowminimum of the i RSS values. We denote this minimum by RSSd min (K ) as K increases and find the “knee” in thewe can inspect the values RSSd min become noticeablycurve – the point where successive decreases in RSSsmaller. There are two such points in Figure 16.8, one at K = 4, where thegradient flattens slightly, and a clearer flattening at K = 9. This is typical:there is seldom a single best number of clusters. We still need to employ anexternal constraint to choose from a number of possible values of K (4 and 9in this case).Online edition (c) 2009 Cambridge UP366Flat clustering1900185017501800residual sum of squares195016246810number of clusters◮ Figure 16.8 Estimated minimal residual sum of squares as a function of the number of clusters in K-means.
In this clustering of 1203 Reuters-RCV1 documents, thered min curve flattens: at 4 clusters and at 9 clusters. Theare two points where the RSSdocuments were selected from the categories China, Germany, Russia and Sports, sothe K = 4 clustering is closest to the Reuters classification.DISTORTIONMODEL COMPLEXITY(16.11)A second type of criterion for cluster cardinality imposes a penalty for eachnew cluster – where conceptually we start with a single cluster containing alldocuments and then search for the optimal number of clusters K by successively incrementing K by one.