An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 89
Текст из файла (страница 89)
Somewhatatypically, the final assignment is a hard assignment here. EM usually converges to a soft assignment. In iteration 25, the prior α1 for cluster 1 is5/11 ≈ 0.45 because 5 of the 11 documents are in cluster 1. Some termsare quickly associated with one cluster because the initial assignment can“spread” to them unambiguously. For example, membership in cluster 2spreads from document 7 to document 8 in the first iteration because theyshare sugar (r8,1 = 0 in iteration 1).
For parameters of terms occurringin ambiguous contexts, convergence takes longer. Seed documents 6 and 7Online edition (c) 2009 Cambridge UP37116.5 Model-based clustering(a)(b)docID123456document texthot chocolate cocoa beanscocoa ghana africabeans harvest ghanacocoa butterbutter trufflessweet chocolateParameter0α1r1,1r2,1r3,1r4,1r5,1r6,1r7,1r8,1r9,1r10,1r11,1qafrica,1qafrica,2qbrazil,1qbrazil,2qcocoa,1qcocoa,2qsugar,1qsugar,2qsweet,1qsweet,21.000.0010.501.000.500.500.500.501.000.000.000.000.500.500.0000.0000.0000.0000.0000.0000.0001.0001.0001.000docID7891011document textsweet sugarsugar cane brazilsweet sugar beetsweet cake icingcake black forestIteration of clustering23450.450.530.570.581.001.001.001.000.790.991.001.000.841.001.001.000.750.941.001.000.520.660.911.001.001.001.001.000.000.000.000.000.000.000.000.000.000.000.000.000.400.140.010.000.570.580.410.070.100 0.134 0.158 0.1580.083 0.042 0.001 0.0000.000 0.000 0.000 0.0000.167 0.195 0.213 0.2140.400 0.432 0.465 0.4740.167 0.090 0.014 0.0010.000 0.000 0.000 0.0000.500 0.585 0.640 0.6420.300 0.238 0.180 0.1590.417 0.507 0.610 0.640150.541.001.001.001.001.000.830.000.000.000.000.000.1690.0000.0000.1960.5080.0000.0000.5890.1530.608250.451.001.001.001.001.000.000.000.000.000.000.000.2000.0000.0000.1670.6000.0000.0000.5000.0000.667◮ Table 16.3 The EM clustering algorithm.
The table shows a set of documents(a) and parameter values for selected iterations during EM clustering (b). Parametersshown are prior α1 , soft assignment scores rn,1 (both omitted for cluster 2), and lexicalparameters q m,k for a few terms. The authors initially assigned document 6 to cluster 1 and document 7 to cluster 2 (iteration 0). EM converges after 25 iterations. Forsmoothing, the rnk in Equation (16.16) were replaced with rnk + ǫ where ǫ = 0.0001.Online edition (c) 2009 Cambridge UP37216Flat clusteringboth contain sweet.
As a result, it takes 25 iterations for the term to be unambiguously associated with cluster 2. (qsweet,1 = 0 in iteration 25.)Finding good seeds is even more critical for EM than for K-means. EM isprone to get stuck in local optima if the seeds are not chosen well. This is ageneral problem that also occurs in other applications of EM.4 Therefore, aswith K-means, the initial assignment of documents to clusters is often computed by a different algorithm. For example, a hard K-means clustering mayprovide the initial assignment, which EM can then “soften up.”?16.6Exercise 16.6We saw above that the time complexity of K-means is Θ ( IKNM ). What is the timecomplexity of EM?References and further readingBerkhin (2006b) gives a general up-to-date survey of clustering methods withspecial attention to scalability.
The classic reference for clustering in pattern recognition, covering both K-means and EM, is (Duda et al. 2000). Rasmussen (1992) introduces clustering from an information retrieval perspective. Anderberg (1973) provides a general introduction to clustering for applications. In addition to Euclidean distance and cosine similarity, KullbackLeibler divergence is often used in clustering as a measure of how (dis)similardocuments and clusters are (Xu and Croft 1999, Muresan and Harper 2004,Kurland and Lee 2004).The cluster hypothesis is due to Jardine and van Rijsbergen (1971) whostate it as follows: Associations between documents convey information about therelevance of documents to requests.
Salton (1971a; 1975), Croft (1978), Voorhees(1985a), Can and Ozkarahan (1990), Cacheda et al. (2003), Can et al. (2004),Singitham et al. (2004) and Altingövde et al. (2008) investigate the efficiencyand effectiveness of cluster-based retrieval. While some of these studiesshow improvements in effectiveness, efficiency or both, there is no consensusthat cluster-based retrieval works well consistently across scenarios. Clusterbased language modeling was pioneered by Liu and Croft (2004).There is good evidence that clustering of search results improves user experience and search result quality (Hearst and Pedersen 1996, Zamir and Etzioni 1999, Tombros et al.
2002, Käki 2005, Toda and Kataoka 2005), althoughnot as much as search result structuring based on carefully edited categoryhierarchies (Hearst 2006). The Scatter-Gather interface for browsing collections was presented by Cutting et al. (1992). A theoretical framework for an4.
For example, this problem is common when EM is used to estimate parameters of hiddenMarkov models, probabilistic grammars, and machine translation models in natural languageprocessing (Manning and Schütze 1999).Online edition (c) 2009 Cambridge UP16.6 References and further readingADJUSTED R AND INDEX373alyzing the properties of Scatter/Gather and other information seeking userinterfaces is presented by Pirolli (2007). Schütze and Silverstein (1997) evaluate LSI (Chapter 18) and truncated representations of centroids for efficientK-means clustering.The Columbia NewsBlaster system (McKeown et al. 2002), a forerunner tothe now much more famous and refined Google News (http://news.google.com),used hierarchical clustering (Chapter 17) to give two levels of news topicgranularity.
See Hatzivassiloglou et al. (2000) for details, and Chen and Lin(2000) and Radev et al. (2001) for related systems. Other applications ofclustering in information retrieval are duplicate detection (Yang and Callan(2006), Section 19.6, page 438), novelty detection (see references in Section 17.9,page 399) and metadata discovery on the semantic web (Alonso et al. 2006).The discussion of external evaluation measures is partially based on Strehl(2002).
Dom (2002) proposes a measure Q0 that is better motivated theoretically than NMI. Q0 is the number of bits needed to transmit class memberships assuming cluster memberships are known. The Rand index is due toRand (1971). Hubert and Arabie (1985) propose an adjusted Rand index thatranges between −1 and 1 and is 0 if there is only chance agreement betweenclusters and classes (similar to κ in Chapter 8, page 165).
Basu et al. (2004) argue that the three evaluation measures NMI, Rand index and F measure givevery similar results. Stein et al. (2003) propose expected edge density as an internal measure and give evidence that it is a good predictor of the quality of aclustering. Kleinberg (2002) and Meilă (2005) present axiomatic frameworksfor comparing clusterings.Authors that are often credited with the invention of the K-means algorithm include Lloyd (1982) (first distributed in 1957), Ball (1965), MacQueen(1967), and Hartigan and Wong (1979).
Arthur and Vassilvitskii (2006) investigate the worst-case complexity of K-means. Bradley and Fayyad (1998),Pelleg and Moore (1999) and Davidson and Satyanarayana (2003) investigate the convergence properties of K-means empirically and how it dependson initial seed selection. Dhillon and Modha (2001) compare K-means clusters with SVD-based clusters (Chapter 18). The K-medoid algorithm waspresented by Kaufman and Rousseeuw (1990). The EM algorithm was originally introduced by Dempster et al. (1977). An in-depth treatment of EM is(McLachlan and Krishnan 1996).
See Section 18.5 (page 417) for publicationson latent analysis, which can also be viewed as soft clustering.AIC is due to Akaike (1974) (see also Burnham and Anderson (2002)). Analternative to AIC is BIC, which can be motivated as a Bayesian model selection procedure (Schwarz 1978). Fraley and Raftery (1998) show how tochoose an optimal number of clusters based on BIC. An application of BIC toK-means is (Pelleg and Moore 2000).
Hamerly and Elkan (2003) propose analternative to BIC that performs better in their experiments. Another influential Bayesian approach for determining the number of clusters (simultane-Online edition (c) 2009 Cambridge UP37416CO - CLUSTERINGFlat clusteringously with cluster assignment) is described by Cheeseman and Stutz (1996).Two methods for determining cardinality without external criteria are presented by Tibshirani et al. (2001).We only have space here for classical completely unsupervised clustering.An important current topic of research is how to use prior knowledge toguide clustering (e.g., Ji and Xu (2006)) and how to incorporate interactivefeedback during clustering (e.g., Huang and Mitchell (2006)).