An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 94
Текст из файла (страница 94)
They can only be in the same cluster if a merge with similarity smaller than s has occurred in the merge sequence producing ΩK . Thisimplies s > COMB - SIM (ΩK ). Thus, COMB - SIM (Ω′K −1 ) = s > COMB - SIM (ΩK ) >COMB - SIM ( Ω ′K ) > COMB - SIM ( Ω ′K −1 ). Contradiction.Case 2: The two documents linked by s = COMB - SIM (Ω′K −1 ) are not inthe same cluster in ΩK .
But s = COMB - SIM (Ω′K −1 ) > COMB - SIM (ΩK −1 ), sothe single-link merging rule should have merged these two clusters whenprocessing ΩK . Contradiction.Thus, ΩK −1 is optimal.In contrast to single-link clustering, complete-link clustering and GAACare not optimal as this example shows:×d13×d21×d33×d4Both algorithms merge the two points with distance 1 (d2 and d3 ) first andthus cannot find the two-cluster clustering {{d1 , d2 }, {d3 , d4 }}.
But {{d1 , d2 }, {d3 , d4 }}is optimal on the optimality criteria of complete-link clustering and GAAC.However, the merge criteria of complete-link clustering and GAAC approximate the desideratum of approximate sphericity better than the mergecriterion of single-link clustering. In many applications, we want spherical clusters. Thus, even though single-link clustering may seem preferable atfirst because of its optimality, it is optimal with respect to the wrong criterionin many document clustering applications.Table 17.1 summarizes the properties of the four HAC algorithms introduced in this chapter. We recommend GAAC for document clustering because it is generally the method that produces the clustering with the bestOnline edition (c) 2009 Cambridge UP39517.6 Divisive clusteringmethodsingle-linkcombination similaritymax inter-similarity of any 2 docstime compl.Θ( N 2 )optimal?yescommentchaining effectcomplete-linkmin inter-similarity of any 2 docsΘ( N 2 log N )nosensitive to outliersgroup-averageaverage of all simsΘ( N 2 log N )nobest choice formost applicationscentroidaverage inter-similarityΘ( N 2 log N )noinversions can occur◮ Table 17.1 Comparison of HAC algorithms.FIRST STORYDETECTION?17.6TOP - DOWNCLUSTERINGproperties for applications.
It does not suffer from chaining, from sensitivityto outliers and from inversions.There are two exceptions to this recommendation. First, for non-vectorrepresentations, GAAC is not applicable and clustering should typically beperformed with the complete-link method.Second, in some applications the purpose of clustering is not to create acomplete hierarchy or exhaustive partition of the entire document set. Forinstance, first story detection or novelty detection is the task of detecting the firstoccurrence of an event in a stream of news stories.
One approach to this taskis to find a tight cluster within the documents that were sent across the wirein a short period of time and are dissimilar from all previous documents. Forexample, the documents sent over the wire in the minutes after the WorldTrade Center attack on September 11, 2001 form such a cluster. Variations ofsingle-link clustering can do well on this task since it is the structure of smallparts of the vector space – and not global structure – that is important in thiscase.Similarly, we will describe an approach to duplicate detection on the webin Section 19.6 (page 440) where single-link clustering is used in the guise ofthe union-find algorithm.
Again, the decision whether a group of documentsare duplicates of each other is not influenced by documents that are locatedfar away and single-link clustering is a good choice for duplicate detection.Exercise 17.4Show the equivalence of the two definitions of combination similarity: the processdefinition on page 378 and the static definition on page 393.Divisive clusteringSo far we have only looked at agglomerative clustering, but a cluster hierarchy can also be generated top-down. This variant of hierarchical clusteringis called top-down clustering or divisive clustering.
We start at the top with alldocuments in one cluster. The cluster is split using a flat clustering algo-Online edition (c) 2009 Cambridge UP39617 Hierarchical clusteringrithm. This procedure is applied recursively until each document is in itsown singleton cluster.Top-down clustering is conceptually more complex than bottom-up clustering since we need a second, flat clustering algorithm as a “subroutine”.
Ithas the advantage of being more efficient if we do not generate a completehierarchy all the way down to individual document leaves. For a fixed number of top levels, using an efficient flat algorithm like K-means, top-downalgorithms are linear in the number of documents and clusters. So they runmuch faster than HAC algorithms, which are at least quadratic.There is evidence that divisive algorithms produce more accurate hierarchies than bottom-up algorithms in some circumstances.
See the referenceson bisecting K-means in Section 17.9. Bottom-up methods make clustering decisions based on local patterns without initially taking into accountthe global distribution. These early decisions cannot be undone. Top-downclustering benefits from complete information about the global distributionwhen making top-level partitioning decisions.17.7DIFFERENTIAL CLUSTERLABELINGCLUSTER - INTERNALLABELINGCluster labelingIn many applications of flat clustering and hierarchical clustering, particularly in analysis tasks and in user interfaces (see applications in Table 16.1,page 351), human users interact with clusters. In such settings, we must labelclusters, so that users can see what a cluster is about.Differential cluster labeling selects cluster labels by comparing the distribution of terms in one cluster with that of other clusters.
The feature selectionmethods we introduced in Section 13.5 (page 271) can all be used for differential cluster labeling.5 In particular, mutual information (MI) (Section 13.5.1,page 272) or, equivalently, information gain and the χ2 -test (Section 13.5.2,page 275) will identify cluster labels that characterize one cluster in contrastto other clusters. A combination of a differential test with a penalty for rareterms often gives the best labeling results because rare terms are not necessarily representative of the cluster as a whole.We apply three labeling methods to a K-means clustering in Table 17.2.
Inthis example, there is almost no difference between MI and χ2 . We thereforeomit the latter.Cluster-internal labeling computes a label that solely depends on the clusteritself, not on other clusters. Labeling a cluster with the title of the documentclosest to the centroid is one cluster-internal method. Titles are easier to readthan a list of terms. A full title can also contain important context that didn’tmake it into the top 10 terms selected by MI. On the web, anchor text can5.
Selecting the most frequent terms is a non-differential feature selection technique we discussed in Section 13.5. It can also be used for labeling clusters.Online edition (c) 2009 Cambridge UP39717.7 Cluster labeling4# docscentroid622oil plant mexico production crude power000 refinery gas bpd91017101259police security russianpeople military peacekilled told groznycourt00 000 tonnes tradersfutures wheat pricescentsseptembertonnelabeling methodmutual informationplant oil productionbarrelscrude bpdmexico dolly capacitypetroleumpolice killed militarysecurity peace toldtroops forces rebelspeopledeliverytradersfutures tonne tonnesdesk wheat prices 00000titleMEXICO:Hurricane Dolly heads forMexico coastRUSSIA:Russia’sLebed meets rebelchief in ChechnyaUSA: Export Business- Grain/oilseeds complex◮ Table 17.2 Automatically computed cluster labels.
This is for three of ten clusters(4, 9, and 10) in a K-means clustering of the first 10,000 documents in Reuters-RCV1.The last three columns show cluster summaries computed by three labeling methods:most highly weighted terms in centroid (centroid), mutual information, and the titleof the document closest to the centroid of the cluster (title).
Terms selected by onlyone of the first two methods are in bold.play a role similar to a title since the anchor text pointing to a page can serveas a concise summary of its contents.In Table 17.2, the title for cluster 9 suggests that many of its documents areabout the Chechnya conflict, a fact the MI terms do not reveal. However, asingle document is unlikely to be representative of all documents in a cluster.An example is cluster 4, whose selected title is misleading. The main topic ofthe cluster is oil. Articles about hurricane Dolly only ended up in this clusterbecause of its effect on oil prices.We can also use a list of terms with high weights in the centroid of the cluster as a label. Such highly weighted terms (or, even better, phrases, especiallynoun phrases) are often more representative of the cluster than a few titlescan be, even if they are not filtered for distinctiveness as in the differentialmethods.
However, a list of phrases takes more time to digest for users thana well crafted title.Cluster-internal methods are efficient, but they fail to distinguish termsthat are frequent in the collection as a whole from those that are frequent onlyin the cluster. Terms like year or Tuesday may be among the most frequent ina cluster, but they are not helpful in understanding the contents of a clusterwith a specific topic like oil.In Table 17.2, the centroid method selects a few more uninformative terms(000, court, cents, september) than MI (forces, desk), but most of the terms se-Online edition (c) 2009 Cambridge UP39817 Hierarchical clusteringlected by either method are good descriptors.