An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 85
Текст из файла (страница 85)
For example, if the query is carand several car documents are taken from a cluster of automobile documents,then we can add documents from this cluster that use terms other than car(automobile, vehicle etc). This can increase recall since a group of documentswith high mutual similarity is often relevant as a whole.More recently this idea has been used for language modeling. Equation (12.10),page 245, showed that to avoid sparse data problems in the language modeling approach to IR, the model of document d can be interpolated with aOnline edition (c) 2009 Cambridge UP35416Flat clusteringcollection model.
But the collection contains many documents with termsuntypical of d. By replacing the collection model with a model derived fromd’s cluster, we get more accurate estimates of the occurrence probabilities ofterms in d.Clustering can also speed up search. As we saw in Section 6.3.2 (page 123)search in the vector space model amounts to finding the nearest neighborsto the query. The inverted index supports fast nearest-neighbor search forthe standard IR setting.
However, sometimes we may not be able to use aninverted index efficiently, e.g., in latent semantic indexing (Chapter 18). Insuch cases, we could compute the similarity of the query to every document,but this is slow. The cluster hypothesis offers an alternative: Find the clusters that are closest to the query and only consider documents from theseclusters. Within this much smaller set, we can compute similarities exhaustively and rank documents in the usual way.
Since there are many fewerclusters than documents, finding the closest cluster is fast; and since the documents matching a query are all similar to each other, they tend to be inthe same clusters. While this algorithm is inexact, the expected decrease insearch quality is small. This is essentially the application of clustering thatwas covered in Section 7.1.6 (page 141).?Exercise 16.1Define two documents as similar if they have at least two proper names like Clintonor Sarkozy in common. Give an example of an information need and two documents,for which the cluster hypothesis does not hold for this notion of similarity.Exercise 16.2Make up a simple one-dimensional example (i.e.
points on a line) with two clusterswhere the inexactness of cluster-based retrieval shows up. In your example, retrieving clusters close to the query should do worse than direct nearest neighbor search.16.2OBJECTIVE FUNCTIONProblem statementWe can define the goal in hard flat clustering as follows. Given (i) a set ofdocuments D = {d1 , .
. . , d N }, (ii) a desired number of clusters K, and (iii)an objective function that evaluates the quality of a clustering, we want tocompute an assignment γ : D → {1, . . . , K } that minimizes (or, in othercases, maximizes) the objective function. In most cases, we also demand thatγ is surjective, i.e., that none of the K clusters is empty.The objective function is often defined in terms of similarity or distancebetween documents. Below, we will see that the objective in K-means clustering is to minimize the average distance between documents and their centroids or, equivalently, to maximize the similarity between documents andtheir centroids. The discussion of similarity measures and distance metricsOnline edition (c) 2009 Cambridge UP16.2 Problem statement355in Chapter 14 (page 291) also applies to this chapter.
As in Chapter 14, we useboth similarity and distance to talk about relatedness between documents.For documents, the type of similarity we want is usually topic similarityor high values on the same dimensions in the vector space model. For example, documents about China have high values on dimensions like Chinese,Beijing, and Mao whereas documents about the UK tend to have high valuesfor London, Britain and Queen. We approximate topic similarity with cosinesimilarity or Euclidean distance in vector space (Chapter 6).
If we intend tocapture similarity of a type other than topic, for example, similarity of language, then a different representation may be appropriate. When computingtopic similarity, stop words can be safely ignored, but they are importantcues for separating clusters of English (in which the occurs frequently and lainfrequently) and French documents (in which the occurs infrequently and lafrequently).PARTITIONALCLUSTERINGEXHAUSTIVEEXCLUSIVE16.2.1CARDINALITYA note on terminology.
An alternative definition of hard clustering is thata document can be a full member of more than one cluster. Partitional clustering always refers to a clustering where each document belongs to exactlyone cluster. (But in a partitional hierarchical clustering (Chapter 17) all members of a cluster are of course also members of its parent.) On the definitionof hard clustering that permits multiple membership, the difference betweensoft clustering and hard clustering is that membership values in hard clustering are either 0 or 1, whereas they can take on any non-negative value insoft clustering.Some researchers distinguish between exhaustive clusterings that assigneach document to a cluster and non-exhaustive clusterings, in which somedocuments will be assigned to no cluster.
Non-exhaustive clusterings inwhich each document is a member of either no cluster or one cluster arecalled exclusive. We define clustering to be exhaustive in this book.Cardinality – the number of clustersA difficult issue in clustering is determining the number of clusters or cardinality of a clustering, which we denote by K.
Often K is nothing more thana good guess based on experience or domain knowledge. But for K-means,we will also introduce a heuristic method for choosing K and an attempt toincorporate the selection of K into the objective function. Sometimes the application puts constraints on the range of K.
For example, the Scatter-Gatherinterface in Figure 16.3 could not display more than about K = 10 clustersper layer because of the size and resolution of computer monitors in the early1990s.Since our goal is to optimize an objective function, clustering is essentiallyOnline edition (c) 2009 Cambridge UP35616Flat clusteringa search problem. The brute force solution would be to enumerate all possible clusterings and pick the best. However, there are exponentially manypartitions, so this approach is not feasible.1 For this reason, most flat clustering algorithms refine an initial partitioning iteratively.
If the search startsat an unfavorable initial point, we may miss the global optimum. Finding agood starting point is therefore another important problem we have to solvein flat clustering.16.3INTERNAL CRITERIONOF QUALITYEXTERNAL CRITERIONOF QUALITYPURITYEvaluation of clusteringTypical objective functions in clustering formalize the goal of attaining highintra-cluster similarity (documents within a cluster are similar) and low intercluster similarity (documents from different clusters are dissimilar). This isan internal criterion for the quality of a clustering.
But good scores on aninternal criterion do not necessarily translate into good effectiveness in anapplication. An alternative to internal criteria is direct evaluation in the application of interest. For search result clustering, we may want to measurethe time it takes users to find an answer with different clustering algorithms.This is the most direct evaluation, but it is expensive, especially if large userstudies are necessary.As a surrogate for user judgments, we can use a set of classes in an evaluation benchmark or gold standard (see Section 8.5, page 164, and Section 13.6,page 279). The gold standard is ideally produced by human judges with agood level of inter-judge agreement (see Chapter 8, page 152). We can thencompute an external criterion that evaluates how well the clustering matchesthe gold standard classes.
For example, we may want to say that the optimal clustering of the search results for jaguar in Figure 16.2 consists of threeclasses corresponding to the three senses car, animal, and operating system.In this type of evaluation, we only use the partition provided by the goldstandard, not the class labels.This section introduces four external criteria of clustering quality. Purity isa simple and transparent evaluation measure. Normalized mutual informationcan be information-theoretically interpreted. The Rand index penalizes bothfalse positive and false negative decisions during clustering.