An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 84
Текст из файла (страница 84)
Cao et al. (2006)study how to make this approach effective in IR, and Qin et al. (2007) suggestan extension involving using multiple hyperplanes. Yue et al. (2007) studyhow to do ranking with a structural SVM approach, and in particular showhow this construction can be effectively used to directly optimize for MAP(Section 8.4, page 158), rather than using surrogate measures like accuracy orarea under the ROC curve. Geng et al. (2007) study feature selection for theranking problem.Other approaches to learning to rank have also been shown to be effectivefor web search, such as (Burges et al. 2005, Richardson et al. 2006).Online edition (c) 2009 Cambridge UPDRAFT! © April 1, 2009 Cambridge University Press.
Feedback welcome.16Flat clusteringClustering algorithms group a set of documents into subsets or clusters. Thealgorithms’ goal is to create clusters that are coherent internally, but clearlydifferent from each other. In other words, documents within a cluster shouldbe as similar as possible; and documents in one cluster should be as dissimilar as possible from documents in other clusters.0.00.51.01.52.02.5CLUSTER3490.00.51.01.52.0◮ Figure 16.1 An example of a data set with a clear cluster structure.UNSUPERVISEDLEARNINGClustering is the most common form of unsupervised learning. No supervision means that there is no human expert who has assigned documentsto classes.
In clustering, it is the distribution and makeup of the data thatwill determine cluster membership. A simple example is Figure 16.1. It isvisually clear that there are three distinct clusters of points. This chapter andChapter 17 introduce algorithms that find such clusters in an unsupervisedfashion.The difference between clustering and classification may not seem greatat first. After all, in both cases we have a partition of a set of documentsinto groups. But as we will see the two problems are fundamentally different.
Classification is a form of supervised learning (Chapter 13, page 256):our goal is to replicate a categorical distinction that a human supervisor im-Online edition (c) 2009 Cambridge UP35016FLAT CLUSTERINGHARD CLUSTERINGSOFT CLUSTERING16.1CLUSTER HYPOTHESISFlat clusteringposes on the data. In unsupervised learning, of which clustering is the mostimportant example, we have no such teacher to guide us.The key input to a clustering algorithm is the distance measure.
In Figure 16.1, the distance measure is distance in the 2D plane. This measure suggests three different clusters in the figure. In document clustering, the distance measure is often also Euclidean distance. Different distance measuresgive rise to different clusterings. Thus, the distance measure is an importantmeans by which we can influence the outcome of clustering.Flat clustering creates a flat set of clusters without any explicit structure thatwould relate clusters to each other. Hierarchical clustering creates a hierarchyof clusters and will be covered in Chapter 17. Chapter 17 also addresses thedifficult problem of labeling clusters automatically.A second important distinction can be made between hard and soft clustering algorithms. Hard clustering computes a hard assignment – each documentis a member of exactly one cluster.
The assignment of soft clustering algorithms is soft – a document’s assignment is a distribution over all clusters.In a soft assignment, a document has fractional membership in several clusters. Latent semantic indexing, a form of dimensionality reduction, is a softclustering algorithm (Chapter 18, page 417).This chapter motivates the use of clustering in information retrieval byintroducing a number of applications (Section 16.1), defines the problemwe are trying to solve in clustering (Section 16.2) and discusses measuresfor evaluating cluster quality (Section 16.3).
It then describes two flat clustering algorithms, K-means (Section 16.4), a hard clustering algorithm, andthe Expectation-Maximization (or EM) algorithm (Section 16.5), a soft clustering algorithm. K-means is perhaps the most widely used flat clusteringalgorithm due to its simplicity and efficiency. The EM algorithm is a generalization of K-means and can be applied to a large variety of documentrepresentations and distributions.Clustering in information retrievalThe cluster hypothesis states the fundamental assumption we make when using clustering in information retrieval.Cluster hypothesis.
Documents in the same cluster behave similarlywith respect to relevance to information needs.The hypothesis states that if there is a document from a cluster that is relevant to a search request, then it is likely that other documents from the samecluster are also relevant.
This is because clustering puts together documentsthat share many terms. The cluster hypothesis essentially is the contiguityOnline edition (c) 2009 Cambridge UP35116.1 Clustering in information retrievalApplicationSearch result clusteringScatter-GatherCollection clusteringWhat isclustered?searchresults(subsets of)collectioncollectionLanguage modelingcollectionCluster-based retrievalcollection◮ Table 16.1SEARCH RESULTCLUSTERINGS CATTER -G ATHERBenefitExamplemore effective informationpresentation to useralternative user interface:“search without typing”effective information presentation for exploratorybrowsingincreased precision and/orrecallhigher efficiency: fastersearchFigure 16.2Figure 16.3McKeown et al.
(2002),http://news.google.comLiu and Croft (2004)Salton (1971a)Some applications of clustering in information retrieval.hypothesis in Chapter 14 (page 289). In both cases, we posit that similardocuments behave similarly with respect to relevance.Table 16.1 shows some of the main applications of clustering in information retrieval. They differ in the set of documents that they cluster – searchresults, collection or subsets of the collection – and the aspect of an information retrieval system they try to improve – user experience, user interface,effectiveness or efficiency of the search system. But they are all based on thebasic assumption stated by the cluster hypothesis.The first application mentioned in Table 16.1 is search result clustering whereby search results we mean the documents that were returned in response toa query.
The default presentation of search results in information retrieval isa simple list. Users scan the list from top to bottom until they have foundthe information they are looking for. Instead, search result clustering clusters the search results, so that similar documents appear together. It is ofteneasier to scan a few coherent groups than many individual documents. Thisis particularly useful if a search term has different word senses.
The examplein Figure 16.2 is jaguar. Three frequent senses on the web refer to the car, theanimal and an Apple operating system. The Clustered Results panel returnedby the Vivísimo search engine (http://vivisimo.com) can be a more effective userinterface for understanding what is in the search results than a simple list ofdocuments.A better user interface is also the goal of Scatter-Gather, the second application in Table 16.1. Scatter-Gather clusters the whole collection to getgroups of documents that the user can select or gather. The selected groupsare merged and the resulting set is again clustered.
This process is repeateduntil a cluster of interest is found. An example is shown in Figure 16.3.Online edition (c) 2009 Cambridge UP35216Flat clustering◮ Figure 16.2 Clustering of search results to improve recall. None of the top hitscover the animal sense of jaguar, but users can easily access it by clicking on the catcluster in the Clustered Results panel on the left (third arrow from the top).Automatically generated clusters like those in Figure 16.3 are not as neatlyorganized as a manually constructed hierarchical tree like the Open Directory at http://dmoz.org. Also, finding descriptive labels for clusters automatically is a difficult problem (Section 17.7, page 396).
But cluster-based navigation is an interesting alternative to keyword searching, the standard information retrieval paradigm. This is especially true in scenarios where usersprefer browsing over searching because they are unsure about which searchterms to use.As an alternative to the user-mediated iterative clustering in Scatter-Gather,we can also compute a static hierarchical clustering of a collection that isnot influenced by user interactions (“Collection clustering” in Table 16.1).Google News and its precursor, the Columbia NewsBlaster system, are examples of this approach. In the case of news, we need to frequently recompute the clustering to make sure that users can access the latest breakingstories.
Clustering is well suited for access to a collection of news storiessince news reading is not really search, but rather a process of selecting asubset of stories about recent events.Online edition (c) 2009 Cambridge UP16.1 Clustering in information retrieval353◮ Figure 16.3 An example of a user session in Scatter-Gather. A collection of NewYork Times news stories is clustered (“scattered”) into eight clusters (top row).
Theuser manually gathers three of these into a smaller collection International Stories andperforms another scattering operation. This process repeats until a small cluster withrelevant documents is found (e.g., Trinidad).The fourth application of clustering exploits the cluster hypothesis directlyfor improving search results, based on a clustering of the entire collection.We use a standard inverted index to identify an initial set of documents thatmatch the query, but we then add other documents from the same clusterseven if they have low similarity to the query.