An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 91

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 91 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 912020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 91)

. . , sK −1 are the combination similaritiesof the successive merges of an HAC, then s1 ≥ s2 ≥ . . . ≥ sK −1 holds. A nonmonotonic hierarchical clustering contains at least one inversion si < si+1and contradicts the fundamental assumption that we chose the best mergeavailable at each step. We will see an example of an inversion in Figure 17.12.Hierarchical clustering does not require a prespecified number of clusters.However, in some applications we want a partition of disjoint clusters just asOnline edition (c) 2009 Cambridge UPAg trade reform.Back−to−school spending is upLloyd’s CEO questionedLloyd’s chief / U.S. grillingViag stays positiveChrysler / Latin AmericaOhio Blue CrossJapanese prime minister / MexicoCompuServe reports lossSprint / Internet access servicePlanet HollywoodTrocadero: tripling of revenuesGerman unions splitWar hero Colin PowellWar hero Colin PowellOil prices slipChains may raise pricesClinton signs lawLawsuit against tobacco companiessuits against tobacco firmsIndiana tobacco lawsuitMost active stocksMexican marketsHog prices tumbleNYSE closing averagesBritish FTSE indexFed holds interest rates steadyFed to keep interest rates steadyFed keeps interest rates steadyFed keeps interest rates steady0.80.60.40.20.017.1 Hierarchical agglomerative clustering379◮ Figure 17.1 A dendrogram of a single-link clustering of 30 documents fromReuters-RCV1.

Two possible cuts of the dendrogram are shown: at 0.4 into 24 clustersand at 0.1 into 12 clusters.Online edition (c) 2009 Cambridge UP1.038017 Hierarchical clusteringin flat clustering. In those cases, the hierarchy needs to be cut at some point.A number of criteria can be used to determine the cutting point:• Cut at a prespecified level of similarity. For example, we cut the dendrogram at 0.4 if we want clusters with a minimum combination similarityof 0.4. In Figure 17.1, cutting the diagram at y = 0.4 yields 24 clusters(grouping only documents with high similarity together) and cutting it aty = 0.1 yields 12 clusters (one large financial news cluster and 11 smallerclusters).• Cut the dendrogram where the gap between two successive combinationsimilarities is largest.

Such large gaps arguably indicate “natural” clusterings. Adding one more cluster decreases the quality of the clusteringsignificantly, so cutting before this steep decrease occurs is desirable. Thisstrategy is analogous to looking for the knee in the K-means graph in Figure 16.8 (page 366).• Apply Equation (16.11) (page 366):K = arg min[RSS(K ′ ) + λK ′ ]K′where K ′ refers to the cut of the hierarchy that results in K ′ clusters, RSS isthe residual sum of squares and λ is a penalty for each additional cluster.Instead of RSS, another measure of distortion can be used.• As in flat clustering, we can also prespecify the number of clusters K andselect the cutting point that produces K clusters.A simple, naive HAC algorithm is shown in Figure 17.2.

We first computethe N × N similarity matrix C. The algorithm then executes N − 1 stepsof merging the currently most similar clusters. In each iteration, the twomost similar clusters are merged and the rows and columns of the mergedcluster i in C are updated.2 The clustering is stored as a list of merges inA. I indicates which clusters are still available to be merged.

The functionSIM (i, m, j) computes the similarity of cluster j with the merge of clusters iand m. For some HAC algorithms, SIM (i, m, j) is simply a function of C [ j][i ]and C [ j][m], for example, the maximum of these two values for single-link.We will now refine this algorithm for the different similarity measuresof single-link and complete-link clustering (Section 17.2) and group-averageand centroid clustering (Sections 17.3 and 17.4). The merge criteria of thesefour variants of HAC are shown in Figure 17.3.2.

We assume that we use a deterministic method for breaking ties, such as always choose themerge that is the first cluster with respect to a total ordering of the subsets of the document setD.Online edition (c) 2009 Cambridge UP38117.1 Hierarchical agglomerative clusteringS IMPLE HAC(d1 , . . . , d N )1 for n ← 1 to N2 do for i ← 1 to N3do C [n][i ] ← S IM (dn , di )4I [n] ← 1 (keeps track of active clusters)5 A ← [] (assembles clustering as a sequence of merges)6 for k ← 1 to N − 17 do hi, mi ← arg max{hi,mi:i6=m∧ I [i]=1∧ I [m]=1} C [i ][m]8A.A PPEND (hi, mi) (store merge)9for j ← 1 to N10do C [i ][ j] ← S IM (i, m, j)11C [ j][i ] ← S IM (i, m, j)12I [m] ← 0 (deactivate cluster)13 return A◮ Figure 17.2 A simple, but inefficient HAC algorithm.bbbbbbb(a) single-link: maximum similarityb(b) complete-link: minimum similaritybbbbbbbb(c) centroid: average inter-similarity (d) group-average: average of all similarities◮ Figure 17.3 The different notions of cluster similarity used by the four HAC algorithms.

An inter-similarity is a similarity between two documents from differentclusters.Online edition (c) 2009 Cambridge UP38217 Hierarchical clustering32d1d2d3d4××××d7d8××d51d6××120321d1d2d3d4××××d5d6d7d8××××120034034◮ Figure 17.4 A single-link (left) and complete-link (right) clustering of eight documents. The ellipses correspond to successive clustering stages. Left: The single-linksimilarity of the two upper two-point clusters is the similarity of d2 and d3 (solidline), which is greater than the single-link similarity of the two left two-point clusters(dashed line). Right: The complete-link similarity of the two upper two-point clustersis the similarity of d1 and d4 (dashed line), which is smaller than the complete-linksimilarity of the two left two-point clusters (solid line).17.2SINGLE - LINKCLUSTERINGCOMPLETE - LINKCLUSTERINGSingle-link and complete-link clusteringIn single-link clustering or single-linkage clustering, the similarity of two clusters is the similarity of their most similar members (see Figure 17.3, (a))3.

Thissingle-link merge criterion is local. We pay attention solely to the area wherethe two clusters come closest to each other. Other, more distant parts of thecluster and the clusters’ overall structure are not taken into account.In complete-link clustering or complete-linkage clustering, the similarity of twoclusters is the similarity of their most dissimilar members (see Figure 17.3, (b)).This is equivalent to choosing the cluster pair whose merge has the smallestdiameter. This complete-link merge criterion is non-local; the entire structureof the clustering can influence merge decisions.

This results in a preferencefor compact clusters with small diameters over long, straggly clusters, butalso causes sensitivity to outliers. A single document far from the center canincrease diameters of candidate merge clusters dramatically and completelychange the final clustering.Figure 17.4 depicts a single-link and a complete-link clustering of eightdocuments. The first four steps, each producing a cluster consisting of a pairof two documents, are identical. Then single-link clustering joins the upper two pairs (and after that the lower two pairs) because on the maximumsimilarity definition of cluster similarity, those two clusters are closest. Complete3. Throughout this chapter, we equate similarity with proximity in 2D depictions of clustering.Online edition (c) 2009 Cambridge UP0.60.40.20.0383NYSE closing averagesHog prices tumbleOil prices slipAg trade reform.Chrysler / Latin AmericaJapanese prime minister / MexicoFed holds interest rates steadyFed to keep interest rates steadyFed keeps interest rates steadyFed keeps interest rates steadyMexican marketsBritish FTSE indexWar hero Colin PowellWar hero Colin PowellLloyd’s CEO questionedLloyd’s chief / U.S.

grillingOhio Blue CrossLawsuit against tobacco companiessuits against tobacco firmsIndiana tobacco lawsuitViag stays positiveMost active stocksCompuServe reports lossSprint / Internet access servicePlanet HollywoodTrocadero: tripling of revenuesBack−to−school spending is upGerman unions splitChains may raise pricesClinton signs law0.817.2 Single-link and complete-link clustering◮ Figure 17.5 A dendrogram of a complete-link clustering. The same 30 documentswere clustered with single-link clustering in Figure 17.1.Online edition (c) 2009 Cambridge UP1.038417 Hierarchical clustering××××××××××××◮ Figure 17.6 Chaining in single-link clustering.

The local criterion in single-linkclustering can cause undesirable elongated clusters.CONNECTEDCOMPONENTCLIQUElink clustering joins the left two pairs (and then the right two pairs) becausethose are the closest pairs according to the minimum-similarity definition ofcluster similarity.4Figure 17.1 is an example of a single-link clustering of a set of documentsand Figure 17.5 is the complete-link clustering of the same set.

When cuttingthe last merge in Figure 17.5, we obtain two clusters of similar size (documents 1–16, from NYSE closing averages to Lloyd’s chief / U.S. grilling, anddocuments 17–30, from Ohio Blue Cross to Clinton signs law). There is no cutof the dendrogram in Figure 17.1 that would give us an equally balancedclustering.Both single-link and complete-link clustering have graph-theoretic interpretations. Define sk to be the combination similarity of the two clustersmerged in step k, and G (sk ) the graph that links all data points with a similarity of at least sk . Then the clusters after step k in single-link clustering are theconnected components of G (sk ) and the clusters after step k in complete-linkclustering are maximal cliques of G (sk ). A connected component is a maximalset of connected points such that there is a path connecting each pair.

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.