An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 88

Файл №811397 An introduction to information retrieval. Manning_ Raghavan (2009) (An introduction to information retrieval. Manning_ Raghavan (2009).pdf) 88 страницаAn introduction to information retrieval. Manning_ Raghavan (2009) (811397) страница 882020-08-252020-08-25СтудИзба

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 88)

To determine the cluster cardinality in thisway, we create a generalized objective function that combines two elements:distortion, a measure of how much documents deviate from the prototype oftheir clusters (e.g., RSS for K-means); and a measure of model complexity.

Weinterpret a clustering here as a model of the data. Model complexity in clustering is usually the number of clusters or a function thereof. For K-means,we then get this selection criterion for K:K = arg min[RSSmin (K ) + λK ]Kwhere λ is a weighting factor. A large value of λ favors solutions with fewclusters. For λ = 0, there is no penalty for more clusters and K = N is thebest solution.Online edition (c) 2009 Cambridge UP16.4A KAIKE I NFORMATIONC RITERION(16.12)367K-meansThe obvious difficulty with Equation (16.11) is that we need to determineλ. Unless this is easier than determining K directly, then we are back tosquare one. In some cases, we can choose values of λ that have worked wellfor similar data sets in the past. For example, if we periodically cluster newsstories from a newswire, there is likely to be a fixed value of λ that gives usthe right K in each successive clustering.

In this application, we would notbe able to determine K based on past experience since K changes.A theoretical justification for Equation (16.11) is the Akaike Information Criterion or AIC, an information-theoretic measure that trades off distortionagainst model complexity. The general form of AIC is:AIC:K = arg min[−2L(K ) + 2q(K )]Kwhere − L(K ), the negative maximum log-likelihood of the data for K clusters, is a measure of distortion and q(K ), the number of parameters of amodel with K clusters, is a measure of model complexity. We will not attempt to derive the AIC here, but it is easy to understand intuitively. Thefirst property of a good model of the data is that each data point is modeledwell by the model. This is the goal of low distortion.

But models shouldalso be small (i.e., have low model complexity) since a model that merelydescribes the data (and therefore has zero distortion) is worthless. AIC provides a theoretical justification for one particular way of weighting these twofactors, distortion and model complexity, when selecting a model.For K-means, the AIC can be stated as follows:(16.13)AIC:K = arg min[RSSmin (K ) + 2MK ]KEquation (16.13) is a special case of Equation (16.11) for λ = 2M.To derive Equation (16.13) from Equation (16.12) observe that q(K ) = KMin K-means since each element of the K centroids is a parameter that can bevaried independently; and that L(K ) = −(1/2)RSSmin (K ) (modulo a constant) if we view the model underlying K-means as a Gaussian mixture withhard assignment, uniform cluster priors and identical spherical covariancematrices (see Exercise 16.19).The derivation of AIC is based on a number of assumptions, e.g., that thedata are independent and identically distributed.

These assumptions areonly approximately true for data sets in information retrieval. As a consequence, the AIC can rarely be applied without modification in text clustering.In Figure 16.8, the dimensionality of the vector space is M ≈ 50,000. Thus,d min (1) < 5000,2MK > 50,000 dominates the smaller RSS-based term (RSSnot shown in the figure) and the minimum of the expression is reached forK = 1. But as we know, K = 4 (corresponding to the four classes China,Online edition (c) 2009 Cambridge UP36816Flat clusteringGermany, Russia and Sports) is a better choice than K = 1. In practice, Equation (16.11) is often more useful than Equation (16.13) – with the caveat thatwe need to come up with an estimate for λ.?Exercise 16.4Why are documents that do not use the same term for the concept car likely to endup in the same cluster in K-means clustering?Exercise 16.5Two of the possible termination conditions for K-means were (1) assignment does notchange, (2) centroids do not change (page 361).

Do these two conditions imply eachother?✄16.5MODEL - BASEDCLUSTERINGModel-based clusteringIn this section, we describe a generalization of K-means, the EM algorithm.It can be applied to a larger variety of document representations and distributions than K-means.In K-means, we attempt to find centroids that are good representatives. Wecan view the set of K centroids as a model that generates the data. Generatinga document in this model consists of first picking a centroid at random andthen adding some noise.

If the noise is normally distributed, this procedurewill result in clusters of spherical shape. Model-based clustering assumes thatthe data were generated by a model and tries to recover the original modelfrom the data. The model that we recover from the data then defines clustersand an assignment of documents to clusters.A commonly used criterion for estimating the model parameters is maximum likelihood. In K-means, the quantity exp(−RSS) is proportional to thelikelihood that a particular model (i.e., a set of centroids) generated the data.For K-means, maximum likelihood and minimal RSS are equivalent criteria.We denote the model parameters by Θ.

In K-means, Θ = {~µ1 , . . . , ~µK }.More generally, the maximum likelihood criterion is to select the parameters Θ that maximize the log-likelihood of generating the data D:NΘ = arg max L( D |Θ) = arg max log ∏ P(dn |Θ) = arg maxΘΘn =1ΘN∑ log P(dn |Θ)n =1L( D |Θ) is the objective function that measures the goodness of the clustering.

Given two clusterings with the same number of clusters, we prefer theone with higher L( D |Θ).This is the same approach we took in Chapter 12 (page 237) for languagemodeling and in Section 13.1 (page 265) for text classification. In text classification, we chose the class that maximizes the likelihood of generating aparticular document. Here, we choose the clustering Θ that maximizes theOnline edition (c) 2009 Cambridge UP36916.5 Model-based clusteringE XPECTATION M AXIMIZATIONALGORITHM(16.14)(16.15)likelihood of generating a given set of documents. Once we have Θ, we cancompute an assignment probability P(d|ωk ; Θ) for each document-clusterpair. This set of assignment probabilities defines a soft clustering.An example of a soft assignment is that a document about Chinese carsmay have a fractional membership of 0.5 in each of the two clusters Chinaand automobiles, reflecting the fact that both topics are pertinent.

A hard clustering like K-means cannot model this simultaneous relevance to two topics.Model-based clustering provides a framework for incorporating our knowledge about a domain. K-means and the hierarchical algorithms in Chapter 17 make fairly rigid assumptions about the data. For example, clustersin K-means are assumed to be spheres. Model-based clustering offers moreflexibility.

The clustering model can be adapted to what we know aboutthe underlying distribution of the data, be it Bernoulli (as in the examplein Table 16.3), Gaussian with non-spherical variance (another model that isimportant in document clustering) or a member of a different family.A commonly used algorithm for model-based clustering is the ExpectationMaximization algorithm or EM algorithm. EM clustering is an iterative algorithm that maximizes L( D |Θ). EM can be applied to many different types ofprobabilistic modeling. We will work with a mixture of multivariate Bernoullidistributions here, the distribution we know from Section 11.3 (page 222) andSection 13.3 (page 263):!!P ( d |ωk ; Θ) =∏qmktm ∈d∏ (1 − qmk )tm ∈/dwhere Θ = {Θ1 , .

. . , ΘK }, Θk = (αk , q1k , . . . , q Mk ), and qmk = P(Um = 1|ωk )are the parameters of the model.3 P(Um = 1|ωk ) is the probability that adocument from cluster ωk contains term tm . The probability αk is the prior ofcluster ωk : the probability that a document d is in ωk if we have no information about d.The mixture model then is:!!KP ( d |Θ) =∑ αk ∏k =1tm ∈dqmk∏ (1 − qmk )tm ∈/dIn this model, we generate a document by first picking a cluster k with probability αk and then generating the terms of the document according to theparameters qmk . Recall that the document representation of the multivariateBernoulli is a vector of M Boolean values (and not a real-valued vector).3.

Um is the random variable we defined in Section 13.3 (page 266) for the Bernoulli Naive Bayesmodel. It takes the values 1 (term tm is present in the document) and 0 (term tm is absent in thedocument).Online edition (c) 2009 Cambridge UP37016EXPECTATION STEPMAXIMIZATION STEP(16.16)Flat clusteringHow do we use EM to infer the parameters of the clustering from the data?That is, how do we choose parameters Θ that maximize L( D |Θ)? EM is similar to K-means in that it alternates between an expectation step, correspondingto reassignment, and a maximization step, corresponding to recomputation ofthe parameters of the model.

The parameters of K-means are the centroids,the parameters of the instance of EM in this section are the αk and qmk .The maximization step recomputes the conditional parameters qmk and thepriors αk as follows:Maximization step: qmk =∑nN=1 rnk I (tm ∈ dn )∑nN=1 rnkαk =∑nN=1 rnkNwhere I (tm ∈ dn ) = 1 if tm ∈ dn and 0 otherwise and rnk is the soft assignment of document d n to cluster k as computed in the preceding iteration.(We’ll address the issue of initialization in a moment.) These are the maximum likelihood estimates for the parameters of the multivariate Bernoullifrom Table 13.3 (page 268) except that documents are assigned fractionally toclusters here.

These maximum likelihood estimates maximize the likelihoodof the data given the model.The expectation step computes the soft assignment of documents to clusters given the current parameters qmk and αk :(16.17)Expectation step :rnk =αk (∏tm ∈dn qmk )(∏tm ∈/ dn (1 − qmk ))K∑k=1 αk (∏tm ∈dn qmk )(∏tm ∈/ dn (1 − qmk ))This expectation step applies Equations (16.14) and (16.15) to computing thelikelihood that ωk generated document dn .

It is the classification procedurefor the multivariate Bernoulli in Table 13.3. Thus, the expectation step isnothing else but Bernoulli Naive Bayes classification (including normalization, i.e. dividing by the denominator, to get a probability distribution overclusters).We clustered a set of 11 documents into two clusters using EM in Table 16.3. After convergence in iteration 25, the first 5 documents are assignedto cluster 1 (ri,1 = 1.00) and the last 6 to cluster 2 (ri,1 = 0.00).

Характеристики

Тип файла

PDF-файл

Размер

6,58 Mb

Материал

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Тип материала

Книга

Предмет

Анализ текстовых данных и информационный поиск

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

an-introduction-to-information-retrieval.-manning_-raghavan-2009.pdf.rar

An introduction to information retrieval. Manning_ Raghavan (2009).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.