Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 29

Файл №811375 Bishop C.M. Pattern Recognition and Machine Learning (2006) (Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf) 29 страницаBishop C.M. Pattern Recognition and Machine Learning (2006) (811375) страница 292020-08-252020-08-25СтудИзба

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 29)

In some circumstances, it will be convenient to remove this constraint byexpressing the distribution in terms of only M − 1 parameters. This can be achievedby using the relationship (2.209) to eliminate µM by expressing it in terms of theremaining {µk } where k = 1, . . . , M − 1, thereby leaving M − 1 parameters. Notethat these remaining parameters are still subject to the constraints0 µk 1,M−1k=1µk 1.(2.210)2.4. The Exponential Family115Making use of the constraint (2.209), the multinomial distribution in this representation then becomesMexpxk ln µkk=1= expM −1= expM −11−xk ln µk +k=1M−1k=1xk ln1−k=1We now identifyµkM −1j =1xkln 1 −µjlnM−1µkk=1+ ln 1 −M−1µk. (2.211)k=11−µkjµj= ηk(2.212)which we can solve for µk by ﬁrst summing both sides over k and then rearrangingand back-substituting to giveµk =exp(ηk ).1 + j exp(ηj )(2.213)This is called the softmax function, or the normalized exponential.

In this representation, the multinomial distribution therefore takes the form−1M−1exp(ηk )exp(η T x).(2.214)p(x|η) = 1 +k=1This is the standard form of the exponential family, with parameter vector η =(η1 , . . . , ηM −1 )T in whichu(x) = xh(x) = 1g(η) =1+M−1(2.215)(2.216)−1exp(ηk ).(2.217)k=1Finally, let us consider the Gaussian distribution. For the univariate Gaussian,we have1122exp − 2 (x − µ)(2.218)p(x|µ, σ ) =2σ(2πσ 2 )1/211 2µ1 2=exp−x+x−µ(2.219)2σ 2σ22σ 2(2πσ 2 )1/21162. PROBABILITY DISTRIBUTIONSExercise 2.57which, after some simple rearrangement, can be cast in the standard exponentialfamily form (2.194) withµ/σ 2η =(2.220)−1/2σ 2 xu(x) =(2.221)x2h(x) = (2π)−1/2g(η) = (−2η2 )1/2 expη12(2.222)4η2.(2.223)2.4.1 Maximum likelihood and sufﬁcient statisticsLet us now consider the problem of estimating the parameter vector η in the general exponential family distribution (2.194) using the technique of maximum likelihood.

Taking the gradient of both sides of (2.195) with respect to η, we have∇g(η) h(x) exp η T u(x) dx+ g(η) h(x) exp η T u(x) u(x) dx = 0.(2.224)Rearranging, and making use again of (2.195) then gives1−∇g(η) = g(η) h(x) exp η T u(x) u(x) dx = E[u(x)]g(η)(2.225)where we have used (2.194). We therefore obtain the result−∇ ln g(η) = E[u(x)].Exercise 2.58(2.226)Note that the covariance of u(x) can be expressed in terms of the second derivativesof g(η), and similarly for higher order moments.

Thus, provided we can normalize adistribution from the exponential family, we can always ﬁnd its moments by simpledifferentiation.Now consider a set of independent identically distributed data denoted by X ={x1 , . . . , xn }, for which the likelihood function is given byNNh(xn ) g(η)N exp η Tu(xn ) .(2.227)p(X|η) =n=1n=1Setting the gradient of ln p(X|η) with respect to η to zero, we get the followingcondition to be satisﬁed by the maximum likelihood estimator η ML−∇ ln g(η ML ) =N1 u(xn )Nn=1(2.228)2.4. The Exponential Family117which can in principle be solved to obtain η ML . We see that thesolution for themaximum likelihood estimator depends on the data only through n u(xn ), whichis therefore called the sufﬁcient statistic of the distribution (2.194).

We do not needto store the entire data set itself but only the value of the sufﬁcient statistic. Forthe Bernoulli distribution, for example, the function u(x) is given just by x andso we need only keep the sum of the data points {xn }, whereas for the Gaussianu(x) = (x, x2 )T , and so we should keep both the sum of {xn } and the sum of {x2n }.If we consider the limit N → ∞, then the right-hand side of (2.228) becomesE[u(x)], and so by comparing with (2.226) we see that in this limit η ML will equalthe true value η.In fact, this sufﬁciency property holds also for Bayesian inference, althoughwe shall defer discussion of this until Chapter 8 when we have equipped ourselveswith the tools of graphical models and can thereby gain a deeper insight into theseimportant concepts.2.4.2 Conjugate priorsWe have already encountered the concept of a conjugate prior several times, forexample in the context of the Bernoulli distribution (for which the conjugate prioris the beta distribution) or the Gaussian (where the conjugate prior for the mean isa Gaussian, and the conjugate prior for the precision is the Wishart distribution).

Ingeneral, for a given probability distribution p(x|η), we can seek a prior p(η) that isconjugate to the likelihood function, so that the posterior distribution has the samefunctional form as the prior. For any member of the exponential family (2.194), thereexists a conjugate prior that can be written in the form(2.229)p(η|χ, ν) = f (χ, ν)g(η)ν exp νη T χwhere f (χ, ν) is a normalization coefﬁcient, and g(η) is the same function as appears in (2.194).

To see that this is indeed conjugate, let us multiply the prior (2.229)by the likelihood function (2.227) to obtain the posterior distribution, up to a normalization coefﬁcient, in the form Nu(xn ) + νχ.(2.230)p(η|X, χ, ν) ∝ g(η)ν +N exp η Tn=1This again takes the same functional form as the prior (2.229), conﬁrming conjugacy.Furthermore, we see that the parameter ν can be interpreted as a effective number ofpseudo-observations in the prior, each of which has a value for the sufﬁcient statisticu(x) given by χ.2.4.3 Noninformative priorsIn some applications of probabilistic inference, we may have prior knowledgethat can be conveniently expressed through the prior distribution.

For example, ifthe prior assigns zero probability to some value of variable, then the posterior distribution will necessarily also assign zero probability to that value, irrespective of1182. PROBABILITY DISTRIBUTIONSany subsequent observations of data. In many cases, however, we may have littleidea of what form the distribution should take. We may then seek a form of priordistribution, called a noninformative prior, which is intended to have as little inﬂuence on the posterior distribution as possible (Jeffries, 1946; Box and Tao, 1973;Bernardo and Smith, 1994).

This is sometimes referred to as ‘letting the data speakfor themselves’.If we have a distribution p(x|λ) governed by a parameter λ, we might be temptedto propose a prior distribution p(λ) = const as a suitable prior. If λ is a discretevariable with K states, this simply amounts to setting the prior probability of eachstate to 1/K. In the case of continuous parameters, however, there are two potentialdifﬁculties with this approach. The ﬁrst is that, if the domain of λ is unbounded,this prior distribution cannot be correctly normalized because the integral over λdiverges. Such priors are called improper. In practice, improper priors can oftenbe used provided the corresponding posterior distribution is proper, i.e., that it canbe correctly normalized.

For instance, if we put a uniform prior distribution overthe mean of a Gaussian, then the posterior distribution for the mean, once we haveobserved at least one data point, will be proper.A second difﬁculty arises from the transformation behaviour of a probabilitydensity under a nonlinear change of variables, given by (1.27).

If a function h(λ)is constant, and we change variables to λ = η 2 , then h(η) = h(η 2 ) will also beconstant. However, if we choose the density pλ (λ) to be constant, then the densityof η will be given, from (1.27), by dλ = pλ (η 2 )2η ∝ η(2.231)pη (η) = pλ (λ) dη and so the density over η will not be constant. This issue does not arise when we usemaximum likelihood, because the likelihood function p(x|λ) is a simple function ofλ and so we are free to use any convenient parameterization. If, however, we are tochoose a prior distribution that is constant, we must take care to use an appropriaterepresentation for the parameters.Here we consider two simple examples of noninformative priors (Berger, 1985).First of all, if a density takes the formp(x|µ) = f (x − µ)(2.232)then the parameter µ is known as a location parameter. This family of densitiesx = x + c,exhibits translation invariance because if we shift x by a constant to give thenx|µ) = f (x−µ)(2.233)p( = µ + c.

Thus the density takes the same form in thewhere we have deﬁned µnew variable as in the original one, and so the density is independent of the choiceof origin. We would like to choose a prior distribution that reﬂects this translationinvariance property, and so we choose a prior that assigns equal probability mass to2.4. The Exponential Family119an interval A µ B as to the shifted interval A − c µ B − c. This implies B B−c Bp(µ) dµ =p(µ) dµ =p(µ − c) dµ(2.234)AA−cAand because this must hold for all choices of A and B, we havep(µ − c) = p(µ)(2.235)which implies that p(µ) is constant. An example of a location parameter would bethe mean µ of a Gaussian distribution. As we have seen, the conjugate prior distribution for µ in this case is a Gaussian p(µ|µ0 , σ02 ) = N (µ|µ0 , σ02 ), and we obtain anoninformative prior by taking the limit σ02 → ∞.

Indeed, from (2.141) and (2.142)we see that this gives a posterior distribution over µ in which the contributions fromthe prior vanish.As a second example, consider a density of the form1 xp(x|σ) = f(2.236)σσExercise 2.59where σ > 0. Note that this will be a normalized density provided f (x) is correctlynormalized. The parameter σ is known as a scale parameter, and the density exhibitsx = cx, thenscale invariance because if we scale x by a constant to give 1xx|σ) = f(2.237)p(σσwhere we have deﬁned σ = cσ. This transformation corresponds to a change ofscale, for example from meters to kilometers if x is a length, and we would liketo choose a prior distribution that reﬂects this scale invariance.

Характеристики

Тип файла

PDF-файл

Размер

9,37 Mb

Материал

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Тип материала

Книга

Предмет

(ММО) Методы машинного обучения

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

bishop-c.m.-pattern-recognition-and-machine-learning-2006.pdf.rar

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.