Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 51

Файл №811375 Bishop C.M. Pattern Recognition and Machine Learning (2006) (Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf) 51 страницаBishop C.M. Pattern Recognition and Machine Learning (2006) (811375) страница 512020-08-252020-08-25СтудИзба

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 51)

The Laplace Approximation213Thus y and η must related, and we denote this relation through η = ψ(y).Following Nelder and Wedderburn (1972), we deﬁne a generalized linear modelto be one for which y is a nonlinear function of a linear combination of the input (orfeature) variables so thaty = f (wT φ)(4.120)where f (·) is known as the activation function in the machine learning literature, andf −1 (·) is known as the link function in statistics.Now consider the log likelihood function for this model, which, as a function ofη, is given byln p(t|η, s) =Nln p(tn |η, s) =n=1N ηn tn ln g(ηn ) ++ consts(4.121)n=1where we are assuming that all observations share a common scale parameter (whichcorresponds to the noise variance for a Gaussian distribution for instance) and so sis independent of n. The derivative of the log likelihood with respect to the modelparameters w is then given byN dtn dηn dynln g(ηn ) +∇an∇w ln p(t|η, s) =dηns dyn dann=1=N1n=1s{tn − yn } ψ (yn )f (an )φn(4.122)where an = wT φn , and we have used yn = f (an ) together with the result (4.119)for E[t|η].

We now see that there is a considerable simpliﬁcation if we choose aparticular form for the link function f −1 (y) given byf −1 (y) = ψ(y)(4.123)which gives f (ψ(y)) = y and hence f (ψ)ψ (y) = 1. Also, because a = f −1 (y),we have a = ψ and hence f (a)ψ (y) = 1. In this case, the gradient of the errorfunction reduces toN1∇ ln E(w) ={yn − tn }φn .(4.124)sn=1For the Gaussian s = β−1, whereas for the logistic model s = 1.4.4. The Laplace ApproximationIn Section 4.5 we shall discuss the Bayesian treatment of logistic regression. Aswe shall see, this is more complex than the Bayesian treatment of linear regressionmodels, discussed in Sections 3.3 and 3.5. In particular, we cannot integrate exactly214Chapter 10Chapter 114.

LINEAR MODELS FOR CLASSIFICATIONover the parameter vector w since the posterior distribution is no longer Gaussian.It is therefore necessary to introduce some form of approximation. Later in thebook we shall consider a range of techniques based on analytical approximationsand numerical sampling.Here we introduce a simple, but widely used, framework called the Laplace approximation, that aims to ﬁnd a Gaussian approximation to a probability densitydeﬁned over a set of continuous variables. Consider ﬁrst the case of a single continuous variable z, and suppose the distribution p(z) is deﬁned byp(z) =1f (z)Z(4.125)where Z = f (z) dz is the normalization coefﬁcient.

We shall suppose that thevalue of Z is unknown. In the Laplace method the goal is to ﬁnd a Gaussian approximation q(z) which is centred on a mode of the distribution p(z). The ﬁrst step is toﬁnd a mode of p(z), in other words a point z0 such that p (z0 ) = 0, or equivalentlydf (z) = 0.(4.126)dz z=z0A Gaussian distribution has the property that its logarithm is a quadratic functionof the variables.

We therefore consider a Taylor expansion of ln f (z) centred on themode z0 so that1ln f (z) ln f (z0 ) − A(z − z0 )2(4.127)2whered2A = − 2 ln f (z).(4.128)dzz =z0Note that the ﬁrst-order term in the Taylor expansion does not appear since z0 is alocal maximum of the distribution. Taking the exponential we obtainA(4.129)f (z) f (z0 ) exp − (z − z0 )2 .2We can then obtain a normalized distribution q(z) by making use of the standardresult for the normalization of a Gaussian, so thatq(z) =A2π1/2Aexp − (z − z0 )2 .2(4.130)The Laplace approximation is illustrated in Figure 4.14.

Note that the Gaussianapproximation will only be well deﬁned if its precision A > 0, in other words thestationary point z0 must be a local maximum, so that the second derivative of f (z)at the point z0 is negative.2154.4. The Laplace Approximation0.8400.6300.4200.2100−2−1012340−2−101234Figure 4.14 Illustration of the Laplace approximation applied to the distribution p(z) ∝ exp(−z 2 /2)σ(20z + 4)where σ(z) is the logistic sigmoid function deﬁned by σ(z) = (1 + e−z )−1 . The left plot shows the normalizeddistribution p(z) in yellow, together with the Laplace approximation centred on the mode z0 of p(z) in red. Theright plot shows the negative logarithms of the corresponding curves.We can extend the Laplace method to approximate a distribution p(z) = f (z)/Zdeﬁned over an M -dimensional space z.

At a stationary point z0 the gradient ∇f (z)will vanish. Expanding around this stationary point we have1ln f (z) ln f (z0 ) − (z − z0 )T A(z − z0 )2(4.131)where the M × M Hessian matrix A is deﬁned byA = − ∇∇ ln f (z)|z=z0(4.132)and ∇ is the gradient operator. Taking the exponential of both sides we obtain1Tf (z) f (z0 ) exp − (z − z0 ) A(z − z0 ) .(4.133)2The distribution q(z) is proportional to f (z) and the appropriate normalization coefﬁcient can be found by inspection, using the standard result (2.43) for a normalizedmultivariate Gaussian, giving|A|1/21Tq(z) =exp−)A(z−z)= N (z|z0 , A−1 )(4.134)(z−z002(2π)M/2where |A| denotes the determinant of A. This Gaussian distribution will be welldeﬁned provided its precision matrix, given by A, is positive deﬁnite, which impliesthat the stationary point z0 must be a local maximum, not a minimum or a saddlepoint.In order to apply the Laplace approximation we ﬁrst need to ﬁnd the mode z0 ,and then evaluate the Hessian matrix at that mode.

In practice a mode will typically be found by running some form of numerical optimization algorithm (Bishop2164. LINEAR MODELS FOR CLASSIFICATIONand Nabney, 2008). Many of the distributions encountered in practice will be multimodal and so there will be different Laplace approximations according to whichmode is being considered. Note that the normalization constant Z of the true distribution does not need to be known in order to apply the Laplace method. As a resultof the central limit theorem, the posterior distribution for a model is expected tobecome increasingly better approximated by a Gaussian as the number of observeddata points is increased, and so we would expect the Laplace approximation to bemost useful in situations where the number of data points is relatively large.One major weakness of the Laplace approximation is that, since it is based on aGaussian distribution, it is only directly applicable to real variables.

In other casesit may be possible to apply the Laplace approximation to a transformation of thevariable. For instance if 0 τ < ∞ then we can consider a Laplace approximationof ln τ . The most serious limitation of the Laplace framework, however, is thatit is based purely on the aspects of the true distribution at a speciﬁc value of thevariable, and so can fail to capture important global properties.

In Chapter 10 weshall consider alternative approaches which adopt a more global perspective.4.4.1 Model comparison and BICAs well as approximating the distribution p(z) we can also obtain an approximation to the normalization constant Z. Using the approximation (4.133) we haveZ =f (z) dz1T f (z0 ) exp − (z − z0 ) A(z − z0 ) dz2= f (z0 )(2π)M/2|A|1/2(4.135)where we have noted that the integrand is Gaussian and made use of the standardresult (2.43) for a normalized Gaussian distribution.

We can use the result (4.135) toobtain an approximation to the model evidence which, as discussed in Section 3.4,plays a central role in Bayesian model comparison.Consider a data set D and a set of models {Mi } having parameters {θ i }. Foreach model we deﬁne a likelihood function p(D|θ i , Mi ). If we introduce a priorp(θ i |Mi ) over the parameters, then we are interested in computing the model evidence p(D|Mi ) for the various models.

From now on we omit the conditioning onMi to keep the notation uncluttered. From Bayes’ theorem the model evidence isgiven byp(D) = p(D|θ)p(θ) dθ.(4.136)Exercise 4.22Identifying f (θ) = p(D|θ)p(θ) and Z = p(D), and applying the result (4.135), weobtain1Mln p(D) ln p(D|θ MAP ) + ln p(θ MAP ) +ln(2π) − ln |A|(4.137)22()*+Occam factor4.5. Bayesian Logistic Regression217where θ MAP is the value of θ at the mode of the posterior distribution, and A is theHessian matrix of second derivatives of the negative log posteriorA = −∇∇ ln p(D|θ MAP )p(θ MAP ) = −∇∇ ln p(θ MAP |D).Exercise 4.23The ﬁrst term on the right hand side of (4.137) represents the log likelihood evaluated using the optimized parameters, while the remaining three terms comprise the‘Occam factor’ which penalizes model complexity.If we assume that the Gaussian prior distribution over parameters is broad, andthat the Hessian has full rank, then we can approximate (4.137) very roughly using1ln p(D) ln p(D|θ MAP ) − M ln N2Section 3.5.3(4.138)(4.139)where N is the number of data points, M is the number of parameters in θ andwe have omitted additive constants.

This is known as the Bayesian InformationCriterion (BIC) or the Schwarz criterion (Schwarz, 1978). Note that, compared toAIC given by (1.73), this penalizes model complexity more heavily.Complexity measures such as AIC and BIC have the virtue of being easy toevaluate, but can also give misleading results. In particular, the assumption that theHessian matrix has full rank is often not valid since many of the parameters are not‘well-determined’. We can use the result (4.137) to obtain a more accurate estimateof the model evidence starting from the Laplace approximation, as we illustrate inthe context of neural networks in Section 5.7.4.5. Bayesian Logistic RegressionWe now turn to a Bayesian treatment of logistic regression. Exact Bayesian inference for logistic regression is intractable.

Характеристики

Тип файла

PDF-файл

Размер

9,37 Mb

Материал

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Тип материала

Книга

Предмет

(ММО) Методы машинного обучения

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

bishop-c.m.-pattern-recognition-and-machine-learning-2006.pdf.rar

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.