Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 66

Файл №811375 Bishop C.M. Pattern Recognition and Machine Learning (2006) (Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf) 66 страницаBishop C.M. Pattern Recognition and Machine Learning (2006) (811375) страница 662020-08-252020-08-25СтудИзба

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 66)

From (5.165), this is given byA = −∇∇ ln p(w|D, α, β) = αI + βH(5.166)where H is the Hessian matrix comprising the second derivatives of the sum-ofsquares error function with respect to the components of w. Algorithms for computing and approximating the Hessian were discussed in Section 5.4. The correspondingGaussian approximation to the posterior is then given from (4.134) byq(w|D) = N (w|wMAP , A−1 ).(5.167)Similarly, the predictive distribution is obtained by marginalizing with respectto this posterior distribution(5.168)p(t|x, D) = p(t|x, w)q(w|D) dw.However, even with the Gaussian approximation to the posterior, this integration isstill analytically intractable due to the nonlinearity of the network function y(x, w)as a function of w. To make progress, we now assume that the posterior distributionhas small variance compared with the characteristic scales of w over which y(x, w)is varying.

This allows us to make a Taylor series expansion of the network functionaround wMAP and retain only the linear termsy(x, w) y(x, wMAP ) + gT (w − wMAP )where we have deﬁnedg = ∇w y(x, w)|w=wMAP .(5.169)(5.170)With this approximation, we now have a linear-Gaussian model with a Gaussiandistribution for p(w) and a Gaussian for p(t|w) whose mean is a linear function ofw of the form(5.171)p(t|x, w, β) N t|y(x, wMAP ) + gT (w − wMAP ), β −1 .Exercise 5.38We can therefore make use of the general result (2.115) for the marginal p(t) to give(5.172)p(t|x, D, α, β) = N t|y(x, wMAP ), σ 2 (x)2805. NEURAL NETWORKSwhere the input-dependent variance is given byσ 2 (x) = β −1 + gT A−1 g.(5.173)We see that the predictive distribution p(t|x, D) is a Gaussian whose mean is givenby the network function y(x, wMAP ) with the parameter set to their MAP value.

Thevariance has two terms, the ﬁrst of which arises from the intrinsic noise on the targetvariable, whereas the second is an x-dependent term that expresses the uncertaintyin the interpolant due to the uncertainty in the model parameters w. This shouldbe compared with the corresponding predictive distribution for the linear regressionmodel, given by (3.58) and (3.59).5.7.2 Hyperparameter optimizationSo far, we have assumed that the hyperparameters α and β are ﬁxed and known.We can make use of the evidence framework, discussed in Section 3.5, together withthe Gaussian approximation to the posterior obtained using the Laplace approximation, to obtain a practical procedure for choosing the values of such hyperparameters.The marginal likelihood, or evidence, for the hyperparameters is obtained byintegrating over the network weights(5.174)p(D|α, β) = p(D|w, β)p(w|α) dw.Exercise 5.39This is easily evaluated by making use of the Laplace approximation result (4.135).Taking logarithms then givesln p(D|α, β) −E(wMAP ) −1WNNln |A| +ln α +ln β −ln(2π) (5.175)2222where W is the total number of parameters in w, and the regularized error functionis deﬁned byE(wMAP ) =Nβα T2{y(xn , wMAP ) − tn } + wMAPwMAP .22(5.176)n=1We see that this takes the same form as the corresponding result (3.86) for the linearregression model.In the evidence framework, we make point estimates for α and β by maximizingln p(D|α, β).

Consider ﬁrst the maximization with respect to α, which can be doneby analogy with the linear regression case discussed in Section 3.5.2. We ﬁrst deﬁnethe eigenvalue equation(5.177)βHui = λi uiwhere H is the Hessian matrix comprising the second derivatives of the sum-ofsquares error function, evaluated at w = wMAP . By analogy with (3.92), we obtainα=γTwMAPwMAP(5.178)5.7. Bayesian Neural NetworksSection 3.5.3281where γ represents the effective number of parameters and is deﬁned byγ=Wi=1λi.α + λi(5.179)Note that this result was exact for the linear regression case.

For the nonlinear neuralnetwork, however, it ignores the fact that changes in α will cause changes in theHessian H, which in turn will change the eigenvalues. We have therefore implicitlyignored terms involving the derivatives of λi with respect to α.Similarly, from (3.95) we see that maximizing the evidence with respect to βgives the re-estimation formulaN11 {y(xn , wMAP ) − tn }2 .=βN −γ(5.180)n=1Section 5.1.1As with the linear model, we need to alternate between re-estimation of the hyperparameters α and β and updating of the posterior distribution.

The situation witha neural network model is more complex, however, due to the multimodality of theposterior distribution. As a consequence, the solution for wMAP found by maximizing the log posterior will depend on the initialization of w. Solutions that differ onlyas a consequence of the interchange and sign reversal symmetries in the hidden unitsare identical so far as predictions are concerned, and it is irrelevant which of theequivalent solutions is found. However, there may be inequivalent solutions as well,and these will generally yield different values for the optimized hyperparameters.In order to compare different models, for example neural networks having different numbers of hidden units, we need to evaluate the model evidence p(D). This canbe approximated by taking (5.175) and substituting the values of α and β obtainedfrom the iterative optimization of these hyperparameters.

A more careful evaluationis obtained by marginalizing over α and β, again by making a Gaussian approximation (MacKay, 1992c; Bishop, 1995a). In either case, it is necessary to evaluate thedeterminant |A| of the Hessian matrix. This can be problematic in practice becausethe determinant, unlike the trace, is sensitive to the small eigenvalues that are oftendifﬁcult to determine accurately.The Laplace approximation is based on a local quadratic expansion around amode of the posterior distribution over weights.

We have seen in Section 5.1.1 thatany given mode in a two-layer network is a member of a set of M !2M equivalentmodes that differ by interchange and sign-change symmetries, where M is the number of hidden units. When comparing networks having different numbers of hidden units, this can be taken into account by multiplying the evidence by a factor ofM !2M .5.7.3 Bayesian neural networks for classiﬁcationSo far, we have used the Laplace approximation to develop a Bayesian treatment of neural network regression models.

We now discuss the modiﬁcations to2825. NEURAL NETWORKSExercise 5.40this framework that arise when it is applied to classiﬁcation. Here we shall consider a network having a single logistic sigmoid output corresponding to a two-classclassiﬁcation problem. The extension to networks with multiclass softmax outputsis straightforward. We shall build extensively on the analogous results for linearclassiﬁcation models discussed in Section 4.5, and so we encourage the reader tofamiliarize themselves with that material before studying this section.The log likelihood function for this model is given byln p(D|w) == 1N {tn ln yn + (1 − tn ) ln(1 − yn )}(5.181)nExercise 5.41where tn ∈ {0, 1} are the target values, and yn ≡ y(xn , w). Note that there is nohyperparameter β, because the data points are assumed to be correctly labelled.

Asbefore, the prior is taken to be an isotropic Gaussian of the form (5.162).The ﬁrst stage in applying the Laplace framework to this model is to initializethe hyperparameter α, and then to determine the parameter vector w by maximizingthe log posterior distribution. This is equivalent to minimizing the regularized errorfunctionα(5.182)E(w) = − ln p(D|w) + wT w2and can be achieved using error backpropagation combined with standard optimization algorithms, as discussed in Section 5.3.Having found a solution wMAP for the weight vector, the next step is to evaluate the Hessian matrix H comprising the second derivatives of the negative loglikelihood function.

This can be done, for instance, using the exact method of Section 5.4.5, or using the outer product approximation given by (5.85). The secondderivatives of the negative log posterior can again be written in the form (5.166), andthe Gaussian approximation to the posterior is then given by (5.167).To optimize the hyperparameter α, we again maximize the marginal likelihood,which is easily shown to take the formln p(D|α) −E(wMAP ) −1Wln |A| +ln α + const22(5.183)where the regularized error function is deﬁned byE(wMAP ) = −Nn=1{tn ln yn + (1 − tn ) ln(1 − yn )} +α TwwMAP (5.184)2 MAPin which yn ≡ y(xn , wMAP ).

Maximizing this evidence function with respect to αagain leads to the re-estimation equation given by (5.178).The use of the evidence procedure to determine α is illustrated in Figure 5.22for the synthetic two-dimensional data discussed in Appendix A.Finally, we need the predictive distribution, which is deﬁned by (5.168). Again,this integration is intractable due to the nonlinearity of the network function. The2835.7. Bayesian Neural NetworksFigure 5.22Illustration of the evidence framework3applied to a synthetic two-class data set.The green curve shows the optimal decision boundary, the black curve shows 2the result of ﬁtting a two-layer networkwith 8 hidden units by maximum likeli- 1hood, and the red curve shows the result of including a regularizer in which 0α is optimized using the evidence procedure, starting from the initial value −1α = 0. Note that the evidence procedure greatly reduces the over-ﬁtting of −2the network.−2−1012simplest approximation is to assume that the posterior distribution is very narrowand hence make the approximationp(t|x, D) p(t|x, wMAP ).(5.185)We can improve on this, however, by taking account of the variance of the posteriordistribution.

In this case, a linear approximation for the network outputs, as was usedin the case of regression, would be inappropriate due to the logistic sigmoid outputunit activation function that constrains the output to lie in the range (0, 1). Instead,we make a linear approximation for the output unit activation in the forma(x, w) aMAP (x) + bT (w − wMAP )(5.186)where aMAP (x) = a(x, wMAP ), and the vector b ≡ ∇a(x, wMAP ) can be found bybackpropagation.Because we now have a Gaussian approximation for the posterior distributionover w, and a model for a that is a linear function of w, we can now appeal to theresults of Section 4.5.2. The distribution of output unit activation values, induced bythe distribution over network weights, is given byp(a|x, D) = δ a − aMAP (x) − bT (x)(w − wMAP ) q(w|D) dw (5.187)where q(w|D) is the Gaussian approximation to the posterior distribution given by(5.167).

Характеристики

Тип файла

PDF-файл

Размер

9,37 Mb

Материал

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Тип материала

Книга

Предмет

(ММО) Методы машинного обучения

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

bishop-c.m.-pattern-recognition-and-machine-learning-2006.pdf.rar

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.