Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 65

Файл №811375 Bishop C.M. Pattern Recognition and Machine Learning (2006) (Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf) 65 страницаBishop C.M. Pattern Recognition and Machine Learning (2006) (811375) страница 652020-08-252020-08-25СтудИзба

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 65)

NEURAL NETWORKSFigure 5.21 (a) Plot of the mixingcoefﬁcients πk (x) as a function ofx for the three kernel functions in amixture density network trained onthe data shown in Figure 5.19. Themodel has three Gaussian components, and uses a two-layer multilayer perceptron with ﬁve ‘tanh’ sigmoidal units in the hidden layer, andnine outputs (corresponding to the 3means and 3 variances of the Gaussian components and the 3 mixingcoefﬁcients). At both small and largevalues of x, where the conditionalprobability density of the target datais unimodal, only one of the kernels has a high value for its priorprobability, while at intermediate values of x, where the conditional density is trimodal, the three mixing coefﬁcients have comparable values.(b) Plots of the means µk (x) usingthe same colour coding as for themixing coefﬁcients.

(c) Plot of thecontours of the corresponding conditional probability density of the target data for the same mixture density network. (d) Plot of the approximate conditional mode, shownby the red points, of the conditionaldensity.1100010(a)(b)11000110(c)1(d)We illustrate the use of a mixture density network by returning to the toy example of an inverse problem shown in Figure 5.19. Plots of the mixing coefﬁcients πk (x), the means µk (x), and the conditional density contours correspondingto p(t|x), are shown in Figure 5.21.

The outputs of the neural network, and hence theparameters in the mixture model, are necessarily continuous single-valued functionsof the input variables. However, we see from Figure 5.21(c) that the model is able toproduce a conditional density that is unimodal for some values of x and trimodal forother values by modulating the amplitudes of the mixing components πk (x).Once a mixture density network has been trained, it can predict the conditionaldensity function of the target data for any given value of the input vector.

Thisconditional density represents a complete description of the generator of the data, sofar as the problem of predicting the value of the output vector is concerned. Fromthis density function we can calculate more speciﬁc quantities that may be of interestin different applications. One of the simplest of these is the mean, corresponding tothe conditional average of the target data, and is given byE [t|x] =tp(t|x) dt =Kk=1πk (x)µk (x)(5.158)5.7. Bayesian Neural NetworksExercise 5.37277where we have used (5.148).

Because a standard network trained by least squaresis approximating the conditional mean, we see that a mixture density network canreproduce the conventional least-squares result as a special case. Of course, as wehave already noted, for a multimodal distribution the conditional mean is of limitedvalue.We can similarly evaluate the variance of the density function about the conditional average, to give2s2 (x) = E t − E[t|x] |x(5.159)⎧⎫''KK''2 ⎬⎨''2πk (x) σk (x) + 'µk (x) −πl (x)µl (x)'(5.160)='' ⎭⎩k=1l=1where we have used (5.148) and (5.158).

This is more general than the correspondingleast-squares result because the variance is a function of x.We have seen that for multimodal distributions, the conditional mean can givea poor representation of the data. For instance, in controlling the simple robot armshown in Figure 5.18, we need to pick one of the two possible joint angle settingsin order to achieve the desired end-effector location, whereas the average of the twosolutions is not itself a solution. In such cases, the conditional mode may be ofmore value. Because the conditional mode for the mixture density network does nothave a simple analytical solution, this would require numerical iteration. A simplealternative is to take the mean of the most probable component (i.e., the one with thelargest mixing coefﬁcient) at each value of x.

This is shown for the toy data set inFigure 5.21(d).5.7. Bayesian Neural NetworksSo far, our discussion of neural networks has focussed on the use of maximum likelihood to determine the network parameters (weights and biases). Regularized maximum likelihood can be interpreted as a MAP (maximum posterior) approach inwhich the regularizer can be viewed as the logarithm of a prior parameter distribution. However, in a Bayesian treatment we need to marginalize over the distributionof parameters in order to make predictions.In Section 3.3, we developed a Bayesian solution for a simple linear regressionmodel under the assumption of Gaussian noise.

We saw that the posterior distribution, which is Gaussian, could be evaluated exactly and that the predictive distribution could also be found in closed form. In the case of a multilayered network, thehighly nonlinear dependence of the network function on the parameter values meansthat an exact Bayesian treatment can no longer be found. In fact, the log of the posterior distribution will be nonconvex, corresponding to the multiple local minima inthe error function.The technique of variational inference, to be discussed in Chapter 10, has beenapplied to Bayesian neural networks using a factorized Gaussian approximation2785. NEURAL NETWORKSto the posterior distribution (Hinton and van Camp, 1993) and also using a fullcovariance Gaussian (Barber and Bishop, 1998a; Barber and Bishop, 1998b). Themost complete treatment, however, has been based on the Laplace approximation(MacKay, 1992c; MacKay, 1992b) and forms the basis for the discussion given here.We will approximate the posterior distribution by a Gaussian, centred at a mode ofthe true posterior.

Furthermore, we shall assume that the covariance of this Gaussian is small so that the network function is approximately linear with respect to theparameters over the region of parameter space for which the posterior probability issigniﬁcantly nonzero. With these two approximations, we will obtain models thatare analogous to the linear regression and classiﬁcation models discussed in earlierchapters and so we can exploit the results obtained there. We can then make use ofthe evidence framework to provide point estimates for the hyperparameters and tocompare alternative models (for example, networks having different numbers of hidden units).

To start with, we shall discuss the regression case and then later considerthe modiﬁcations needed for solving classiﬁcation tasks.5.7.1 Posterior parameter distributionConsider the problem of predicting a single continuous target variable t froma vector x of inputs (the extension to multiple targets is straightforward). We shallsuppose that the conditional distribution p(t|x) is Gaussian, with an x-dependentmean given by the output of a neural network model y(x, w), and with precision(inverse variance) βp(t|x, w, β) = N (t|y(x, w), β −1 ).(5.161)Similarly, we shall choose a prior distribution over the weights w that is Gaussian ofthe form(5.162)p(w|α) = N (w|0, α−1 I).For an i.i.d.

data set of N observations x1 , . . . , xN , with a corresponding set of targetvalues D = {t1 , . . . , tN }, the likelihood function is given byp(D|w, β) =NN (tn |y(xn , w), β −1 )(5.163)n=1and so the resulting posterior distribution is thenp(w|D, α, β) ∝ p(w|α)p(D|w, β).(5.164)which, as a consequence of the nonlinear dependence of y(x, w) on w, will be nonGaussian.We can ﬁnd a Gaussian approximation to the posterior distribution by using theLaplace approximation. To do this, we must ﬁrst ﬁnd a (local) maximum of theposterior, and this must be done using iterative numerical optimization.

As usual, itis convenient to maximize the logarithm of the posterior, which can be written in the5.7. Bayesian Neural Networks279formNαβ2ln p(w|D) = − wT w −{y(xn , w) − tn } + const22(5.165)n=1which corresponds to a regularized sum-of-squares error function. Assuming forthe moment that α and β are ﬁxed, we can ﬁnd a maximum of the posterior, whichwe denote wMAP , by standard nonlinear optimization algorithms such as conjugategradients, using error backpropagation to evaluate the required derivatives.Having found a mode wMAP , we can then build a local Gaussian approximationby evaluating the matrix of second derivatives of the negative log posterior distribution.

Характеристики

Тип файла

PDF-файл

Размер

9,37 Mb

Материал

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Тип материала

Книга

Предмет

(ММО) Методы машинного обучения

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

bishop-c.m.-pattern-recognition-and-machine-learning-2006.pdf.rar

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.