Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 74

Файл №811375 Bishop C.M. Pattern Recognition and Machine Learning (2006) (Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf) 74 страницаBishop C.M. Pattern Recognition and Machine Learning (2006) (811375) страница 742020-08-252020-08-25СтудИзба

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 74)

We cannot simply ﬁnd the mode bysetting this gradient to zero, because σ N depends nonlinearly on aN , and so weresort to an iterative scheme based on the Newton-Raphson method, which gives riseto an iterative reweighted least squares (IRLS) algorithm. This requires the secondderivatives of Ψ(aN ), which we also require for the Laplace approximation anyway,and which are given by1∇∇Ψ(aN ) = −WN − C−NExercise 6.24(6.81)(6.82)where WN is a diagonal matrix with elements σ(an )(1 − σ(an )), and we have usedthe result (4.88) for the derivative of the logistic sigmoid function.

Note that thesediagonal elements lie in the range (0, 1/4), and hence WN is a positive deﬁnitematrix. Because CN (and hence its inverse) is positive deﬁnite by construction, andbecause the sum of two positive deﬁnite matrices is also positive deﬁnite, we seethat the Hessian matrix A = −∇∇Ψ(aN ) is positive deﬁnite and so the posteriordistribution p(aN |tN ) is log convex and therefore has a single mode that is the global6.4. Gaussian ProcessesExercise 6.25317maximum.

The posterior distribution is not Gaussian, however, because the Hessianis a function of aN .Using the Newton-Raphson formula (4.92), the iterative update equation for aNis given by−1{tN − σ N + WN aN } .anewN = CN (I + WN CN )(6.83)These equations are iterated until they converge to the mode which we denote byaN . At the mode, the gradient ∇Ψ(aN ) will vanish, and hence aN will satisfyaN = CN (tN − σ N ).(6.84)Once we have found the mode aN of the posterior, we can evaluate the Hessianmatrix given by1(6.85)H = −∇∇Ψ(aN ) = WN + C−Nwhere the elements of WN are evaluated using aN .

This deﬁnes our Gaussian approximation to the posterior distribution p(aN |tN ) given byq(aN ) = N (aN |aN , H−1 ).Exercise 6.26(6.86)We can now combine this with (6.78) and hence evaluate the integral (6.77). Becausethis corresponds to a linear-Gaussian model, we can use the general result (2.115) togiveE[aN +1 |tN ] = kT (tN − σ N )−1+ CN )−1 k.var[aN +1 |tN ] = c − kT (WN(6.87)(6.88)Now that we have a Gaussian distribution for p(aN +1 |tN ), we can approximatethe integral (6.76) using the result (4.153).

As with the Bayesian logistic regressionmodel of Section 4.5, if we are only interested in the decision boundary corresponding to p(tN +1 |tN ) = 0.5, then we need only consider the mean and we can ignorethe effect of the variance.We also need to determine the parameters θ of the covariance function. Oneapproach is to maximize the likelihood function given by p(tN |θ) for which we needexpressions for the log likelihood and its gradient.

If desired, suitable regularizationterms can also be added, leading to a penalized maximum likelihood solution. Thelikelihood function is deﬁned byp(tN |θ) = p(tN |aN )p(aN |θ) daN .(6.89)This integral is analytically intractable, so again we make use of the Laplace approximation. Using the result (4.135), we obtain the following approximation for the logof the likelihood functionln p(tN |θ) = Ψ(aN ) −1N1ln |WN + C−ln(2π)N |+22(6.90)3186. KERNEL METHODSwhere Ψ(aN ) = ln p(aN |θ) + ln p(tN |aN ). We also need to evaluate the gradientof ln p(tN |θ) with respect to the parameter vector θ.

Note that changes in θ willcause changes in aN , leading to additional terms in the gradient. Thus, when wedifferentiate (6.90) with respect to θ, we obtain two sets of terms, the ﬁrst arisingfrom the dependence of the covariance matrix CN on θ, and the rest arising fromdependence of aN on θ.The terms arising from the explicit dependence on θ can be found by using(6.80) together with the results (C.21) and (C.22), and are given by∂ ln p(tN |θ)∂θj=1 T −1 ∂CN −1 a CC a2 N N ∂θj N N∂CN1−1− Tr (I + CN WN ) WN.2∂θj(6.91)To compute the terms arising from the dependence of aN on θ, we note thatthe Laplace approximation has been constructed such that Ψ(aN ) has zero gradientat aN = aN , and so Ψ(aN ) gives no contribution to the gradient as a result of itsdependence on aN .

This leaves the following contribution to the derivative withrespect to a component θj of θ11 ∂ ln |WN + C−N | ∂an−2∂an∂θjNn=11 ∂a(I + CN WN )−1 CN nn σn (1 − σn )(1 − 2σn ) n2∂θjN=−(6.92)n=1where σn = σ(an ), and again we have used the result (C.22) together with thedeﬁnition of WN . We can evaluate the derivative of aN with respect to θj by differentiating the relation (6.84) with respect to θj to give∂an∂a∂CN=(tN − σ N ) − CN WN n .∂θj∂θj∂θj(6.93)Rearranging then gives∂an∂CN= (I + WN CN )−1(tN − σ N ).∂θj∂θjAppendix A(6.94)Combining (6.91), (6.92), and (6.94), we can evaluate the gradient of the loglikelihood function, which can be used with standard nonlinear optimization algorithms in order to determine a value for θ.We can illustrate the application of the Laplace approximation for Gaussian processes using the synthetic two-class data set shown in Figure 6.12.

Extension of theLaplace approximation to Gaussian processes involving K > 2 classes, using thesoftmax activation function, is straightforward (Williams and Barber, 1998).6.4. Gaussian Processes31920−2−202Figure 6.12 Illustration of the use of a Gaussian process for classiﬁcation, showing the data on the left togetherwith the optimal decision boundary from the true distribution in green, and the decision boundary from theGaussian process classiﬁer in black. On the right is the predicted posterior probability for the blue and redclasses together with the Gaussian process decision boundary.6.4.7 Connection to neural networksWe have seen that the range of functions which can be represented by a neuralnetwork is governed by the number M of hidden units, and that, for sufﬁcientlylarge M , a two-layer network can approximate any given function with arbitraryaccuracy.

In the framework of maximum likelihood, the number of hidden unitsneeds to be limited (to a level dependent on the size of the training set) in orderto avoid over-ﬁtting. However, from a Bayesian perspective it makes little sense tolimit the number of parameters in the network according to the size of the trainingset.In a Bayesian neural network, the prior distribution over the parameter vectorw, in conjunction with the network function f (x, w), produces a prior distributionover functions from y(x) where y is the vector of network outputs. Neal (1996)has shown that, for a broad class of prior distributions over w, the distribution offunctions generated by a neural network will tend to a Gaussian process in the limitM → ∞.

It should be noted, however, that in this limit the output variables of theneural network become independent. One of the great merits of neural networks isthat the outputs share the hidden units and so they can ‘borrow statistical strength’from each other, that is, the weights associated with each hidden unit are inﬂuencedby all of the output variables not just by one of them. This property is therefore lostin the Gaussian process limit.We have seen that a Gaussian process is determined by its covariance (kernel)function. Williams (1998) has given explicit forms for the covariance in the case oftwo speciﬁc choices for the hidden unit activation function (probit and Gaussian).These kernel functions k(x, x ) are nonstationary, i.e. they cannot be expressed asa function of the difference x − x , as a consequence of the Gaussian weight priorbeing centred on zero which breaks translation invariance in weight space.3206.

KERNEL METHODSBy working directly with the covariance function we have implicitly marginalized over the distribution of weights. If the weight prior is governed by hyperparameters, then their values will determine the length scales of the distribution overfunctions, as can be understood by studying the examples in Figure 5.11 for the caseof a ﬁnite number of hidden units.

Note that we cannot marginalize out the hyperparameters analytically, and must instead resort to techniques of the kind discussed inSection 6.4.Exercises6.1 ( ) www Consider the dual formulation of the least squares linear regressionproblem given in Section 6.1. Show that the solution for the components an ofthe vector a can be expressed as a linear combination of the elements of the vectorφ(xn ).

Denoting these coefﬁcients by the vector w, show that the dual of the dualformulation is given by the original representation in terms of the parameter vectorw.6.2 ( ) In this exercise, we develop a dual formulation of the perceptron learningalgorithm. Using the perceptron learning rule (4.55), show that the learned weightvector w can be written as a linear combination of the vectors tn φ(xn ) where tn ∈{−1, +1}.

Denote the coefﬁcients of this linear combination by αn and derive aformulation of the perceptron learning algorithm, and the predictive function for theperceptron, in terms of the αn . Show that the feature vector φ(x) enters only in theform of the kernel function k(x, x ) = φ(x)T φ(x ).6.3 () The nearest-neighbour classiﬁer (Section 2.5.2) assigns a new input vector xto the same class as that of the nearest input vector xn from the training set, wherein the simplest case, the distance is deﬁned by the Euclidean metric x − xn 2 . Byexpressing this rule in terms of scalar products and then making use of kernel substitution, formulate the nearest-neighbour classiﬁer for a general nonlinear kernel.6.4 () In Appendix C, we give an example of a matrix that has positive elements butthat has a negative eigenvalue and hence that is not positive deﬁnite.

Find an exampleof the converse property, namely a 2 × 2 matrix with positive eigenvalues yet thathas at least one negative element.6.5 () wwwVerify the results (6.13) and (6.14) for constructing valid kernels.6.6 () Verify the results (6.15) and (6.16) for constructing valid kernels.6.7 () wwwVerify the results (6.17) and (6.18) for constructing valid kernels.6.8 () Verify the results (6.19) and (6.20) for constructing valid kernels.6.9 () Verify the results (6.21) and (6.22) for constructing valid kernels.6.10 () Show that an excellent choice of kernel for learning a function f (x) is givenby k(x, x ) = f (x)f (x ) by showing that a linear learning machine based on thiskernel will always ﬁnd a solution proportional to f (x).Exercises3216.11 () By making use of the expansion (6.25), and then expanding the middle factoras a power series, show that the Gaussian kernel (6.23) can be expressed as the innerproduct of an inﬁnite-dimensional feature vector.6.12 ( ) www Consider the space of all possible subsets A of a given ﬁxed set D.Show that the kernel function (6.27) corresponds to an inner product in a featurespace of dimensionality 2|D| deﬁned by the mapping φ(A) where A is a subset of Dand the element φU (A), indexed by the subset U , is given by1, if U ⊆ A;φU (A) =(6.95)0, otherwise.Here U ⊆ A denotes that U is either a subset of A or is equal to A.6.13 () Show that the Fisher kernel, deﬁned by (6.33), remains invariant if we makea nonlinear transformation of the parameter vector θ → ψ(θ), where the functionψ(·) is invertible and differentiable.6.14 () www Write down the form of the Fisher kernel, deﬁned by (6.33), for thecase of a distribution p(x|µ) = N (x|µ, S) that is Gaussian with mean µ and ﬁxedcovariance S.6.15 () By considering the determinant of a 2 × 2 Gram matrix, show that a positivedeﬁnite kernel function k(x, x ) satisﬁes the Cauchy-Schwartz inequalityk(x1 , x2 )2 k(x1 , x1 )k(x2 , x2 ).(6.96)6.16 ( ) Consider a parametric model governed by the parameter vector w togetherwith a data set of input values x1 , .

Характеристики

Тип файла

PDF-файл

Размер

9,37 Mb

Материал

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Тип материала

Книга

Предмет

(ММО) Методы машинного обучения

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

bishop-c.m.-pattern-recognition-and-machine-learning-2006.pdf.rar

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.