Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 67

Файл №811375 Bishop C.M. Pattern Recognition and Machine Learning (2006) (Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf) 67 страницаBishop C.M. Pattern Recognition and Machine Learning (2006) (811375) страница 672020-08-252020-08-25СтудИзба

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 67)

From Section 4.5.2, we see that this distribution is Gaussian with meanaMAP ≡ a(x, wMAP ), and varianceσa2 (x) = bT (x)A−1 b(x).(5.188)Finally, to obtain the predictive distribution, we must marginalize over a using(5.189)p(t = 1|x, D) = σ(a)p(a|x, D) da.2845. NEURAL NETWORKS33221100−1−1−2−2−2−1012−2−1012Figure 5.23 An illustration of the Laplace approximation for a Bayesian neural network having 8 hidden unitswith ‘tanh’ activation functions and a single logistic-sigmoid output unit. The weight parameters were found usingscaled conjugate gradients, and the hyperparameter α was optimized using the evidence framework.

On the leftis the result of using the simple approximation (5.185) based on a point estimate wMAP of the parameters,in which the green curve shows the y = 0.5 decision boundary, and the other contours correspond to outputprobabilities of y = 0.1, 0.3, 0.7, and 0.9. On the right is the corresponding result obtained using (5.190). Notethat the effect of marginalization is to spread out the contours and to make the predictions less conﬁdent, sothat at each input point x, the posterior probabilities are shifted towards 0.5, while the y = 0.5 contour itself isunaffected.The convolution of a Gaussian with a logistic sigmoid is intractable. We thereforeapply the approximation (4.153) to (5.189) giving(5.190)p(t = 1|x, D) = σ κ(σa2 )bT wMAPwhere κ(·) is deﬁned by (4.154).

Recall that both σa2 and b are functions of x.Figure 5.23 shows an example of this framework applied to the synthetic classiﬁcation data set described in Appendix A.Exercises5.1 ( ) Consider a two-layer network function of the form (5.7) in which the hiddenunit nonlinear activation functions g(·) are given by logistic sigmoid functions of theform−1(5.191)σ(a) = {1 + exp(−a)} .Show that there exists an equivalent network, which computes exactly the same function, but with hidden unit activation functions given by tanh(a) where the tanh function is deﬁned by (5.59).

Hint: ﬁrst ﬁnd the relation between σ(a) and tanh(a), andthen show that the parameters of the two networks differ by linear transformations.5.2 () www Show that maximizing the likelihood function under the conditionaldistribution (5.16) for a multioutput neural network is equivalent to minimizing thesum-of-squares error function (5.11).Exercises2855.3 ( ) Consider a regression problem involving multiple target variables in which itis assumed that the distribution of the targets, conditioned on the input vector x, is aGaussian of the formp(t|x, w) = N (t|y(x, w), Σ)(5.192)where y(x, w) is the output of a neural network with input vector x and weightvector w, and Σ is the covariance of the assumed Gaussian noise on the targets.Given a set of independent observations of x and t, write down the error functionthat must be minimized in order to ﬁnd the maximum likelihood solution for w, ifwe assume that Σ is ﬁxed and known.

Now assume that Σ is also to be determinedfrom the data, and write down an expression for the maximum likelihood solutionfor Σ. Note that the optimizations of w and Σ are now coupled, in contrast to thecase of independent target variables discussed in Section 5.2.5.4 ( ) Consider a binary classiﬁcation problem in which the target values are t ∈{0, 1}, with a network output y(x, w) that represents p(t = 1|x), and suppose thatthere is a probability that the class label on a training data point has been incorrectlyset. Assuming independent and identically distributed data, write down the errorfunction corresponding to the negative log likelihood. Verify that the error function(5.21) is obtained when = 0. Note that this error function makes the model robustto incorrectly labelled data, in contrast to the usual error function.5.5 () www Show that maximizing likelihood for a multiclass neural network modelin which the network outputs have the interpretation yk (x, w) = p(tk = 1|x) isequivalent to the minimization of the cross-entropy error function (5.24).5.6 () www Show the derivative of the error function (5.21) with respect to theactivation ak for an output unit having a logistic sigmoid activation function satisﬁes(5.18).5.7 () Show the derivative of the error function (5.24) with respect to the activation akfor output units having a softmax activation function satisﬁes (5.18).5.8 () We saw in (4.88) that the derivative of the logistic sigmoid activation functioncan be expressed in terms of the function value itself.

Derive the corresponding resultfor the ‘tanh’ activation function deﬁned by (5.59).5.9 () www The error function (5.21) for binary classiﬁcation problems was derived for a network having a logistic-sigmoid output activation function, so that0 y(x, w) 1, and data having target values t ∈ {0, 1}. Derive the corresponding error function if we consider a network having an output −1 y(x, w) 1and target values t = 1 for class C1 and t = −1 for class C2 .

What would be theappropriate choice of output unit activation function?5.10 () www Consider a Hessian matrix H with eigenvector equation (5.33). Bysetting the vector v in (5.39) equal to each of the eigenvectors ui in turn, show thatH is positive deﬁnite if, and only if, all of its eigenvalues are positive.2865. NEURAL NETWORKS5.11 ( ) www Consider a quadratic error function deﬁned by (5.32), in which theHessian matrix H has an eigenvalue equation given by (5.33). Show that the contours of constant error are ellipses whose axes are aligned with the eigenvectors ui ,with lengths that are inversely proportional to the square root of the correspondingeigenvalues λi .5.12 ( ) www By considering the local Taylor expansion (5.32) of an error functionabout a stationary point w , show that the necessary and sufﬁcient condition for thestationary point to be a local minimum of the error function is that the Hessian matrix = w , be positive deﬁnite.H, deﬁned by (5.30) with w5.13 () Show that as a consequence of the symmetry of the Hessian matrix H, thenumber of independent elements in the quadratic error function (5.28) is given byW (W + 3)/2.5.14 () By making a Taylor expansion, verify that the terms that are O() cancel on theright-hand side of (5.69).5.15 ( ) In Section 5.3.4, we derived a procedure for evaluating the Jacobian matrix of aneural network using a backpropagation procedure.

Derive an alternative formalismfor ﬁnding the Jacobian based on forward propagation equations.5.16 () The outer product approximation to the Hessian matrix for a neural networkusing a sum-of-squares error function is given by (5.84). Extend this result to thecase of multiple outputs.5.17 () Consider a squared loss function of the form12{y(x, w) − t} p(x, t) dx dtE=2(5.193)where y(x, w) is a parametric function such as a neural network. The result (1.89)shows that the function y(x, w) that minimizes this error is given by the conditionalexpectation of t given x. Use this result to show that the second derivative of E withrespect to two elements wr and ws of the vector w, is given by∂y ∂y∂2E=p(x) dx.(5.194)∂wr ∂ws∂wr ∂wsNote that, for a ﬁnite sample from p(x), we obtain (5.84).5.18 () Consider a two-layer network of the form shown in Figure 5.1 with the additionof extra parameters corresponding to skip-layer connections that go directly fromthe inputs to the outputs.

By extending the discussion of Section 5.3.2, write downthe equations for the derivatives of the error function with respect to these additionalparameters.5.19 () www Derive the expression (5.85) for the outer product approximation tothe Hessian matrix for a network having a single output with a logistic sigmoidoutput-unit activation function and a cross-entropy error function, corresponding tothe result (5.84) for the sum-of-squares error function.Exercises2875.20 () Derive an expression for the outer product approximation to the Hessian matrixfor a network having K outputs with a softmax output-unit activation function anda cross-entropy error function, corresponding to the result (5.84) for the sum-ofsquares error function.5.21 ( ) Extend the expression (5.86) for the outer product approximation of the Hessian matrix to the case of K > 1 output units.

Hence, derive a recursive expressionanalogous to (5.87) for incrementing the number N of patterns and a similar expression for incrementing the number K of outputs. Use these results, together with theidentity (5.88), to ﬁnd sequential update expressions analogous to (5.89) for ﬁndingthe inverse of the Hessian by incrementally including both extra patterns and extraoutputs.5.22 ( ) Derive the results (5.93), (5.94), and (5.95) for the elements of the Hessianmatrix of a two-layer feed-forward network by application of the chain rule of calculus.5.23 ( ) Extend the results of Section 5.4.5 for the exact Hessian of a two-layer networkto include skip-layer connections that go directly from inputs to outputs.5.24 () Verify that the network function deﬁned by (5.113) and (5.114) is invariant under the transformation (5.115) applied to the inputs, provided the weights and biasesare simultaneously transformed using (5.116) and (5.117).

Similarly, show that thenetwork outputs can be transformed according (5.118) by applying the transformation (5.119) and (5.120) to the second-layer weights and biases.5.25 ( ) wwwConsider a quadratic error function of the form1E = E0 + (w − w )T H(w − w )2(5.195)where w represents the minimum, and the Hessian matrix H is positive deﬁnite andconstant. Suppose the initial weight vector w(0) is chosen to be at the origin and isupdated using simple gradient descentw(τ ) = w(τ −1) − ρ∇E(5.196)where τ denotes the step number, and ρ is the learning rate (which is assumed to besmall).

Show that, after τ steps, the components of the weight vector parallel to theeigenvectors of H can be written(τ )wj= {1 − (1 − ρηj )τ } wj(5.197)where wj = wT uj , and uj and ηj are the eigenvectors and eigenvalues, respectively,of H so that(5.198)Huj = ηj uj .Show that as τ → ∞, this gives w(τ ) → w as expected, provided |1 − ρηj | < 1.Now suppose that training is halted after a ﬁnite number τ of steps. Show that the2885. NEURAL NETWORKScomponents of the weight vector parallel to the eigenvectors of the Hessian satisfy(τ )wj wjwhen ηj (ρτ )−1(τ )|wj | |wj | when ηj (ρτ )−1 .(5.199)(5.200)Compare this result with the discussion in Section 3.5.3 of regularization with simpleweight decay, and hence show that (ρτ )−1 is analogous to the regularization parameter λ. The above results also show that the effective number of parameters in thenetwork, as deﬁned by (3.91), grows as the training progresses.5.26 ( ) Consider a multilayer perceptron with arbitrary feed-forward topology, whichis to be trained by minimizing the tangent propagation error function (5.127) inwhich the regularizing function is given by (5.128).

Show that the regularizationterm Ω can be written as a sum over patterns of terms of the form12(Gyk )(5.201)Ωn =2kwhere G is a differential operator deﬁned by ∂G≡τi.∂xi(5.202)iBy acting on the forward propagation equationszj = h(aj ),aj =wji zi(5.203)iwith the operator G, show that Ωn can be evaluated by forward propagation usingthe following equations:βj =wji αi .(5.204)αj = h (aj )βj ,iwhere we have deﬁned the new variablesαj ≡ Gzj ,βj ≡ Gaj .(5.205)Now show that the derivatives of Ωn with respect to a weight wrs in the network canbe written in the form∂Ωn=αk {φkr zs + δkr αs }(5.206)∂wrskwhere we have deﬁnedδkr ≡∂yk,∂arφkr ≡ Gδkr .(5.207)Write down the backpropagation equations for δkr , and hence derive a set of backpropagation equations for the evaluation of the φkr .Exercises2895.27 ( ) www Consider the framework for training with transformed data in thespecial case in which the transformation consists simply of the addition of randomnoise x → x + ξ where ξ has a Gaussian distribution with zero mean and unitcovariance.

Характеристики

Тип файла

PDF-файл

Размер

9,37 Mb

Материал

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Тип материала

Книга

Предмет

(ММО) Методы машинного обучения

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

bishop-c.m.-pattern-recognition-and-machine-learning-2006.pdf.rar

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.