Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 60

Файл №811375 Bishop C.M. Pattern Recognition and Machine Learning (2006) (Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf) 60 страницаBishop C.M. Pattern Recognition and Machine Learning (2006) (811375) страница 602020-08-252020-08-25СтудИзба

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 60)

This gives∂2E∂E1∂E=(wlk + ) −(wlk − ) + O(2 ).(5.91)∂wji ∂wlk2 ∂wji∂wjiBecause there are now only W weights to be perturbed, and because the gradientscan be evaluated in O(W ) steps, we see that this method gives the Hessian in O(W 2 )operations.5.4.5 Exact evaluation of the HessianExercise 5.22So far, we have considered various approximation schemes for evaluating theHessian matrix or its inverse.

The Hessian can also be evaluated exactly, for a network of arbitrary feed-forward topology, using extension of the technique of backpropagation used to evaluate ﬁrst derivatives, which shares many of its desirablefeatures including computational efﬁciency (Bishop, 1991; Bishop, 1992). It can beapplied to any differentiable error function that can be expressed as a function ofthe network outputs and to networks having arbitrary differentiable activation functions.

The number of computational steps needed to evaluate the Hessian scaleslike O(W 2 ). Similar algorithms have also been considered by Buntine and Weigend(1993).Here we consider the speciﬁc case of a network having two layers of weights,for which the required equations are easily derived. We shall use indices i and ito denote inputs, indices j and j to denoted hidden units, and indices k and k todenote outputs. We ﬁrst deﬁneδk =∂En,∂akMkk ≡∂ 2 En∂ak ∂ak(5.92)where En is the contribution to the error from data point n. The Hessian matrix forthis network can then be considered in three separate blocks as follows.1.

Both weights in the second layer:∂ 2 En(2)(2)∂wkj ∂wk j = zj zj Mkk .(5.93)2545. NEURAL NETWORKS2. Both weights in the ﬁrst layer:∂ 2 En(1)(1)∂wji ∂wj i= xi xi h (aj )Ijj +xi xi h (aj )h (aj )k(2)wkj δkk(2)(2)wk j wkj Mkk .(5.94)k3. One weight in each layer:∂ 2 En(1)(2)∂wji ∂wkj Exercise 5.23= xi h (aj )δk Ijj + zj(2)wk j Hkk.(5.95)kHere Ijj is the j, j element of the identity matrix. If one or both of the weights isa bias term, then the corresponding expressions are obtained simply by setting theappropriate activation(s) to 1.

Inclusion of skip-layer connections is straightforward.5.4.6 Fast multiplication by the HessianFor many applications of the Hessian, the quantity of interest is not the Hessianmatrix H itself but the product of H with some vector v. We have seen that theevaluation of the Hessian takes O(W 2 ) operations, and it also requires storage that isO(W 2 ). The vector vT H that we wish to calculate, however, has only W elements,so instead of computing the Hessian as an intermediate step, we can instead try toﬁnd an efﬁcient approach to evaluating vT H directly in a way that requires onlyO(W ) operations.To do this, we ﬁrst note thatvT H = vT ∇(∇E)(5.96)where ∇ denotes the gradient operator in weight space. We can then write downthe standard forward-propagation and backpropagation equations for the evaluationof ∇E and apply (5.96) to these equations to give a set of forward-propagation andbackpropagation equations for the evaluation of vT H (Møller, 1993; Pearlmutter,1994).

This corresponds to acting on the original forward-propagation and backpropagation equations with a differential operator vT ∇. Pearlmutter (1994) used thenotation R{·} to denote the operator vT ∇, and we shall follow this convention. Theanalysis is straightforward and makes use of the usual rules of differential calculus,together with the resultR{w} = v.(5.97)The technique is best illustrated with a simple example, and again we choose atwo-layer network of the form shown in Figure 5.1, with linear output units and asum-of-squares error function. As before, we consider the contribution to the errorfunction from one pattern in the data set. The required vector is then obtained as5.4. The Hessian Matrix255usual by summing over the contributions from each of the patterns separately.

Forthe two-layer network, the forward-propagation equations are given byaj =wji xi(5.98)izjyk= h(aj )=wkj zj .(5.99)(5.100)jWe now act on these equations using the R{·} operator to obtain a set of forwardpropagation equations in the formR{aj } =vji xi(5.101)iR{zj } = h (aj )R{aj }wkj R{zj } +vkj zjR{yk } =j(5.102)(5.103)jwhere vji is the element of the vector v that corresponds to the weight wji . Quantities of the form R{zj }, R{aj } and R{yk } are to be regarded as new variableswhose values are found using the above equations.Because we are considering a sum-of-squares error function, we have the following standard backpropagation expressions:δkδj= yk − tk= h (aj )wkj δk .(5.104)(5.105)kAgain, we act on these equations with the R{·} operator to obtain a set of backpropagation equations in the formR{δk } = R{yk }R{δj } = h (aj )R{aj }+ h (aj )(5.106)wkj δkkvkj δk + h (aj )kwkj R{δk }.(5.107)kFinally, we have the usual equations for the ﬁrst derivatives of the error∂E∂wkj∂E∂wji= δ k zj(5.108)= δj xi(5.109)2565.

NEURAL NETWORKSand acting on these with the R{·} operator, we obtain expressions for the elementsof the vector vT H∂E= R{δk }zj + δk R{zj }(5.110)R∂wkj∂ER= xi R{δj }.(5.111)∂wjiThe implementation of this algorithm involves the introduction of additionalvariables R{aj }, R{zj } and R{δj } for the hidden units and R{δk } and R{yk }for the output units. For each input pattern, the values of these quantities can befound using the above results, and the elements of vT H are then given by (5.110)and (5.111). An elegant aspect of this technique is that the equations for evaluatingvT H mirror closely those for standard forward and backward propagation, and so theextension of existing software to compute this product is typically straightforward.If desired, the technique can be used to evaluate the full Hessian matrix bychoosing the vector v to be given successively by a series of unit vectors of theform (0, 0, .

. . , 1, . . . , 0) each of which picks out one column of the Hessian. Thisleads to a formalism that is analytically equivalent to the backpropagation procedureof Bishop (1992), as described in Section 5.4.5, though with some loss of efﬁciencydue to redundant calculations.5.5. Regularization in Neural NetworksThe number of input and outputs units in a neural network is generally determinedby the dimensionality of the data set, whereas the number M of hidden units is a freeparameter that can be adjusted to give the best predictive performance.

Note that Mcontrols the number of parameters (weights and biases) in the network, and so wemight expect that in a maximum likelihood setting there will be an optimum valueof M that gives the best generalization performance, corresponding to the optimumbalance between under-ﬁtting and over-ﬁtting. Figure 5.9 shows an example of theeffect of different values of M for the sinusoidal regression problem.The generalization error, however, is not a simple function of M due to thepresence of local minima in the error function, as illustrated in Figure 5.10. Herewe see the effect of choosing multiple random initializations for the weight vectorfor a range of values of M .

The overall best validation set performance in thiscase occurred for a particular solution having M = 8. In practice, one approach tochoosing M is in fact to plot a graph of the kind shown in Figure 5.10 and then tochoose the speciﬁc solution having the smallest validation set error.There are, however, other ways to control the complexity of a neural networkmodel in order to avoid over-ﬁtting. From our discussion of polynomial curve ﬁttingin Chapter 1, we see that an alternative approach is to choose a relatively large valuefor M and then to control complexity by the addition of a regularization term to theerror function.

The simplest regularizer is the quadratic, giving a regularized error2575.5. Regularization in Neural NetworksM =11M =31000−1−1−1010M = 101101Figure 5.9 Examples of two-layer networks trained on 10 data points drawn from the sinusoidal data set. Thegraphs show the result of ﬁtting networks having M = 1, 3 and 10 hidden units, respectively, by minimizing asum-of-squares error function using a scaled conjugate-gradient algorithm.of the formλE(w)= E(w) + wT w.(5.112)2This regularizer is also known as weight decay and has been discussed at lengthin Chapter 3. The effective model complexity is then determined by the choice ofthe regularization coefﬁcient λ. As we have seen previously, this regularizer can beinterpreted as the negative logarithm of a zero-mean Gaussian prior distribution overthe weight vector w.5.5.1 Consistent Gaussian priorsOne of the limitations of simple weight decay in the form (5.112) is that isinconsistent with certain scaling properties of network mappings.

To illustrate this,consider a multilayer perceptron network having two layers of weights and linearoutput units, which performs a mapping from a set of input variables {xi } to a setof output variables {yk }. The activations of the hidden units in the ﬁrst hidden layerFigure 5.10Plot of the sum-of-squares test-seterror for the polynomial data set versus the number of hidden units in thenetwork, with 30 random starts foreach network size, showing the effect of local minima. For each newstart, the weight vector was initialized by sampling from an isotropicGaussian distribution having a meanof zero and a variance of 10.160140120100806002468102585.

NEURAL NETWORKStake the formzj = hwji xi + wj 0(5.113)while the activations of the output units are given bywkj zj + wk0 .yk =(5.114)ijSuppose we perform a linear transformation of the input data of the formxi = axi + b.xi → Exercise 5.24(5.115)Then we can arrange for the mapping performed by the network to be unchangedby making a corresponding linear transformation of the weights and biases from theinputs to the units in the hidden layer of the formwji → w ji =1wjia j 0 = wj 0 −wj 0 → w(5.116)bwji .a(5.117)iSimilarly, a linear transformation of the output variables of the network of the formyk = cyk + dyk → (5.118)can be achieved by making a transformation of the second-layer weights and biasesusing kj = cwkjwkj → w k0 = cwk0 + d.wk0 → w(5.119)(5.120)If we train one network using the original data and one network using data for whichthe input and/or target variables are transformed by one of the above linear transformations, then consistency requires that we should obtain equivalent networks thatdiffer only by the linear transformation of the weights as given.

Any regularizershould be consistent with this property, otherwise it arbitrarily favours one solutionover another, equivalent one. Clearly, simple weight decay (5.112), that treats allweights and biases on an equal footing, does not satisfy this property.We therefore look for a regularizer which is invariant under the linear transformations (5.116), (5.117), (5.119) and (5.120). These require that the regularizershould be invariant to re-scaling of the weights and to shifts of the biases. Such aregularizer is given byλ 1 2 λ2 2w +w(5.121)22w∈W1w∈W2where W1 denotes the set of weights in the ﬁrst layer, W2 denotes the set of weightsin the second layer, and biases are excluded from the summations. This regularizer5.5.

Характеристики

Тип файла

PDF-файл

Размер

9,37 Mb

Материал

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Тип материала

Книга

Предмет

(ММО) Методы машинного обучения

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

bishop-c.m.-pattern-recognition-and-machine-learning-2006.pdf.rar

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.