Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 56

Файл №811375 Bishop C.M. Pattern Recognition and Machine Learning (2006) (Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf) 56 страницаBishop C.M. Pattern Recognition and Machine Learning (2006) (811375) страница 562020-08-252020-08-25СтудИзба

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 56)

NEURAL NETWORKSFigure 5.5Geometrical view of the error function E(w) asa surface sitting over weight space. Point wA isa local minimum and wB is the global minimum.At any point wC , the local gradient of the errorsurface is given by the vector ∇E.E(w)w1wAw2wBwC∇EFollowing the discussion of Section 4.3.4, we see that the output unit activationfunction, which corresponds to the canonical link, is given by the softmax functionexp(ak (x, w))yk (x, w) = exp(aj (x, w))(5.25)jExercise 5.7which satisﬁes 0 yk 1 and k yk = 1.

Note that the yk (x, w) are unchangedif a constant is added to all of the ak (x, w), causing the error function to be constantfor some directions in weight space. This degeneracy is removed if an appropriateregularization term (Section 5.5) is added to the error function.Once again, the derivative of the error function with respect to the activation fora particular output unit takes the familiar form (5.18).In summary, there is a natural choice of both output unit activation functionand matching error function, according to the type of problem being solved. For regression we use linear outputs and a sum-of-squares error, for (multiple independent)binary classiﬁcations we use logistic sigmoid outputs and a cross-entropy error function, and for multiclass classiﬁcation we use softmax outputs with the correspondingmulticlass cross-entropy error function.

For classiﬁcation problems involving twoclasses, we can use a single logistic sigmoid output, or alternatively we can use anetwork with two outputs having a softmax output activation function.5.2.1 Parameter optimizationWe turn next to the task of ﬁnding a weight vector w which minimizes thechosen function E(w). At this point, it is useful to have a geometrical picture of theerror function, which we can view as a surface sitting over weight space as shown inFigure 5.5.

First note that if we make a small step in weight space from w to w + δwthen the change in the error function is δE δwT ∇E(w), where the vector ∇E(w)points in the direction of greatest rate of increase of the error function. Because theerror E(w) is a smooth continuous function of w, its smallest value will occur at a5.2. Network Training237point in weight space such that the gradient of the error function vanishes, so that∇E(w) = 0Section 5.1.1(5.26)as otherwise we could make a small step in the direction of −∇E(w) and therebyfurther reduce the error. Points at which the gradient vanishes are called stationarypoints, and may be further classiﬁed into minima, maxima, and saddle points.Our goal is to ﬁnd a vector w such that E(w) takes its smallest value.

However, the error function typically has a highly nonlinear dependence on the weightsand bias parameters, and so there will be many points in weight space at which thegradient vanishes (or is numerically very small). Indeed, from the discussion in Section 5.1.1 we see that for any point w that is a local minimum, there will be otherpoints in weight space that are equivalent minima.

For instance, in a two-layer network of the kind shown in Figure 5.1, with M hidden units, each point in weightspace is a member of a family of M !2M equivalent points.Furthermore, there will typically be multiple inequivalent stationary points andin particular multiple inequivalent minima. A minimum that corresponds to thesmallest value of the error function for any weight vector is said to be a globalminimum.

Any other minima corresponding to higher values of the error functionare said to be local minima. For a successful application of neural networks, it maynot be necessary to ﬁnd the global minimum (and in general it will not be knownwhether the global minimum has been found) but it may be necessary to compareseveral local minima in order to ﬁnd a sufﬁciently good solution.Because there is clearly no hope of ﬁnding an analytical solution to the equation ∇E(w) = 0 we resort to iterative numerical procedures. The optimization ofcontinuous nonlinear functions is a widely studied problem and there exists an extensive literature on how to solve it efﬁciently.

Most techniques involve choosingsome initial value w(0) for the weight vector and then moving through weight spacein a succession of steps of the formw(τ +1) = w(τ ) + ∆w(τ )(5.27)where τ labels the iteration step. Different algorithms involve different choices forthe weight vector update ∆w(τ ) . Many algorithms make use of gradient informationand therefore require that, after each update, the value of ∇E(w) is evaluated atthe new weight vector w(τ +1) . In order to understand the importance of gradientinformation, it is useful to consider a local approximation to the error function basedon a Taylor expansion.5.2.2 Local quadratic approximationInsight into the optimization problem, and into the various techniques for solving it, can be obtained by considering a local quadratic approximation to the errorfunction.

in weight spaceConsider the Taylor expansion of E(w) around some point w1 + (w − w) T b + (w − w) T H(w − w)E(w) E(w)2(5.28)2385. NEURAL NETWORKSwhere cubic and higher terms have been omitted. Here b is deﬁned to be the gradientof E evaluated at w(5.29)b ≡ ∇E|w=wband the Hessian matrix H = ∇∇E has elements∂E .(H)ij ≡∂wi ∂wj w=wb(5.30)From (5.28), the corresponding local approximation to the gradient is given by∇E b + H(w − w).(5.31) these expressions will give reasonableFor points w that are sufﬁciently close to w,approximations for the error and its gradient.Consider the particular case of a local quadratic approximation around a pointw that is a minimum of the error function. In this case there is no linear term,because ∇E = 0 at w , and (5.28) becomes1E(w) = E(w ) + (w − w )T H(w − w )2(5.32)where the Hessian H is evaluated at w . In order to interpret this geometrically,consider the eigenvalue equation for the Hessian matrixHui = λi ui(5.33)where the eigenvectors ui form a complete orthonormal set (Appendix C) so thatuTi uj = δij .(5.34)We now expand (w − w ) as a linear combination of the eigenvectors in the formαi u i .(5.35)w − w =iThis can be regarded as a transformation of the coordinate system in which the originis translated to the point w , and the axes are rotated to align with the eigenvectors(through the orthogonal matrix whose columns are the ui ), and is discussed in moredetail in Appendix C.

Substituting (5.35) into (5.32), and using (5.33) and (5.34),allows the error function to be written in the form1λi αi2 .(5.36)E(w) = E(w ) +2iA matrix H is said to be positive deﬁnite if, and only if,vT Hv > 0for all v.(5.37)2395.2. Network TrainingFigure 5.6In the neighbourhood of a min- w2imum w , the error functioncan be approximated by aquadratic. Contours of constant error are then ellipseswhose axes are aligned withthe eigenvectors ui of the Hes−1/2λ2sian matrix, with lengths thatare inversely proportional to thesquare roots of the corresponding eigenvectors λi .u2u1w−1/2λ1w1Because the eigenvectors {ui } form a complete set, an arbitrary vector v can bewritten in the formv=ci ui .(5.38)iFrom (5.33) and (5.34), we then havevT Hv =c2i λi(5.39)iExercise 5.10Exercise 5.11Exercise 5.12and so H will be positive deﬁnite if, and only if, all of its eigenvalues are positive.In the new coordinate system, whose basis vectors are given by the eigenvectors{ui }, the contours of constant E are ellipses centred on the origin, as illustratedin Figure 5.6.

For a one-dimensional weight space, a stationary point w will be aminimum if∂ 2 E > 0.(5.40)∂w2 wThe corresponding result in D-dimensions is that the Hessian matrix, evaluated atw , should be positive deﬁnite.5.2.3 Use of gradient informationExercise 5.13As we shall see in Section 5.3, it is possible to evaluate the gradient of an errorfunction efﬁciently by means of the backpropagation procedure. The use of thisgradient information can lead to signiﬁcant improvements in the speed with whichthe minima of the error function can be located.

We can see why this is so, as follows.In the quadratic approximation to the error function, given in (5.28), the errorsurface is speciﬁed by the quantities b and H, which contain a total of W (W +3)/2 independent elements (because the matrix H is symmetric), where W is thedimensionality of w (i.e., the total number of adaptive parameters in the network).The location of the minimum of this quadratic approximation therefore depends onO(W 2 ) parameters, and we should not expect to be able to locate the minimum untilwe have gathered O(W 2 ) independent pieces of information. If we do not makeuse of gradient information, we would expect to have to perform O(W 2 ) function2405.

NEURAL NETWORKSevaluations, each of which would require O(W ) steps. Thus, the computationaleffort needed to ﬁnd the minimum using such an approach would be O(W 3 ).Now compare this with an algorithm that makes use of the gradient information.Because each evaluation of ∇E brings W items of information, we might hope toﬁnd the minimum of the function in O(W ) gradient evaluations. As we shall see,by using error backpropagation, each such evaluation takes only O(W ) steps and sothe minimum can now be found in O(W 2 ) steps. For this reason, the use of gradientinformation forms the basis of practical algorithms for training neural networks.5.2.4 Gradient descent optimizationThe simplest approach to using gradient information is to choose the weightupdate in (5.27) to comprise a small step in the direction of the negative gradient, sothatw(τ +1) = w(τ ) − η∇E(w(τ ) )(5.41)where the parameter η > 0 is known as the learning rate.

After each such update, thegradient is re-evaluated for the new weight vector and the process repeated. Note thatthe error function is deﬁned with respect to a training set, and so each step requiresthat the entire training set be processed in order to evaluate ∇E. Techniques thatuse the whole data set at once are called batch methods. At each step the weightvector is moved in the direction of the greatest rate of decrease of the error function,and so this approach is known as gradient descent or steepest descent.

Althoughsuch an approach might intuitively seem reasonable, in fact it turns out to be a pooralgorithm, for reasons discussed in Bishop and Nabney (2008).For batch optimization, there are more efﬁcient methods, such as conjugate gradients and quasi-Newton methods, which are much more robust and much fasterthan simple gradient descent (Gill et al., 1981; Fletcher, 1987; Nocedal and Wright,1999). Unlike gradient descent, these algorithms have the property that the errorfunction always decreases at each iteration unless the weight vector has arrived at alocal or global minimum.In order to ﬁnd a sufﬁciently good minimum, it may be necessary to run agradient-based algorithm multiple times, each time using a different randomly chosen starting point, and comparing the resulting performance on an independent validation set.There is, however, an on-line version of gradient descent that has proved usefulin practice for training neural networks on large data sets (Le Cun et al., 1989).Error functions based on maximum likelihood for a set of independent observationscomprise a sum of terms, one for each data pointE(w) =NEn (w).(5.42)n=1On-line gradient descent, also known as sequential gradient descent or stochasticgradient descent, makes an update to the weight vector based on one data point at atime, so that(5.43)w(τ +1) = w(τ ) − η∇En (w(τ ) ).5.3.

Характеристики

Тип файла

PDF-файл

Размер

9,37 Mb

Материал

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Тип материала

Книга

Предмет

(ММО) Методы машинного обучения

Высшее учебное заведение

МГУ им. Ломоносова

Список файлов книги

bishop-c.m.-pattern-recognition-and-machine-learning-2006.pdf.rar

Bishop C.M. Pattern Recognition and Machine Learning (2006).pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.