Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 56
Текст из файла (страница 56)
NEURAL NETWORKSFigure 5.5Geometrical view of the error function E(w) asa surface sitting over weight space. Point wA isa local minimum and wB is the global minimum.At any point wC , the local gradient of the errorsurface is given by the vector ∇E.E(w)w1wAw2wBwC∇EFollowing the discussion of Section 4.3.4, we see that the output unit activationfunction, which corresponds to the canonical link, is given by the softmax functionexp(ak (x, w))yk (x, w) = exp(aj (x, w))(5.25)jExercise 5.7which satisfies 0 yk 1 and k yk = 1.
Note that the yk (x, w) are unchangedif a constant is added to all of the ak (x, w), causing the error function to be constantfor some directions in weight space. This degeneracy is removed if an appropriateregularization term (Section 5.5) is added to the error function.Once again, the derivative of the error function with respect to the activation fora particular output unit takes the familiar form (5.18).In summary, there is a natural choice of both output unit activation functionand matching error function, according to the type of problem being solved. For regression we use linear outputs and a sum-of-squares error, for (multiple independent)binary classifications we use logistic sigmoid outputs and a cross-entropy error function, and for multiclass classification we use softmax outputs with the correspondingmulticlass cross-entropy error function.
For classification problems involving twoclasses, we can use a single logistic sigmoid output, or alternatively we can use anetwork with two outputs having a softmax output activation function.5.2.1 Parameter optimizationWe turn next to the task of finding a weight vector w which minimizes thechosen function E(w). At this point, it is useful to have a geometrical picture of theerror function, which we can view as a surface sitting over weight space as shown inFigure 5.5.
First note that if we make a small step in weight space from w to w + δwthen the change in the error function is δE δwT ∇E(w), where the vector ∇E(w)points in the direction of greatest rate of increase of the error function. Because theerror E(w) is a smooth continuous function of w, its smallest value will occur at a5.2. Network Training237point in weight space such that the gradient of the error function vanishes, so that∇E(w) = 0Section 5.1.1(5.26)as otherwise we could make a small step in the direction of −∇E(w) and therebyfurther reduce the error. Points at which the gradient vanishes are called stationarypoints, and may be further classified into minima, maxima, and saddle points.Our goal is to find a vector w such that E(w) takes its smallest value.
However, the error function typically has a highly nonlinear dependence on the weightsand bias parameters, and so there will be many points in weight space at which thegradient vanishes (or is numerically very small). Indeed, from the discussion in Section 5.1.1 we see that for any point w that is a local minimum, there will be otherpoints in weight space that are equivalent minima.
For instance, in a two-layer network of the kind shown in Figure 5.1, with M hidden units, each point in weightspace is a member of a family of M !2M equivalent points.Furthermore, there will typically be multiple inequivalent stationary points andin particular multiple inequivalent minima. A minimum that corresponds to thesmallest value of the error function for any weight vector is said to be a globalminimum.
Any other minima corresponding to higher values of the error functionare said to be local minima. For a successful application of neural networks, it maynot be necessary to find the global minimum (and in general it will not be knownwhether the global minimum has been found) but it may be necessary to compareseveral local minima in order to find a sufficiently good solution.Because there is clearly no hope of finding an analytical solution to the equation ∇E(w) = 0 we resort to iterative numerical procedures. The optimization ofcontinuous nonlinear functions is a widely studied problem and there exists an extensive literature on how to solve it efficiently.
Most techniques involve choosingsome initial value w(0) for the weight vector and then moving through weight spacein a succession of steps of the formw(τ +1) = w(τ ) + ∆w(τ )(5.27)where τ labels the iteration step. Different algorithms involve different choices forthe weight vector update ∆w(τ ) . Many algorithms make use of gradient informationand therefore require that, after each update, the value of ∇E(w) is evaluated atthe new weight vector w(τ +1) . In order to understand the importance of gradientinformation, it is useful to consider a local approximation to the error function basedon a Taylor expansion.5.2.2 Local quadratic approximationInsight into the optimization problem, and into the various techniques for solving it, can be obtained by considering a local quadratic approximation to the errorfunction.
in weight spaceConsider the Taylor expansion of E(w) around some point w1 + (w − w) T b + (w − w) T H(w − w)E(w) E(w)2(5.28)2385. NEURAL NETWORKSwhere cubic and higher terms have been omitted. Here b is defined to be the gradientof E evaluated at w(5.29)b ≡ ∇E|w=wband the Hessian matrix H = ∇∇E has elements∂E .(H)ij ≡∂wi ∂wj w=wb(5.30)From (5.28), the corresponding local approximation to the gradient is given by∇E b + H(w − w).(5.31) these expressions will give reasonableFor points w that are sufficiently close to w,approximations for the error and its gradient.Consider the particular case of a local quadratic approximation around a pointw that is a minimum of the error function. In this case there is no linear term,because ∇E = 0 at w , and (5.28) becomes1E(w) = E(w ) + (w − w )T H(w − w )2(5.32)where the Hessian H is evaluated at w . In order to interpret this geometrically,consider the eigenvalue equation for the Hessian matrixHui = λi ui(5.33)where the eigenvectors ui form a complete orthonormal set (Appendix C) so thatuTi uj = δij .(5.34)We now expand (w − w ) as a linear combination of the eigenvectors in the formαi u i .(5.35)w − w =iThis can be regarded as a transformation of the coordinate system in which the originis translated to the point w , and the axes are rotated to align with the eigenvectors(through the orthogonal matrix whose columns are the ui ), and is discussed in moredetail in Appendix C.
Substituting (5.35) into (5.32), and using (5.33) and (5.34),allows the error function to be written in the form1λi αi2 .(5.36)E(w) = E(w ) +2iA matrix H is said to be positive definite if, and only if,vT Hv > 0for all v.(5.37)2395.2. Network TrainingFigure 5.6In the neighbourhood of a min- w2imum w , the error functioncan be approximated by aquadratic. Contours of constant error are then ellipseswhose axes are aligned withthe eigenvectors ui of the Hes−1/2λ2sian matrix, with lengths thatare inversely proportional to thesquare roots of the corresponding eigenvectors λi .u2u1w−1/2λ1w1Because the eigenvectors {ui } form a complete set, an arbitrary vector v can bewritten in the formv=ci ui .(5.38)iFrom (5.33) and (5.34), we then havevT Hv =c2i λi(5.39)iExercise 5.10Exercise 5.11Exercise 5.12and so H will be positive definite if, and only if, all of its eigenvalues are positive.In the new coordinate system, whose basis vectors are given by the eigenvectors{ui }, the contours of constant E are ellipses centred on the origin, as illustratedin Figure 5.6.
For a one-dimensional weight space, a stationary point w will be aminimum if∂ 2 E > 0.(5.40)∂w2 wThe corresponding result in D-dimensions is that the Hessian matrix, evaluated atw , should be positive definite.5.2.3 Use of gradient informationExercise 5.13As we shall see in Section 5.3, it is possible to evaluate the gradient of an errorfunction efficiently by means of the backpropagation procedure. The use of thisgradient information can lead to significant improvements in the speed with whichthe minima of the error function can be located.
We can see why this is so, as follows.In the quadratic approximation to the error function, given in (5.28), the errorsurface is specified by the quantities b and H, which contain a total of W (W +3)/2 independent elements (because the matrix H is symmetric), where W is thedimensionality of w (i.e., the total number of adaptive parameters in the network).The location of the minimum of this quadratic approximation therefore depends onO(W 2 ) parameters, and we should not expect to be able to locate the minimum untilwe have gathered O(W 2 ) independent pieces of information. If we do not makeuse of gradient information, we would expect to have to perform O(W 2 ) function2405.
NEURAL NETWORKSevaluations, each of which would require O(W ) steps. Thus, the computationaleffort needed to find the minimum using such an approach would be O(W 3 ).Now compare this with an algorithm that makes use of the gradient information.Because each evaluation of ∇E brings W items of information, we might hope tofind the minimum of the function in O(W ) gradient evaluations. As we shall see,by using error backpropagation, each such evaluation takes only O(W ) steps and sothe minimum can now be found in O(W 2 ) steps. For this reason, the use of gradientinformation forms the basis of practical algorithms for training neural networks.5.2.4 Gradient descent optimizationThe simplest approach to using gradient information is to choose the weightupdate in (5.27) to comprise a small step in the direction of the negative gradient, sothatw(τ +1) = w(τ ) − η∇E(w(τ ) )(5.41)where the parameter η > 0 is known as the learning rate.
After each such update, thegradient is re-evaluated for the new weight vector and the process repeated. Note thatthe error function is defined with respect to a training set, and so each step requiresthat the entire training set be processed in order to evaluate ∇E. Techniques thatuse the whole data set at once are called batch methods. At each step the weightvector is moved in the direction of the greatest rate of decrease of the error function,and so this approach is known as gradient descent or steepest descent.
Althoughsuch an approach might intuitively seem reasonable, in fact it turns out to be a pooralgorithm, for reasons discussed in Bishop and Nabney (2008).For batch optimization, there are more efficient methods, such as conjugate gradients and quasi-Newton methods, which are much more robust and much fasterthan simple gradient descent (Gill et al., 1981; Fletcher, 1987; Nocedal and Wright,1999). Unlike gradient descent, these algorithms have the property that the errorfunction always decreases at each iteration unless the weight vector has arrived at alocal or global minimum.In order to find a sufficiently good minimum, it may be necessary to run agradient-based algorithm multiple times, each time using a different randomly chosen starting point, and comparing the resulting performance on an independent validation set.There is, however, an on-line version of gradient descent that has proved usefulin practice for training neural networks on large data sets (Le Cun et al., 1989).Error functions based on maximum likelihood for a set of independent observationscomprise a sum of terms, one for each data pointE(w) =NEn (w).(5.42)n=1On-line gradient descent, also known as sequential gradient descent or stochasticgradient descent, makes an update to the weight vector based on one data point at atime, so that(5.43)w(τ +1) = w(τ ) − η∇En (w(τ ) ).5.3.