Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 57
Текст из файла (страница 57)
Error Backpropagation241This update is repeated by cycling through the data either in sequence or by selectingpoints at random with replacement. There are of course intermediate scenarios inwhich the updates are based on batches of data points.One advantage of on-line methods compared to batch methods is that the formerhandle redundancy in the data much more efficiently. To see, this consider an extreme example in which we take a data set and double its size by duplicating everydata point.
Note that this simply multiplies the error function by a factor of 2 and sois equivalent to using the original error function. Batch methods will require doublethe computational effort to evaluate the batch error function gradient, whereas online methods will be unaffected. Another property of on-line gradient descent is thepossibility of escaping from local minima, since a stationary point with respect tothe error function for the whole data set will generally not be a stationary point foreach data point individually.Nonlinear optimization algorithms, and their practical application to neural network training, are discussed in detail in Bishop and Nabney (2008).5.3. Error BackpropagationOur goal in this section is to find an efficient technique for evaluating the gradientof an error function E(w) for a feed-forward neural network.
We shall see thatthis can be achieved using a local message passing scheme in which information issent alternately forwards and backwards through the network and is known as errorbackpropagation, or sometimes simply as backprop.It should be noted that the term backpropagation is used in the neural computing literature to mean a variety of different things. For instance, the multilayerperceptron architecture is sometimes called a backpropagation network.
The termbackpropagation is also used to describe the training of a multilayer perceptron using gradient descent applied to a sum-of-squares error function. In order to clarifythe terminology, it is useful to consider the nature of the training process more carefully. Most training algorithms involve an iterative procedure for minimization of anerror function, with adjustments to the weights being made in a sequence of steps. Ateach such step, we can distinguish between two distinct stages. In the first stage, thederivatives of the error function with respect to the weights must be evaluated. Aswe shall see, the important contribution of the backpropagation technique is in providing a computationally efficient method for evaluating such derivatives.
Becauseit is at this stage that errors are propagated backwards through the network, we shalluse the term backpropagation specifically to describe the evaluation of derivatives.In the second stage, the derivatives are then used to compute the adjustments to bemade to the weights. The simplest such technique, and the one originally consideredby Rumelhart et al. (1986), involves gradient descent.
It is important to recognizethat the two stages are distinct. Thus, the first stage, namely the propagation of errors backwards through the network in order to evaluate derivatives, can be appliedto many other kinds of network and not just the multilayer perceptron. It can also beapplied to error functions other that just the simple sum-of-squares, and to the eval-2425. NEURAL NETWORKSuation of other derivatives such as the Jacobian and Hessian matrices, as we shallsee later in this chapter.
Similarly, the second stage of weight adjustment using thecalculated derivatives can be tackled using a variety of optimization schemes, manyof which are substantially more powerful than simple gradient descent.5.3.1 Evaluation of error-function derivativesWe now derive the backpropagation algorithm for a general network having arbitrary feed-forward topology, arbitrary differentiable nonlinear activation functions,and a broad class of error function. The resulting formulae will then be illustratedusing a simple layered network structure having a single layer of sigmoidal hiddenunits together with a sum-of-squares error.Many error functions of practical interest, for instance those defined by maximum likelihood for a set of i.i.d. data, comprise a sum of terms, one for each datapoint in the training set, so thatE(w) =NEn (w).(5.44)n=1Here we shall consider the problem of evaluating ∇En (w) for one such term in theerror function.
This may be used directly for sequential optimization, or the resultscan be accumulated over the training set in the case of batch methods.Consider first a simple linear model in which the outputs yk are linear combinations of the input variables xi so thatwki xi(5.45)yk =itogether with an error function that, for a particular input pattern n, takes the form1(ynk − tnk )2(5.46)En =2kwhere ynk = yk (xn , w). The gradient of this error function with respect to a weightwji is given by∂En= (ynj − tnj )xni(5.47)∂wjiwhich can be interpreted as a ‘local’ computation involving the product of an ‘errorsignal’ ynj − tnj associated with the output end of the link wji and the variable xniassociated with the input end of the link.
In Section 4.3.2, we saw how a similarformula arises with the logistic sigmoid activation function together with the crossentropy error function, and similarly for the softmax activation function togetherwith its matching cross-entropy error function. We shall now see how this simpleresult extends to the more complex setting of multilayer feed-forward networks.In a general feed-forward network, each unit computes a weighted sum of itsinputs of the formaj =wji zi(5.48)i5.3.
Error Backpropagation243where zi is the activation of a unit, or input, that sends a connection to unit j, and wjiis the weight associated with that connection. In Section 5.1, we saw that biases canbe included in this sum by introducing an extra unit, or input, with activation fixedat +1. We therefore do not need to deal with biases explicitly. The sum in (5.48) istransformed by a nonlinear activation function h(·) to give the activation zj of unit jin the form(5.49)zj = h(aj ).Note that one or more of the variables zi in the sum in (5.48) could be an input, andsimilarly, the unit j in (5.49) could be an output.For each pattern in the training set, we shall suppose that we have supplied thecorresponding input vector to the network and calculated the activations of all ofthe hidden and output units in the network by successive application of (5.48) and(5.49).
This process is often called forward propagation because it can be regardedas a forward flow of information through the network.Now consider the evaluation of the derivative of En with respect to a weightwji . The outputs of the various units will depend on the particular input pattern n.However, in order to keep the notation uncluttered, we shall omit the subscript nfrom the network variables. First we note that En depends on the weight wji onlyvia the summed input aj to unit j. We can therefore apply the chain rule for partialderivatives to give∂En∂En ∂aj=.(5.50)∂wji∂aj ∂wjiWe now introduce a useful notationδj ≡∂En∂aj(5.51)where the δ’s are often referred to as errors for reasons we shall see shortly.
Using(5.48), we can write∂aj= zi .(5.52)∂wjiSubstituting (5.51) and (5.52) into (5.50), we then obtain∂En= δ j zi .∂wji(5.53)Equation (5.53) tells us that the required derivative is obtained simply by multiplyingthe value of δ for the unit at the output end of the weight by the value of z for the unitat the input end of the weight (where z = 1 in the case of a bias). Note that this takesthe same form as for the simple linear model considered at the start of this section.Thus, in order to evaluate the derivatives, we need only to calculate the value of δjfor each hidden and output unit in the network, and then apply (5.53).As we have seen already, for the output units, we haveδ k = yk − tk(5.54)2445.
NEURAL NETWORKSFigure 5.7Illustration of the calculation of δj for hidden unit j bybackpropagation of the δ’s from those units k to which ziunit j sends connections. The blue arrow denotes thedirection of information flow during forward propagation,and the red arrows indicate the backward propagationof error information.wjiδjzjδkwkjδ1provided we are using the canonical link as the output-unit activation function. Toevaluate the δ’s for hidden units, we again make use of the chain rule for partialderivatives,∂En ∂En ∂akδj ≡=(5.55)∂aj∂ak ∂ajkwhere the sum runs over all units k to which unit j sends connections.
The arrangement of units and weights is illustrated in Figure 5.7. Note that the units labelled kcould include other hidden units and/or output units. In writing down (5.55), we aremaking use of the fact that variations in aj give rise to variations in the error function only through variations in the variables ak . If we now substitute the definitionof δ given by (5.51) into (5.55), and make use of (5.48) and (5.49), we obtain thefollowing backpropagation formulawkj δk(5.56)δj = h (aj )kwhich tells us that the value of δ for a particular hidden unit can be obtained bypropagating the δ’s backwards from units higher up in the network, as illustratedin Figure 5.7.