Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 58
Текст из файла (страница 58)
Note that the summation in (5.56) is taken over the first index onwkj (corresponding to backward propagation of information through the network),whereas in the forward propagation equation (5.10) it is taken over the second index.Because we already know the values of the δ’s for the output units, it follows thatby recursively applying (5.56) we can evaluate the δ’s for all of the hidden units in afeed-forward network, regardless of its topology.The backpropagation procedure can therefore be summarized as follows.Error Backpropagation1. Apply an input vector xn to the network and forward propagate throughthe network using (5.48) and (5.49) to find the activations of all the hiddenand output units.2.
Evaluate the δk for all the output units using (5.54).3. Backpropagate the δ’s using (5.56) to obtain δj for each hidden unit in thenetwork.4. Use (5.53) to evaluate the required derivatives.5.3. Error Backpropagation245For batch methods, the derivative of the total error E can then be obtained byrepeating the above steps for each pattern in the training set and then summing overall patterns: ∂En∂E=.(5.57)∂wji∂wjinIn the above derivation we have implicitly assumed that each hidden or output unit inthe network has the same activation function h(·).
The derivation is easily generalized, however, to allow different units to have individual activation functions, simplyby keeping track of which form of h(·) goes with which unit.5.3.2 A simple exampleThe above derivation of the backpropagation procedure allowed for generalforms for the error function, the activation functions, and the network topology.
Inorder to illustrate the application of this algorithm, we shall consider a particularexample. This is chosen both for its simplicity and for its practical importance, because many applications of neural networks reported in the literature make use ofthis type of network. Specifically, we shall consider a two-layer network of the formillustrated in Figure 5.1, together with a sum-of-squares error, in which the outputunits have linear activation functions, so that yk = ak , while the hidden units havelogistic sigmoid activation functions given byh(a) ≡ tanh(a)(5.58)whereea − e−a.(5.59)ea + e−aA useful feature of this function is that its derivative can be expressed in a particularly simple form:h (a) = 1 − h(a)2 .(5.60)We also consider a standard sum-of-squares error function, so that for pattern n theerror is given byK1(yk − tk )2(5.61)En =2tanh(a) =k=1where yk is the activation of output unit k, and tk is the corresponding target, for aparticular input pattern xn .For each pattern in the training set in turn, we first perform a forward propagationusingaj=D(1)wji xi(5.62)i=0zj= tanh(aj )yk=Mj =0(2)wkj zj .(5.63)(5.64)2465.
NEURAL NETWORKSNext we compute the δ’s for each output unit usingδ k = yk − tk .(5.65)Then we backpropagate these to obtain δs for the hidden units usingδj = (1 − zj2 )Kwkj δk .(5.66)k=1Finally, the derivatives with respect to the first-layer and second-layer weights aregiven by∂En∂En= δj xi ,= δ k zj .(5.67)(1)(2)∂wji∂wkj5.3.3 Efficiency of backpropagationOne of the most important aspects of backpropagation is its computational efficiency.
To understand this, let us examine how the number of computer operationsrequired to evaluate the derivatives of the error function scales with the total numberW of weights and biases in the network. A single evaluation of the error function(for a given input pattern) would require O(W ) operations, for sufficiently large W .This follows from the fact that, except for a network with very sparse connections,the number of weights is typically much greater than the number of units, and so thebulk of the computational effort in forward propagation is concerned with evaluating the sums in (5.48), with the evaluation of the activation functions representing asmall overhead.
Each term in the sum in (5.48) requires one multiplication and oneaddition, leading to an overall computational cost that is O(W ).An alternative approach to backpropagation for computing the derivatives of theerror function is to use finite differences. This can be done by perturbing each weightin turn, and approximating the derivatives by the expressionEn (wji + ) − En (wji )∂En=+ O()∂wji(5.68)where 1. In a software simulation, the accuracy of the approximation to thederivatives can be improved by making smaller, until numerical roundoff problemsarise.
The accuracy of the finite differences method can be improved significantlyby using symmetrical central differences of the formEn (wji + ) − En (wji − )∂En=+ O(2 ).∂wji2Exercise 5.14(5.69)In this case, the O() corrections cancel, as can be verified by Taylor expansion onthe right-hand side of (5.69), and so the residual corrections are O(2 ). The numberof computational steps is, however, roughly doubled compared with (5.68).The main problem with numerical differentiation is that the highly desirableO(W ) scaling has been lost. Each forward propagation requires O(W ) steps, and5.3.
Error BackpropagationFigure 5.8Illustration of a modular patternrecognition system in which theJacobian matrix can be used uto backpropagate error signalsfrom the outputs through to earlier modules in the system.247vyxwzthere are W weights in the network each of which must be perturbed individually, sothat the overall scaling is O(W 2 ).However, numerical differentiation plays an important role in practice, because acomparison of the derivatives calculated by backpropagation with those obtained using central differences provides a powerful check on the correctness of any softwareimplementation of the backpropagation algorithm. When training networks in practice, derivatives should be evaluated using backpropagation, because this gives thegreatest accuracy and numerical efficiency.
However, the results should be comparedwith numerical differentiation using (5.69) for some test cases in order to check thecorrectness of the implementation.5.3.4 The Jacobian matrixWe have seen how the derivatives of an error function with respect to the weightscan be obtained by the propagation of errors backwards through the network. Thetechnique of backpropagation can also be applied to the calculation of other derivatives. Here we consider the evaluation of the Jacobian matrix, whose elements aregiven by the derivatives of the network outputs with respect to the inputsJki ≡∂yk∂xi(5.70)where each such derivative is evaluated with all other inputs held fixed.
Jacobianmatrices play a useful role in systems built from a number of distinct modules, asillustrated in Figure 5.8. Each module can comprise a fixed or adaptive function,which can be linear or nonlinear, so long as it is differentiable. Suppose we wishto minimize an error function E with respect to the parameter w in Figure 5.8. Thederivative of the error function is given by ∂E ∂yk ∂zj∂E=∂w∂yk ∂zj ∂w(5.71)k,jin which the Jacobian matrix for the red module in Figure 5.8 appears in the middleterm.Because the Jacobian matrix provides a measure of the local sensitivity of theoutputs to changes in each of the input variables, it also allows any known errors ∆xi2485.
NEURAL NETWORKSassociated with the inputs to be propagated through the trained network in order toestimate their contribution ∆yk to the errors at the outputs, through the relation ∂yk∆xi(5.72)∆yk ∂xiiwhich is valid provided the |∆xi | are small. In general, the network mapping represented by a trained neural network will be nonlinear, and so the elements of theJacobian matrix will not be constants but will depend on the particular input vectorused. Thus (5.72) is valid only for small perturbations of the inputs, and the Jacobianitself must be re-evaluated for each new input vector.The Jacobian matrix can be evaluated using a backpropagation procedure that issimilar to the one derived earlier for evaluating the derivatives of an error functionwith respect to the weights.
We start by writing the element Jki in the form ∂yk ∂aj∂yk=Jki =∂xi∂aj ∂xij=jwji∂yk∂aj(5.73)where we have made use of (5.48). The sum in (5.73) runs over all units j to whichthe input unit i sends connections (for example, over all units in the first hiddenlayer in the layered topology considered earlier). We now write down a recursivebackpropagation formula to determine the derivatives ∂yk /∂aj ∂yk ∂al∂yk=∂aj∂al ∂ajl∂yk= h (aj )wlj(5.74)∂allwhere the sum runs over all units l to which unit j sends connections (correspondingto the first index of wlj ). Again, we have made use of (5.48) and (5.49). Thisbackpropagation starts at the output units for which the required derivatives can befound directly from the functional form of the output-unit activation function.