Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 59
Текст из файла (страница 59)
Forinstance, if we have individual sigmoidal activation functions at each output unit,then∂yk= δkj σ (aj )(5.75)∂ajwhereas for softmax outputs we have∂yk= δkj yk − yk yj .∂aj(5.76)We can summarize the procedure for evaluating the Jacobian matrix as follows.Apply the input vector corresponding to the point in input space at which the Jacobian matrix is to be found, and forward propagate in the usual way to obtain the5.4. The Hessian MatrixExercise 5.15249activations of all of the hidden and output units in the network.
Next, for each rowk of the Jacobian matrix, corresponding to the output unit k, backpropagate usingthe recursive relation (5.74), starting with (5.75) or (5.76), for all of the hidden unitsin the network. Finally, use (5.73) to do the backpropagation to the inputs. TheJacobian can also be evaluated using an alternative forward propagation formalism,which can be derived in an analogous way to the backpropagation approach givenhere.Again, the implementation of such algorithms can be checked by using numerical differentiation in the form∂ykyk (xi + ) − yk (xi − )+ O(2 )=∂xi2(5.77)which involves 2D forward propagations for a network having D inputs.5.4.
The Hessian MatrixWe have shown how the technique of backpropagation can be used to obtain the firstderivatives of an error function with respect to the weights in the network. Backpropagation can also be used to evaluate the second derivatives of the error, givenby∂2E.(5.78)∂wji ∂wlkNote that it is sometimes convenient to consider all of the weight and bias parametersas elements wi of a single vector, denoted w, in which case the second derivativesform the elements Hij of the Hessian matrix H, where i, j ∈ {1, . . . , W } and W isthe total number of weights and biases.
The Hessian plays an important role in manyaspects of neural computing, including the following:1. Several nonlinear optimization algorithms used for training neural networksare based on considerations of the second-order properties of the error surface,which are controlled by the Hessian matrix (Bishop and Nabney, 2008).2. The Hessian forms the basis of a fast procedure for re-training a feed-forwardnetwork following a small change in the training data (Bishop, 1991).3. The inverse of the Hessian has been used to identify the least significant weightsin a network as part of network ‘pruning’ algorithms (Le Cun et al., 1990).4.
The Hessian plays a central role in the Laplace approximation for a Bayesianneural network (see Section 5.7). Its inverse is used to determine the predictive distribution for a trained network, its eigenvalues determine the values ofhyperparameters, and its determinant is used to evaluate the model evidence.Various approximation schemes have been used to evaluate the Hessian matrixfor a neural network. However, the Hessian can also be calculated exactly using anextension of the backpropagation technique.2505.
NEURAL NETWORKSAn important consideration for many applications of the Hessian is the efficiencywith which it can be evaluated. If there are W parameters (weights and biases) in thenetwork, then the Hessian matrix has dimensions W × W and so the computationaleffort needed to evaluate the Hessian will scale like O(W 2 ) for each pattern in thedata set. As we shall see, there are efficient methods for evaluating the Hessianwhose scaling is indeed O(W 2 ).5.4.1 Diagonal approximationSome of the applications for the Hessian matrix discussed above require theinverse of the Hessian, rather than the Hessian itself. For this reason, there hasbeen some interest in using a diagonal approximation to the Hessian, in other wordsone that simply replaces the off-diagonal elements with zeros, because its inverse istrivial to evaluate. Again, we shall consider an error function that consists of a sumof terms, one for each pattern in the data set, so that E = n En .
The Hessian canthen be obtained by considering one pattern at a time, and then summing the resultsover all patterns. From (5.48), the diagonal elements of the Hessian, for pattern n,can be written∂ 2 En 2∂ 2 En=z .(5.79)2∂wji∂a2j iUsing (5.48) and (5.49), the second derivatives on the right-hand side of (5.79) canbe found recursively using the chain rule of differential calculus to give a backpropagation equation of the form∂ 2 En∂E n∂ 2 En2j=h(a)ww+h(a)w.jkjkjkj∂a2j∂ak ∂ak∂akkk(5.80)kIf we now neglect off-diagonal elements in the second-derivative terms, we obtain(Becker and Le Cun, 1989; Le Cun et al., 1990)2∂En∂ 2 En22 ∂ En=h(a)w+h(a)wkj.jjkj22∂aj∂ak∂akk(5.81)kNote that the number of computational steps required to evaluate this approximationis O(W ), where W is the total number of weight and bias parameters in the network,compared with O(W 2 ) for the full Hessian.Ricotti et al.
(1988) also used the diagonal approximation to the Hessian, butthey retained all terms in the evaluation of ∂ 2 En /∂a2j and so obtained exact expressions for the diagonal terms. Note that this no longer has O(W ) scaling. The majorproblem with diagonal approximations, however, is that in practice the Hessian istypically found to be strongly nondiagonal, and so these approximations, which aredriven mainly be computational convenience, must be treated with care.5.4.
The Hessian Matrix2515.4.2 Outer product approximationWhen neural networks are applied to regression problems, it is common to usea sum-of-squares error function of the form1(yn − tn )22NE=(5.82)n=1Exercise 5.16where we have considered the case of a single output in order to keep the notationsimple (the extension to several outputs is straightforward). We can then write theHessian matrix in the formH = ∇∇E =N∇yn ∇yn +n=1Exercise 5.17N(yn − tn )∇∇yn .(5.83)n=1If the network has been trained on the data set, and its outputs yn happen to be veryclose to the target values tn , then the second term in (5.83) will be small and canbe neglected. More generally, however, it may be appropriate to neglect this termby the following argument.
Recall from Section 1.5.5 that the optimal function thatminimizes a sum-of-squares loss is the conditional average of the target data. Thequantity (yn − tn ) is then a random variable with zero mean. If we assume that itsvalue is uncorrelated with the value of the second derivative term on the right-handside of (5.83), then the whole term will average to zero in the summation over n.By neglecting the second term in (5.83), we arrive at the Levenberg–Marquardtapproximation or outer product approximation (because the Hessian matrix is builtup from a sum of outer products of vectors), given byHNbn bTn(5.84)n=1Exercise 5.19where bn = ∇yn = ∇an because the activation function for the output units issimply the identity.
Evaluation of the outer product approximation for the Hessianis straightforward as it only involves first derivatives of the error function, whichcan be evaluated efficiently in O(W ) steps using standard backpropagation. Theelements of the matrix can then be found in O(W 2 ) steps by simple multiplication.It is important to emphasize that this approximation is only likely to be valid for anetwork that has been trained appropriately, and that for a general network mappingthe second derivative terms on the right-hand side of (5.83) will typically not benegligible.In the case of the cross-entropy error function for a network with logistic sigmoidoutput-unit activation functions, the corresponding approximation is given byHNyn (1 − yn )bn bTn.(5.85)n=1Exercise 5.20An analogous result can be obtained for multiclass networks having softmax outputunit activation functions.2525.
NEURAL NETWORKS5.4.3 Inverse HessianWe can use the outer-product approximation to develop a computationally efficient procedure for approximating the inverse of the Hessian (Hassibi and Stork,1993). First we write the outer-product approximation in matrix notation asHN =Nbn bTn(5.86)n=1where bn ≡ ∇w an is the contribution to the gradient of the output unit activationarising from data point n.
We now derive a sequential procedure for building up theHessian by including data points one at a time. Suppose we have already obtainedthe inverse Hessian using the first L data points. By separating off the contributionfrom data point L + 1, we obtainHL+1 = HL + bL+1 bTL+1 .(5.87)In order to evaluate the inverse of the Hessian, we now consider the matrix identity T −1 −1v)v M(M−1M + vvT= M−1 −(5.88)1 + vT M−1 vwhere I is the unit matrix, which is simply a special case of the Woodbury identity(C.7). If we now identify HL with M and bL+1 with v, we obtain1−1H−L+1 = HL −Exercise 5.211−1TH−L bL+1 bL+1 HL−11 + bTL+1 HL bL+1.(5.89)In this way, data points are sequentially absorbed until L+1 = N and the whole dataset has been processed. This result therefore represents a procedure for evaluatingthe inverse of the Hessian using a single pass through the data set.
The initial matrixH0 is chosen to be αI, where α is a small quantity, so that the algorithm actuallyfinds the inverse of H + αI. The results are not particularly sensitive to the precisevalue of α. Extension of this algorithm to networks having more than one output isstraightforward.We note here that the Hessian matrix can sometimes be calculated indirectly aspart of the network training algorithm. In particular, quasi-Newton nonlinear optimization algorithms gradually build up an approximation to the inverse of the Hessian during training. Such algorithms are discussed in detail in Bishop and Nabney(2008).5.4.4 Finite differencesAs in the case of the first derivatives of the error function, we can find the secondderivatives by using finite differences, with accuracy limited by numerical precision.If we perturb each possible pair of weights in turn, we obtain1∂2E= 2 {E(wji + , wlk + ) − E(wji + , wlk − )∂wji ∂wlk4−E(wji − , wlk + ) + E(wji − , wlk − )} + O(2 ).(5.90)5.4.
The Hessian Matrix253Again, by using a symmetrical central differences formulation, we ensure that theresidual errors are O(2 ) rather than O(). Because there are W 2 elements in theHessian matrix, and because the evaluation of each element requires four forwardpropagations each needing O(W ) operations (per pattern), we see that this approachwill require O(W 3 ) operations to evaluate the complete Hessian. It therefore haspoor scaling properties, although in practice it is very useful as a check on the software implementation of backpropagation methods.A more efficient version of numerical differentiation can be found by applyingcentral differences to the first derivatives of the error function, which are themselvescalculated using backpropagation.