Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 60
Текст из файла (страница 60)
This gives∂2E∂E1∂E=(wlk + ) −(wlk − ) + O(2 ).(5.91)∂wji ∂wlk2 ∂wji∂wjiBecause there are now only W weights to be perturbed, and because the gradientscan be evaluated in O(W ) steps, we see that this method gives the Hessian in O(W 2 )operations.5.4.5 Exact evaluation of the HessianExercise 5.22So far, we have considered various approximation schemes for evaluating theHessian matrix or its inverse.
The Hessian can also be evaluated exactly, for a network of arbitrary feed-forward topology, using extension of the technique of backpropagation used to evaluate first derivatives, which shares many of its desirablefeatures including computational efficiency (Bishop, 1991; Bishop, 1992). It can beapplied to any differentiable error function that can be expressed as a function ofthe network outputs and to networks having arbitrary differentiable activation functions.
The number of computational steps needed to evaluate the Hessian scaleslike O(W 2 ). Similar algorithms have also been considered by Buntine and Weigend(1993).Here we consider the specific case of a network having two layers of weights,for which the required equations are easily derived. We shall use indices i and ito denote inputs, indices j and j to denoted hidden units, and indices k and k todenote outputs. We first defineδk =∂En,∂akMkk ≡∂ 2 En∂ak ∂ak(5.92)where En is the contribution to the error from data point n. The Hessian matrix forthis network can then be considered in three separate blocks as follows.1.
Both weights in the second layer:∂ 2 En(2)(2)∂wkj ∂wk j = zj zj Mkk .(5.93)2545. NEURAL NETWORKS2. Both weights in the first layer:∂ 2 En(1)(1)∂wji ∂wj i= xi xi h (aj )Ijj +xi xi h (aj )h (aj )k(2)wkj δkk(2)(2)wk j wkj Mkk .(5.94)k3. One weight in each layer:∂ 2 En(1)(2)∂wji ∂wkj Exercise 5.23= xi h (aj )δk Ijj + zj(2)wk j Hkk.(5.95)kHere Ijj is the j, j element of the identity matrix. If one or both of the weights isa bias term, then the corresponding expressions are obtained simply by setting theappropriate activation(s) to 1.
Inclusion of skip-layer connections is straightforward.5.4.6 Fast multiplication by the HessianFor many applications of the Hessian, the quantity of interest is not the Hessianmatrix H itself but the product of H with some vector v. We have seen that theevaluation of the Hessian takes O(W 2 ) operations, and it also requires storage that isO(W 2 ). The vector vT H that we wish to calculate, however, has only W elements,so instead of computing the Hessian as an intermediate step, we can instead try tofind an efficient approach to evaluating vT H directly in a way that requires onlyO(W ) operations.To do this, we first note thatvT H = vT ∇(∇E)(5.96)where ∇ denotes the gradient operator in weight space. We can then write downthe standard forward-propagation and backpropagation equations for the evaluationof ∇E and apply (5.96) to these equations to give a set of forward-propagation andbackpropagation equations for the evaluation of vT H (Møller, 1993; Pearlmutter,1994).
This corresponds to acting on the original forward-propagation and backpropagation equations with a differential operator vT ∇. Pearlmutter (1994) used thenotation R{·} to denote the operator vT ∇, and we shall follow this convention. Theanalysis is straightforward and makes use of the usual rules of differential calculus,together with the resultR{w} = v.(5.97)The technique is best illustrated with a simple example, and again we choose atwo-layer network of the form shown in Figure 5.1, with linear output units and asum-of-squares error function. As before, we consider the contribution to the errorfunction from one pattern in the data set. The required vector is then obtained as5.4. The Hessian Matrix255usual by summing over the contributions from each of the patterns separately.
Forthe two-layer network, the forward-propagation equations are given byaj =wji xi(5.98)izjyk= h(aj )=wkj zj .(5.99)(5.100)jWe now act on these equations using the R{·} operator to obtain a set of forwardpropagation equations in the formR{aj } =vji xi(5.101)iR{zj } = h (aj )R{aj }wkj R{zj } +vkj zjR{yk } =j(5.102)(5.103)jwhere vji is the element of the vector v that corresponds to the weight wji . Quantities of the form R{zj }, R{aj } and R{yk } are to be regarded as new variableswhose values are found using the above equations.Because we are considering a sum-of-squares error function, we have the following standard backpropagation expressions:δkδj= yk − tk= h (aj )wkj δk .(5.104)(5.105)kAgain, we act on these equations with the R{·} operator to obtain a set of backpropagation equations in the formR{δk } = R{yk }R{δj } = h (aj )R{aj }+ h (aj )(5.106)wkj δkkvkj δk + h (aj )kwkj R{δk }.(5.107)kFinally, we have the usual equations for the first derivatives of the error∂E∂wkj∂E∂wji= δ k zj(5.108)= δj xi(5.109)2565.
NEURAL NETWORKSand acting on these with the R{·} operator, we obtain expressions for the elementsof the vector vT H∂E= R{δk }zj + δk R{zj }(5.110)R∂wkj∂ER= xi R{δj }.(5.111)∂wjiThe implementation of this algorithm involves the introduction of additionalvariables R{aj }, R{zj } and R{δj } for the hidden units and R{δk } and R{yk }for the output units. For each input pattern, the values of these quantities can befound using the above results, and the elements of vT H are then given by (5.110)and (5.111). An elegant aspect of this technique is that the equations for evaluatingvT H mirror closely those for standard forward and backward propagation, and so theextension of existing software to compute this product is typically straightforward.If desired, the technique can be used to evaluate the full Hessian matrix bychoosing the vector v to be given successively by a series of unit vectors of theform (0, 0, .
. . , 1, . . . , 0) each of which picks out one column of the Hessian. Thisleads to a formalism that is analytically equivalent to the backpropagation procedureof Bishop (1992), as described in Section 5.4.5, though with some loss of efficiencydue to redundant calculations.5.5. Regularization in Neural NetworksThe number of input and outputs units in a neural network is generally determinedby the dimensionality of the data set, whereas the number M of hidden units is a freeparameter that can be adjusted to give the best predictive performance.
Note that Mcontrols the number of parameters (weights and biases) in the network, and so wemight expect that in a maximum likelihood setting there will be an optimum valueof M that gives the best generalization performance, corresponding to the optimumbalance between under-fitting and over-fitting. Figure 5.9 shows an example of theeffect of different values of M for the sinusoidal regression problem.The generalization error, however, is not a simple function of M due to thepresence of local minima in the error function, as illustrated in Figure 5.10. Herewe see the effect of choosing multiple random initializations for the weight vectorfor a range of values of M .
The overall best validation set performance in thiscase occurred for a particular solution having M = 8. In practice, one approach tochoosing M is in fact to plot a graph of the kind shown in Figure 5.10 and then tochoose the specific solution having the smallest validation set error.There are, however, other ways to control the complexity of a neural networkmodel in order to avoid over-fitting. From our discussion of polynomial curve fittingin Chapter 1, we see that an alternative approach is to choose a relatively large valuefor M and then to control complexity by the addition of a regularization term to theerror function.
The simplest regularizer is the quadratic, giving a regularized error2575.5. Regularization in Neural NetworksM =11M =31000−1−1−1010M = 101101Figure 5.9 Examples of two-layer networks trained on 10 data points drawn from the sinusoidal data set. Thegraphs show the result of fitting networks having M = 1, 3 and 10 hidden units, respectively, by minimizing asum-of-squares error function using a scaled conjugate-gradient algorithm.of the formλE(w)= E(w) + wT w.(5.112)2This regularizer is also known as weight decay and has been discussed at lengthin Chapter 3. The effective model complexity is then determined by the choice ofthe regularization coefficient λ. As we have seen previously, this regularizer can beinterpreted as the negative logarithm of a zero-mean Gaussian prior distribution overthe weight vector w.5.5.1 Consistent Gaussian priorsOne of the limitations of simple weight decay in the form (5.112) is that isinconsistent with certain scaling properties of network mappings.
To illustrate this,consider a multilayer perceptron network having two layers of weights and linearoutput units, which performs a mapping from a set of input variables {xi } to a setof output variables {yk }. The activations of the hidden units in the first hidden layerFigure 5.10Plot of the sum-of-squares test-seterror for the polynomial data set versus the number of hidden units in thenetwork, with 30 random starts foreach network size, showing the effect of local minima. For each newstart, the weight vector was initialized by sampling from an isotropicGaussian distribution having a meanof zero and a variance of 10.160140120100806002468102585.
NEURAL NETWORKStake the formzj = hwji xi + wj 0(5.113)while the activations of the output units are given bywkj zj + wk0 .yk =(5.114)ijSuppose we perform a linear transformation of the input data of the formxi = axi + b.xi → Exercise 5.24(5.115)Then we can arrange for the mapping performed by the network to be unchangedby making a corresponding linear transformation of the weights and biases from theinputs to the units in the hidden layer of the formwji → w ji =1wjia j 0 = wj 0 −wj 0 → w(5.116)bwji .a(5.117)iSimilarly, a linear transformation of the output variables of the network of the formyk = cyk + dyk → (5.118)can be achieved by making a transformation of the second-layer weights and biasesusing kj = cwkjwkj → w k0 = cwk0 + d.wk0 → w(5.119)(5.120)If we train one network using the original data and one network using data for whichthe input and/or target variables are transformed by one of the above linear transformations, then consistency requires that we should obtain equivalent networks thatdiffer only by the linear transformation of the weights as given.
Any regularizershould be consistent with this property, otherwise it arbitrarily favours one solutionover another, equivalent one. Clearly, simple weight decay (5.112), that treats allweights and biases on an equal footing, does not satisfy this property.We therefore look for a regularizer which is invariant under the linear transformations (5.116), (5.117), (5.119) and (5.120). These require that the regularizershould be invariant to re-scaling of the weights and to shifts of the biases. Such aregularizer is given byλ 1 2 λ2 2w +w(5.121)22w∈W1w∈W2where W1 denotes the set of weights in the first layer, W2 denotes the set of weightsin the second layer, and biases are excluded from the summations. This regularizer5.5.