Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 67
Текст из файла (страница 67)
From Section 4.5.2, we see that this distribution is Gaussian with meanaMAP ≡ a(x, wMAP ), and varianceσa2 (x) = bT (x)A−1 b(x).(5.188)Finally, to obtain the predictive distribution, we must marginalize over a using(5.189)p(t = 1|x, D) = σ(a)p(a|x, D) da.2845. NEURAL NETWORKS33221100−1−1−2−2−2−1012−2−1012Figure 5.23 An illustration of the Laplace approximation for a Bayesian neural network having 8 hidden unitswith ‘tanh’ activation functions and a single logistic-sigmoid output unit. The weight parameters were found usingscaled conjugate gradients, and the hyperparameter α was optimized using the evidence framework.
On the leftis the result of using the simple approximation (5.185) based on a point estimate wMAP of the parameters,in which the green curve shows the y = 0.5 decision boundary, and the other contours correspond to outputprobabilities of y = 0.1, 0.3, 0.7, and 0.9. On the right is the corresponding result obtained using (5.190). Notethat the effect of marginalization is to spread out the contours and to make the predictions less confident, sothat at each input point x, the posterior probabilities are shifted towards 0.5, while the y = 0.5 contour itself isunaffected.The convolution of a Gaussian with a logistic sigmoid is intractable. We thereforeapply the approximation (4.153) to (5.189) giving(5.190)p(t = 1|x, D) = σ κ(σa2 )bT wMAPwhere κ(·) is defined by (4.154).
Recall that both σa2 and b are functions of x.Figure 5.23 shows an example of this framework applied to the synthetic classification data set described in Appendix A.Exercises5.1 ( ) Consider a two-layer network function of the form (5.7) in which the hiddenunit nonlinear activation functions g(·) are given by logistic sigmoid functions of theform−1(5.191)σ(a) = {1 + exp(−a)} .Show that there exists an equivalent network, which computes exactly the same function, but with hidden unit activation functions given by tanh(a) where the tanh function is defined by (5.59).
Hint: first find the relation between σ(a) and tanh(a), andthen show that the parameters of the two networks differ by linear transformations.5.2 () www Show that maximizing the likelihood function under the conditionaldistribution (5.16) for a multioutput neural network is equivalent to minimizing thesum-of-squares error function (5.11).Exercises2855.3 ( ) Consider a regression problem involving multiple target variables in which itis assumed that the distribution of the targets, conditioned on the input vector x, is aGaussian of the formp(t|x, w) = N (t|y(x, w), Σ)(5.192)where y(x, w) is the output of a neural network with input vector x and weightvector w, and Σ is the covariance of the assumed Gaussian noise on the targets.Given a set of independent observations of x and t, write down the error functionthat must be minimized in order to find the maximum likelihood solution for w, ifwe assume that Σ is fixed and known.
Now assume that Σ is also to be determinedfrom the data, and write down an expression for the maximum likelihood solutionfor Σ. Note that the optimizations of w and Σ are now coupled, in contrast to thecase of independent target variables discussed in Section 5.2.5.4 ( ) Consider a binary classification problem in which the target values are t ∈{0, 1}, with a network output y(x, w) that represents p(t = 1|x), and suppose thatthere is a probability that the class label on a training data point has been incorrectlyset. Assuming independent and identically distributed data, write down the errorfunction corresponding to the negative log likelihood. Verify that the error function(5.21) is obtained when = 0. Note that this error function makes the model robustto incorrectly labelled data, in contrast to the usual error function.5.5 () www Show that maximizing likelihood for a multiclass neural network modelin which the network outputs have the interpretation yk (x, w) = p(tk = 1|x) isequivalent to the minimization of the cross-entropy error function (5.24).5.6 () www Show the derivative of the error function (5.21) with respect to theactivation ak for an output unit having a logistic sigmoid activation function satisfies(5.18).5.7 () Show the derivative of the error function (5.24) with respect to the activation akfor output units having a softmax activation function satisfies (5.18).5.8 () We saw in (4.88) that the derivative of the logistic sigmoid activation functioncan be expressed in terms of the function value itself.
Derive the corresponding resultfor the ‘tanh’ activation function defined by (5.59).5.9 () www The error function (5.21) for binary classification problems was derived for a network having a logistic-sigmoid output activation function, so that0 y(x, w) 1, and data having target values t ∈ {0, 1}. Derive the corresponding error function if we consider a network having an output −1 y(x, w) 1and target values t = 1 for class C1 and t = −1 for class C2 .
What would be theappropriate choice of output unit activation function?5.10 () www Consider a Hessian matrix H with eigenvector equation (5.33). Bysetting the vector v in (5.39) equal to each of the eigenvectors ui in turn, show thatH is positive definite if, and only if, all of its eigenvalues are positive.2865. NEURAL NETWORKS5.11 ( ) www Consider a quadratic error function defined by (5.32), in which theHessian matrix H has an eigenvalue equation given by (5.33). Show that the contours of constant error are ellipses whose axes are aligned with the eigenvectors ui ,with lengths that are inversely proportional to the square root of the correspondingeigenvalues λi .5.12 ( ) www By considering the local Taylor expansion (5.32) of an error functionabout a stationary point w , show that the necessary and sufficient condition for thestationary point to be a local minimum of the error function is that the Hessian matrix = w , be positive definite.H, defined by (5.30) with w5.13 () Show that as a consequence of the symmetry of the Hessian matrix H, thenumber of independent elements in the quadratic error function (5.28) is given byW (W + 3)/2.5.14 () By making a Taylor expansion, verify that the terms that are O() cancel on theright-hand side of (5.69).5.15 ( ) In Section 5.3.4, we derived a procedure for evaluating the Jacobian matrix of aneural network using a backpropagation procedure.
Derive an alternative formalismfor finding the Jacobian based on forward propagation equations.5.16 () The outer product approximation to the Hessian matrix for a neural networkusing a sum-of-squares error function is given by (5.84). Extend this result to thecase of multiple outputs.5.17 () Consider a squared loss function of the form12{y(x, w) − t} p(x, t) dx dtE=2(5.193)where y(x, w) is a parametric function such as a neural network. The result (1.89)shows that the function y(x, w) that minimizes this error is given by the conditionalexpectation of t given x. Use this result to show that the second derivative of E withrespect to two elements wr and ws of the vector w, is given by∂y ∂y∂2E=p(x) dx.(5.194)∂wr ∂ws∂wr ∂wsNote that, for a finite sample from p(x), we obtain (5.84).5.18 () Consider a two-layer network of the form shown in Figure 5.1 with the additionof extra parameters corresponding to skip-layer connections that go directly fromthe inputs to the outputs.
By extending the discussion of Section 5.3.2, write downthe equations for the derivatives of the error function with respect to these additionalparameters.5.19 () www Derive the expression (5.85) for the outer product approximation tothe Hessian matrix for a network having a single output with a logistic sigmoidoutput-unit activation function and a cross-entropy error function, corresponding tothe result (5.84) for the sum-of-squares error function.Exercises2875.20 () Derive an expression for the outer product approximation to the Hessian matrixfor a network having K outputs with a softmax output-unit activation function anda cross-entropy error function, corresponding to the result (5.84) for the sum-ofsquares error function.5.21 ( ) Extend the expression (5.86) for the outer product approximation of the Hessian matrix to the case of K > 1 output units.
Hence, derive a recursive expressionanalogous to (5.87) for incrementing the number N of patterns and a similar expression for incrementing the number K of outputs. Use these results, together with theidentity (5.88), to find sequential update expressions analogous to (5.89) for findingthe inverse of the Hessian by incrementally including both extra patterns and extraoutputs.5.22 ( ) Derive the results (5.93), (5.94), and (5.95) for the elements of the Hessianmatrix of a two-layer feed-forward network by application of the chain rule of calculus.5.23 ( ) Extend the results of Section 5.4.5 for the exact Hessian of a two-layer networkto include skip-layer connections that go directly from inputs to outputs.5.24 () Verify that the network function defined by (5.113) and (5.114) is invariant under the transformation (5.115) applied to the inputs, provided the weights and biasesare simultaneously transformed using (5.116) and (5.117).
Similarly, show that thenetwork outputs can be transformed according (5.118) by applying the transformation (5.119) and (5.120) to the second-layer weights and biases.5.25 ( ) wwwConsider a quadratic error function of the form1E = E0 + (w − w )T H(w − w )2(5.195)where w represents the minimum, and the Hessian matrix H is positive definite andconstant. Suppose the initial weight vector w(0) is chosen to be at the origin and isupdated using simple gradient descentw(τ ) = w(τ −1) − ρ∇E(5.196)where τ denotes the step number, and ρ is the learning rate (which is assumed to besmall).
Show that, after τ steps, the components of the weight vector parallel to theeigenvectors of H can be written(τ )wj= {1 − (1 − ρηj )τ } wj(5.197)where wj = wT uj , and uj and ηj are the eigenvectors and eigenvalues, respectively,of H so that(5.198)Huj = ηj uj .Show that as τ → ∞, this gives w(τ ) → w as expected, provided |1 − ρηj | < 1.Now suppose that training is halted after a finite number τ of steps. Show that the2885. NEURAL NETWORKScomponents of the weight vector parallel to the eigenvectors of the Hessian satisfy(τ )wj wjwhen ηj (ρτ )−1(τ )|wj | |wj | when ηj (ρτ )−1 .(5.199)(5.200)Compare this result with the discussion in Section 3.5.3 of regularization with simpleweight decay, and hence show that (ρτ )−1 is analogous to the regularization parameter λ. The above results also show that the effective number of parameters in thenetwork, as defined by (3.91), grows as the training progresses.5.26 ( ) Consider a multilayer perceptron with arbitrary feed-forward topology, whichis to be trained by minimizing the tangent propagation error function (5.127) inwhich the regularizing function is given by (5.128).
Show that the regularizationterm Ω can be written as a sum over patterns of terms of the form12(Gyk )(5.201)Ωn =2kwhere G is a differential operator defined by ∂G≡τi.∂xi(5.202)iBy acting on the forward propagation equationszj = h(aj ),aj =wji zi(5.203)iwith the operator G, show that Ωn can be evaluated by forward propagation usingthe following equations:βj =wji αi .(5.204)αj = h (aj )βj ,iwhere we have defined the new variablesαj ≡ Gzj ,βj ≡ Gaj .(5.205)Now show that the derivatives of Ωn with respect to a weight wrs in the network canbe written in the form∂Ωn=αk {φkr zs + δkr αs }(5.206)∂wrskwhere we have definedδkr ≡∂yk,∂arφkr ≡ Gδkr .(5.207)Write down the backpropagation equations for δkr , and hence derive a set of backpropagation equations for the evaluation of the φkr .Exercises2895.27 ( ) www Consider the framework for training with transformed data in thespecial case in which the transformation consists simply of the addition of randomnoise x → x + ξ where ξ has a Gaussian distribution with zero mean and unitcovariance.