Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 53
Текст из файла (страница 53)
Suppose we are given a training data set {φn , tn }where n = 1, . . . , N , and tn is a binary target vector of length K that uses the 1-ofK coding scheme, so that it has components tnj = Ijk if pattern n is from class Ck .Assuming that the data points are drawn independently from this model, show thatthe maximum-likelihood solution for the prior probabilities is given byπk =NkN(4.159)where Nk is the number of data points assigned to class Ck .4.10 ( ) Consider the classification model of Exercise 4.9 and now suppose that theclass-conditional densities are given by Gaussian distributions with a shared covariance matrix, so thatp(φ|Ck ) = N (φ|µk , Σ).(4.160)Show that the maximum likelihood solution for the mean of the Gaussian distributionfor class Ck is given byN1 tnk φn(4.161)µk =Nkn=12224. LINEAR MODELS FOR CLASSIFICATIONwhich represents the mean of those feature vectors assigned to class Ck .
Similarly,show that the maximum likelihood solution for the shared covariance matrix is givenbyKNkΣ=(4.162)SkNk=1whereSk =N1 tnk (φn − µk )(φn − µk )T .Nk(4.163)n=1Thus Σ is given by a weighted average of the covariances of the data associated witheach class, in which the weighting coefficients are given by the prior probabilities ofthe classes.4.11 ( ) Consider a classification problem with K classes for which the feature vectorφ has M components each of which can take L discrete states.
Let the values of thecomponents be represented by a 1-of-L binary coding scheme. Further suppose that,conditioned on the class Ck , the M components of φ are independent, so that theclass-conditional density factorizes with respect to the feature vector components.Show that the quantities ak given by (4.63), which appear in the argument to thesoftmax function describing the posterior class probabilities, are linear functions ofthe components of φ. Note that this represents an example of the naive Bayes modelwhich is discussed in Section 8.2.2.4.12 () www Verify the relation (4.88) for the derivative of the logistic sigmoid function defined by (4.59).4.13 () www By making use of the result (4.88) for the derivative of the logistic sigmoid, show that the derivative of the error function (4.90) for the logistic regressionmodel is given by (4.91).4.14 () Show that for a linearly separable data set, the maximum likelihood solutionfor the logistic regression model is obtained by finding a vector w whose decisionboundary wT φ(x) = 0 separates the classes and then taking the magnitude of w toinfinity.4.15 ( ) Show that the Hessian matrix H for the logistic regression model, given by(4.97), is positive definite.
Here R is a diagonal matrix with elements yn (1 − yn ),and yn is the output of the logistic regression model for input vector xn . Hence showthat the error function is a concave function of w and that it has a unique minimum.4.16 () Consider a binary classification problem in which each observation xn is knownto belong to one of two classes, corresponding to t = 0 and t = 1, and suppose thatthe procedure for collecting training data is imperfect, so that training points aresometimes mislabelled. For every data point xn , instead of having a value t for theclass label, we have instead a value πn representing the probability that tn = 1.Given a probabilistic model p(t = 1|φ), write down the log likelihood functionappropriate to such a data set.Exercises2234.17 () www Show that the derivatives of the softmax activation function (4.104),where the ak are defined by (4.105), are given by (4.106).4.18 () Using the result (4.91) for the derivatives of the softmax activation function,show that the gradients of the cross-entropy error (4.108) are given by (4.109).4.19 () www Write down expressions for the gradient of the log likelihood, as wellas the corresponding Hessian matrix, for the probit regression model defined in Section 4.3.5.
These are the quantities that would be required to train such a model usingIRLS.4.20 ( ) Show that the Hessian matrix for the multiclass logistic regression problem,defined by (4.110), is positive semidefinite. Note that the full Hessian matrix forthis problem is of size M K × M K, where M is the number of parameters and Kis the number of classes. To prove the positive semidefinite property, consider theproduct uT Hu where u is an arbitrary vector of length M K, and then apply Jensen’sinequality.4.21 () Show that the probit function (4.114) and the erf function (4.115) are related by(4.116).4.22 () Using the result (4.135), derive the expression (4.137) for the log model evidence under the Laplace approximation.4.23 ( ) www In this exercise, we derive the BIC result (4.139) starting from theLaplace approximation to the model evidence given by (4.137).
Show that if theprior over parameters is Gaussian of the form p(θ) = N (θ|m, V0 ), the log modelevidence under the Laplace approximation takes the form11ln p(D) ln p(D|θ MAP ) − (θ MAP − m)T V0−1 (θ MAP − m) − ln |H| + const22where H is the matrix of second derivatives of the log likelihood ln p(D|θ) evaluatedat θ MAP . Now assume that the prior is broad so that V0−1 is small and the secondterm on the right-hand side above can be neglected. Furthermore, consider the caseof independent, identically distributed data so that H is the sum of terms one for eachdata point. Show that the log model evidence can then be written approximately inthe form of the BIC expression (4.139).4.24 ( ) Use the results from Section 2.3.2 to derive the result (4.151) for the marginalization of the logistic regression model with respect to a Gaussian posterior distribution over the parameters w.4.25 ( ) Suppose we wish to approximate the logistic sigmoid σ(a) defined by (4.59)by a scaled probit function Φ(λa), where Φ(a) is defined by (4.114).
Show that ifλ is chosen so that the derivatives of the two functions are equal at a = 0, thenλ2 = π/8.2244. LINEAR MODELS FOR CLASSIFICATION4.26 ( ) In this exercise, we prove the relation (4.152) for the convolution of a probitfunction with a Gaussian distribution.
To do this, show that the derivative of the lefthand side with respect to µ is equal to the derivative of the right-hand side, and thenintegrate both sides with respect to µ and then show that the constant of integrationvanishes. Note that before differentiating the left-hand side, it is convenient firstto introduce a change of variable given by a = µ + σz so that the integral over ais replaced by an integral over z. When we differentiate the left-hand side of therelation (4.152), we will then obtain a Gaussian integral over z that can be evaluatedanalytically.5NeuralNetworksIn Chapters 3 and 4 we considered models for regression and classification that comprised linear combinations of fixed basis functions. We saw that such models haveuseful analytical and computational properties but that their practical applicabilitywas limited by the curse of dimensionality.
In order to apply such models to largescale problems, it is necessary to adapt the basis functions to the data.Support vector machines (SVMs), discussed in Chapter 7, address this by firstdefining basis functions that are centred on the training data points and then selectinga subset of these during training. One advantage of SVMs is that, although thetraining involves nonlinear optimization, the objective function is convex, and so thesolution of the optimization problem is relatively straightforward.
The number ofbasis functions in the resulting models is generally much smaller than the number oftraining points, although it is often still relatively large and typically increases withthe size of the training set. The relevance vector machine, discussed in Section 7.2,also chooses a subset from a fixed set of basis functions and typically results in much2252265. NEURAL NETWORKSsparser models. Unlike the SVM it also produces probabilistic outputs, although thisis at the expense of a nonconvex optimization during training.An alternative approach is to fix the number of basis functions in advance butallow them to be adaptive, in other words to use parametric forms for the basis functions in which the parameter values are adapted during training.
The most successfulmodel of this type in the context of pattern recognition is the feed-forward neuralnetwork, also known as the multilayer perceptron, discussed in this chapter. In fact,‘multilayer perceptron’ is really a misnomer, because the model comprises multiple layers of logistic regression models (with continuous nonlinearities) rather thanmultiple perceptrons (with discontinuous nonlinearities). For many applications, theresulting model can be significantly more compact, and hence faster to evaluate, thana support vector machine having the same generalization performance.
The price tobe paid for this compactness, as with the relevance vector machine, is that the likelihood function, which forms the basis for network training, is no longer a convexfunction of the model parameters. In practice, however, it is often worth investingsubstantial computational resources during the training phase in order to obtain acompact model that is fast at processing new data.The term ‘neural network’ has its origins in attempts to find mathematical representations of information processing in biological systems (McCulloch and Pitts,1943; Widrow and Hoff, 1960; Rosenblatt, 1962; Rumelhart et al., 1986). Indeed,it has been used very broadly to cover a wide range of different models, many ofwhich have been the subject of exaggerated claims regarding their biological plausibility. From the perspective of practical applications of pattern recognition, however, biological realism would impose entirely unnecessary constraints.
Our focus inthis chapter is therefore on neural networks as efficient models for statistical patternrecognition. In particular, we shall restrict our attention to the specific class of neural networks that have proven to be of greatest practical value, namely the multilayerperceptron.We begin by considering the functional form of the network model, includingthe specific parameterization of the basis functions, and we then discuss the problem of determining the network parameters within a maximum likelihood framework, which involves the solution of a nonlinear optimization problem. This requiresthe evaluation of derivatives of the log likelihood function with respect to the network parameters, and we shall see how these can be obtained efficiently using thetechnique of error backpropagation.
We shall also show how the backpropagationframework can be extended to allow other derivatives to be evaluated, such as theJacobian and Hessian matrices. Next we discuss various approaches to regularization of neural network training and the relationships between them. We also considersome extensions to the neural network model, and in particular we describe a general framework for modelling conditional probability distributions known as mixturedensity networks. Finally, we discuss the use of Bayesian treatments of neural networks. Additional background on neural network models can be found in Bishop(1995a).5.1. Feed-forward Network Functions2275.1.
Feed-forward Network FunctionsThe linear models for regression and classification discussed in Chapters 3 and 4, respectively, are based on linear combinations of fixed nonlinear basis functions φj (x)and take the formMwj φj (x)(5.1)y(x, w) = fj =1where f (·) is a nonlinear activation function in the case of classification and is theidentity in the case of regression.