Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 68
Текст из файла (страница 68)
By following an argument analogous to that of Section 5.5.5, show thatthe resulting regularizer reduces to the Tikhonov form (5.135).5.28 () www Consider a neural network, such as the convolutional network discussedin Section 5.5.6, in which multiple weights are constrained to have the same value.Discuss how the standard backpropagation algorithm must be modified in order toensure that such constraints are satisfied when evaluating the derivatives of an errorfunction with respect to the adjustable parameters in the network.5.29 () wwwVerify the result (5.141).5.30 () Verify the result (5.142).5.31 () Verify the result (5.143).5.32 ( ) Show that the derivatives of the mixing coefficients {πk }, defined by (5.146),with respect to the auxiliary parameters {ηj } are given by∂πk= δjk πj − πj πk .(5.208)∂ηjHence, by making use of the constraint k πk = 1, derive the result (5.147).5.33 () Write down a pair of equations that express the Cartesian coordinates (x1 , x2 )for the robot arm shown in Figure 5.18 in terms of the joint angles θ1 and θ2 andthe lengths L1 and L2 of the links.
Assume the origin of the coordinate system isgiven by the attachment point of the lower arm. These equations define the ‘forwardkinematics’ of the robot arm.5.34 () www Derive the result (5.155) for the derivative of the error function withrespect to the network output activations controlling the mixing coefficients in themixture density network.5.35 () Derive the result (5.156) for the derivative of the error function with respectto the network output activations controlling the component means in the mixturedensity network.5.36 () Derive the result (5.157) for the derivative of the error function with respect tothe network output activations controlling the component variances in the mixturedensity network.5.37 () Verify the results (5.158) and (5.160) for the conditional mean and variance ofthe mixture density network model.5.38 () Using the general result (2.115), derive the predictive distribution (5.172) forthe Laplace approximation to the Bayesian neural network model.2905.
NEURAL NETWORKS5.39 () www Make use of the Laplace approximation result (4.135) to show that theevidence function for the hyperparameters α and β in the Bayesian neural networkmodel can be approximated by (5.175).5.40 () www Outline the modifications needed to the framework for Bayesian neuralnetworks, discussed in Section 5.7.3, to handle multiclass problems using networkshaving softmax output-unit activation functions.5.41 ( ) By following analogous steps to those given in Section 5.7.1 for regressionnetworks, derive the result (5.183) for the marginal likelihood in the case of a network having a cross-entropy error function and logistic-sigmoid output-unit activation function.6KernelMethodsChapter 5Section 2.5.1In Chapters 3 and 4, we considered linear parametric models for regression andclassification in which the form of the mapping y(x, w) from input x to output yis governed by a vector w of adaptive parameters.
During the learning phase, aset of training data is used either to obtain a point estimate of the parameter vectoror to determine a posterior distribution over this vector. The training data is thendiscarded, and predictions for new inputs are based purely on the learned parametervector w. This approach is also used in nonlinear parametric models such as neuralnetworks.However, there is a class of pattern recognition techniques, in which the trainingdata points, or a subset of them, are kept and used also during the prediction phase.For instance, the Parzen probability density model comprised a linear combinationof ‘kernel’ functions each one centred on one of the training data points. Similarly,in Section 2.5.2 we introduced a simple technique for classification called nearestneighbours, which involved assigning to each new test vector the same label as the2912926.
KERNEL METHODSclosest example from the training set. These are examples of memory-based methodsthat involve storing the entire training set in order to make predictions for future datapoints. They typically require a metric to be defined that measures the similarity ofany two vectors in input space, and are generally fast to ‘train’ but slow at makingpredictions for test data points.Many linear parametric models can be re-cast into an equivalent ‘dual representation’ in which the predictions are also based on linear combinations of a kernelfunction evaluated at the training data points. As we shall see, for models which arebased on a fixed nonlinear feature space mapping φ(x), the kernel function is givenby the relationk(x, x ) = φ(x)T φ(x ).(6.1)Chapter 7Section 12.3Section 6.3From this definition, we see that the kernel is a symmetric function of its argumentsso that k(x, x ) = k(x , x).
The kernel concept was introduced into the field of pattern recognition by Aizerman et al. (1964) in the context of the method of potentialfunctions, so-called because of an analogy with electrostatics. Although neglectedfor many years, it was re-introduced into machine learning in the context of largemargin classifiers by Boser et al. (1992) giving rise to the technique of supportvector machines. Since then, there has been considerable interest in this topic, bothin terms of theory and applications.
One of the most significant developments hasbeen the extension of kernels to handle symbolic objects, thereby greatly expandingthe range of problems that can be addressed.The simplest example of a kernel function is obtained by considering the identitymapping for the feature space in (6.1) so that φ(x) = x, in which case k(x, x ) =xT x . We shall refer to this as the linear kernel.The concept of a kernel formulated as an inner product in a feature space allowsus to build interesting extensions of many well-known algorithms by making use ofthe kernel trick, also known as kernel substitution. The general idea is that, if we havean algorithm formulated in such a way that the input vector x enters only in the formof scalar products, then we can replace that scalar product with some other choice ofkernel. For instance, the technique of kernel substitution can be applied to principalcomponent analysis in order to develop a nonlinear variant of PCA (Schölkopf et al.,1998).
Other examples of kernel substitution include nearest-neighbour classifiersand the kernel Fisher discriminant (Mika et al., 1999; Roth and Steinhage, 2000;Baudat and Anouar, 2000).There are numerous forms of kernel functions in common use, and we shall encounter several examples in this chapter. Many have the property of being a functiononly of the difference between the arguments, so that k(x, x ) = k(x − x ), whichare known as stationary kernels because they are invariant to translations in inputspace. A further specialization involves homogeneous kernels, also known as radial basis functions, which depend only on the magnitude of the distance (typicallyEuclidean) between the arguments so that k(x, x ) = k(x − x ).For recent textbooks on kernel methods, see Schölkopf and Smola (2002), Herbrich (2002), and Shawe-Taylor and Cristianini (2004).6.1.
Dual Representations2936.1. Dual RepresentationsMany linear models for regression and classification can be reformulated in terms ofa dual representation in which the kernel function arises naturally. This concept willplay an important role when we consider support vector machines in the next chapter.Here we consider a linear regression model whose parameters are determined byminimizing a regularized sum-of-squares error function given by2 λ1 TJ(w) =w φ(xn ) − tn + wT w22N(6.2)n=1where λ 0. If we set the gradient of J(w) with respect to w equal to zero, we seethat the solution for w takes the form of a linear combination of the vectors φ(xn ),with coefficients that are functions of w, of the formw=−1 Tw φ(xn ) − tn φ(xn ) =an φ(xn ) = ΦT aλNNn=1n=1(6.3)where Φ is the design matrix, whose nth row is given by φ(xn )T .
Here the vectora = (a1 , . . . , aN )T , and we have defined1 Tw φ(xn ) − tn .(6.4)λInstead of working with the parameter vector w, we can now reformulate the leastsquares algorithm in terms of the parameter vector a, giving rise to a dual representation. If we substitute w = ΦT a into J(w), we obtainan = −J(a) =1 T1λa ΦΦT ΦΦT a − aT ΦΦT t + tT t + aT ΦΦT a222(6.5)where t = (t1 , .
. . , tN )T . We now define the Gram matrix K = ΦΦT , which is anN × N symmetric matrix with elementsKnm = φ(xn )T φ(xm ) = k(xn , xm )(6.6)where we have introduced the kernel function k(x, x ) defined by (6.1). In terms ofthe Gram matrix, the sum-of-squares error function can be written asJ(a) =1λ1 Ta KKa − aT Kt + tT t + aT Ka.222(6.7)Setting the gradient of J(a) with respect to a to zero, we obtain the following solution−1a = (K + λIN ) t.(6.8)2946. KERNEL METHODSIf we substitute this back into the linear regression model, we obtain the followingprediction for a new input x−1y(x) = wT φ(x) = aT Φφ(x) = k(x)T (K + λIN )Exercise 6.1Exercise 6.2t(6.9)where we have defined the vector k(x) with elements kn (x) = k(xn , x). Thus wesee that the dual formulation allows the solution to the least-squares problem to beexpressed entirely in terms of the kernel function k(x, x ).
This is known as a dualformulation because, by noting that the solution for a can be expressed as a linearcombination of the elements of φ(x), we recover the original formulation in terms ofthe parameter vector w. Note that the prediction at x is given by a linear combinationof the target values from the training set. In fact, we have already obtained this result,using a slightly different notation, in Section 3.3.3.In the dual formulation, we determine the parameter vector a by inverting anN × N matrix, whereas in the original parameter space formulation we had to invertan M × M matrix in order to determine w. Because N is typically much largerthan M , the dual formulation does not seem to be particularly useful. However, theadvantage of the dual formulation, as we shall see, is that it is expressed entirely interms of the kernel function k(x, x ). We can therefore work directly in terms ofkernels and avoid the explicit introduction of the feature vector φ(x), which allowsus implicitly to use feature spaces of high, even infinite, dimensionality.The existence of a dual representation based on the Gram matrix is a property ofmany linear models, including the perceptron.
In Section 6.4, we will develop a duality between probabilistic linear models for regression and the technique of Gaussianprocesses. Duality will also play an important role when we discuss support vectormachines in Chapter 7.6.2. Constructing KernelsIn order to exploit kernel substitution, we need to be able to construct valid kernelfunctions. One approach is to choose a feature space mapping φ(x) and then usethis to find the corresponding kernel, as is illustrated in Figure 6.1. Here the kernelfunction is defined for a one-dimensional input space byk(x, x ) = φ(x)T φ(x ) =Mφi (x)φi (x )(6.10)i=1where φi (x) are the basis functions.An alternative approach is to construct kernel functions directly. In this case,we must ensure that the function we choose is a valid kernel, in other words that itcorresponds to a scalar product in some (perhaps infinite dimensional) feature space.As a simple example, consider a kernel function given by2(6.11)k(x, z) = xT z .6.2.