Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 64
Текст из файла (страница 64)
Mixture Density NetworksExercise 5.33The goal of supervised learning is to model a conditional distribution p(t|x), whichfor many simple regression problems is chosen to be Gaussian. However, practicalmachine learning problems can often have significantly non-Gaussian distributions.These can arise, for example, with inverse problems in which the distribution can bemultimodal, in which case the Gaussian assumption can lead to very poor predictions.As a simple example of an inverse problem, consider the kinematics of a robotarm, as illustrated in Figure 5.18. The forward problem involves finding the end effector position given the joint angles and has a unique solution. However, in practicewe wish to move the end effector of the robot to a specific position, and to do this wemust set appropriate joint angles. We therefore need to solve the inverse problem,which has two solutions as seen in Figure 5.18.Forward problems often corresponds to causality in a physical system and generally have a unique solution.
For instance, a specific pattern of symptoms in thehuman body may be caused by the presence of a particular disease. In pattern recognition, however, we typically have to solve an inverse problem, such as trying topredict the presence of a disease given a set of symptoms. If the forward probleminvolves a many-to-one mapping, then the inverse problem will have multiple solutions. For instance, several different diseases may result in the same symptoms.In the robotics example, the kinematics is defined by geometrical equations, andthe multimodality is readily apparent.
However, in many machine learning problemsthe presence of multimodality, particularly in problems involving spaces of high dimensionality, can be less obvious. For tutorial purposes, however, we shall considera simple toy problem for which we can easily visualize the multimodality. Data forthis problem is generated by sampling a variable x uniformly over the interval (0, 1),to give a set of values {xn }, and the corresponding target values tn are obtained5.6.
Mixture Density NetworksFigure 5.19 On the left is the dataset for a simple ‘forward problem’ inwhich the red curve shows the resultof fitting a two-layer neural networkby minimizing the sum-of-squareserror function. The correspondinginverse problem, shown on the right,is obtained by exchanging the rolesof x and t. Here the same network trained again by minimizing thesum-of-squares error function givesa very poor fit to the data due to themultimodality of the data set.11000102731by computing the function xn + 0.3 sin(2πxn ) and then adding uniform noise overthe interval (−0.1, 0.1). The inverse problem is then obtained by keeping the samedata points but exchanging the roles of x and t. Figure 5.19 shows the data sets forthe forward and inverse problems, along with the results of fitting two-layer neuralnetworks having 6 hidden units and a single linear output unit by minimizing a sumof-squares error function.
Least squares corresponds to maximum likelihood undera Gaussian assumption. We see that this leads to a very poor model for the highlynon-Gaussian inverse problem.We therefore seek a general framework for modelling conditional probabilitydistributions. This can be achieved by using a mixture model for p(t|x) in whichboth the mixing coefficients as well as the component densities are flexible functionsof the input vector x, giving rise to the mixture density network. For any given valueof x, the mixture model provides a general formalism for modelling an arbitraryconditional density function p(t|x). Provided we consider a sufficiently flexiblenetwork, we then have a framework for approximating arbitrary conditional distributions.Here we shall develop the model explicitly for Gaussian components, so thatp(t|x) =Kπk (x)N t|µk (x), σk2 (x) .(5.148)k=1This is an example of a heteroscedastic model since the noise variance on the datais a function of the input vector x.
Instead of Gaussians, we can use other distributions for the components, such as Bernoulli distributions if the target variables arebinary rather than continuous. We have also specialized to the case of isotropic covariances for the components, although the mixture density network can readily beextended to allow for general covariance matrices by representing the covariancesusing a Cholesky factorization (Williams, 1996). Even with isotropic components,the conditional distribution p(t|x) does not assume factorization with respect to thecomponents of t (in contrast to the standard sum-of-squares regression model) as aconsequence of the mixture distribution.We now take the various parameters of the mixture model, namely the mixingcoefficients πk (x), the means µk (x), and the variances σk2 (x), to be governed by2745.
NEURAL NETWORKSp(t|x)xDθMθx1θ1tFigure 5.20The mixture density network can represent general conditional probability densities p(t|x)by considering a parametric mixture model for the distribution of t whose parameters aredetermined by the outputs of a neural network that takes x as its input vector.the outputs of a conventional neural network that takes x as its input. The structureof this mixture density network is illustrated in Figure 5.20. The mixture densitynetwork is closely related to the mixture of experts discussed in Section 14.5.3.
Theprinciple difference is that in the mixture density network the same function is usedto predict the parameters of all of the component densities as well as the mixing coefficients, and so the nonlinear hidden units are shared amongst the input-dependentfunctions.The neural network in Figure 5.20 can, for example, be a two-layer networkhaving sigmoidal (‘tanh’) hidden units. If there are L components in the mixturemodel (5.148), and if t has K components, then the network will have L output unitactivations denoted by aπk that determine the mixing coefficients πk (x), K outputsdenoted by aσk that determine the kernel widths σk (x), and L × K outputs denotedby aµkj that determine the components µkj (x) of the kernel centres µk (x).
The totalnumber of network outputs is given by (K + 2)L, as compared with the usual Koutputs for a network, which simply predicts the conditional means of the targetvariables.The mixing coefficients must satisfy the constraintsKπk (x) = 1,0 πk (x) 1(5.149)k=1which can be achieved using a set of softmax outputsexp(aπk ).πk (x) = Kπl=1 exp(al )(5.150)Similarly, the variances must satisfy σk2 (x) 0 and so can be represented in termsof the exponentials of the corresponding network activations usingσk (x) = exp(aσk ).(5.151)Finally, because the means µk (x) have real components, they can be represented5.6. Mixture Density Networks275directly by the network output activationsµkj (x) = aµkj .(5.152)The adaptive parameters of the mixture density network comprise the vector wof weights and biases in the neural network, that can be set by maximum likelihood,or equivalently by minimizing an error function defined to be the negative logarithmof the likelihood.
For independent data, this error function takes the form kNlnπk (xn , w)N tn |µk (xn , w), σk2 (xn , w)(5.153)E(w) = −n=1k=1where we have made the dependencies on w explicit.In order to minimize the error function, we need to calculate the derivatives ofthe error E(w) with respect to the components of w. These can be evaluated byusing the standard backpropagation procedure, provided we obtain suitable expressions for the derivatives of the error with respect to the output-unit activations. Theserepresent error signals δ for each pattern and for each output unit, and can be backpropagated to the hidden units and the error function derivatives evaluated in theusual way.
Because the error function (5.153) is composed of a sum of terms, onefor each training data point, we can consider the derivatives for a particular patternn and then find the derivatives of E by summing over all patterns.Because we are dealing with mixture distributions, it is convenient to view themixing coefficients πk (x) as x-dependent prior probabilities and to introduce thecorresponding posterior probabilities given byπk Nnkγk (t|x) = Kl=1 πl NnlExercise 5.34Exercise 5.35Exercise 5.36(5.154)where Nnk denotes N (tn |µk (xn ), σk2 (xn )).The derivatives with respect to the network output activations governing the mixing coefficients are given by∂En= πk − γk .(5.155)∂aπkSimilarly, the derivatives with respect to the output activations controlling the component means are given byµkl − tl∂En=γ.(5.156)k∂aµklσk2Finally, the derivatives with respect to the output activations controlling the component variances are given byt − µk 21∂En= −γk−.(5.157)∂aσkσk3σk2765.