Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 42
Текст из файла (страница 42)
To see this, consider the contours of the likelihood function and the prior as illustrated in Figure 3.15. Here we have implicitlytransformed to a rotated set of axes in parameter space aligned with the eigenvectors ui defined in (3.87). Contours of the likelihood function are then axis-alignedellipses. The eigenvalues λi measure the curvature of the likelihood function, andso in Figure 3.15 the eigenvalue λ1 is small compared with λ2 (because a smallercurvature corresponds to a greater elongation of the contours of the likelihood function).
Because βΦT Φ is a positive definite matrix, it will have positive eigenvalues,and so the ratio λi /(λi + α) will lie between 0 and 1. Consequently, the quantity γdefined by (3.91) will lie in the range 0 γ M . For directions in which λi α,the corresponding parameter wi will be close to its maximum likelihood value, andthe ratio λi /(λi + α) will be close to 1. Such parameters are called well determinedbecause their values are tightly constrained by the data. Conversely, for directionsin which λi α, the corresponding parameters wi will be close to zero, as will theratios λi /(λi + α). These are directions in which the likelihood function is relativelyinsensitive to the parameter value and so the parameter has been set to a small valueby the prior.
The quantity γ defined by (3.91) therefore measures the effective totalnumber of well determined parameters.We can obtain some insight into the result (3.95) for re-estimating β by comparing it with the corresponding maximum likelihood result given by (3.21). Bothof these formulae express the variance (the inverse precision) as an average of thesquared differences between the targets and the model predictions.
However, theydiffer in that the number of data points N in the denominator of the maximum likelihood result is replaced by N − γ in the Bayesian result. We recall from (1.56) thatthe maximum likelihood estimate of the variance for a Gaussian distribution over a3.5.
The Evidence Approximation171single variable x is given by2σML=N1 (xn − µML )2N(3.96)n=1and that this estimate is biased because the maximum likelihood solution µML forthe mean has fitted some of the noise on the data. In effect, this has used up onedegree of freedom in the model. The corresponding unbiased estimate is given by(1.59) and takes the form1 (xn − µML )2 .N −1N2σMAP=(3.97)n=1We shall see in Section 10.1.3 that this result can be obtained from a Bayesian treatment in which we marginalize over the unknown mean. The factor of N − 1 in thedenominator of the Bayesian result takes account of the fact that one degree of freedom has been used in fitting the mean and removes the bias of maximum likelihood.Now consider the corresponding results for the linear regression model.
The meanof the target distribution is now given by the function wT φ(x), which contains Mparameters. However, not all of these parameters are tuned to the data. The effectivenumber of parameters that are determined by the data is γ, with the remaining M −γparameters set to small values by the prior. This is reflected in the Bayesian resultfor the variance that has a factor N − γ in the denominator, thereby correcting forthe bias of the maximum likelihood result.We can illustrate the evidence framework for setting hyperparameters using thesinusoidal synthetic data set from Section 1.1, together with the Gaussian basis function model comprising 9 basis functions, so that the total number of parameters inthe model is given by M = 10 including the bias.
Here, for simplicity of illustration, we have set β to its true value of 11.1 and then used the evidence framework todetermine α, as shown in Figure 3.16.We can also see how the parameter α controls the magnitude of the parameters{wi }, by plotting the individual parameters versus the effective number γ of parameters, as shown in Figure 3.17.If we consider the limit N M in which the number of data points is large inrelation to the number of parameters, then from (3.87) all of the parameters will bewell determined by the data because ΦT Φ involves an implicit sum over data points,and so the eigenvalues λi increase with the size of the data set. In this case, γ = M ,and the re-estimation equations for α and β becomeα =β=M2EW (mN )N2ED (mN )(3.98)(3.99)where EW and ED are defined by (3.25) and (3.26), respectively.
These resultscan be used as an easy-to-compute approximation to the full evidence re-estimation1723. LINEAR MODELS FOR REGRESSION−50ln α5−50ln α5Figure 3.16 The left plot shows γ (red curve) and 2αEW (mN ) (blue curve) versus ln α for the sinusoidalsynthetic data set. It is the intersection of these two curves that defines the optimum value for α given by theevidence procedure. The right plot shows the corresponding graph of log evidence ln p(t|α, β) versus ln α (redcurve) showing that the peak coincides with the crossing point of the curves in the left plot. Also shown is thetest set error (blue curve) showing that the evidence maximum occurs close to the point of best generalization.formulae, because they do not require evaluation of the eigenvalue spectrum of theHessian.Figure 3.17Plot of the 10 parameters wifrom the Gaussian basis function2model versus the effective number of parameters γ, in which the w ihyperparameter α is varied in the1range 0 α ∞ causing γ tovary in the range 0 γ M .00845263−117−290246γ8103.6.
Limitations of Fixed Basis FunctionsThroughout this chapter, we have focussed on models comprising a linear combination of fixed, nonlinear basis functions. We have seen that the assumption of linearityin the parameters led to a range of useful properties including closed-form solutionsto the least-squares problem, as well as a tractable Bayesian treatment. Furthermore,for a suitable choice of basis functions, we can model arbitrary nonlinearities in theExercises173mapping from input variables to targets. In the next chapter, we shall study an analogous class of models for classification.It might appear, therefore, that such linear models constitute a general purposeframework for solving problems in pattern recognition.
Unfortunately, there aresome significant shortcomings with linear models, which will cause us to turn inlater chapters to more complex models such as support vector machines and neuralnetworks.The difficulty stems from the assumption that the basis functions φj (x) are fixedbefore the training data set is observed and is a manifestation of the curse of dimensionality discussed in Section 1.4. As a consequence, the number of basis functionsneeds to grow rapidly, often exponentially, with the dimensionality D of the inputspace.Fortunately, there are two properties of real data sets that we can exploit to helpalleviate this problem. First of all, the data vectors {xn } typically lie close to a nonlinear manifold whose intrinsic dimensionality is smaller than that of the input spaceas a result of strong correlations between the input variables. We will see an exampleof this when we consider images of handwritten digits in Chapter 12.
If we are usinglocalized basis functions, we can arrange that they are scattered in input space onlyin regions containing data. This approach is used in radial basis function networksand also in support vector and relevance vector machines. Neural network models,which use adaptive basis functions having sigmoidal nonlinearities, can adapt theparameters so that the regions of input space over which the basis functions varycorresponds to the data manifold.
The second property is that target variables mayhave significant dependence on only a small number of possible directions within thedata manifold. Neural networks can exploit this property by choosing the directionsin input space to which the basis functions respond.Exercises3.1 () www Show that the ‘tanh’ function and the logistic sigmoid function (3.6)are related bytanh(a) = 2σ(2a) − 1.(3.100)Hence show that a general linear combination of logistic sigmoid functions of theformMx − µ jwj σ(3.101)y(x, w) = w0 +sj =1is equivalent to a linear combination of ‘tanh’ functions of the formy(x, u) = u0 +Mj =1uj tanhx − µ js(3.102)and find expressions to relate the new parameters {u1 , . .
. , uM } to the original parameters {w1 , . . . , wM }.1743. LINEAR MODELS FOR REGRESSION3.2 ( ) Show that the matrixΦ(ΦT Φ)−1 ΦT(3.103)takes any vector v and projects it onto the space spanned by the columns of Φ. Usethis result to show that the least-squares solution (3.15) corresponds to an orthogonalprojection of the vector t onto the manifold S as shown in Figure 3.2.3.3 () Consider a data set in which each data point tn is associated with a weightingfactor rn > 0, so that the sum-of-squares error function becomesED (w) =N21 rn tn − wT φ(xn ) .2(3.104)n=1Find an expression for the solution w that minimizes this error function. Give twoalternative interpretations of the weighted sum-of-squares error function in terms of(i) data dependent noise variance and (ii) replicated data points.3.4 () wwwConsider a linear model of the formy(x, w) = w0 +Dwi xi(3.105)i=1together with a sum-of-squares error function of the form12ED (w) ={y(xn , w) − tn } .2N(3.106)n=1Now suppose that Gaussian noise i with zero mean and variance σ 2 is added independently to each of the input variables xi .
By making use of E[i ] = 0 andE[i j ] = δij σ 2 , show that minimizing ED averaged over the noise distribution isequivalent to minimizing the sum-of-squares error for noise-free input variables withthe addition of a weight-decay regularization term, in which the bias parameter w0is omitted from the regularizer.3.5 () www Using the technique of Lagrange multipliers, discussed in Appendix E,show that minimization of the regularized error function (3.29) is equivalent to minimizing the unregularized sum-of-squares error (3.12) subject to the constraint (3.30).Discuss the relationship between the parameters η and λ.3.6 () www Consider a linear basis function regression model for a multivariatetarget variable t having a Gaussian distribution of the formwherep(t|W, Σ) = N (t|y(x, W), Σ)(3.107)y(x, W) = WT φ(x)(3.108)Exercises175together with a training data set comprising input basis vectors φ(xn ) and corresponding target vectors tn , with n = 1, .