Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 36
Текст из файла (страница 36)
LINEAR MODELS FOR REGRESSIONFigure 3.4 Plot of the contoursof the unregularized error function(blue) along with the constraint region (3.30) for the quadratic regularizer q = 2 on the left and the lassoregularizer q = 1 on the right, inwhich the optimum value for the parameter vector w is denoted by w .The lasso gives a sparse solution inwhich w1 = 0.w2w2www1w1For the remainder of this chapter we shall focus on the quadratic regularizer(3.27) both for its practical importance and its analytical tractability.3.1.5 Multiple outputsSo far, we have considered the case of a single target variable t.
In some applications, we may wish to predict K > 1 target variables, which we denote collectivelyby the target vector t. This could be done by introducing a different set of basis functions for each component of t, leading to multiple, independent regression problems.However, a more interesting, and more common, approach is to use the same set ofbasis functions to model all of the components of the target vector so thaty(x, w) = WT φ(x)(3.31)where y is a K-dimensional column vector, W is an M × K matrix of parameters,and φ(x) is an M -dimensional column vector with elements φj (x), with φ0 (x) = 1as before.
Suppose we take the conditional distribution of the target vector to be anisotropic Gaussian of the formp(t|x, W, β) = N (t|WT φ(x), β −1 I).(3.32)If we have a set of observations t1 , . . . , tN , we can combine these into a matrix Tof size N × K such that the nth row is given by tTn . Similarly, we can combine theinput vectors x1 , . . . , xN into a matrix X. The log likelihood function is then givenbyln p(T|X, W, β) =Nln N (tn |WT φ(xn ), β −1 I)n=1=NKln2β2π−N'β ''tn − WT φ(xn )'2 . (3.33)2n=13.2. The Bias-Variance DecompositionExercise 3.6147As before, we can maximize this function with respect to W, giving−1 TΦ T.WML = ΦT Φ(3.34)If we examine this result for each target variable tk , we have−1 TΦ tk = Φ† tkwk = ΦT Φ(3.35)where tk is an N -dimensional column vector with components tnk for n = 1, .
. . N .Thus the solution to the regression problem decouples between the different targetvariables, and we need only compute a single pseudo-inverse matrix Φ† , which isshared by all of the vectors wk .The extension to general Gaussian noise distributions having arbitrary covariance matrices is straightforward. Again, this leads to a decoupling into K independent regression problems. This result is unsurprising because the parameters Wdefine only the mean of the Gaussian noise distribution, and we know from Section 2.3.4 that the maximum likelihood solution for the mean of a multivariate Gaussian is independent of the covariance.
From now on, we shall therefore consider asingle target variable t for simplicity.3.2. The Bias-Variance DecompositionSo far in our discussion of linear models for regression, we have assumed that theform and number of basis functions are both fixed. As we have seen in Chapter 1,the use of maximum likelihood, or equivalently least squares, can lead to severeover-fitting if complex models are trained using data sets of limited size. However,limiting the number of basis functions in order to avoid over-fitting has the sideeffect of limiting the flexibility of the model to capture interesting and importanttrends in the data. Although the introduction of regularization terms can controlover-fitting for models with many parameters, this raises the question of how todetermine a suitable value for the regularization coefficient λ.
Seeking the solutionthat minimizes the regularized error function with respect to both the weight vectorw and the regularization coefficient λ is clearly not the right approach since thisleads to the unregularized solution with λ = 0.As we have seen in earlier chapters, the phenomenon of over-fitting is really anunfortunate property of maximum likelihood and does not arise when we marginalizeover parameters in a Bayesian setting.
In this chapter, we shall consider the Bayesianview of model complexity in some depth. Before doing so, however, it is instructiveto consider a frequentist viewpoint of the model complexity issue, known as the biasvariance trade-off. Although we shall introduce this concept in the context of linearbasis function models, where it is easy to illustrate the ideas using simple examples,the discussion has more general applicability.In Section 1.5.5, when we discussed decision theory for regression problems,we considered various loss functions each of which leads to a corresponding optimalprediction once we are given the conditional distribution p(t|x).
A popular choice is1483. LINEAR MODELS FOR REGRESSIONthe squared loss function, for which the optimal prediction is given by the conditionalexpectation, which we denote by h(x) and which is given byh(x) = E[t|x] = tp(t|x) dt.(3.36)At this point, it is worth distinguishing between the squared loss function arisingfrom decision theory and the sum-of-squares error function that arose in the maximum likelihood estimation of model parameters.
We might use more sophisticatedtechniques than least squares, for example regularization or a fully Bayesian approach, to determine the conditional distribution p(t|x). These can all be combinedwith the squared loss function for the purpose of making predictions.We showed in Section 1.5.5 that the expected squared loss can be written in theform2E[L] = {y(x) − h(x)} p(x) dx + {h(x) − t}2 p(x, t) dx dt.(3.37)Recall that the second term, which is independent of y(x), arises from the intrinsicnoise on the data and represents the minimum achievable value of the expected loss.The first term depends on our choice for the function y(x), and we will seek a solution for y(x) which makes this term a minimum.
Because it is nonnegative, thesmallest that we can hope to make this term is zero. If we had an unlimited supply ofdata (and unlimited computational resources), we could in principle find the regression function h(x) to any desired degree of accuracy, and this would represent theoptimal choice for y(x).
However, in practice we have a data set D containing onlya finite number N of data points, and consequently we do not know the regressionfunction h(x) exactly.If we model the h(x) using a parametric function y(x, w) governed by a parameter vector w, then from a Bayesian perspective the uncertainty in our model isexpressed through a posterior distribution over w. A frequentist treatment, however,involves making a point estimate of w based on the data set D, and tries insteadto interpret the uncertainty of this estimate through the following thought experiment.
Suppose we had a large number of data sets each of size N and each drawnindependently from the distribution p(t, x). For any given data set D, we can runour learning algorithm and obtain a prediction function y(x; D). Different data setsfrom the ensemble will give different functions and consequently different values ofthe squared loss. The performance of a particular learning algorithm is then assessedby taking the average over this ensemble of data sets.Consider the integrand of the first term in (3.37), which for a particular data setD takes the form(3.38){y(x; D) − h(x)}2 .Because this quantity will be dependent on the particular data set D, we take its average over the ensemble of data sets.
If we add and subtract the quantity ED [y(x; D)]3.2. The Bias-Variance Decomposition149inside the braces, and then expand, we obtain{y(x; D) − ED [y(x; D)] + ED [y(x; D)] − h(x)}2= {y(x; D) − ED [y(x; D)]}2 + {ED [y(x; D)] − h(x)}2+2{y(x; D) − ED [y(x; D)]}{ED [y(x; D)] − h(x)}.(3.39)We now take the expectation of this expression with respect to D and note that thefinal term will vanish, givingED {y(x; D) − h(x)}2= {ED [y(x; D)] − h(x)}2 + ED {y(x; D) − ED [y(x; D)]}2 .
(3.40)()*+ ()*+(bias)2varianceWe see that the expected squared difference between y(x; D) and the regressionfunction h(x) can be expressed as the sum of two terms. The first term, called thesquared bias, represents the extent to which the average prediction over all data setsdiffers from the desired regression function. The second term, called the variance,measures the extent to which the solutions for individual data sets vary around theiraverage, and hence this measures the extent to which the function y(x; D) is sensitiveto the particular choice of data set. We shall provide some intuition to support thesedefinitions shortly when we consider a simple example.So far, we have considered a single input value x. If we substitute this expansionback into (3.37), we obtain the following decomposition of the expected squared lossexpected loss = (bias)2 + variance + noisewhere(bias)2=variance =noise =Appendix A(3.41){ED [y(x; D)] − h(x)}2 p(x) dx(3.42)ED {y(x; D) − ED [y(x; D)]}2 p(x) dx(3.43){h(x) − t}2 p(x, t) dx dt(3.44)and the bias and variance terms now refer to integrated quantities.Our goal is to minimize the expected loss, which we have decomposed into thesum of a (squared) bias, a variance, and a constant noise term.