Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 11
Текст из файла (страница 11)
In practice, for anything other than small N , this bias will notprove to be a serious problem. However, throughout this book we shall be interestedin more complex models with many parameters, for which the bias problems associated with maximum likelihood will be much more severe. In fact, as we shall see,the issue of bias in maximum likelihood lies at the root of the over-fitting problemthat we encountered earlier in the context of polynomial curve fitting.1.2.5 Curve fitting re-visitedSection 1.1We have seen how the problem of polynomial curve fitting can be expressed interms of error minimization.
Here we return to the curve fitting example and view itfrom a probabilistic perspective, thereby gaining some insights into error functionsand regularization, as well as taking us towards a full Bayesian treatment.The goal in the curve fitting problem is to be able to make predictions for thetarget variable t given some new value of the input variable x on the basis of a set oftraining data comprising N input values x = (x1 , . . . , xN )T and their correspondingtarget values t = (t1 , . . . , tN )T . We can express our uncertainty over the value ofthe target variable using a probability distribution. For this purpose, we shall assumethat, given the value of x, the corresponding value of t has a Gaussian distributionwith a mean equal to the value y(x, w) of the polynomial curve given by (1.1).
Thuswe have(1.60)p(t|x, w, β) = N t|y(x, w), β −1where, for consistency with the notation in later chapters, we have defined a precision parameter β corresponding to the inverse variance of the distribution. This isillustrated schematically in Figure 1.16.291.2. Probability TheoryFigure 1.16 Schematic illustration of a Gaussian conditional distribution for t given x given by(1.60), in which the mean is given by the polynomial function y(x, w), and the precision is givenby the parameter β, which is related to the variance by β −1 = σ 2 .ty(x, w)y(x0 , w)2σp(t|x0 , w, β)x0xWe now use the training data {x, t} to determine the values of the unknownparameters w and β by maximum likelihood.
If the data are assumed to be drawnindependently from the distribution (1.60), then the likelihood function is given byp(t|x, w, β) =NN tn |y(xn , w), β −1 .(1.61)n=1As we did in the case of the simple Gaussian distribution earlier, it is convenient tomaximize the logarithm of the likelihood function. Substituting for the form of theGaussian distribution, given by (1.46), we obtain the log likelihood function in theformln p(t|x, w, β) = −NNβN2ln β −ln(2π).{y(xn , w) − tn } +222(1.62)n=1Consider first the determination of the maximum likelihood solution for the polynomial coefficients, which will be denoted by wML .
These are determined by maximizing (1.62) with respect to w. For this purpose, we can omit the last two termson the right-hand side of (1.62) because they do not depend on w. Also, we notethat scaling the log likelihood by a positive constant coefficient does not alter thelocation of the maximum with respect to w, and so we can replace the coefficientβ/2 with 1/2. Finally, instead of maximizing the log likelihood, we can equivalentlyminimize the negative log likelihood.
We therefore see that maximizing likelihood isequivalent, so far as determining w is concerned, to minimizing the sum-of-squareserror function defined by (1.2). Thus the sum-of-squares error function has arisen asa consequence of maximizing likelihood under the assumption of a Gaussian noisedistribution.We can also use maximum likelihood to determine the precision parameter β ofthe Gaussian conditional distribution. Maximizing (1.62) with respect to β gives1βML=N1 2{y(xn , wML ) − tn } .Nn=1(1.63)301. INTRODUCTIONSection 1.2.4Again we can first determine the parameter vector wML governing the mean and subsequently use this to find the precision βML as was the case for the simple Gaussiandistribution.Having determined the parameters w and β, we can now make predictions fornew values of x.
Because we now have a probabilistic model, these are expressedin terms of the predictive distribution that gives the probability distribution over t,rather than simply a point estimate, and is obtained by substituting the maximumlikelihood parameters into (1.60) to give−1.(1.64)p(t|x, wML , βML ) = N t|y(x, wML ), βMLNow let us take a step towards a more Bayesian approach and introduce a priordistribution over the polynomial coefficients w.
For simplicity, let us consider aGaussian distribution of the form α (M +1)/2 αexp − wT w(1.65)p(w|α) = N (w|0, α−1 I) =2π2where α is the precision of the distribution, and M +1 is the total number of elementsin the vector w for an M th order polynomial. Variables such as α, which controlthe distribution of model parameters, are called hyperparameters. Using Bayes’theorem, the posterior distribution for w is proportional to the product of the priordistribution and the likelihood functionp(w|x, t, α, β) ∝ p(t|x, w, β)p(w|α).(1.66)We can now determine w by finding the most probable value of w given the data,in other words by maximizing the posterior distribution.
This technique is calledmaximum posterior, or simply MAP. Taking the negative logarithm of (1.66) andcombining with (1.62) and (1.65), we find that the maximum of the posterior isgiven by the minimum ofNαβ{y(xn , w) − tn }2 + wT w.22(1.67)n=1Thus we see that maximizing the posterior distribution is equivalent to minimizingthe regularized sum-of-squares error function encountered earlier in the form (1.4),with a regularization parameter given by λ = α/β.1.2.6 Bayesian curve fittingAlthough we have included a prior distribution p(w|α), we are so far still making a point estimate of w and so this does not yet amount to a Bayesian treatment. Ina fully Bayesian approach, we should consistently apply the sum and product rulesof probability, which requires, as we shall see shortly, that we integrate over all values of w. Such marginalizations lie at the heart of Bayesian methods for patternrecognition.1.2.
Probability Theory31In the curve fitting problem, we are given the training data x and t, along witha new test point x, and our goal is to predict the value of t. We therefore wishto evaluate the predictive distribution p(t|x, x, t). Here we shall assume that theparameters α and β are fixed and known in advance (in later chapters we shall discusshow such parameters can be inferred from data in a Bayesian setting).A Bayesian treatment simply corresponds to a consistent application of the sumand product rules of probability, which allow the predictive distribution to be writtenin the formp(t|x, x, t) = p(t|x, w)p(w|x, t) dw.(1.68)Here p(t|x, w) is given by (1.60), and we have omitted the dependence on α andβ to simplify the notation.
Here p(w|x, t) is the posterior distribution over parameters, and can be found by normalizing the right-hand side of (1.66). We shall seein Section 3.3 that, for problems such as the curve-fitting example, this posteriordistribution is a Gaussian and can be evaluated analytically. Similarly, the integration in (1.68) can also be performed analytically with the result that the predictivedistribution is given by a Gaussian of the formp(t|x, x, t) = N t|m(x), s2 (x)(1.69)where the mean and variance are given bym(x) = βφ(x) STNφ(xn )tnn=1Ts2 (x) = β −1 + φ(x) Sφ(x).(1.70)(1.71)Here the matrix S is given byS−1 = αI + βNφ(xn )φ(x)T(1.72)n=1where I is the unit matrix, and we have defined the vector φ(x) with elementsφi (x) = xi for i = 0, .
. . , M .We see that the variance, as well as the mean, of the predictive distribution in(1.69) is dependent on x. The first term in (1.71) represents the uncertainty in thepredicted value of t due to the noise on the target variables and was expressed already−1in the maximum likelihood predictive distribution (1.64) through βML. However, thesecond term arises from the uncertainty in the parameters w and is a consequenceof the Bayesian treatment. The predictive distribution for the synthetic sinusoidalregression problem is illustrated in Figure 1.17.321.
INTRODUCTIONFigure 1.17The predictive distribution resulting from a Bayesian treatment ofpolynomial curve fitting using anM = 9 polynomial, with the fixedparameters α = 5 × 10−3 and β =11.1 (corresponding to the knownnoise variance), in which the redcurve denotes the mean of thepredictive distribution and the redregion corresponds to ±1 standard deviation around the mean.1t0−10x11.3. Model SelectionIn our example of polynomial curve fitting using least squares, we saw that there wasan optimal order of polynomial that gave the best generalization.
The order of thepolynomial controls the number of free parameters in the model and thereby governsthe model complexity. With regularized least squares, the regularization coefficientλ also controls the effective complexity of the model, whereas for more complexmodels, such as mixture distributions or neural networks there may be multiple parameters governing complexity. In a practical application, we need to determinethe values of such parameters, and the principal objective in doing so is usually toachieve the best predictive performance on new data. Furthermore, as well as finding the appropriate values for complexity parameters within a given model, we maywish to consider a range of different types of model in order to find the best one forour particular application.We have already seen that, in the maximum likelihood approach, the performance on the training set is not a good indicator of predictive performance on unseen data due to the problem of over-fitting.