Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 40
Текст из файла (страница 40)
If we assume that the posterior distribution is sharply peaked around themost probable value wMAP , with width ∆wposterior , then we can approximate the integral by the value of the integrand at its maximum times the width of the peak. If wefurther assume that the prior is flat with width ∆wprior so that p(w) = 1/∆wprior ,then we have∆wposterior(3.70)p(D) = p(D|w)p(w) dw p(D|wMAP )∆wprior3.4.
Bayesian Model ComparisonFigure 3.12163∆wposteriorWe can obtain a rough approximation tothe model evidence if we assume thatthe posterior distribution over parameters is sharply peaked around its modewMAP .wwMAP∆wpriorand so taking logs we obtainln p(D) ln p(D|wMAP ) + ln∆wposterior∆wprior.(3.71)This approximation is illustrated in Figure 3.12. The first term represents the fit tothe data given by the most probable parameter values, and for a flat prior this wouldcorrespond to the log likelihood.
The second term penalizes the model according toits complexity. Because ∆wposterior < ∆wprior this term is negative, and it increasesin magnitude as the ratio ∆wposterior /∆wprior gets smaller. Thus, if parameters arefinely tuned to the data in the posterior distribution, then the penalty term is large.For a model having a set of M parameters, we can make a similar approximationfor each parameter in turn. Assuming that all parameters have the same ratio of∆wposterior /∆wprior , we obtain∆wposteriorln p(D) ln p(D|wMAP ) + M ln.(3.72)∆wpriorSection 4.4.1Thus, in this very simple approximation, the size of the complexity penalty increaseslinearly with the number M of adaptive parameters in the model.
As we increasethe complexity of the model, the first term will typically decrease, because a morecomplex model is better able to fit the data, whereas the second term will increasedue to the dependence on M . The optimal model complexity, as determined bythe maximum evidence, will be given by a trade-off between these two competingterms. We shall later develop a more refined version of this approximation, based ona Gaussian approximation to the posterior distribution.We can gain further insight into Bayesian model comparison and understandhow the marginal likelihood can favour models of intermediate complexity by considering Figure 3.13.
Here the horizontal axis is a one-dimensional representationof the space of possible data sets, so that each point on this axis corresponds to aspecific data set. We now consider three models M1 , M2 and M3 of successivelyincreasing complexity. Imagine running these models generatively to produce example data sets, and then looking at the distribution of data sets that result. Any given1643. LINEAR MODELS FOR REGRESSIONFigure 3.13Schematic illustration of thedistribution of data sets for p(D)three models of different complexity, in which M1 is thesimplest and M3 is the mostcomplex.
Note that the distributions are normalized. Inthis example, for the particular observed data set D0 ,the model M2 with intermediate complexity has the largestevidence.M1M2M3D0Dmodel can generate a variety of different data sets since the parameters are governedby a prior probability distribution, and for any choice of the parameters there maybe random noise on the target variables. To generate a particular data set from a specific model, we first choose the values of the parameters from their prior distributionp(w), and then for these parameter values we sample the data from p(D|w). A simple model (for example, based on a first order polynomial) has little variability andso will generate data sets that are fairly similar to each other. Its distribution p(D)is therefore confined to a relatively small region of the horizontal axis.
By contrast,a complex model (such as a ninth order polynomial) can generate a great variety ofdifferent data sets, and so its distribution p(D) is spread over a large region of thespace of data sets. Because the distributions p(D|Mi ) are normalized, we see thatthe particular data set D0 can have the highest value of the evidence for the modelof intermediate complexity. Essentially, the simpler model cannot fit the data well,whereas the more complex model spreads its predictive probability over too broad arange of data sets and so assigns relatively small probability to any one of them.Implicit in the Bayesian model comparison framework is the assumption thatthe true distribution from which the data are generated is contained within the set ofmodels under consideration. Provided this is so, we can show that Bayesian modelcomparison will on average favour the correct model.
To see this, consider twomodels M1 and M2 in which the truth corresponds to M1 . For a given finite dataset, it is possible for the Bayes factor to be larger for the incorrect model. However, ifwe average the Bayes factor over the distribution of data sets, we obtain the expectedBayes factor in the formp(D|M1 )p(D|M1 ) lndD(3.73)p(D|M2 )Section 1.6.1where the average has been taken with respect to the true distribution of the data.This quantity is an example of the Kullback-Leibler divergence and satisfies the property of always being positive unless the two distributions are equal in which case itis zero. Thus on average the Bayes factor will always favour the correct model.We have seen that the Bayesian framework avoids the problem of over-fittingand allows models to be compared on the basis of the training data alone.
However,3.5. The Evidence Approximation165a Bayesian approach, like any approach to pattern recognition, needs to make assumptions about the form of the model, and if these are invalid then the results canbe misleading. In particular, we see from Figure 3.12 that the model evidence canbe sensitive to many aspects of the prior, such as the behaviour in the tails. Indeed,the evidence is not defined if the prior is improper, as can be seen by noting thatan improper prior has an arbitrary scaling factor (in other words, the normalizationcoefficient is not defined because the distribution cannot be normalized). If we consider a proper prior and then take a suitable limit in order to obtain an improper prior(for example, a Gaussian prior in which we take the limit of infinite variance) thenthe evidence will go to zero, as can be seen from (3.70) and Figure 3.12.
It may,however, be possible to consider the evidence ratio between two models first andthen take a limit to obtain a meaningful answer.In a practical application, therefore, it will be wise to keep aside an independenttest set of data on which to evaluate the overall performance of the final system.3.5.
The Evidence ApproximationIn a fully Bayesian treatment of the linear basis function model, we would introduce prior distributions over the hyperparameters α and β and make predictions bymarginalizing with respect to these hyperparameters as well as with respect to theparameters w. However, although we can integrate analytically over either w orover the hyperparameters, the complete marginalization over all of these variablesis analytically intractable.
Here we discuss an approximation in which we set thehyperparameters to specific values determined by maximizing the marginal likelihood function obtained by first integrating over the parameters w. This frameworkis known in the statistics literature as empirical Bayes (Bernardo and Smith, 1994;Gelman et al., 2004), or type 2 maximum likelihood (Berger, 1985), or generalizedmaximum likelihood (Wahba, 1975), and in the machine learning literature is alsocalled the evidence approximation (Gull, 1989; MacKay, 1992a).If we introduce hyperpriors over α and β, the predictive distribution is obtainedby marginalizing over w, α and β so thatp(t|w, β)p(w|t, α, β)p(α, β|t) dw dα dβ(3.74)p(t|t) =where p(t|w, β) is given by (3.8) and p(w|t, α, β) is given by (3.49) with mN andSN defined by (3.53) and (3.54) respectively.
Here we have omitted the dependenceon the input variable x to keep the notation uncluttered. If the posterior distribution then the predictive distribution is and β,p(α, β|t) is sharply peaked around values αobtained simply by marginalizing over w in which α and β are fixed to the values αand β, so that dw. , β) = p(t|w, β)p(w|t,α , β)(3.75)p(t|t) p(t|t, α1663. LINEAR MODELS FOR REGRESSIONFrom Bayes’ theorem, the posterior distribution for α and β is given byp(α, β|t) ∝ p(t|α, β)p(α, β).(3.76) andIf the prior is relatively flat, then in the evidence framework the values of α are obtained by maximizing the marginal likelihood function p(t|α, β).
We shallβproceed by evaluating the marginal likelihood for the linear basis function model andthen finding its maxima. This will allow us to determine values for these hyperparameters from the training data alone, without recourse to cross-validation. Recallthat the ratio α/β is analogous to a regularization parameter.As an aside it is worth noting that, if we define conjugate (Gamma) prior distributions over α and β, then the marginalization over these hyperparameters in (3.74)can be performed analytically to give a Student’s t-distribution over w (see Section 2.3.7).