Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 30
Текст из файла (страница 30)
If we consider aninterval A σ B, and a scaled interval A/c σ B/c, then the prior shouldassign equal probability mass to these two intervals. Thus we have B B/c B 11p(σ) dσ =p(σ) dσ =pσdσ(2.238)ccAA/cAand because this must hold for choices of A and B, we have 11p(σ) = pσcc(2.239)and hence p(σ) ∝ 1/σ. Note that again this is an improper prior because the integralof the distribution over 0 σ ∞ is divergent. It is sometimes also convenientto think of the prior distribution for a scale parameter in terms of the density of thelog of the parameter.
Using the transformation rule (1.27) for densities we see thatp(ln σ) = const. Thus, for this prior there is the same probability mass in the range1 σ 10 as in the range 10 σ 100 and in 100 σ 1000.1202. PROBABILITY DISTRIBUTIONSAn example of a scale parameter would be the standard deviation σ of a Gaussiandistribution, after we have taken account of the location parameter µ, becauseN (x|µ, σ 2 ) ∝ σ −1 exp −(x/σ)2(2.240)Section 2.3where x = x − µ. As discussed earlier, it is often more convenient to work in termsof the precision λ = 1/σ 2 rather than σ itself. Using the transformation rule fordensities, we see that a distribution p(σ) ∝ 1/σ corresponds to a distribution over λof the form p(λ) ∝ 1/λ. We have seen that the conjugate prior for λ was the gammadistribution Gam(λ|a0 , b0 ) given by (2.146). The noninformative prior is obtainedas the special case a0 = b0 = 0.
Again, if we examine the results (2.150) and (2.151)for the posterior distribution of λ, we see that for a0 = b0 = 0, the posterior dependsonly on terms arising from the data and not from the prior.2.5. Nonparametric MethodsThroughout this chapter, we have focussed on the use of probability distributionshaving specific functional forms governed by a small number of parameters whosevalues are to be determined from a data set. This is called the parametric approachto density modelling.
An important limitation of this approach is that the chosendensity might be a poor model of the distribution that generates the data, which canresult in poor predictive performance. For instance, if the process that generates thedata is multimodal, then this aspect of the distribution can never be captured by aGaussian, which is necessarily unimodal.In this final section, we consider some nonparametric approaches to density estimation that make few assumptions about the form of the distribution. Here we shallfocus mainly on simple frequentist methods. The reader should be aware, however,that nonparametric Bayesian methods are attracting increasing interest (Walker et al.,1999; Neal, 2000; Müller and Quintana, 2004; Teh et al., 2006).Let us start with a discussion of histogram methods for density estimation, whichwe have already encountered in the context of marginal and conditional distributionsin Figure 1.11 and in the context of the central limit theorem in Figure 2.6.
Here weexplore the properties of histogram density models in more detail, focussing on thecase of a single continuous variable x. Standard histograms simply partition x intodistinct bins of width ∆i and then count the number ni of observations of x fallingin bin i. In order to turn this count into a normalized probability density, we simplydivide by the total number N of observations and by the width ∆i of the bins toobtain probability values for each bin given bypi =niN ∆i(2.241)for which it is easily seen that p(x) dx = 1.
This gives a model for the densityp(x) that is constant over the width of each bin, and often the bins are chosen to havethe same width ∆i = ∆.2.5. Nonparametric MethodsFigure 2.24An illustration of the histogram approach 5to density estimation, in which a data set∆ = 0.04of 50 data points is generated from thedistribution shown by the green curve.Histogram density estimates, based on 0 0(2.241), with a common bin width ∆ are 5∆ = 0.08shown for various values of ∆.050Section 1.401210.510.510.51∆ = 0.250In Figure 2.24, we show an example of histogram density estimation.
Herethe data is drawn from the distribution, corresponding to the green curve, which isformed from a mixture of two Gaussians. Also shown are three examples of histogram density estimates corresponding to three different choices for the bin width∆. We see that when ∆ is very small (top figure), the resulting density model is veryspiky, with a lot of structure that is not present in the underlying distribution thatgenerated the data set. Conversely, if ∆ is too large (bottom figure) then the result isa model that is too smooth and that consequently fails to capture the bimodal property of the green curve.
The best results are obtained for some intermediate valueof ∆ (middle figure). In principle, a histogram density model is also dependent onthe choice of edge location for the bins, though this is typically much less significantthan the value of ∆.Note that the histogram method has the property (unlike the methods to be discussed shortly) that, once the histogram has been computed, the data set itself canbe discarded, which can be advantageous if the data set is large. Also, the histogramapproach is easily applied if the data points are arriving sequentially.In practice, the histogram technique can be useful for obtaining a quick visualization of data in one or two dimensions but is unsuited to most density estimationapplications.
One obvious problem is that the estimated density has discontinuitiesthat are due to the bin edges rather than any property of the underlying distributionthat generated the data. Another major limitation of the histogram approach is itsscaling with dimensionality. If we divide each variable in a D-dimensional spaceinto M bins, then the total number of bins will be M D . This exponential scalingwith D is an example of the curse of dimensionality. In a space of high dimensionality, the quantity of data needed to provide meaningful estimates of local probabilitydensity would be prohibitive.The histogram approach to density estimation does, however, teach us two important lessons.
First, to estimate the probability density at a particular location,we should consider the data points that lie within some local neighbourhood of thatpoint. Note that the concept of locality requires that we assume some form of distance measure, and here we have been assuming Euclidean distance.
For histograms,1222. PROBABILITY DISTRIBUTIONSthis neighbourhood property was defined by the bins, and there is a natural ‘smoothing’ parameter describing the spatial extent of the local region, in this case the binwidth. Second, the value of the smoothing parameter should be neither too large nortoo small in order to obtain good results. This is reminiscent of the choice of modelcomplexity in polynomial curve fitting discussed in Chapter 1 where the degree Mof the polynomial, or alternatively the value α of the regularization parameter, wasoptimal for some intermediate value, neither too large nor too small.
Armed withthese insights, we turn now to a discussion of two widely used nonparametric techniques for density estimation, kernel estimators and nearest neighbours, which havebetter scaling with dimensionality than the simple histogram model.2.5.1 Kernel density estimatorsLet us suppose that observations are being drawn from some unknown probability density p(x) in some D-dimensional space, which we shall take to be Euclidean,and we wish to estimate the value of p(x). From our earlier discussion of locality,let us consider some small region R containing x. The probability mass associatedwith this region is given byp(x) dx.(2.242)P =RSection 2.1Now suppose that we have collected a data set comprising N observations drawnfrom p(x).
Because each data point has a probability P of falling within R, the totalnumber K of points that lie inside R will be distributed according to the binomialdistributionN!(2.243)P K (1 − P )1−K .Bin(K|N, P ) =K!(N − K)!Using (2.11), we see that the mean fraction of points falling inside the region isE[K/N ] = P , and similarly using (2.12) we see that the variance around this meanis var[K/N ] = P (1 − P )/N .
For large N , this distribution will be sharply peakedaround the mean and soK N P.(2.244)If, however, we also assume that the region R is sufficiently small that the probabilitydensity p(x) is roughly constant over the region, then we haveP p(x)V(2.245)where V is the volume of R. Combining (2.244) and (2.245), we obtain our densityestimate in the formK.(2.246)p(x) =NVNote that the validity of (2.246) depends on two contradictory assumptions, namelythat the region R be sufficiently small that the density is approximately constant overthe region and yet sufficiently large (in relation to the value of that density) that thenumber K of points falling inside the region is sufficient for the binomial distributionto be sharply peaked.2.5. Nonparametric Methods123We can exploit the result (2.246) in two different ways. Either we can fix K anddetermine the value of V from the data, which gives rise to the K-nearest-neighbourtechnique discussed shortly, or we can fix V and determine K from the data, giving rise to the kernel approach.
It can be shown that both the K-nearest-neighbourdensity estimator and the kernel density estimator converge to the true probabilitydensity in the limit N → ∞ provided V shrinks suitably with N , and K grows withN (Duda and Hart, 1973).We begin by discussing the kernel method in detail, and to start with we takethe region R to be a small hypercube centred on the point x at which we wish todetermine the probability density. In order to count the number K of points fallingwithin this region, it is convenient to define the following function1, |ui | 1/2,i = 1, .
. . , D,k(u) =(2.247)0, otherwisewhich represents a unit cube centred on the origin. The function k(u) is an exampleof a kernel function, and in this context is also called a Parzen window. From (2.247),the quantity k((x − xn )/h) will be one if the data point xn lies inside a cube of sideh centred on x, and zero otherwise. The total number of data points lying inside thiscube will therefore beNx − x nk.(2.248)K=hn=1Substituting this expression into (2.246) then gives the following result for the estimated density at xN1 1 x − xn p(x) =k(2.249)NhDhn=1Dwhere we have used V = h for the volume of a hypercube of side h in D dimensions.
Using the symmetry of the function k(u), we can now re-interpret thisequation, not as a single cube centred on x but as the sum over N cubes centred onthe N data points xn .As it stands, the kernel density estimator (2.249) will suffer from one of the sameproblems that the histogram method suffered from, namely the presence of artificialdiscontinuities, in this case at the boundaries of the cubes.