Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 31
Текст из файла (страница 31)
We can obtain a smootherdensity model if we choose a smoother kernel function, and a common choice is theGaussian, which gives rise to the following kernel density modelp(x) =N1 1x − xn 2exp−N2h2(2πh2 )1/2(2.250)n=1where h represents the standard deviation of the Gaussian components.
Thus ourdensity model is obtained by placing a Gaussian over each data point and then addingup the contributions over the whole data set, and then dividing by N so that the density is correctly normalized. In Figure 2.25, we apply the model (2.250) to the data1242. PROBABILITY DISTRIBUTIONSFigure 2.25Illustration of the kernel density model(2.250) applied to the same data set usedto demonstrate the histogram approach inFigure 2.24. We see that h acts as asmoothing parameter and that if it is settoo small (top panel), the result is a verynoisy density model, whereas if it is settoo large (bottom panel), then the bimodalnature of the underlying distribution fromwhich the data is generated (shown by thegreen curve) is washed out.
The best density model is obtained for some intermediate value of h (middle panel).505050h = 0.00500.510.510.51h = 0.070h = 0.20set used earlier to demonstrate the histogram technique. We see that, as expected,the parameter h plays the role of a smoothing parameter, and there is a trade-offbetween sensitivity to noise at small h and over-smoothing at large h. Again, theoptimization of h is a problem in model complexity, analogous to the choice of binwidth in histogram density estimation, or the degree of the polynomial used in curvefitting.We can choose any other kernel function k(u) in (2.249) subject to the conditionsk(u) 0,(2.251)k(u) du = 1(2.252)which ensure that the resulting probability distribution is nonnegative everywhereand integrates to one. The class of density model given by (2.249) is called a kerneldensity estimator, or Parzen estimator.
It has a great merit that there is no computation involved in the ‘training’ phase because this simply requires storage of thetraining set. However, this is also one of its great weaknesses because the computational cost of evaluating the density grows linearly with the size of the data set.2.5.2 Nearest-neighbour methodsOne of the difficulties with the kernel approach to density estimation is that theparameter h governing the kernel width is fixed for all kernels.
In regions of highdata density, a large value of h may lead to over-smoothing and a washing out ofstructure that might otherwise be extracted from the data. However, reducing h maylead to noisy estimates elsewhere in data space where the density is smaller. Thusthe optimal choice for h may be dependent on location within the data space. Thisissue is addressed by nearest-neighbour methods for density estimation.We therefore return to our general result (2.246) for local density estimation,and instead of fixing V and determining the value of K from the data, we considera fixed value of K and use the data to find an appropriate value for V . To do this,we consider a small sphere centred on the point x at which we wish to estimate the2.5.
Nonparametric MethodsFigure 2.26Illustration of K-nearest-neighbour density estimation using the same data setas in Figures 2.25 and 2.24. We seethat the parameter K governs the degreeof smoothing, so that a small value ofK leads to a very noisy density model(top panel), whereas a large value (bottom panel) smoothes out the bimodal nature of the true distribution (shown by thegreen curve) from which the data set wasgenerated.505050Exercise 2.61125K =100.510.510.51K =50K = 300density p(x), and we allow the radius of the sphere to grow until it contains preciselyK data points. The estimate of the density p(x) is then given by (2.246) with V set tothe volume of the resulting sphere.
This technique is known as K nearest neighboursand is illustrated in Figure 2.26, for various choices of the parameter K, using thesame data set as used in Figure 2.24 and Figure 2.25. We see that the value of Know governs the degree of smoothing and that again there is an optimum choice forK that is neither too large nor too small. Note that the model produced by K nearestneighbours is not a true density model because the integral over all space diverges.We close this chapter by showing how the K-nearest-neighbour technique fordensity estimation can be extended to the problem of classification.
To do this, weapply the K-nearest-neighbour density estimation technique to each class separatelyand then make use of Bayes’ theorem. Let us suppose that wehave a data set comprising Nk points in class Ck with N points in total, so that k Nk = N . If wewish to classify a new point x, we draw a sphere centred on x containing preciselyK points irrespective of their class.
Suppose this sphere has volume V and containsKk points from class Ck . Then (2.246) provides an estimate of the density associatedwith each classKkp(x|Ck ) =.(2.253)Nk VSimilarly, the unconditional density is given byp(x) =KNV(2.254)p(Ck ) =Nk.N(2.255)while the class priors are given byWe can now combine (2.253), (2.254), and (2.255) using Bayes’ theorem to obtainthe posterior probability of class membershipp(Ck |x) =Kkp(x|Ck )p(Ck )=.p(x)K(2.256)1262. PROBABILITY DISTRIBUTIONSFigure 2.27 (a) In the K-nearestneighbour classifier, a new point,shown by the black diamond, is classified according to the majority classmembership of the K closest training data points, in this case K =3.(b) In the nearest-neighbour(K = 1) approach to classification,the resulting decision boundary iscomposed of hyperplanes that formperpendicular bisectors of pairs ofpoints from different classes.x2x2x1x1(a)(b)If we wish to minimize the probability of misclassification, this is done by assigningthe test point x to the class having the largest posterior probability, corresponding tothe largest value of Kk /K.
Thus to classify a new point, we identify the K nearestpoints from the training data set and then assign the new point to the class having thelargest number of representatives amongst this set. Ties can be broken at random.The particular case of K = 1 is called the nearest-neighbour rule, because a testpoint is simply assigned to the same class as the nearest point from the training set.These concepts are illustrated in Figure 2.27.In Figure 2.28, we show the results of applying the K-nearest-neighbour algorithm to the oil flow data, introduced in Chapter 1, for various values of K.
Asexpected, we see that K controls the degree of smoothing, so that small K producesmany small regions of each class, whereas large K leads to fewer larger regions.K =1K =32K = 312x72x710x7101x620101x62001x62Figure 2.28 Plot of 200 data points from the oil data set showing values of x6 plotted against x7 , where thered, green, and blue points correspond to the ‘laminar’, ‘annular’, and ‘homogeneous’ classes, respectively. Alsoshown are the classifications of the input space given by the K-nearest-neighbour algorithm for various valuesof K.Exercises127An interesting property of the nearest-neighbour (K = 1) classifier is that, in thelimit N → ∞, the error rate is never more than twice the minimum achievable errorrate of an optimal classifier, i.e., one that uses the true class distributions (Cover andHart, 1967) .As discussed so far, both the K-nearest-neighbour method, and the kernel density estimator, require the entire training data set to be stored, leading to expensivecomputation if the data set is large.
This effect can be offset, at the expense of someadditional one-off computation, by constructing tree-based search structures to allow(approximate) near neighbours to be found efficiently without doing an exhaustivesearch of the data set. Nevertheless, these nonparametric methods are still severelylimited. On the other hand, we have seen that simple parametric models are veryrestricted in terms of the forms of distribution that they can represent. We thereforeneed to find density models that are very flexible and yet for which the complexityof the models can be controlled independently of the size of the training set, and weshall see in subsequent chapters how to achieve this.Exercises2.1 () wwwertiesVerify that the Bernoulli distribution (2.2) satisfies the following prop1p(x|µ) = 1(2.257)E[x] = µvar[x] = µ(1 − µ).(2.258)(2.259)x=0Show that the entropy H[x] of a Bernoulli distributed random binary variable x isgiven byH[x] = −µ ln µ − (1 − µ) ln(1 − µ).(2.260)2.2 ( ) The form of the Bernoulli distribution given by (2.2) is not symmetric between the two values of x.
In some situations, it will be more convenient to use anequivalent formulation for which x ∈ {−1, 1}, in which case the distribution can bewritten(1−x)/2 (1+x)/21−µ1+µ(2.261)p(x|µ) =22where µ ∈ [−1, 1]. Show that the distribution (2.261) is normalized, and evaluate itsmean, variance, and entropy.2.3 ( ) www In this exercise, we prove that the binomial distribution (2.9) is normalized.
First use the definition (2.10) of the number of combinations of m identicalobjects chosen from a total of N to show that NNN +1+=.(2.262)mm−1m1282. PROBABILITY DISTRIBUTIONSUse this result to prove by induction the following resultN N m(1 + x) =xmN(2.263)m=0which is known as the binomial theorem, and which is valid for all real values of x.Finally, show that the binomial distribution is normalized, so thatN N mµ (1 − µ)N −m = 1m(2.264)m=0which can be done by first pulling out a factor (1 − µ)N out of the summation andthen making use of the binomial theorem.2.4 ( ) Show that the mean of the binomial distribution is given by (2.11).
To do this,differentiate both sides of the normalization condition (2.264) with respect to µ andthen rearrange to obtain an expression for the mean of n. Similarly, by differentiating(2.264) twice with respect to µ and making use of the result (2.11) for the mean ofthe binomial distribution prove the result (2.12) for the variance of the binomial.2.5 ( ) www In this exercise, we prove that the beta distribution, given by (2.13), iscorrectly normalized, so that (2.14) holds. This is equivalent to showing that 1Γ(a)Γ(b)µa−1 (1 − µ)b−1 dµ =.(2.265)Γ(a + b)0From the definition (1.141) of the gamma function, we have ∞ ∞a−1exp(−x)xdxexp(−y)y b−1 dy.Γ(a)Γ(b) =0(2.266)0Use this expression to prove (2.265) as follows.