Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 19
Текст из файла (страница 19)
Then make use of (1.135), togetherwith (1.136), to show that, if the result holds at order M − 1, then it will also hold atorder M1.16 ( ) In Exercise 1.15, we proved the result (1.135) for the number of independentparameters in the M th order term of a D-dimensional polynomial. We now find anexpression for the total number N (D, M ) of independent parameters in all of theterms up to and including the M 6th order. First show that N (D, M ) satisfiesN (D, M ) =Mn(D, m)(1.138)m=0where n(D, m) is the number of independent parameters in the term of order m.Now make use of the result (1.137), together with proof by induction, to show that(D + M )!.(1.139)D! M !This can be done by first proving that the result holds for M = 0 and arbitraryD 1, then assuming that it holds at order M , and hence showing that it holds atorder M + 1.
Finally, make use of Stirling’s approximation in the formN (d, M ) =n! nn e−n(1.140)for large n to show that, for D M , the quantity N (D, M ) grows like DM ,and for M D it grows like M D . Consider a cubic (M = 3) polynomial in Ddimensions, and evaluate numerically the total number of independent parametersfor (i) D = 10 and (ii) D = 100, which correspond to typical small-scale andmedium-scale machine learning applications.621. INTRODUCTION1.17 ( ) wwwThe gamma function is defined by ∞ux−1 e−u du.Γ(x) ≡(1.141)0Using integration by parts, prove the relation Γ(x + 1) = xΓ(x).
Show also thatΓ(1) = 1 and hence that Γ(x + 1) = x! when x is an integer.1.18 ( ) www We can use the result (1.126) to derive an expression for the surfacearea SD , and the volume VD , of a sphere of unit radius in D dimensions. To do this,consider the following result, which is obtained by transforming from Cartesian topolar coordinatesD i=1∞−x2ie−∞∞dxi = SDe−r rD−1 dr.2(1.142)0Using the definition (1.141) of the Gamma function, together with (1.126), evaluateboth sides of this equation, and hence show thatSD =2π D/2.Γ(D/2)(1.143)Next, by integrating with respect to radius from 0 to 1, show that the volume of theunit sphere in D dimensions is given byVD =SD.D(1.144)√Finally, use the results Γ(1) = 1 and Γ(3/2) = π/2 to show that (1.143) and(1.144) reduce to the usual expressions for D = 2 and D = 3.1.19 ( ) Consider a sphere of radius a in D-dimensions together with the concentrichypercube of side 2a, so that the sphere touches the hypercube at the centres of eachof its sides.
By using the results of Exercise 1.18, show that the ratio of the volumeof the sphere to the volume of the cube is given byπ D/2volume of sphere=.D−volume of cubeD2 1 Γ(D/2)(1.145)Now make use of Stirling’s formula in the formΓ(x + 1) (2π)1/2 e−x xx+1/2(1.146)which is valid for x 1, to show that, as D → ∞, the ratio (1.145) goes to zero.Show also that the ratio of the distance from the centre of the hypercube√ to one ofthe corners, divided by the perpendicular distance to one of the sides, is D, whichtherefore goes to ∞ as D → ∞. From these results we see that, in a space of highdimensionality, most of the volume of a cube is concentrated in the large number ofcorners, which themselves become very long ‘spikes’!Exercises631.20 ( ) www In this exercise, we explore the behaviour of the Gaussian distributionin high-dimensional spaces.
Consider a Gaussian distribution in D dimensions givenby1x2p(x) =exp−.(1.147)2σ 2(2πσ 2 )D/2We wish to find the density with respect to radius in polar coordinates in which thedirection variables have been integrated out. To do this, show that the integral ofthe probability density over a thin shell of radius r and thickness , where 1, isgiven by p(r) whereSD rD−1r2p(r) =exp − 2(1.148)2σ(2πσ 2 )D/2where SD is the surface area of a unit sphere in D dimensions. √Show that the functionr Dσ. By consideringp(r) has a single stationary point located, for large D, at r + ) where r, show that for large D,p(32p(r + ) = p(r) exp − 2(1.149)2σr is a maximum of the radial probability density and also that p(r)which shows that r with length scale σ. We havedecays exponentially away from its maximum at already seen that σ r for large D, and so we see that most of the probabilitymass is concentrated in a thin shell at large radius.
Finally, show that the probabilityr by a factor of exp(D/2).density p(x) is larger at the origin than at the radius We therefore see that most of the probability mass in a high-dimensional Gaussiandistribution is located at a different radius from the region of high probability density.This property of distributions in spaces of high dimensionality will have importantconsequences when we consider Bayesian inference of model parameters in laterchapters.1.21 ( ) Consider two nonnegative numbers a and b, and show that, if a b, thena (ab)1/2 .
Use this result to show that, if the decision regions of a two-classclassification problem are chosen to minimize the probability of misclassification,this probability will satisfy1 /2dx.(1.150)p(mistake) {p(x, C1 )p(x, C2 )}1.22 () www Given a loss matrix with elements Lkj , the expected risk is minimizedif, for each x, we choose the class that minimizes (1.81). Verify that, when theloss matrix is given by Lkj = 1 − Ikj , where Ikj are the elements of the identitymatrix, this reduces to the criterion of choosing the class having the largest posteriorprobability. What is the interpretation of this form of loss matrix?1.23 () Derive the criterion for minimizing the expected loss when there is a generalloss matrix and general prior probabilities for the classes.641.
INTRODUCTION1.24 ( ) www Consider a classification problem in which the loss incurred whenan input vector from class Ck is classified as belonging to class Cj is given by theloss matrix Lkj , and for which the loss incurred in selecting the reject option is λ.Find the decision criterion that will give the minimum expected loss. Verify that thisreduces to the reject criterion discussed in Section 1.5.3 when the loss matrix is givenby Lkj = 1 − Ikj .
What is the relationship between λ and the rejection threshold θ?1.25 () www Consider the generalization of the squared loss function (1.87) for asingle target variable t to the case of multiple target variables described by the vectort given byy(x) − t2 p(x, t) dx dt.(1.151)E[L(t, y(x))] =Using the calculus of variations, show that the function y(x) for which this expectedloss is minimized is given by y(x) = Et [t|x]. Show that this result reduces to (1.89)for the case of a single target variable t.1.26 () By expansion of the square in (1.151), derive a result analogous to (1.90) andhence show that the function y(x) that minimizes the expected squared loss for thecase of a vector t of target variables is again given by the conditional expectation oft.1.27 ( ) www Consider the expected loss for regression problems under the Lq lossfunction given by (1.91).
Write down the condition that y(x) must satisfy in orderto minimize E[Lq ]. Show that, for q = 1, this solution represents the conditionalmedian, i.e., the function y(x) such that the probability mass for t < y(x) is thesame as for t y(x). Also show that the minimum expected Lq loss for q → 0 isgiven by the conditional mode, i.e., by the function y(x) equal to the value of t thatmaximizes p(t|x) for each x.1.28 () In Section 1.6, we introduced the idea of entropy h(x) as the information gainedon observing the value of a random variable x having distribution p(x). We sawthat, for independent variables x and y for which p(x, y) = p(x)p(y), the entropyfunctions are additive, so that h(x, y) = h(x) + h(y).
In this exercise, we derive therelation between h and p in the form of a function h(p). First show that h(p2 ) =2h(p), and hence by induction that h(pn ) = nh(p) where n is a positive integer.Hence show that h(pn/m ) = (n/m)h(p) where m is also a positive integer. Thisimplies that h(px ) = xh(p) where x is a positive rational number, and hence bycontinuity when it is a positive real number. Finally, show that this implies h(p)must take the form h(p) ∝ ln p.1.29 () www Consider an M -state discrete random variable x, and use Jensen’s inequality in the form (1.115) to show that the entropy of its distribution p(x) satisfiesH[x] ln M .1.30 ( ) Evaluate the Kullback-Leibler divergence (1.113) between two Gaussiansp(x) = N (x|µ, σ 2 ) and q(x) = N (x|m, s2 ).ExercisesTable 1.3The joint distribution p(x, y) for two binary variablesx and y used in Exercise 1.39.65yx0101/3011/31/31.31 ( ) www Consider two variables x and y having joint distribution p(x, y).
Showthat the differential entropy of this pair of variables satisfiesH[x, y] H[x] + H[y](1.152)with equality if, and only if, x and y are statistically independent.1.32 () Consider a vector x of continuous variables with distribution p(x) and corresponding entropy H[x]. Suppose that we make a nonsingular linear transformationof x to obtain a new variable y = Ax. Show that the corresponding entropy is givenby H[y] = H[x] + ln |A| where |A| denotes the determinant of A.1.33 ( ) Suppose that the conditional entropy H[y|x] between two discrete randomvariables x and y is zero.
Show that, for all values of x such that p(x) > 0, thevariable y must be a function of x, in other words for each x there is only one valueof y such that p(y|x) = 0.1.34 ( ) www Use the calculus of variations to show that the stationary point of thefunctional (1.108) is given by (1.108). Then use the constraints (1.105), (1.106),and (1.107) to eliminate the Lagrange multipliers and hence show that the maximumentropy solution is given by the Gaussian (1.109).1.35 () www Use the results (1.106) and (1.107) to show that the entropy of theunivariate Gaussian (1.109) is given by (1.110).1.36 () A strictly convex function is defined as one for which every chord lies abovethe function. Show that this is equivalent to the condition that the second derivativeof the function be positive.1.37 () Using the definition (1.111) together with the product rule of probability, provethe result (1.112).1.38 ( ) www Using proof by induction, show that the inequality (1.114) for convexfunctions implies the result (1.115).1.39 ( ) Consider two binary variables x and y having the joint distribution given inTable 1.3.Evaluate the following quantities(a) H[x](b) H[y](c) H[y|x](d) H[x|y](e) H[x, y](f) I[x, y].Draw a diagram to show the relationship between these various quantities.661.
INTRODUCTION1.40 () By applying Jensen’s inequality (1.115) with f (x) = ln x, show that the arithmetic mean of a set of real numbers is never less than their geometrical mean.1.41 () www Using the sum and product rules of probability, show that the mutualinformation I(x, y) satisfies the relation (1.121).2ProbabilityDistributionsIn Chapter 1, we emphasized the central role played by probability theory in thesolution of pattern recognition problems. We turn now to an exploration of someparticular examples of probability distributions and their properties.
As well as being of great interest in their own right, these distributions can form building blocksfor more complex models and will be used extensively throughout the book. Thedistributions introduced in this chapter will also serve another important purpose,namely to provide us with the opportunity to discuss some key statistical concepts,such as Bayesian inference, in the context of simple models before we encounterthem in more complex situations in later chapters.One role for the distributions discussed in this chapter is to model the probability distribution p(x) of a random variable x, given a finite set x1 , .