Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 18
Текст из файла (страница 18)
We can try to approximate this distribution using some parametricdistribution q(x|θ), governed by a set of adjustable parameters θ, for example amultivariate Gaussian. One way to determine θ is to minimize the Kullback-Leiblerdivergence between p(x) and q(x|θ) with respect to θ. We cannot do this directlybecause we don’t know p(x). Suppose, however, that we have observed a finite setof training points xn , for n = 1, . . .
, N , drawn from p(x). Then the expectationwith respect to p(x) can be approximated by a finite sum over these points, using(1.35), so thatN{− ln q(xn |θ) + ln p(xn )} .(1.119)KL(pq) n=1The second term on the right-hand side of (1.119) is independent of θ, and the firstterm is the negative log likelihood function for θ under the distribution q(x|θ) evaluated using the training set. Thus we see that minimizing this Kullback-Leiblerdivergence is equivalent to maximizing the likelihood function.Now consider the joint distribution between two sets of variables x and y givenby p(x, y). If the sets of variables are independent, then their joint distribution willfactorize into the product of their marginals p(x, y) = p(x)p(y). If the variables arenot independent, we can gain some idea of whether they are ‘close’ to being independent by considering the Kullback-Leibler divergence between the joint distributionand the product of the marginals, given byI[x, y] ≡ KL(p(x, y)p(x)p(y))p(x)p(y)p(x, y) lndx dy= −p(x, y)Exercise 1.41(1.120)which is called the mutual information between the variables x and y.
From theproperties of the Kullback-Leibler divergence, we see that I(x, y) 0 with equality if, and only if, x and y are independent. Using the sum and product rules ofprobability, we see that the mutual information is related to the conditional entropythroughI[x, y] = H[x] − H[x|y] = H[y] − H[y|x].(1.121)581. INTRODUCTIONThus we can view the mutual information as the reduction in the uncertainty about xby virtue of being told the value of y (or vice versa). From a Bayesian perspective,we can view p(x) as the prior distribution for x and p(x|y) as the posterior distribution after we have observed new data y. The mutual information therefore representsthe reduction in uncertainty about x as a consequence of the new observation y.Exercises1.1 () www Consider the sum-of-squares error function given by (1.2) in whichthe function y(x, w) is given by the polynomial (1.1). Show that the coefficientsw = {wi } that minimize this error function are given by the solution to the followingset of linear equationsMAij wj = Ti(1.122)j =0whereAij =Nn=1i+ j(xn ),Ti =N(xn )i tn .(1.123)n=1Here a suffix i or j denotes the index of a component, whereas (x)i denotes x raisedto the power of i.1.2 () Write down the set of coupled linear equations, analogous to (1.122), satisfiedby the coefficients wi which minimize the regularized sum-of-squares error functiongiven by (1.4).1.3 ( ) Suppose that we have three coloured boxes r (red), b (blue), and g (green).Box r contains 3 apples, 4 oranges, and 3 limes, box b contains 1 apple, 1 orange,and 0 limes, and box g contains 3 apples, 3 oranges, and 4 limes.
If a box is chosenat random with probabilities p(r) = 0.2, p(b) = 0.2, p(g) = 0.6, and a piece offruit is removed from the box (with equal probability of selecting any of the items inthe box), then what is the probability of selecting an apple? If we observe that theselected fruit is in fact an orange, what is the probability that it came from the greenbox?1.4 ( ) www Consider a probability density px (x) defined over a continuous variable x, and suppose that we make a nonlinear change of variable using x = g(y),so that the density transforms according to (1.27).
By differentiating (1.27), showthat the location y of the maximum of the density in y is not in general related to thelocation x of the maximum of the density over x by the simple functional relationx = g(y ) as a consequence of the Jacobian factor. This shows that the maximumof a probability density (in contrast to a simple function) is dependent on the choiceof variable. Verify that, in the case of a linear transformation, the location of themaximum transforms in the same way as the variable itself.1.5 () Using the definition (1.38) show that var[f (x)] satisfies (1.39).Exercises591.6 () Show that if two variables x and y are independent, then their covariance iszero.1.7 ( ) www In this exercise, we prove the normalization condition (1.48) for theunivariate Gaussian. To do this consider, the integral ∞1 2I=exp − 2 xdx(1.124)2σ−∞which we can evaluate by first writing its square in the form ∞ ∞11I2 =exp − 2 x2 − 2 y 2 dx dy.2σ2σ−∞ −∞(1.125)Now make the transformation from Cartesian coordinates (x, y) to polar coordinates(r, θ) and then substitute u = r2 .
Show that, by performing the integrals over θ andu, and then taking the square root of both sides, we obtain 1 /2.I = 2πσ 2(1.126)Finally, use this result to show that the Gaussian distribution N (x|µ, σ 2 ) is normalized.1.8 ( ) www By using a change of variables, verify that the univariate Gaussiandistribution given by (1.46) satisfies (1.49). Next, by differentiating both sides of thenormalization condition ∞N x|µ, σ 2 dx = 1(1.127)−∞with respect to σ 2 , verify that the Gaussian satisfies (1.50).
Finally, show that (1.51)holds.1.9 () www Show that the mode (i.e. the maximum) of the Gaussian distribution(1.46) is given by µ. Similarly, show that the mode of the multivariate Gaussian(1.52) is given by µ.1.10 () www Suppose that the two variables x and z are statistically independent.Show that the mean and variance of their sum satisfiesE[x + z] = E[x] + E[z]var[x + z] = var[x] + var[z].(1.128)(1.129)1.11 () By setting the derivatives of the log likelihood function (1.54) with respect to µand σ 2 equal to zero, verify the results (1.55) and (1.56).601. INTRODUCTION1.12 ( ) wwwUsing the results (1.49) and (1.50), show thatE[xn xm ] = µ2 + Inm σ 2(1.130)where xn and xm denote data points sampled from a Gaussian distribution with meanµ and variance σ 2 , and Inm satisfies Inm = 1 if n = m and Inm = 0 otherwise.Hence prove the results (1.57) and (1.58).1.13 () Suppose that the variance of a Gaussian is estimated using the result (1.56) butwith the maximum likelihood estimate µML replaced with the true value µ of themean.
Show that this estimator has the property that its expectation is given by thetrue variance σ 2 .1.14 ( ) Show that an arbitrary square matrix with elements wij can be written inSASA+ wijwhere wijand wijare symmetric and anti-symmetricthe form wij = wijSSAAmatrices, respectively, satisfying wij = wji and wij= −wjifor all i and j. Nowconsider the second order term in a higher order polynomial in D dimensions, givenbyD Dwij xi xj .(1.131)i=1 j =1Show thatD Dwij xi xj =i=1 j =1D DSwijxi xj(1.132)i=1 j =1so that the contribution from the anti-symmetric matrix vanishes. We therefore seethat, without loss of generality, the matrix of coefficients wij can be chosen to besymmetric, and so not all of the D2 elements of this matrix can be chosen indepenSis givendently.
Show that the number of independent parameters in the matrix wijby D(D + 1)/2.1.15 ( ) www In this exercise and the next, we explore how the number of independent parameters in a polynomial grows with the order M of the polynomial and withthe dimensionality D of the input space. We start by writing down the M th orderterm for a polynomial in D dimensions in the formD D···i1 =1 i2 =1Dwi1 i2 ···iM xi1 xi2 · · · xiM .(1.133)iM =1The coefficients wi1 i2 ···iM comprise DM elements, but the number of independentparameters is significantly fewer due to the many interchange symmetries of thefactor xi1 xi2 · · · xiM .
Begin by showing that the redundancy in the coefficients canbe removed by rewriting this M th order term in the formi1D i1 =1 i2 =1iM − 1···iM =1w i1 i2 ···iM xi1 xi2 · · · xiM .(1.134)Exercises61 coefficients and w coefficients needNote that the precise relationship between the wnot be made explicit. Use this result to show that the number of independent parameters n(D, M ), which appear at order M , satisfies the following recursion relationn(D, M ) =Dn(i, M − 1).(1.135)i=1Next use proof by induction to show that the following result holdsDi=1(i + M − 2)!(D + M − 1)!=(i − 1)! (M − 1)!(D − 1)! M !(1.136)which can be done by first proving the result for D = 1 and arbitrary M by makinguse of the result 0! = 1, then assuming it is correct for dimension D and verifyingthat it is correct for dimension D + 1. Finally, use the two previous results, togetherwith proof by induction, to shown(D, M ) =(D + M − 1)!.(D − 1)! M !(1.137)To do this, first show that the result is true for M = 2, and any value of D 1,by comparison with the result of Exercise 1.14.