Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 10
Текст из файла (страница 10)
Weshall see the motivation for these terms shortly. Figure 1.13 shows a plot of theGaussian distribution.From the form of (1.46) we see that the Gaussian distribution satisfiesN (x|µ, σ 2 ) > 0.Exercise 1.7(1.47)Also it is straightforward to show that the Gaussian is normalized, so thatPierre-Simon Laplace1749–1827It is said that Laplace was seriously lacking in modesty and at onepoint declared himself to be thebest mathematician in France at thetime, a claim that was arguably true.As well as being prolific in mathematics, he also made numerous contributions to astronomy, including the nebular hypothesis by which theearth is thought to have formed from the condensation and cooling of a large rotating disk of gas anddust. In 1812 he published the first edition of ThéorieAnalytique des Probabilités, in which Laplace statesthat “probability theory is nothing but common sensereduced to calculation”.
This work included a discussion of the inverse probability calculation (later termedBayes’ theorem by Poincaré), which he used to solveproblems in life expectancy, jurisprudence, planetarymasses, triangulation, and error estimation.1.2. Probability TheoryFigure 1.13Plot of the univariate Gaussianshowing the mean µ and thestandard deviation σ.25N (x|µ, σ 2 )2σµ∞−∞Exercise 1.8N x|µ, σ 2 dx = 1.x(1.48)Thus (1.46) satisfies the two requirements for a valid probability density.We can readily find expectations of functions of x under the Gaussian distribution. In particular, the average value of x is given by ∞N x|µ, σ 2 x dx = µ.(1.49)E[x] =−∞Because the parameter µ represents the average value of x under the distribution, itis referred to as the mean.
Similarly, for the second order moment ∞2N x|µ, σ 2 x2 dx = µ2 + σ 2 .(1.50)E[x ] =−∞From (1.49) and (1.50), it follows that the variance of x is given byvar[x] = E[x2 ] − E[x]2 = σ 2Exercise 1.9(1.51)and hence σ 2 is referred to as the variance parameter. The maximum of a distributionis known as its mode. For a Gaussian, the mode coincides with the mean.We are also interested in the Gaussian distribution defined over a D-dimensionalvector x of continuous variables, which is given by111T −1exp − (x − µ) Σ (x − µ)(1.52)N (x|µ, Σ) =2(2π)D/2 |Σ|1/2where the D-dimensional vector µ is called the mean, the D × D matrix Σ is calledthe covariance, and |Σ| denotes the determinant of Σ.
We shall make use of themultivariate Gaussian distribution briefly in this chapter, although its properties willbe studied in detail in Section 2.3.261. INTRODUCTIONFigure 1.14Illustration of the likelihood function fora Gaussian distribution, shown by thered curve. Here the black points de- p(x)note a data set of values {xn }, andthe likelihood function given by (1.53)corresponds to the product of the bluevalues.
Maximizing the likelihood involves adjusting the mean and variance of the Gaussian so as to maximize this product.N (xn |µ, σ 2 )xnxNow suppose that we have a data set of observations x = (x1 , . . . , xN )T , representing N observations of the scalar variable x. Note that we are using the typeface x to distinguish this from a single observation of the vector-valued variable(x1 , . . . , xD )T , which we denote by x.
We shall suppose that the observations aredrawn independently from a Gaussian distribution whose mean µ and variance σ 2are unknown, and we would like to determine these parameters from the data set.Data points that are drawn independently from the same distribution are said to beindependent and identically distributed, which is often abbreviated to i.i.d. We haveseen that the joint probability of two independent events is given by the product ofthe marginal probabilities for each event separately. Because our data set x is i.i.d.,we can therefore write the probability of the data set, given µ and σ 2 , in the form2p(x|µ, σ ) =NN xn |µ, σ 2 .(1.53)n=1Section 1.2.5When viewed as a function of µ and σ 2 , this is the likelihood function for the Gaussian and is interpreted diagrammatically in Figure 1.14.One common criterion for determining the parameters in a probability distribution using an observed data set is to find the parameter values that maximize thelikelihood function.
This might seem like a strange criterion because, from our foregoing discussion of probability theory, it would seem more natural to maximize theprobability of the parameters given the data, not the probability of the data given theparameters. In fact, these two criteria are related, as we shall discuss in the contextof curve fitting.For the moment, however, we shall determine values for the unknown parameters µ and σ 2 in the Gaussian by maximizing the likelihood function (1.53). In practice, it is more convenient to maximize the log of the likelihood function. Becausethe logarithm is a monotonically increasing function of its argument, maximizationof the log of a function is equivalent to maximization of the function itself. Takingthe log not only simplifies the subsequent mathematical analysis, but it also helpsnumerically because the product of a large number of small probabilities can easilyunderflow the numerical precision of the computer, and this is resolved by computinginstead the sum of the log probabilities.
From (1.46) and (1.53), the log likelihood1.2. Probability Theory27function can be written in the formN1 NNln p x|µ, σ 2 = − 2ln σ 2 −ln(2π).(xn − µ)2 −2σ22(1.54)n=1Exercise 1.11Maximizing (1.54) with respect to µ, we obtain the maximum likelihood solutiongiven byN1 µML =xn(1.55)Nn=1which is the sample mean, i.e., the mean of the observed values {xn }. Similarly,maximizing (1.54) with respect to σ 2 , we obtain the maximum likelihood solutionfor the variance in the form2=σMLN1 (xn − µML )2N(1.56)n=1Section 1.1Exercise 1.12which is the sample variance measured with respect to the sample mean µML . Notethat we are performing a joint maximization of (1.54) with respect to µ and σ 2 , butin the case of the Gaussian distribution the solution for µ decouples from that for σ 2so that we can first evaluate (1.55) and then subsequently use this result to evaluate(1.56).Later in this chapter, and also in subsequent chapters, we shall highlight the significant limitations of the maximum likelihood approach.
Here we give an indicationof the problem in the context of our solutions for the maximum likelihood parameter settings for the univariate Gaussian distribution. In particular, we shall showthat the maximum likelihood approach systematically underestimates the varianceof the distribution. This is an example of a phenomenon called bias and is relatedto the problem of over-fitting encountered in the context of polynomial curve fitting.2We first note that the maximum likelihood solutions µML and σMLare functions ofthe data set values x1 , .
. . , xN . Consider the expectations of these quantities withrespect to the data set values, which themselves come from a Gaussian distributionwith parameters µ and σ 2 . It is straightforward to show thatE[µML ] = µN −12E[σML ] =σ2N(1.57)(1.58)so that on average the maximum likelihood estimate will obtain the correct mean butwill underestimate the true variance by a factor (N − 1)/N . The intuition behindthis result is given by Figure 1.15.From (1.58) it follows that the following estimate for the variance parameter isunbiasedNN1 2σ2 ==(xn − µML )2 .(1.59)σMLN −1N −1n=1281. INTRODUCTIONFigure 1.15Illustration of how bias arises in using maximum likelihood to determine the varianceof a Gaussian.
The green curve showsthe true Gaussian distribution from whichdata is generated, and the three red curvesshow the Gaussian distributions obtainedby fitting to three data sets, each consisting of two data points shown in blue, using the maximum likelihood results (1.55)and (1.56). Averaged across the three datasets, the mean is correct, but the varianceis systematically under-estimated becauseit is measured relative to the sample meanand not relative to the true mean.(a)(b)(c)In Section 10.1.3, we shall see how this result arises automatically when we adopt aBayesian approach.Note that the bias of the maximum likelihood solution becomes less significantas the number N of data points increases, and in the limit N → ∞ the maximumlikelihood solution for the variance equals the true variance of the distribution thatgenerated the data.