Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 20
Текст из файла (страница 20)
. . , xN ofobservations. This problem is known as density estimation. For the purposes ofthis chapter, we shall assume that the data points are independent and identicallydistributed. It should be emphasized that the problem of density estimation is fun-67682. PROBABILITY DISTRIBUTIONSdamentally ill-posed, because there are infinitely many probability distributions thatcould have given rise to the observed finite data set.
Indeed, any distribution p(x)that is nonzero at each of the data points x1 , . . . , xN is a potential candidate. Theissue of choosing an appropriate distribution relates to the problem of model selection that has already been encountered in the context of polynomial curve fitting inChapter 1 and that is a central issue in pattern recognition.We begin by considering the binomial and multinomial distributions for discreterandom variables and the Gaussian distribution for continuous random variables.These are specific examples of parametric distributions, so-called because they aregoverned by a small number of adaptive parameters, such as the mean and variance inthe case of a Gaussian for example. To apply such models to the problem of densityestimation, we need a procedure for determining suitable values for the parameters,given an observed data set.
In a frequentist treatment, we choose specific valuesfor the parameters by optimizing some criterion, such as the likelihood function. Bycontrast, in a Bayesian treatment we introduce prior distributions over the parametersand then use Bayes’ theorem to compute the corresponding posterior distributiongiven the observed data.We shall see that an important role is played by conjugate priors, that lead toposterior distributions having the same functional form as the prior, and that therefore lead to a greatly simplified Bayesian analysis.
For example, the conjugate priorfor the parameters of the multinomial distribution is called the Dirichlet distribution,while the conjugate prior for the mean of a Gaussian is another Gaussian. All of thesedistributions are examples of the exponential family of distributions, which possessa number of important properties, and which will be discussed in some detail.One limitation of the parametric approach is that it assumes a specific functionalform for the distribution, which may turn out to be inappropriate for a particularapplication. An alternative approach is given by nonparametric density estimationmethods in which the form of the distribution typically depends on the size of the dataset.
Such models still contain parameters, but these control the model complexityrather than the form of the distribution. We end this chapter by considering threenonparametric methods based respectively on histograms, nearest-neighbours, andkernels.2.1. Binary VariablesWe begin by considering a single binary random variable x ∈ {0, 1}. For example,x might describe the outcome of flipping a coin, with x = 1 representing ‘heads’,and x = 0 representing ‘tails’.
We can imagine that this is a damaged coin so thatthe probability of landing heads is not necessarily the same as that of landing tails.The probability of x = 1 will be denoted by the parameter µ so thatp(x = 1|µ) = µ(2.1)2.1. Binary Variables69where 0 µ 1, from which it follows that p(x = 0|µ) = 1 − µ. The probabilitydistribution over x can therefore be written in the formBern(x|µ) = µx (1 − µ)1−xExercise 2.1(2.2)which is known as the Bernoulli distribution. It is easily verified that this distributionis normalized and that it has mean and variance given byE[x] = µvar[x] = µ(1 − µ).(2.3)(2.4)Now suppose we have a data set D = {x1 , .
. . , xN } of observed values of x.We can construct the likelihood function, which is a function of µ, on the assumptionthat the observations are drawn independently from p(x|µ), so thatp(D|µ) =Np(xn |µ) =n=1Nµxn (1 − µ)1−xn .(2.5)n=1In a frequentist setting, we can estimate a value for µ by maximizing the likelihoodfunction, or equivalently by maximizing the logarithm of the likelihood. In the caseof the Bernoulli distribution, the log likelihood function is given byln p(D|µ) =Nln p(xn |µ) =n=1Section 2.4N{xn ln µ + (1 − xn ) ln(1 − µ)} .(2.6)n=1At this point, it is worth noting that thelog likelihood function depends on the Nobservations xn only through their sum n xn . This sum provides an example of asufficient statistic for the data under this distribution, and we shall study the important role of sufficient statistics in some detail. If we set the derivative of ln p(D|µ)with respect to µ equal to zero, we obtain the maximum likelihood estimatorµML =N1 xnN(2.7)n=1Jacob Bernoulli1654–1705Jacob Bernoulli, also known asJacques or James Bernoulli, was aSwiss mathematician and was thefirst of many in the Bernoulli familyto pursue a career in science andmathematics.
Although compelledto study philosophy and theology against his will byhis parents, he travelled extensively after graduatingin order to meet with many of the leading scientists ofhis time, including Boyle and Hooke in England. Whenhe returned to Switzerland, he taught mechanics andbecame Professor of Mathematics at Basel in 1687.Unfortunately, rivalry between Jacob and his youngerbrother Johann turned an initially productive collaboration into a bitter and public dispute.
Jacob’s most significant contributions to mathematics appeared in TheArt of Conjecture published in 1713, eight years afterhis death, which deals with topics in probability theory including what has become known as the Bernoullidistribution.702. PROBABILITY DISTRIBUTIONSFigure 2.1Histogram plot of the binomial dis0.3tribution (2.9) as a function of m forN = 10 and µ = 0.25.0.20.10012345m678910which is also known as the sample mean. If we denote the number of observationsof x = 1 (heads) within this data set by m, then we can write (2.7) in the formµML =mN(2.8)so that the probability of landing heads is given, in this maximum likelihood framework, by the fraction of observations of heads in the data set.Now suppose we flip a coin, say, 3 times and happen to observe 3 heads. ThenN = m = 3 and µML = 1.
In this case, the maximum likelihood result wouldpredict that all future observations should give heads. Common sense tells us thatthis is unreasonable, and in fact this is an extreme example of the over-fitting associated with maximum likelihood. We shall see shortly how to arrive at more sensibleconclusions through the introduction of a prior distribution over µ.We can also work out the distribution of the number m of observations of x = 1,given that the data set has size N .
This is called the binomial distribution, andfrom (2.5) we see that it is proportional to µm (1 − µ)N −m . In order to obtain thenormalization coefficient we note that out of N coin flips, we have to add up allof the possible ways of obtaining m heads, so that the binomial distribution can bewritten N mµ (1 − µ)N −m(2.9)Bin(m|N, µ) =mwhereExercise 2.3 NN!≡(N − m)!m!m(2.10)is the number of ways of choosing m objects out of a total of N identical objects.Figure 2.1 shows a plot of the binomial distribution for N = 10 and µ = 0.25.The mean and variance of the binomial distribution can be found by using theresult of Exercise 1.10, which shows that for independent events the mean of thesum is the sum of the means, and the variance of the sum is the sum of the variances.Because m = x1 + .
. . + xN , and for each observation the mean and variance are2.1. Binary Variables71given by (2.3) and (2.4), respectively, we haveE[m] ≡NmBin(m|N, µ) = N µ(2.11)m=0var[m] ≡N2(m − E[m]) Bin(m|N, µ) = N µ(1 − µ).(2.12)m=0Exercise 2.4These results can also be proved directly using calculus.2.1.1 The beta distributionWe have seen in (2.8) that the maximum likelihood setting for the parameter µin the Bernoulli distribution, and hence in the binomial distribution, is given by thefraction of the observations in the data set having x = 1. As we have already noted,this can give severely over-fitted results for small data sets. In order to develop aBayesian treatment for this problem, we need to introduce a prior distribution p(µ)over the parameter µ.
Here we consider a form of prior distribution that has a simpleinterpretation as well as some useful analytical properties. To motivate this prior,we note that the likelihood function takes the form of the product of factors of theform µx (1 − µ)1−x . If we choose a prior to be proportional to powers of µ and(1 − µ), then the posterior distribution, which is proportional to the product of theprior and the likelihood function, will have the same functional form as the prior.This property is called conjugacy and we will see several examples of it later in thischapter. We therefore choose a prior, called the beta distribution, given byBeta(µ|a, b) =Exercise 2.5Γ(a + b) a−1µ (1 − µ)b−1Γ(a)Γ(b)(2.13)where Γ(x) is the gamma function defined by (1.141), and the coefficient in (2.13)ensures that the beta distribution is normalized, so that 1Beta(µ|a, b) dµ = 1.(2.14)0Exercise 2.6The mean and variance of the beta distribution are given byaE[µ] =a+babvar[µ] =.2(a + b) (a + b + 1)(2.15)(2.16)The parameters a and b are often called hyperparameters because they control thedistribution of the parameter µ.
Figure 2.2 shows plots of the beta distribution forvarious values of the hyperparameters.The posterior distribution of µ is now obtained by multiplying the beta prior(2.13) by the binomial likelihood function (2.9) and normalizing. Keeping only thefactors that depend on µ, we see that this posterior distribution has the formp(µ|m, l, a, b) ∝ µm+a−1 (1 − µ)l+b−1(2.17)722. PROBABILITY DISTRIBUTIONS33a=1a = 0.1b=1b = 0.12211000.5µ1300µ10.5µ13a=2a=8b=3b=4221100.500.5µ100Figure 2.2 Plots of the beta distribution Beta(µ|a, b) given by (2.13) as a function of µ for various values of thehyperparameters a and b.where l = N − m, and therefore corresponds to the number of ‘tails’ in the coinexample. We see that (2.17) has the same functional dependence on µ as the priordistribution, reflecting the conjugacy properties of the prior with respect to the likelihood function.