Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 28
Текст из файла (страница 28)
The simplest approach is to use a histogram ofobservations in which the angular coordinate is divided into fixed bins. This has thevirtue of simplicity and flexibility but also suffers from significant limitations, as weshall see when we discuss histogram methods in more detail in Section 2.5. Anotherapproach starts, like the von Mises distribution, from a Gaussian distribution over aEuclidean space but now marginalizes onto the unit circle rather than conditioning(Mardia and Jupp, 2000). However, this leads to more complex forms of distributionand will not be discussed further.
Finally, any valid distribution over the real axis(such as a Gaussian) can be turned into a periodic distribution by mapping successive intervals of width 2π onto the periodic variable (0, 2π), which corresponds to‘wrapping’ the real axis around unit circle. Again, the resulting distribution is morecomplex to handle than the von Mises distribution.One limitation of the von Mises distribution is that it is unimodal. By formingmixtures of von Mises distributions, we obtain a flexible framework for modellingperiodic variables that can handle multimodality. For an example of a machine learning application that makes use of von Mises distributions, see Lawrence et al.
(2002),and for extensions to modelling conditional densities for regression problems, seeBishop and Nabney (1996).2.3.9 Mixtures of GaussiansAppendix AWhile the Gaussian distribution has some important analytical properties, it suffers from significant limitations when it comes to modelling real data sets.
Considerthe example shown in Figure 2.21. This is known as the ‘Old Faithful’ data set,and comprises 272 measurements of the eruption of the Old Faithful geyser at Yellowstone National Park in the USA. Each measurement comprises the duration of2.3. The Gaussian DistributionFigure 2.22111Example of a Gaussian mixture distribution p(x)in one dimension showing three Gaussians(each scaled by a coefficient) in blue andtheir sum in red.xthe eruption in minutes (horizontal axis) and the time in minutes to the next eruption (vertical axis).
We see that the data set forms two dominant clumps, and thata simple Gaussian distribution is unable to capture this structure, whereas a linearsuperposition of two Gaussians gives a better characterization of the data set.Such superpositions, formed by taking linear combinations of more basic distributions such as Gaussians, can be formulated as probabilistic models known asmixture distributions (McLachlan and Basford, 1988; McLachlan and Peel, 2000).In Figure 2.22 we see that a linear combination of Gaussians can give rise to verycomplex densities. By using a sufficient number of Gaussians, and by adjusting theirmeans and covariances as well as the coefficients in the linear combination, almostany continuous density can be approximated to arbitrary accuracy.We therefore consider a superposition of K Gaussian densities of the formp(x) =Kπk N (x|µk , Σk )(2.188)k=1Section 9.3.3which is called a mixture of Gaussians. Each Gaussian density N (x|µk , Σk ) iscalled a component of the mixture and has its own mean µk and covariance Σk .Contour and surface plots for a Gaussian mixture having 3 components are shown inFigure 2.23.In this section we shall consider Gaussian components to illustrate the framework of mixture models.
More generally, mixture models can comprise linear combinations of other distributions. For instance, in Section 9.3.3 we shall considermixtures of Bernoulli distributions as an example of a mixture model for discretevariables.The parameters πk in (2.188) are called mixing coefficients. If we integrate bothsides of (2.188) with respect to x, and note that both p(x) and the individual Gaussiancomponents are normalized, we obtainKπk = 1.(2.189)k=1Also, the requirement that p(x) 0, together with N (x|µk , Σk ) 0, impliesπk 0 for all k. Combining this with the condition (2.189) we obtain0 πk 1.(2.190)11212. PROBABILITY DISTRIBUTIONS1(a)0.50.2(b)0.50.30.50000.5100.51Figure 2.23 Illustration of a mixture of 3 Gaussians in a two-dimensional space.
(a) Contours of constantdensity for each of the mixture components, in which the 3 components are denoted red, blue and green, andthe values of the mixing coefficients are shown below each component. (b) Contours of the marginal probabilitydensity p(x) of the mixture distribution. (c) A surface plot of the distribution p(x).We therefore see that the mixing coefficients satisfy the requirements to be probabilities.From the sum and product rules, the marginal density is given byp(x) =Kp(k)p(x|k)(2.191)k=1which is equivalent to (2.188) in which we can view πk = p(k) as the prior probability of picking the k th component, and the density N (x|µk , Σk ) = p(x|k) asthe probability of x conditioned on k. As we shall see in later chapters, an important role is played by the posterior probabilities p(k|x), which are also known asresponsibilities.
From Bayes’ theorem these are given byγk (x) ≡ p(k|x)p(k)p(x|k)= l p(l)p(x|l)πk N (x|µk , Σk ).= l πl N (x|µl , Σl )(2.192)We shall discuss the probabilistic interpretation of the mixture distribution in greaterdetail in Chapter 9.The form of the Gaussian mixture distribution is governed by the parameters π,µ and Σ, where we have used the notation π ≡ {π1 , . .
. , πK }, µ ≡ {µ1 , . . . , µK }and Σ ≡ {Σ1 , . . . ΣK }. One way to set the values of these parameters is to usemaximum likelihood. From (2.188) the log of the likelihood function is given byKNln p(X|π, µ, Σ) =lnπk N (xn |µk , Σk )(2.193)n=1k=12.4. The Exponential Family113where X = {x1 , . . . , xN }. We immediately see that the situation is now muchmore complex than with a single Gaussian, due to the presence of the summationover k inside the logarithm. As a result, the maximum likelihood solution for theparameters no longer has a closed-form analytical solution. One approach to maximizing the likelihood function is to use iterative numerical optimization techniques(Fletcher, 1987; Nocedal and Wright, 1999; Bishop and Nabney, 2008).
Alternatively we can employ a powerful framework called expectation maximization, whichwill be discussed at length in Chapter 9.2.4. The Exponential FamilyThe probability distributions that we have studied so far in this chapter (with theexception of the Gaussian mixture) are specific examples of a broad class of distributions called the exponential family (Duda and Hart, 1973; Bernardo and Smith,1994). Members of the exponential family have many important properties in common, and it is illuminating to discuss these properties in some generality.The exponential family of distributions over x, given parameters η, is defined tobe the set of distributions of the form(2.194)p(x|η) = h(x)g(η) exp η T u(x)where x may be scalar or vector, and may be discrete or continuous. Here η arecalled the natural parameters of the distribution, and u(x) is some function of x.The function g(η) can be interpreted as the coefficient that ensures that the distribution is normalized and therefore satisfies(2.195)g(η) h(x) exp η T u(x) dx = 1where the integration is replaced by summation if x is a discrete variable.We begin by taking some examples of the distributions introduced earlier inthe chapter and showing that they are indeed members of the exponential family.Consider first the Bernoulli distributionp(x|µ) = Bern(x|µ) = µx (1 − µ)1−x .(2.196)Expressing the right-hand side as the exponential of the logarithm, we havep(x|µ) = exp {x ln µ + (1 − x) ln(1 − µ)} µ= (1 − µ) exp lnx .1−µComparison with (2.194) allows us to identifyµη = ln1−µ(2.197)(2.198)1142.
PROBABILITY DISTRIBUTIONSwhich we can solve for µ to give µ = σ(η), whereσ(η) =11 + exp(−η)(2.199)is called the logistic sigmoid function. Thus we can write the Bernoulli distributionusing the standard representation (2.194) in the formp(x|η) = σ(−η) exp(ηx)(2.200)where we have used 1 − σ(η) = σ(−η), which is easily proved from (2.199). Comparison with (2.194) shows thatu(x) = xh(x) = 1g(η) = σ(−η).(2.201)(2.202)(2.203)Next consider the multinomial distribution that, for a single observation x, takesthe formMMp(x|µ) =µxkk = expxk ln µk(2.204)k=1k=1where x = (x1 , . .
. , xN ) . Again, we can write this in the standard representation(2.194) so that(2.205)p(x|η) = exp(η T x)Twhere ηk = ln µk , and we have defined η = (η1 , . . . , ηM )T . Again, comparing with(2.194) we haveu(x) = xh(x) = 1g(η) = 1.(2.206)(2.207)(2.208)Note that the parameters ηk are not independent because the parameters µk are subject to the constraintMµk = 1(2.209)k=1so that, given any M − 1 of the parameters µk , the value of the remaining parameteris fixed.