Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 43
Текст из файла (страница 43)
. . , N . Show that the maximum likelihoodsolution WML for the parameter matrix W has the property that each column isgiven by an expression of the form (3.15), which was the solution for an isotropicnoise distribution. Note that this is independent of the covariance matrix Σ. Showthat the maximum likelihood solution for Σ is given byNT1 TTtn − WMLφ(xn ) tn − WMLφ(xn ) .Σ=N(3.109)n=13.7 () By using the technique of completing the square, verify the result (3.49) for theposterior distribution of the parameters w in the linear basis function model in whichmN and SN are defined by (3.50) and (3.51) respectively.3.8 ( ) www Consider the linear basis function model in Section 3.1, and supposethat we have already observed N data points, so that the posterior distribution overw is given by (3.49). This posterior can be regarded as the prior for the next observation.
By considering an additional data point (xN +1 , tN +1 ), and by completingthe square in the exponential, show that the resulting posterior distribution is againgiven by (3.49) but with SN replaced by SN +1 and mN replaced by mN +1 .3.9 ( ) Repeat the previous exercise but instead of completing the square by hand,make use of the general result for linear-Gaussian models given by (2.116).3.10 ( ) www By making use of the result (2.115) to evaluate the integral in (3.57),verify that the predictive distribution for the Bayesian linear regression model isgiven by (3.58) in which the input-dependent variance is given by (3.59).3.11 ( ) We have seen that, as the size of a data set increases, the uncertainty associatedwith the posterior distribution over model parameters decreases.
Make use of thematrix identity (Appendix C)(M−1 v) vT M−1T −1−1(3.110)M + vv=M −1 + vT M−1 v2(x) associated with the linear regression functionto show that the uncertainty σNgiven by (3.59) satisfies22σN(3.111)+1 (x) σN (x).3.12 ( ) We saw in Section 2.3.6 that the conjugate prior for a Gaussian distributionwith unknown mean and unknown precision (inverse variance) is a normal-gammadistribution.
This property also holds for the case of the conditional Gaussian distribution p(t|x, w, β) of the linear regression model. If we consider the likelihoodfunction (3.10), then the conjugate prior for w and β is given byp(w, β) = N (w|m0 , β −1 S0 )Gam(β|a0 , b0 ).(3.112)1763.
LINEAR MODELS FOR REGRESSIONShow that the corresponding posterior distribution takes the same functional form,so thatp(w, β|t) = N (w|mN , β −1 SN )Gam(β|aN , bN )(3.113)and find expressions for the posterior parameters mN , SN , aN , and bN .3.13 ( ) Show that the predictive distribution p(t|x, t) for the model discussed in Exercise 3.12 is given by a Student’s t-distribution of the formp(t|x, t) = St(t|µ, λ, ν)(3.114)and obtain expressions for µ, λ and ν.3.14 ( ) In this exercise, we explore in more detail the properties of the equivalentkernel defined by (3.62), where SN is defined by (3.54).
Suppose that the basisfunctions φj (x) are linearly independent and that the number N of data points isgreater than the number M of basis functions. Furthermore, let one of the basisfunctions be constant, say φ0 (x) = 1. By taking suitable linear combinations ofthese basis functions, we can construct a new basis set ψj (x) spanning the samespace but that are orthonormal, so thatNψj (xn )ψk (xn ) = Ijk(3.115)n=1where Ijk is defined to be 1 if j = k and 0 otherwise, and we take ψ0 (x) = 1. Showthat for α = 0, the equivalent kernel can be written as k(x, x ) = ψ(x)T ψ(x )where ψ = (ψ1 , . . . , ψM )T . Use this result to show that the kernel satisfies thesummation constraintNk(x, xn ) = 1.(3.116)n=13.15 () www Consider a linear basis function model for regression in which the parameters α and β are set using the evidence framework. Show that the functionE(mN ) defined by (3.82) satisfies the relation 2E(mN ) = N .3.16 ( ) Derive the result (3.86) for the log evidence function p(t|α, β) of the linearregression model by making use of (2.115) to evaluate the integral (3.77) directly.3.17 () Show that the evidence function for the Bayesian linear regression model canbe written in the form (3.78) in which E(w) is defined by (3.79).3.18 ( ) www By completing the square over w, show that the error function (3.79)in Bayesian linear regression can be written in the form (3.80).3.19 ( ) Show that the integration over w in the Bayesian linear regression model givesthe result (3.85).
Hence show that the log marginal likelihood is given by (3.86).Exercises1773.20 ( ) www Starting from (3.86) verify all of the steps needed to show that maximization of the log marginal likelihood function (3.86) with respect to α leads to there-estimation equation (3.92).3.21 ( ) An alternative way to derive the result (3.92) for the optimal value of α in theevidence framework is to make use of the identityd−1 dln |A| = Tr AA .(3.117)dαdαProve this identity by considering the eigenvalue expansion of a real, symmetricmatrix A, and making use of the standard results for the determinant and trace ofA expressed in terms of its eigenvalues (Appendix C).
Then make use of (3.117) toderive (3.92) starting from (3.86).3.22 ( ) Starting from (3.86) verify all of the steps needed to show that maximization of the log marginal likelihood function (3.86) with respect to β leads to there-estimation equation (3.95).3.23 ( ) www Show that the marginal probability of the data, in other words themodel evidence, for the model described in Exercise 3.12 is given byp(t) =ba0 0 Γ(aN ) |SN |1/21(2π)N/2 baNN Γ(a0 ) |S0 |1/2(3.118)by first marginalizing with respect to w and then with respect to β.3.24 ( ) Repeat the previous exercise but now use Bayes’ theorem in the formp(t) =p(t|w, β)p(w, β)p(w, β|t)(3.119)and then substitute for the prior and posterior distributions and the likelihood function in order to derive the result (3.118).4LinearModels forClassificationIn the previous chapter, we explored a class of regression models having particularlysimple analytical and computational properties. We now discuss an analogous classof models for solving classification problems.
The goal in classification is to take aninput vector x and to assign it to one of K discrete classes Ck where k = 1, . . . , K.In the most common scenario, the classes are taken to be disjoint, so that each input isassigned to one and only one class. The input space is thereby divided into decisionregions whose boundaries are called decision boundaries or decision surfaces. Inthis chapter, we consider linear models for classification, by which we mean that thedecision surfaces are linear functions of the input vector x and hence are definedby (D − 1)-dimensional hyperplanes within the D-dimensional input space. Datasets whose classes can be separated exactly by linear decision surfaces are said to belinearly separable.For regression problems, the target variable t was simply the vector of real numbers whose values we wish to predict.
In the case of classification, there are various1791804. LINEAR MODELS FOR CLASSIFICATIONways of using target values to represent class labels. For probabilistic models, themost convenient, in the case of two-class problems, is the binary representation inwhich there is a single target variable t ∈ {0, 1} such that t = 1 represents class C1and t = 0 represents class C2 . We can interpret the value of t as the probability thatthe class is C1 , with the values of probability taking only the extreme values of 0 and1. For K > 2 classes, it is convenient to use a 1-of-K coding scheme in which t isa vector of length K such that if the class is Cj , then all elements tk of t are zeroexcept element tj , which takes the value 1.
For instance, if we have K = 5 classes,then a pattern from class 2 would be given the target vectort = (0, 1, 0, 0, 0)T .(4.1)Again, we can interpret the value of tk as the probability that the class is Ck . Fornonprobabilistic models, alternative choices of target variable representation willsometimes prove convenient.In Chapter 1, we identified three distinct approaches to the classification problem. The simplest involves constructing a discriminant function that directly assignseach vector x to a specific class.
A more powerful approach, however, models theconditional probability distribution p(Ck |x) in an inference stage, and then subsequently uses this distribution to make optimal decisions. By separating inferenceand decision, we gain numerous benefits, as discussed in Section 1.5.4. There aretwo different approaches to determining the conditional probabilities p(Ck |x).
Onetechnique is to model them directly, for example by representing them as parametricmodels and then optimizing the parameters using a training set. Alternatively, wecan adopt a generative approach in which we model the class-conditional densitiesgiven by p(x|Ck ), together with the prior probabilities p(Ck ) for the classes, and thenwe compute the required posterior probabilities using Bayes’ theoremp(Ck |x) =p(x|Ck )p(Ck ).p(x)(4.2)We shall discuss examples of all three approaches in this chapter.In the linear regression models considered in Chapter 3, the model predictiony(x, w) was given by a linear function of the parameters w.
In the simplest case,the model is also linear in the input variables and therefore takes the form y(x) =wT x + w0 , so that y is a real number. For classification problems, however, we wishto predict discrete class labels, or more generally posterior probabilities that lie inthe range (0, 1). To achieve this, we consider a generalization of this model in whichwe transform the linear function of w using a nonlinear function f ( · ) so that(4.3)y(x) = f wT x + w0 .In the machine learning literature f ( · ) is known as an activation function, whereasits inverse is called a link function in the statistics literature. The decision surfacescorrespond to y(x) = constant, so that wT x + w0 = constant and hence the decision surfaces are linear functions of x, even if the function f (·) is nonlinear. For thisreason, the class of models described by (4.3) are called generalized linear models4.1.
Discriminant Functions181(McCullagh and Nelder, 1989). Note, however, that in contrast to the models usedfor regression, they are no longer linear in the parameters due to the presence of thenonlinear function f (·). This will lead to more complex analytical and computational properties than for linear regression models. Nevertheless, these models arestill relatively simple compared to the more general nonlinear models that will bestudied in subsequent chapters.The algorithms discussed in this chapter will be equally applicable if we firstmake a fixed nonlinear transformation of the input variables using a vector of basisfunctions φ(x) as we did for regression models in Chapter 3. We begin by considering classification directly in the original input space x, while in Section 4.3 we shallfind it convenient to switch to a notation involving basis functions for consistencywith later chapters.4.1. Discriminant FunctionsA discriminant is a function that takes an input vector x and assigns it to one of Kclasses, denoted Ck .