Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 50
Текст из файла (страница 50)
In fact, we can interpret IRLSas the solution to a linearized problem in the space of the variable a = wT φ. Thequantity zn , which corresponds to the nth element of z, can then be given a simpleinterpretation as an effective target value in this space obtained by making a locallinear approximation to the logistic sigmoid function around the current operatingpoint w(old)dan (old)an (w) an (w)+(tn − yn )dy n w(old)(yn − tn )(old)−(4.103)= zn .= φTnwyn (1 − yn )4.3.
Probabilistic Discriminative Models2094.3.4 Multiclass logistic regressionSection 4.2In our discussion of generative models for multiclass classification, we haveseen that for a large class of distributions, the posterior probabilities are given by asoftmax transformation of linear functions of the feature variables, so thatexp(ak )p(Ck |φ) = yk (φ) = j exp(aj )(4.104)where the ‘activations’ ak are given byak = wkT φ.Exercise 4.17(4.105)There we used maximum likelihood to determine separately the class-conditionaldensities and the class priors and then found the corresponding posterior probabilitiesusing Bayes’ theorem, thereby implicitly determining the parameters {wk }.
Here weconsider the use of maximum likelihood to determine the parameters {wk } of thismodel directly. To do this, we will require the derivatives of yk with respect to all ofthe activations aj . These are given by∂yk= yk (Ikj − yj )∂aj(4.106)where Ikj are the elements of the identity matrix.Next we write down the likelihood function. This is most easily done usingthe 1-of-K coding scheme in which the target vector tn for a feature vector φnbelonging to class Ck is a binary vector with all elements zero except for element k,which equals one. The likelihood function is then given byp(T|w1 , . . .
, wK ) =N Kp(Ck |φn )tnk =n=1 k=1N Kn=1 k=1tnkynk(4.107)where ynk = yk (φn ), and T is an N × K matrix of target variables with elementstnk . Taking the negative logarithm then givesE(w1 , . . . , wK ) = − ln p(T|w1 , . . . , wK ) = −N Ktnk ln ynk(4.108)n=1 k=1Exercise 4.18which is known as the cross-entropy error function for the multiclass classificationproblem.We now take the gradient of the error function with respect to one of the parameter vectors wj . Making use of the result (4.106) for the derivatives of the softmaxfunction, we obtain∇wj E(w1 , .
. . , wK ) =Nn=1(ynj − tnj ) φn(4.109)2104. LINEAR MODELS FOR CLASSIFICATIONwhere we have made use of k tnk = 1. Once again, we see the same form arisingfor the gradient as was found for the sum-of-squares error function with the linearmodel and the cross-entropy error for the logistic regression model, namely the product of the error (ynj − tnj ) times the basis function φn . Again, we could use thisto formulate a sequential algorithm in which patterns are presented one at a time, inwhich each of the weight vectors is updated using (3.22).We have seen that the derivative of the log likelihood function for a linear regression model with respect to the parameter vector w for a data point n took the formof the ‘error’ yn − tn times the feature vector φn .
Similarly, for the combinationof logistic sigmoid activation function and cross-entropy error function (4.90), andfor the softmax activation function with the multiclass cross-entropy error function(4.108), we again obtain this same simple form. This is an example of a more generalresult, as we shall see in Section 4.3.6.To find a batch algorithm, we again appeal to the Newton-Raphson update toobtain the corresponding IRLS algorithm for the multiclass problem. This requiresevaluation of the Hessian matrix that comprises blocks of size M × M in whichblock j, k is given by∇wk ∇wj E(w1 , .
. . , wK ) = −Nynk (Ikj − ynj )φn φTn.(4.110)n=1Exercise 4.20As with the two-class problem, the Hessian matrix for the multiclass logistic regression model is positive definite and so the error function again has a unique minimum.Practical details of IRLS for the multiclass case can be found in Bishop and Nabney(2008).4.3.5 Probit regressionWe have seen that, for a broad range of class-conditional distributions, describedby the exponential family, the resulting posterior class probabilities are given by alogistic (or softmax) transformation acting on a linear function of the feature variables. However, not all choices of class-conditional density give rise to such a simpleform for the posterior probabilities (for instance, if the class-conditional densities aremodelled using Gaussian mixtures).
This suggests that it might be worth exploringother types of discriminative probabilistic model. For the purposes of this chapter,however, we shall return to the two-class case, and again remain within the framework of generalized linear models so thatp(t = 1|a) = f (a)(4.111)where a = wT φ, and f (·) is the activation function.One way to motivate an alternative choice for the link function is to consider anoisy threshold model, as follows. For each input φn , we evaluate an = wT φn andthen we set the target value according totn = 1 if an θ(4.112)tn = 0 otherwise.2114.3. Probabilistic Discriminative ModelsFigure 4.13 Schematic example of a probability density p(θ)shown by the blue curve, given in this example by a mixtureof two Gaussians, along with its cumulative distribution functionf (a), shown by the red curve.
Note that the value of the bluecurve at any point, such as that indicated by the vertical greenline, corresponds to the slope of the red curve at the same point.Conversely, the value of the red curve at this point correspondsto the area under the blue curve indicated by the shaded greenregion. In the stochastic threshold model, the class label takesthe value t = 1 if the value of a = wT φ exceeds a threshold, otherwise it takes the value t = 0.
This is equivalent to an activationfunction given by the cumulative distribution function f (a).10.80.60.40.2001234If the value of θ is drawn from a probability density p(θ), then the correspondingactivation function will be given by the cumulative distribution function ap(θ) dθ(4.113)f (a) =−∞as illustrated in Figure 4.13.As a specific example, suppose that the density p(θ) is given by a zero mean,unit variance Gaussian. The corresponding cumulative distribution function is givenbyaΦ(a) =−∞N (θ|0, 1) dθ(4.114)which is known as the probit function. It has a sigmoidal shape and is comparedwith the logistic sigmoid function in Figure 4.9. Note that the use of a more general Gaussian distribution does not change the model because this is equivalent toa re-scaling of the linear coefficients w.
Many numerical packages provide for theevaluation of a closely related function defined by a2exp(−θ2 /2) dθ(4.115)erf(a) = √π 0Exercise 4.21and known as the erf function or error function (not to be confused with the errorfunction of a machine learning model).
It is related to the probit function by11√erf(a) .1+(4.116)Φ(a) =22The generalized linear model based on a probit activation function is known as probitregression.We can determine the parameters of this model using maximum likelihood, by astraightforward extension of the ideas discussed earlier. In practice, the results foundusing probit regression tend to be similar to those of logistic regression. We shall,2124. LINEAR MODELS FOR CLASSIFICATIONhowever, find another use for the probit model when we discuss Bayesian treatmentsof logistic regression in Section 4.5.One issue that can occur in practical applications is that of outliers, which canarise for instance through errors in measuring the input vector x or through mislabelling of the target value t.
Because such points can lie a long way to the wrong sideof the ideal decision boundary, they can seriously distort the classifier. Note that thelogistic and probit regression models behave differently in this respect because thetails of the logistic sigmoid decay asymptotically like exp(−x) for x → ∞, whereasfor the probit activation function they decay like exp(−x2 ), and so the probit modelcan be significantly more sensitive to outliers.However, both the logistic and the probit models assume the data is correctlylabelled.
The effect of mislabelling is easily incorporated into a probabilistic modelby introducing a probability that the target value t has been flipped to the wrongvalue (Opper and Winther, 2000a), leading to a target value distribution for data pointx of the formp(t|x) = (1 − )σ(x) + (1 − σ(x))= + (1 − 2)σ(x)(4.117)where σ(x) is the activation function with input vector x. Here may be set inadvance, or it may be treated as a hyperparameter whose value is inferred from thedata.4.3.6 Canonical link functionsFor the linear regression model with a Gaussian noise distribution, the errorfunction, corresponding to the negative log likelihood, is given by (3.12).
If we takethe derivative with respect to the parameter vector w of the contribution to the errorfunction from a data point n, this takes the form of the ‘error’ yn − tn times thefeature vector φn , where yn = wT φn . Similarly, for the combination of the logisticsigmoid activation function and the cross-entropy error function (4.90), and for thesoftmax activation function with the multiclass cross-entropy error function (4.108),we again obtain this same simple form.
We now show that this is a general resultof assuming a conditional distribution for the target variable from the exponentialfamily, along with a corresponding choice for the activation function known as thecanonical link function.We again make use of the restricted form (4.84) of exponential family distributions. Note that here we are applying the assumption of exponential family distribution to the target variable t, in contrast to Section 4.2.4 where we applied it to theinput vector x. We therefore consider conditional distributions of the target variableof the form ηt 1 tg(η) exp.(4.118)p(t|η, s) = hsssUsing the same line of argument as led to the derivation of the result (2.226), we seethat the conditional mean of t, which we denote by y, is given byy ≡ E[t|η] = −sdln g(η).dη(4.119)4.4.