Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 45
Текст из файла (страница 45)
However, this summation constraint alone is not sufficient to allow themodel outputs to be interpreted as probabilities because they are not constrained tolie within the interval (0, 1).The least-squares approach gives an exact closed-form solution for the discriminant function parameters.
However, even as a discriminant function (where we use itto make decisions directly and dispense with any probabilistic interpretation) it suffers from some severe problems. We have already seen that least-squares solutionslack robustness to outliers, and this applies equally to the classification application,as illustrated in Figure 4.4. Here we see that the additional data points in the righthand figure produce a significant change in the location of the decision boundary,even though these point would be correctly classified by the original decision boundary in the left-hand figure.
The sum-of-squares error function penalizes predictionsthat are ‘too correct’ in that they lie a long way on the correct side of the decision1864. LINEAR MODELS FOR CLASSIFICATION442200−2−2−4−4−6−6−8−8−4−202468−4−202468Figure 4.4 The left plot shows data from two classes, denoted by red crosses and blue circles, together withthe decision boundary found by least squares (magenta curve) and also by the logistic regression model (greencurve), which is discussed later in Section 4.3.2. The right-hand plot shows the corresponding results obtainedwhen extra data points are added at the bottom left of the diagram, showing that least squares is highly sensitiveto outliers, unlike logistic regression.boundary.
In Section 7.1.2, we shall consider several alternative error functions forclassification and we shall see that they do not suffer from this difficulty.However, problems with least squares can be more severe than simply lack ofrobustness, as illustrated in Figure 4.5. This shows a synthetic data set drawn fromthree classes in a two-dimensional input space (x1 , x2 ), having the property that linear decision boundaries can give excellent separation between the classes. Indeed,the technique of logistic regression, described later in this chapter, gives a satisfactory solution as seen in the right-hand plot.
However, the least-squares solution givespoor results, with only a small region of the input space assigned to the green class.The failure of least squares should not surprise us when we recall that it corresponds to maximum likelihood under the assumption of a Gaussian conditionaldistribution, whereas binary target vectors clearly have a distribution that is far fromGaussian. By adopting more appropriate probabilistic models, we shall obtain classification techniques with much better properties than least squares.
For the moment,however, we continue to explore alternative nonprobabilistic methods for setting theparameters in the linear classification models.4.1.4 Fisher’s linear discriminantOne way to view a linear classification model is in terms of dimensionalityreduction. Consider first the case of two classes, and suppose we take the D-1874.1. Discriminant Functions66442200−2−2−4−4−6−6−4−2024−6−66−4−20246Figure 4.5 Example of a synthetic data set comprising three classes, with training data points denoted in red(×), green (+), and blue (◦). Lines denote the decision boundaries, and the background colours denote therespective classes of the decision regions. On the left is the result of using a least-squares discriminant. We seethat the region of input space assigned to the green class is too small and so most of the points from this classare misclassified.
On the right is the result of using logistic regressions as described in Section 4.3.2 showingcorrect classification of the training data.dimensional input vector x and project it down to one dimension usingy = wT x.(4.20)If we place a threshold on y and classify y −w0 as class C1 , and otherwise classC2 , then we obtain our standard linear classifier discussed in the previous section.In general, the projection onto one dimension leads to a considerable loss of information, and classes that are well separated in the original D-dimensional space maybecome strongly overlapping in one dimension.
However, by adjusting the components of the weight vector w, we can select a projection that maximizes the classseparation. To begin with, consider a two-class problem in which there are N1 pointsof class C1 and N2 points of class C2 , so that the mean vectors of the two classes aregiven by1 1 xn ,m2 =xn .(4.21)m1 =N1N2n ∈ C1n ∈ C2The simplest measure of the separation of the classes, when projected onto w, is theseparation of the projected class means. This suggests that we might choose w so asto maximize(4.22)m2 − m1 = wT (m2 − m1 )wheremk = w T mk(4.23)1884. LINEAR MODELS FOR CLASSIFICATION442200−2−2−226−226Figure 4.6 The left plot shows samples from two classes (depicted in red and blue) along with the histogramsresulting from projection onto the line joining the class means.
Note that there is considerable class overlap inthe projected space. The right plot shows the corresponding projection based on the Fisher linear discriminant,showing the greatly improved class separation.Appendix EExercise 4.4is the mean of the projected data from class Ck . However, this expression can bemade arbitrarily large simply by increasing the magnitude ofw. To solve thisproblem, we could constrain w to have unit length, so that i wi2 = 1.
Usinga Lagrange multiplier to perform the constrained maximization, we then find thatw ∝ (m2 − m1 ). There is still a problem with this approach, however, as illustratedin Figure 4.6. This shows two classes that are well separated in the original twodimensional space (x1 , x2 ) but that have considerable overlap when projected ontothe line joining their means. This difficulty arises from the strongly nondiagonalcovariances of the class distributions.
The idea proposed by Fisher is to maximizea function that will give a large separation between the projected class means whilealso giving a small variance within each class, thereby minimizing the class overlap.The projection formula (4.20) transforms the set of labelled data points in xinto a labelled set in the one-dimensional space y. The within-class variance of thetransformed data from class Ck is therefore given bys2k =(yn − mk )2(4.24)n∈Ckwhere yn = wT xn . We can define the total within-class variance for the wholedata set to be simply s21 + s22 .
The Fisher criterion is defined to be the ratio of thebetween-class variance to the within-class variance and is given byJ(w) =Exercise 4.5(m2 − m1 )2.s21 + s22(4.25)We can make the dependence on w explicit by using (4.20), (4.23), and (4.24) torewrite the Fisher criterion in the form1894.1. Discriminant FunctionsJ(w) =w T SB ww T SW w(4.26)where SB is the between-class covariance matrix and is given bySB = (m2 − m1 )(m2 − m1 )Tand SW is the total within-class covariance matrix, given by(xn − m1 )(xn − m1 )T +(xn − m2 )(xn − m2 )T .SW =n∈C1(4.27)(4.28)n∈C2Differentiating (4.26) with respect to w, we find that J(w) is maximized when(wT SB w)SW w = (wT SW w)SB w.(4.29)From (4.27), we see that SB w is always in the direction of (m2 −m1 ).
Furthermore,we do not care about the magnitude of w, only its direction, and so we can drop the1scalar factors (wT SB w) and (wT SW w). Multiplying both sides of (4.29) by S−Wwe then obtain1(4.30)w ∝ S−W (m2 − m1 ).Note that if the within-class covariance is isotropic, so that SW is proportional to theunit matrix, we find that w is proportional to the difference of the class means, asdiscussed above.The result (4.30) is known as Fisher’s linear discriminant, although strictly itis not a discriminant but rather a specific choice of direction for projection of thedata down to one dimension. However, the projected data can subsequently be usedto construct a discriminant, by choosing a threshold y0 so that we classify a newpoint as belonging to C1 if y(x) y0 and classify it as belonging to C2 otherwise.For example, we can model the class-conditional densities p(y|Ck ) using Gaussiandistributions and then use the techniques of Section 1.2.4 to find the parametersof the Gaussian distributions by maximum likelihood.
Having found Gaussian approximations to the projected classes, the formalism of Section 1.5.1 then gives anexpression for the optimal threshold. Some justification for the Gaussian assumptioncomes from the central limit theorem by noting that y = wT x is the sum of a set ofrandom variables.4.1.5 Relation to least squaresThe least-squares approach to the determination of a linear discriminant wasbased on the goal of making the model predictions as close as possible to a set oftarget values.
By contrast, the Fisher criterion was derived by requiring maximumclass separation in the output space. It is interesting to see the relationship betweenthese two approaches. In particular, we shall show that, for the two-class problem,the Fisher criterion can be obtained as a special case of least squares.So far we have considered 1-of-K coding for the target values. If, however, weadopt a slightly different target coding scheme, then the least-squares solution for1904.