Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 44
Текст из файла (страница 44)
In this chapter, we shall restrict attention to linear discriminants,namely those for which the decision surfaces are hyperplanes. To simplify the discussion, we consider first the case of two classes and then investigate the extensionto K > 2 classes.4.1.1 Two classesThe simplest representation of a linear discriminant function is obtained by taking a linear function of the input vector so thaty(x) = wT x + w0(4.4)where w is called a weight vector, and w0 is a bias (not to be confused with bias inthe statistical sense). The negative of the bias is sometimes called a threshold. Aninput vector x is assigned to class C1 if y(x) 0 and to class C2 otherwise.
The corresponding decision boundary is therefore defined by the relation y(x) = 0, whichcorresponds to a (D − 1)-dimensional hyperplane within the D-dimensional inputspace. Consider two points xA and xB both of which lie on the decision surface.Because y(xA ) = y(xB ) = 0, we have wT (xA − xB ) = 0 and hence the vector w isorthogonal to every vector lying within the decision surface, and so w determines theorientation of the decision surface. Similarly, if x is a point on the decision surface,then y(x) = 0, and so the normal distance from the origin to the decision surface isgiven bywT xw0=−.(4.5)wwWe therefore see that the bias parameter w0 determines the location of the decisionsurface.
These properties are illustrated for the case of D = 2 in Figure 4.1.Furthermore, we note that the value of y(x) gives a signed measure of the perpendicular distance r of the point x from the decision surface. To see this, consider1824. LINEAR MODELS FOR CLASSIFICATIONFigure 4.1 Illustration of the geometry of ax2y>0linear discriminant function in two dimensions.y=0The decision surface, shown in red, is perpenR1dicular to w, and its displacement from the y < 0origin is controlled by the bias parameter w0 .R2Also, the signed orthogonal distance of a general point x from the decision surface is givenby y(x)/w.xwy(x)wx⊥x1−w0wan arbitrary point x and let x⊥ be its orthogonal projection onto the decision surface,so thatw.(4.6)x = x⊥ + rwMultiplying both sides of this result by wT and adding w0 , and making use of y(x) =wT x + w0 and y(x⊥ ) = wT x⊥ + w0 = 0, we haver=y(x).w(4.7)This result is illustrated in Figure 4.1.As with the linear regression models in Chapter 3, it is sometimes convenientto use a more compact notation in which we introduce an additional dummy ‘input’ = (w0 , w) and x = (x0 , x) so thatvalue x0 = 1 and then define w Tx.y(x) = w(4.8)In this case, the decision surfaces are D-dimensional hyperplanes passing throughthe origin of the D + 1-dimensional expanded input space.4.1.2 Multiple classesNow consider the extension of linear discriminants to K > 2 classes.
We mightbe tempted be to build a K-class discriminant by combining a number of two-classdiscriminant functions. However, this leads to some serious difficulties (Duda andHart, 1973) as we now show.Consider the use of K −1 classifiers each of which solves a two-class problem ofseparating points in a particular class Ck from points not in that class. This is knownas a one-versus-the-rest classifier. The left-hand example in Figure 4.2 shows an1834.1. Discriminant FunctionsC3C1?R1R1R2C1R3R3C1C2?C2C3R2not C1C2not C2Figure 4.2 Attempting to construct a K class discriminant from a set of two class discriminants leads to ambiguous regions, shown in green. On the left is an example involving the use of two discriminants designed todistinguish points in class Ck from points not in class Ck .
On the right is an example involving three discriminantfunctions each of which is used to separate a pair of classes Ck and Cj .example involving three classes where this approach leads to regions of input spacethat are ambiguously classified.An alternative is to introduce K(K − 1)/2 binary discriminant functions, onefor every possible pair of classes.
This is known as a one-versus-one classifier. Eachpoint is then classified according to a majority vote amongst the discriminant functions. However, this too runs into the problem of ambiguous regions, as illustratedin the right-hand diagram of Figure 4.2.We can avoid these difficulties by considering a single K-class discriminantcomprising K linear functions of the formyk (x) = wkT x + wk0(4.9)and then assigning a point x to class Ck if yk (x) > yj (x) for all j = k.
The decisionboundary between class Ck and class Cj is therefore given by yk (x) = yj (x) andhence corresponds to a (D − 1)-dimensional hyperplane defined by(wk − wj )T x + (wk0 − wj 0 ) = 0.(4.10)This has the same form as the decision boundary for the two-class case discussed inSection 4.1.1, and so analogous geometrical properties apply.The decision regions of such a discriminant are always singly connected andconvex. To see this, consider two points xA and xB both of which lie inside decision that lies on the line connectingregion Rk , as illustrated in Figure 4.3. Any point xxA and xB can be expressed in the formx = λxA + (1 − λ)xB(4.11)1844.
LINEAR MODELS FOR CLASSIFICATIONFigure 4.3Illustration of the decision regions for a multiclass linear discriminant, with the decisionboundaries shown in red. If two points xAand xB both lie inside the same decision reb that lies on the linegion Rk , then any point xconnecting these two points must also lie inRk , and hence the decision region must besingly connected and convex.RjRiRkxAxBx̂where 0 λ 1. From the linearity of the discriminant functions, it follows thatyk (x) = λyk (xA ) + (1 − λ)yk (xB ).(4.12)Because both xA and xB lie inside Rk , it follows that yk (xA ) > yj (xA ), and) > yj (x), and so x also liesyk (xB ) > yj (xB ), for all j = k, and hence yk (xinside Rk .
Thus Rk is singly connected and convex.Note that for two classes, we can either employ the formalism discussed here,based on two discriminant functions y1 (x) and y2 (x), or else use the simpler butequivalent formulation described in Section 4.1.1 based on a single discriminantfunction y(x).We now explore three approaches to learning the parameters of linear discriminant functions, based on least squares, Fisher’s linear discriminant, and the perceptron algorithm.4.1.3 Least squares for classificationIn Chapter 3, we considered models that were linear functions of the parameters, and we saw that the minimization of a sum-of-squares error function led to asimple closed-form solution for the parameter values. It is therefore tempting to seeif we can apply the same formalism to classification problems.
Consider a generalclassification problem with K classes, with a 1-of-K binary coding scheme for thetarget vector t. One justification for using least squares in such a context is that itapproximates the conditional expectation E[t|x] of the target values given the inputvector. For the binary coding scheme, this conditional expectation is given by thevector of posterior class probabilities.
Unfortunately, however, these probabilitiesare typically approximated rather poorly, indeed the approximations can have valuesoutside the range (0, 1), due to the limited flexibility of a linear model as we shallsee shortly.Each class Ck is described by its own linear model so thatyk (x) = wkT x + wk0(4.13)where k = 1, . . .
, K. We can conveniently group these together using vector notation so that, Tx(4.14)y(x) = W4.1. Discriminant Functions185, is a matrix whose k th column comprises the D + 1-dimensional vectorwhere Ww k = (wk0 , wkT )T and x is the corresponding augmented input vector (1, xT )T witha dummy input x0 = 1. This representation was discussed in detail in Section 3.1. A kT x is largest.new input x is then assigned to the class for which the output yk = w,We now determine the parameter matrix W by minimizing a sum-of-squareserror function, as we did for regression in Chapter 3. Consider a training data set{xn , tn } where n = 1, .
. . , N , and define a matrix T whose nth row is the vector tTn,thTn . The sum-of-squares error functiontogether with a matrix X whose n row is xcan then be written as, = 1 Tr (XW, − T)T (XW, − T) .ED (W)(4.15)2, to zero, and rearranging, we then obtain theSetting the derivative with respect to W, in the formsolution for W, = (X T X) −1 X TT = X †TW(4.16) † is the pseudo-inverse of the matrix X, as discussed in Section 3.1.1. Wewhere Xthen obtain the discriminant function in the form T† x, Tx = TT X.(4.17)y(x) = WAn interesting property of least-squares solutions with multiple target variablesis that if every target vector in the training set satisfies some linear constraintaT t n + b = 0Exercise 4.2Section 2.3.7(4.18)for some constants a and b, then the model prediction for any value of x will satisfythe same constraint so that(4.19)aT y(x) + b = 0.Thus if we use a 1-of-K coding scheme for K classes, then the predictions madeby the model will have the property that the elements of y(x) will sum to 1 for anyvalue of x.