Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 46
Текст из файла (страница 46)
LINEAR MODELS FOR CLASSIFICATIONthe weights becomes equivalent to the Fisher solution (Duda and Hart, 1973). Inparticular, we shall take the targets for class C1 to be N/N1 , where N1 is the numberof patterns in class C1 , and N is the total number of patterns. This target valueapproximates the reciprocal of the prior probability for class C1 .
For class C2 , weshall take the targets to be −N/N2 , where N2 is the number of patterns in class C2 .The sum-of-squares error function can be written21 Tw xn + w0 − tn .2NE=(4.31)n=1Setting the derivatives of E with respect to w0 and w to zero, we obtain respectivelyNwT xn + w0 − tn= 0(4.32)= 0.(4.33)n=1NwT xn + w0 − tn xnn=1From (4.32), and making use of our choice of target coding scheme for the tn , weobtain an expression for the bias in the formw0 = −wT mwhere we have usedNtn = N 1n=1NN− N2=0N1N2(4.34)(4.35)and where m is the mean of the total data set and is given byN11 xn = (N1 m1 + N2 m2 ).m=NN(4.36)n=1Exercise 4.6After some straightforward algebra, and again making use of the choice of tn , thesecond equation (4.33) becomesN1 N2SW +(4.37)SB w = N (m1 − m2 )Nwhere SW is defined by (4.28), SB is defined by (4.27), and we have substituted forthe bias using (4.34).
Using (4.27), we note that SB w is always in the direction of(m2 − m1 ). Thus we can write1w ∝ S−W (m2 − m1 )(4.38)where we have ignored irrelevant scale factors. Thus the weight vector coincideswith that found from the Fisher criterion. In addition, we have also found an expression for the bias value w0 given by (4.34). This tells us that a new vector x should beclassified as belonging to class C1 if y(x) = wT (x − m) > 0 and class C2 otherwise.4.1. Discriminant Functions1914.1.6 Fisher’s discriminant for multiple classesWe now consider the generalization of the Fisher discriminant to K > 2 classes,and we shall assume that the dimensionality D of the input space is greater than thenumber K of classes.
Next, we introduce D > 1 linear ‘features’ yk = wkT x, wherek = 1, . . . , D . These feature values can conveniently be grouped together to forma vector y. Similarly, the weight vectors {wk } can be considered to be the columnsof a matrix W, so that(4.39)y = WT x.Note that again we are not including any bias parameters in the definition of y. Thegeneralization of the within-class covariance matrix to the case of K classes followsfrom (4.28) to giveKSk(4.40)SW =k=1whereSk=(xn − mk )(xn − mk )T(4.41)n∈Ckmk1 xnNk=(4.42)n∈Ckand Nk is the number of patterns in class Ck . In order to find a generalization of thebetween-class covariance matrix, we follow Duda and Hart (1973) and consider firstthe total covariance matrixST =N(xn − m)(xn − m)T(4.43)n=1where m is the mean of the total data setm=NK1 1 xn =Nk mkNNn=1(4.44)k=1and N = k Nk is the total number of data points.
The total covariance matrix canbe decomposed into the sum of the within-class covariance matrix, given by (4.40)and (4.41), plus an additional matrix SB , which we identify as a measure of thebetween-class covariance(4.45)ST = SW + S BwhereSB =Kk=1Nk (mk − m)(mk − m)T .(4.46)1924. LINEAR MODELS FOR CLASSIFICATIONThese covariance matrices have been defined in the original x-space. We can nowdefine similar matrices in the projected D -dimensional y-spacesW =K (yn − µk )(yn − µk )T(4.47)k=1 n∈CkandsB =KNk (µk − µ)(µk − µ)T(4.48)k=1where1 µk =yn ,Nkn∈CkK1 µ=Nk µk .N(4.49)k=1Again we wish to construct a scalar that is large when the between-class covarianceis large and when the within-class covariance is small.
There are now many possiblechoices of criterion (Fukunaga, 1990). One example is given by 1 (4.50)J(W) = Tr s−W sB .This criterion can then be rewritten as an explicit function of the projection matrixW in the form(4.51)J(w) = Tr (WSW WT )−1 (WSB WT ) .Maximization of such criteria is straightforward, though somewhat involved, and isdiscussed at length in Fukunaga (1990).
The weight values are determined by those1eigenvectors of S−W SB that correspond to the D largest eigenvalues.There is one important result that is common to all such criteria, which is worthemphasizing. We first note from (4.46) that SB is composed of the sum of K matrices, each of which is an outer product of two vectors and therefore of rank 1. Inaddition, only (K − 1) of these matrices are independent as a result of the constraint(4.44). Thus, SB has rank at most equal to (K − 1) and so there are at most (K − 1)nonzero eigenvalues. This shows that the projection onto the (K − 1)-dimensionalsubspace spanned by the eigenvectors of SB does not alter the value of J(w), andso we are therefore unable to find more than (K − 1) linear ‘features’ by this means(Fukunaga, 1990).4.1.7 The perceptron algorithmAnother example of a linear discriminant model is the perceptron of Rosenblatt(1962), which occupies an important place in the history of pattern recognition algorithms.
It corresponds to a two-class model in which the input vector x is firsttransformed using a fixed nonlinear transformation to give a feature vector φ(x),and this is then used to construct a generalized linear model of the form(4.52)y(x) = f wT φ(x)4.1. Discriminant Functions193where the nonlinear activation function f (·) is given by a step function of the form+1, a 0(4.53)f (a) =−1, a < 0.The vector φ(x) will typically include a bias component φ0 (x) = 1.
In earlierdiscussions of two-class classification problems, we have focussed on a target codingscheme in which t ∈ {0, 1}, which is appropriate in the context of probabilisticmodels. For the perceptron, however, it is more convenient to use target valuest = +1 for class C1 and t = −1 for class C2 , which matches the choice of activationfunction.The algorithm used to determine the parameters w of the perceptron can mosteasily be motivated by error function minimization. A natural choice of error function would be the total number of misclassified patterns. However, this does not leadto a simple learning algorithm because the error is a piecewise constant functionof w, with discontinuities wherever a change in w causes the decision boundary tomove across one of the data points.
Methods based on changing w using the gradient of the error function cannot then be applied, because the gradient is zero almosteverywhere.We therefore consider an alternative error function known as the perceptron criterion. To derive this, we note that we are seeking a weight vector w such thatpatterns xn in class C1 will have wT φ(xn ) > 0, whereas patterns xn in class C2have wT φ(xn ) < 0. Using the t ∈ {−1, +1} target coding scheme it follows thatwe would like all patterns to satisfy wT φ(xn )tn > 0. The perceptron criterionassociates zero error with any pattern that is correctly classified, whereas for a misclassified pattern xn it tries to minimize the quantity −wT φ(xn )tn . The perceptroncriterion is therefore given bywT φn tn(4.54)EP (w) = −n∈MFrank Rosenblatt1928–1969Rosenblatt’s perceptron played animportant role in the history of machine learning.
Initially, Rosenblattsimulated the perceptron on an IBM704 computer at Cornell in 1957,but by the early 1960s he had builtspecial-purpose hardware that provided a direct, parallel implementation of perceptron learning. Many ofhis ideas were encapsulated in “Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms” published in 1962.
Rosenblatt’s work wascriticized by Marvin Minksy, whose objections werepublished in the book “Perceptrons”, co-authored withSeymour Papert. This book was widely misinterpreted at the time as showing that neural networkswere fatally flawed and could only learn solutions forlinearly separable problems. In fact, it only provedsuch limitations in the case of single-layer networkssuch as the perceptron and merely conjectured (incorrectly) that they applied to more general networkmodels. Unfortunately, however, this book contributedto the substantial decline in research funding for neural computing, a situation that was not reversed until the mid-1980s.
Today, there are many hundreds,if not thousands, of applications of neural networksin widespread use, with examples in areas such ashandwriting recognition and information retrieval being used routinely by millions of people.1944. LINEAR MODELS FOR CLASSIFICATIONSection 3.1.3where M denotes the set of all misclassified patterns. The contribution to the errorassociated with a particular misclassified pattern is a linear function of w in regionsof w space where the pattern is misclassified and zero in regions where it is correctlyclassified. The total error function is therefore piecewise linear.We now apply the stochastic gradient descent algorithm to this error function.The change in the weight vector w is then given byw(τ +1) = w(τ ) − η∇EP (w) = w(τ ) + ηφn tn(4.55)where η is the learning rate parameter and τ is an integer that indexes the steps ofthe algorithm.
Because the perceptron function y(x, w) is unchanged if we multiplyw by a constant, we can set the learning rate parameter η equal to 1 without ofgenerality. Note that, as the weight vector evolves during training, the set of patternsthat are misclassified will change.The perceptron learning algorithm has a simple interpretation, as follows. Wecycle through the training patterns in turn, and for each pattern xn we evaluate theperceptron function (4.52). If the pattern is correctly classified, then the weightvector remains unchanged, whereas if it is incorrectly classified, then for class C1we add the vector φ(xn ) onto the current estimate of weight vector w while forclass C2 we subtract the vector φ(xn ) from w.