Bishop C.M. Pattern Recognition and Machine Learning (2006) (811375), страница 78
Текст из файла (страница 78)
This twostage approach is equivalent to assuming that the output y(x) of the support vectormachine represents the log-odds of x belonging to class t = 1. Because the SVMtraining procedure is not specifically intended to encourage this, the SVM can givea poor approximation to the posterior probabilities (Tipping, 2001).7.1.2 Relation to logistic regressionSection 4.3.2As with the separable case, we can re-cast the SVM for nonseparable distributions in terms of the minimization of a regularized error function.
This will alsoallow us to highlight similarities, and differences, compared to the logistic regressionmodel.We have seen that for data points that are on the correct side of the marginboundary, and which therefore satisfy yn tn 1, we have ξn = 0, and for the7.1. Maximum Margin ClassifiersFigure 7.5Plot of the ‘hinge’ error function usedin support vector machines, shownin blue, along with the error functionfor logistic regression, rescaled by afactor of 1/ ln(2) so that it passesthrough the point (0, 1), shown in red.Also shown are the misclassificationerror in black and the squared errorin green.337E(z)−2−1012zremaining points we have ξn = 1 − yn tn . Thus the objective function (7.21) can bewritten (up to an overall multiplicative constant) in the formNESV (yn tn ) + λw2(7.44)n=1where λ = (2C)−1 , and ESV (·) is the hinge error function defined byESV (yn tn ) = [1 − yn tn ]+(7.45)where [ · ]+ denotes the positive part.
The hinge error function, so-called becauseof its shape, is plotted in Figure 7.5. It can be viewed as an approximation to themisclassification error, i.e., the error function that ideally we would like to minimize,which is also shown in Figure 7.5.When we considered the logistic regression model in Section 4.3.2, we found itconvenient to work with target variable t ∈ {0, 1}. For comparison with the supportvector machine, we first reformulate maximum likelihood logistic regression usingthe target variable t ∈ {−1, 1}. To do this, we note that p(t = 1|y) = σ(y) wherey(x) is given by (7.1), and σ(y) is the logistic sigmoid function defined by (4.59). Itfollows that p(t = −1|y) = 1 − σ(y) = σ(−y), where we have used the propertiesof the logistic sigmoid function, and so we can writep(t|y) = σ(yt).Exercise 7.6(7.46)From this we can construct an error function by taking the negative logarithm of thelikelihood function that, with a quadratic regularizer, takes the formNELR (yn tn ) + λw2 .(7.47)n=1whereELR (yt) = ln (1 + exp(−yt)) .(7.48)3387.
SPARSE KERNEL MACHINESFor comparison with other error functions, we can divide by ln(2) so that the errorfunction passes through the point (0, 1). This rescaled error function is also plottedin Figure 7.5 and we see that it has a similar form to the support vector error function.The key difference is that the flat region in ESV (yt) leads to sparse solutions.Both the logistic error and the hinge loss can be viewed as continuous approximations to the misclassification error. Another continuous error function that hassometimes been used to solve classification problems is the squared error, whichis again plotted in Figure 7.5. It has the property, however, of placing increasingemphasis on data points that are correctly classified but that are a long way fromthe decision boundary on the correct side.
Such points will be strongly weighted atthe expense of misclassified points, and so if the objective is to minimize the misclassification rate, then a monotonically decreasing error function would be a betterchoice.7.1.3 Multiclass SVMsThe support vector machine is fundamentally a two-class classifier. In practice,however, we often have to tackle problems involving K > 2 classes. Various methods have therefore been proposed for combining multiple two-class SVMs in orderto build a multiclass classifier.One commonly used approach (Vapnik, 1998) is to construct K separate SVMs,in which the k th model yk (x) is trained using the data from class Ck as the positiveexamples and the data from the remaining K − 1 classes as the negative examples.This is known as the one-versus-the-rest approach. However, in Figure 4.2 we sawthat using the decisions of the individual classifiers can lead to inconsistent resultsin which an input is assigned to multiple classes simultaneously.
This problem issometimes addressed by making predictions for new inputs x usingy(x) = max yk (x).k(7.49)Unfortunately, this heuristic approach suffers from the problem that the differentclassifiers were trained on different tasks, and there is no guarantee that the realvalued quantities yk (x) for different classifiers will have appropriate scales.Another problem with the one-versus-the-rest approach is that the training setsare imbalanced.
For instance, if we have ten classes each with equal numbers oftraining data points, then the individual classifiers are trained on data sets comprising90% negative examples and only 10% positive examples, and the symmetry of theoriginal problem is lost. A variant of the one-versus-the-rest scheme was proposedby Lee et al. (2001) who modify the target values so that the positive class has target+1 and the negative class has target −1/(K − 1).Weston and Watkins (1999) define a single objective function for training allK SVMs simultaneously, based on maximizing the margin from each to remainingclasses. However, this can result in much slower training because, instead of solvingK separate optimization problems each over N data points with an overall cost ofO(KN 2 ), a single optimization problem of size (K − 1)N must be solved giving anoverall cost of O(K 2 N 2 ).7.1.
Maximum Margin Classifiers339Another approach is to train K(K − 1)/2 different 2-class SVMs on all possiblepairs of classes, and then to classify test points according to which class has the highest number of ‘votes’, an approach that is sometimes called one-versus-one. Again,we saw in Figure 4.2 that this can lead to ambiguities in the resulting classification.Also, for large K this approach requires significantly more training time than theone-versus-the-rest approach. Similarly, to evaluate test points, significantly morecomputation is required.The latter problem can be alleviated by organizing the pairwise classifiers intoa directed acyclic graph (not to be confused with a probabilistic graphical model)leading to the DAGSVM (Platt et al., 2000).
For K classes, the DAGSVM has a totalof K(K − 1)/2 classifiers, and to classify a new test point only K − 1 pairwiseclassifiers need to be evaluated, with the particular classifiers used depending onwhich path through the graph is traversed.A different approach to multiclass classification, based on error-correcting output codes, was developed by Dietterich and Bakiri (1995) and applied to supportvector machines by Allwein et al.
(2000). This can be viewed as a generalization ofthe voting scheme of the one-versus-one approach in which more general partitionsof the classes are used to train the individual classifiers. The K classes themselvesare represented as particular sets of responses from the two-class classifiers chosen,and together with a suitable decoding scheme, this gives robustness to errors and toambiguity in the outputs of the individual classifiers. Although the application ofSVMs to multiclass classification problems remains an open issue, in practice theone-versus-the-rest approach is the most widely used in spite of its ad-hoc formulation and its practical limitations.There are also single-class support vector machines, which solve an unsupervised learning problem related to probability density estimation.
Instead of modelling the density of data, however, these methods aim to find a smooth boundaryenclosing a region of high density. The boundary is chosen to represent a quantile ofthe density, that is, the probability that a data point drawn from the distribution willland inside that region is given by a fixed number between 0 and 1 that is specified inadvance. This is a more restricted problem than estimating the full density but maybe sufficient in specific applications. Two approaches to this problem using supportvector machines have been proposed. The algorithm of Schölkopf et al. (2001) triesto find a hyperplane that separates all but a fixed fraction ν of the training data fromthe origin while at the same time maximizing the distance (margin) of the hyperplanefrom the origin, while Tax and Duin (1999) look for the smallest sphere in featurespace that contains all but a fraction ν of the data points.
For kernels k(x, x ) thatare functions only of x − x , the two algorithms are equivalent.7.1.4 SVMs for regressionSection 3.1.4We now extend support vector machines to regression problems while at thesame time preserving the property of sparseness. In simple linear regression, we3407. SPARSE KERNEL MACHINESFigure 7.6Plot of an -insensitive error function (inred) in which the error increases linearly with distance beyond the insensitive region. Also shown for comparison is the quadratic error function (ingreen).E(z)−0zminimize a regularized error function given by1λ2{yn − tn } + w2 .22N(7.50)n=1To obtain sparse solutions, the quadratic error function is replaced by an -insensitiveerror function (Vapnik, 1995), which gives zero error if the absolute difference between the prediction y(x) and the target t is less than where > 0.
A simpleexample of an -insensitive error function, having a linear cost associated with errorsoutside the insensitive region, is given by0,if |y(x) − t| < ;(7.51)E (y(x) − t) =|y(x) − t| − , otherwiseand is illustrated in Figure 7.6.We therefore minimize a regularized error function given byCNn=11E (y(xn ) − tn ) + w22(7.52)where y(x) is given by (7.1).
By convention the (inverse) regularization parameter,denoted C, appears in front of the error term.As before, we can re-express the optimization problem by introducing slackvariables. For each data point xn , we now need two slack variables ξn 0 andξn 0, where ξn > 0 corresponds to a point for which tn > y(xn ) + , and ξn > 0corresponds to a point for which tn < y(xn ) − , as illustrated in Figure 7.7.The condition for a target point to lie inside the -tube is that yn − tn yn +, where yn = y(xn ). Introducing the slack variables allows points to lie outsidethe tube provided the slack variables are nonzero, and the corresponding conditionsaretntn y(xn ) + + ξn y(xn ) − − ξn .(7.53)(7.54)7.1.