An introduction to information retrieval. Manning_ Raghavan (2009) (811397), страница 79
Текст из файла (страница 79)
Another strategy is to build a setof one-versus-one classifiers, and to choose the class that is selected by themost classifiers. While this involves building |C |(|C | − 1)/2 classifiers, thetime for training classifiers may actually decrease, since the training data setfor each classifier is much smaller.However, these are not very elegant approaches to solving multiclass problems.
A better alternative is provided by the construction of multiclass SVMs,where we build a two-class classifier over a feature vector Φ(~x, y) derivedfrom the pair consisting of the input features and the class of the datum. At~ T Φ(~x, y′ ). The martest time, the classifier chooses the class y = arg maxy′ wgin during training is the gap between this value for the correct class andfor the nearest other class, and so the quadratic program formulation willrequire that ∀i ∀y 6= y i w~ T Φ(~xi , yi ) − w~ T Φ(~xi , y) ≥ 1 − ξ i .
This generalmethod can be extended to give a multiclass formulation of various kinds oflinear classifiers. It is also a simple instance of a generalization of classification where the classes are not just a set of independent, categorical labels, butmay be arbitrary structured objects with relationships defined between them.In the SVM world, such work comes under the label of structural SVMs. Wemention them again in Section 15.4.2.Nonlinear SVMsWith what we have presented so far, data sets that are linearly separable (perhaps with a few exceptions or some noise) are well-handled.
But what arewe going to do if the data set just doesn’t allow classification by a linear classifier? Let us look at a one-dimensional case. The top data set in Figure 15.6is straightforwardly classified by a linear classifier but the middle data set isnot. We instead need to be able to pick out an interval. One way to solve thisproblem is to map the data on to a higher dimensional space and then to usea linear classifier in the higher dimensional space.
For example, the bottompart of the figure shows that a linear separator can easily classify the data4. Materializing the features refers to directly calculating higher order and interaction termsand then putting them into a linear model.Online edition (c) 2009 Cambridge UP15.2 Extensions to the SVM model331◮ Figure 15.6 Projecting data that is not linearly separable into a higher dimensionalspace can make it linearly separable.KERNEL TRICKif we use a quadratic function to map the data into two dimensions (a polar coordinates projection would be another possibility). The general idea isto map the original feature space to some higher-dimensional feature spacewhere the training set is separable.
Of course, we would want to do so inways that preserve relevant dimensions of relatedness between data points,so that the resultant classifier should still generalize well.SVMs, and also a number of other linear classifiers, provide an easy andefficient way of doing this mapping to a higher dimensional space, which isreferred to as “the kernel trick”. It’s not really a trick: it just exploits the maththat we have seen.
The SVM linear classifier relies on a dot product betweendata point vectors. Let K (~xi , ~x j ) = ~xi T~x j . Then the classifier we have seen soOnline edition (c) 2009 Cambridge UP33215 Support vector machines and machine learning on documentsfar is:(15.13)f (~x) = sign(∑ αi yi K (~xi , ~x) + b)iKERNEL FUNCTION✎(15.14)KERNELM ERCER KERNEL(15.15)Now suppose we decide to map every data point into a higher dimensionalspace via some transformation Φ: ~x 7→ φ(~x ). Then the dot product becomesφ(~x i )T φ(~x j ). If it turned out that this dot product (which is just a real number) could be computed simply and efficiently in terms of the original datapoints, then we wouldn’t have to actually map from ~x 7→ φ(~x ). Rather, wecould simply compute the quantity K (~x i , ~x j ) = φ(~xi )T φ(~x j ), and then use thefunction’s value in Equation (15.13).
A kernel function K is such a functionthat corresponds to a dot product in some expanded feature space.Example 15.2: The quadratic kernel in two dimensions. For 2-dimensionalvectors ~u = (u1 u2 ), ~v = (v1 v2 ), consider K (~u, ~v) = (1 + ~uT~v)2 . We wish toTshow thati.e.,√ this is a2 kernel,√√ that K (~u, ~v ) = φ(~u ) φ(~v ) for some φ. Consider φ(~u ) =2(1 u1 2u1 u2 u2 2u1 2u2 ). Then:K (~u, ~v)=(1 + ~uT~v)2==1 + u21 v21 + 2u1 v1 u2 v2 + u22 v22 + 2u1 v1 + 2u2 v2√√√√√√(1 u21 2u1 u2 u22 2u1 2u2 )T (1 v21 2v1 v2 v22 2v1 2v2 )=φ(~u)T φ(~v)In the language of functional analysis, what kinds of functions are validkernel functions? Kernel functions are sometimes more precisely referred toas Mercer Rkernels, because they must satisfy Mercer’s condition: for any g(~x)such that g(~x)2 d~x is finite, we must have that:ZK (~x, ~z) g(~x) g(~z)d~xd~z ≥ 0 .A kernel function K must be continuous, symmetric, and have a positive definite gram matrix.
Such a K means that there exists a mapping to a reproducing kernel Hilbert space (a Hilbert space is a vector space closed under dotproducts) such that the dot product there gives the same value as the functionK. If a kernel does not satisfy Mercer’s condition, then the corresponding QPmay have no solution. If you would like to better understand these issues,you should consult the books on SVMs mentioned in Section 15.5.
Otherwise, you can content yourself with knowing that 90% of work with kernelsuses one of two straightforward families of functions of two vectors, whichwe define below, and which define valid kernels.The two commonly used families of kernels are polynomial kernels andradial basis functions. Polynomial kernels are of the form K (~x, ~z) = (1 +Online edition (c) 2009 Cambridge UP33315.2 Extensions to the SVM model~xT~z)d .
The case of d = 1 is a linear kernel, which is what we had before thestart of this section (the constant 1 just changing the threshold). The case ofd = 2 gives a quadratic kernel, and is very commonly used. We illustratedthe quadratic kernel in Example 15.2.The most common form of radial basis function is a Gaussian distribution,calculated as:K (~x, ~z) = e−(~x−~z)(15.16)2 / (2σ2 )A radial basis function (rbf) is equivalent to mapping the data into an infinite dimensional Hilbert space, and so we cannot illustrate the radial basisfunction concretely, as we did a quadratic kernel. Beyond these two families,there has been interesting work developing other kernels, some of which ispromising for text applications. In particular, there has been investigation ofstring kernels (see Section 15.5).The world of SVMs comes with its own language, which is rather differentfrom the language otherwise used in machine learning.
The terminologydoes have deep roots in mathematics, but it’s important not to be too awedby that terminology. Really, we are talking about some quite simple things. Apolynomial kernel allows us to model feature conjunctions (up to the order ofthe polynomial). That is, if we want to be able to model occurrences of pairsof words, which give distinctive information about topic classification, notgiven by the individual words alone, like perhaps operating AND system orethnic AND cleansing, then we need to use a quadratic kernel. If occurrencesof triples of words give distinctive information, then we need to use a cubickernel.
Simultaneously you also get the powers of the basic features – formost text applications, that probably isn’t useful, but just comes along withthe math and hopefully doesn’t do harm. A radial basis function allows youto have features that pick out circles (hyperspheres) – although the decisionboundaries become much more complex as multiple such features interact.
Astring kernel lets you have features that are character subsequences of terms.All of these are straightforward notions which have also been used in manyother places under different names.15.2.4Experimental resultsWe presented results in Section 13.6 showing that an SVM is a very effective text classifier. The results of Dumais et al. (1998) given in Table 13.9show SVMs clearly performing the best. This was one of several pieces ofwork from this time that established the strong reputation of SVMs for textclassification. Another pioneering work on scaling and evaluating SVMsfor text classification was (Joachims 1998).
We present some of his resultsOnline edition (c) 2009 Cambridge UP33415 Support vector machines and machine learning on documentsearnacqmoney-fxgraincrudetradeinterestshipwheatcornmicroavg.NB96.090.759.669.881.252.257.680.963.445.272.3Rocchio96.192.167.679.581.577.472.583.179.462.279.9Dec.Trees96.185.369.489.175.559.249.180.985.587.779.4kNN97.891.875.482.685.877.976.779.872.971.482.6linear SVMC = 0.5 C = 1.098.098.295.595.678.878.591.993.189.489.479.279.275.674.887.486.586.686.887.587.886.787.5rbf-SVMσ≈798.194.774.393.488.776.669.185.882.484.686.4◮ Table 15.2 SVM classifier break-even F1 from (Joachims 2002a, p.
114). Resultsare shown for the 10 largest categories and for microaveraged performance over all90 categories on the Reuters-21578 data set.from (Joachims 2002a) in Table 15.2.5 Joachims used a large number of termfeatures in contrast to Dumais et al. (1998), who used MI feature selection(Section 13.5.1, page 272) to build classifiers with a much more limited number of features. The success of the linear SVM mirrors the results discussedin Section 14.6 (page 308) on other linear approaches like Naive Bayes.
Itseems that working with simple term features can get one a long way. It isagain noticeable the extent to which different papers’ results for the same machine learning methods differ. In particular, based on replications by otherresearchers, the Naive Bayes results of (Joachims 1998) appear too weak, andthe results in Table 13.9 should be taken as representative.15.3Issues in the classification of text documentsThere are lots of applications of text classification in the commercial world;email spam filtering is perhaps now the most ubiquitous.