The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 29
Текст из файла (страница 29)
Above each plot is the training error rate. The Bayes error rate is0.025 for this problem, as is the LDA error rate.p-dimensional input space, one would need general polynomial terms andcross-products of total degree K − 1, O(pK−1 ) terms in all, to resolve suchworst-case scenarios.The example is extreme, but for large K and small p such maskingsnaturally occur. As a more realistic illustration, Figure 4.4 is a projectionof the training data for a vowel recognition problem onto an informativetwo-dimensional subspace. There are K = 11 classes in p = 10 dimensions.This is a difficult classification problem, and the best methods achievearound 40% errors on the test data. The main point here is summarized inTable 4.1; linear regression has an error rate of 67%, while a close relative,linear discriminant analysis, has an error rate of 56%.
It seems that maskinghas hurt in this case. While all the other methods in this chapter are basedon linear functions of x as well, they use them in such a way that avoidsthis masking problem.4.3 Linear Discriminant AnalysisDecision theory for classification (Section 2.4) tells us that we need to knowthe class posteriors Pr(G|X) for optimal classification.
Suppose fk (x) isthe class-conditional density Pof X in class G = k, and let πk be the priorKprobability of class k, with k=1 πk = 1. A simple application of Bayes4.3 Linear Discriminant Analysis1074Linear Discriminant Analysisooooooooooo0-2-4Coordinate 2 for Training Data2ooooo o oooooooo o oooo o ooooooooo oooo o oooooooo oo oooo • ooooo oooo o oo •o oooooooooooooooo oooooo o o ooo ooo o ooo o o ooooooooooo ooo• o ooo oooo oooooooo o o oooooo•oo o ooo o o o ooooooo oo oooooo oooooooooo ooooo ooo oo o o o ooooo o o o ooo•o oo o o ooooooooooo oooo•ooo oooooooo ooooooo o o o o oooooooooo oo ooooooooo o ooo•oo ooo o oooo oooo o o oooo o o ooooooo o o o oooooo ooo oo oooo oooo oo ooo oo ooo o oooo oooooo ooooo o ooo ooooooooooooo o•ooo oo o oooooooooo• oo oooo ooooooo oo o ooo o oo o ooooooo ooo •oooooo o o ooooooo ooooooooooooooooooooo ooooo o o oo oooo oooooo• oooooooo oooooo o ooooooooo•••••••••••o oo-6oo-4-2o024Coordinate 1 for Training DataFIGURE 4.4.
A two-dimensional plot of the vowel training data. There areeleven classes with X ∈ IR10 , and this is the best view in terms of a LDA model(Section 4.3.3). The heavy circles are the projected mean vectors for each class.The class overlap is considerable.TABLE 4.1. Training and test error rates using a variety of linear techniqueson the vowel data.
There are eleven classes in ten dimensions, of which threeaccount for 90% of the variance (via a principal components analysis). We seethat linear regression is hurt by masking, increasing the test and training errorby over 10%.TechniqueLinear regressionLinear discriminant analysisQuadratic discriminant analysisLogistic regressionError RatesTraining Test0.480.670.320.560.010.530.220.511084.
Linear Methods for Classificationtheorem gives usfk (x)πkPr(G = k|X = x) = PK.ℓ=1 fℓ (x)πℓ(4.7)We see that in terms of ability to classify, having the fk (x) is almost equivalent to having the quantity Pr(G = k|X = x).Many techniques are based on models for the class densities:• linear and quadratic discriminant analysis use Gaussian densities;• more flexible mixtures of Gaussians allow for nonlinear decision boundaries (Section 6.8);• general nonparametric density estimates for each class density allowthe most flexibility (Section 6.6.2);• Naive Bayes models are a variant of the previous case, and assumethat each of the class densities are products of marginal densities;that is, they assume that the inputs are conditionally independent ineach class (Section 6.6.3).Suppose that we model each class density as multivariate Gaussianfk (x) =−1T11e− 2 (x−µk ) Σk (x−µk ) .(2π)p/2 |Σk |1/2(4.8)Linear discriminant analysis (LDA) arises in the special case when weassume that the classes have a common covariance matrix Σk = Σ ∀k.
Incomparing two classes k and ℓ, it is sufficient to look at the log-ratio, andwe see thatlogPr(G = k|X = x)fk (x)πk= log+ logPr(G = ℓ|X = x)fℓ (x)πℓ1πk− (µk + µℓ )T Σ−1 (µk − µℓ )= logπℓ2+ xT Σ−1 (µk − µℓ ),(4.9)an equation linear in x. The equal covariance matrices cause the normalization factors to cancel, as well as the quadratic part in the exponents.This linear log-odds function implies that the decision boundary betweenclasses k and ℓ—the set where Pr(G = k|X = x) = Pr(G = ℓ|X = x)—islinear in x; in p dimensions a hyperplane. This is of course true for any pairof classes, so all the decision boundaries are linear.
If we divide IRp intoregions that are classified as class 1, class 2, etc., these regions will be separated by hyperplanes. Figure 4.5 (left panel) shows an idealized examplewith three classes and p = 2. Here the data do arise from three Gaussian distributions with a common covariance matrix. We have included in4.3 Linear Discriminant Analysis++1333333322 2131 13333 33 22213 323332 211 23 33 122 22 23121131 13 1 21 3131 112 22112211221 1 2111 2122221 333+109FIGURE 4.5. The left panel shows three Gaussian distributions, with the samecovariance and different means. Included are the contours of constant densityenclosing 95% of the probability in each case. The Bayes decision boundariesbetween each pair of classes are shown (broken straight lines), and the Bayesdecision boundaries separating all three classes are the thicker solid lines (a subsetof the former).
On the right we see a sample of 30 drawn from each Gaussiandistribution, and the fitted LDA decision boundaries.the figure the contours corresponding to 95% highest probability density,as well as the class centroids. Notice that the decision boundaries are notthe perpendicular bisectors of the line segments joining the centroids. Thiswould be the case if the covariance Σ were spherical σ 2 I, and the classpriors were equal. From (4.9) we see that the linear discriminant functions1δk (x) = xT Σ−1 µk − µTk Σ−1 µk + log πk2(4.10)are an equivalent description of the decision rule, with G(x) = argmaxk δk (x).In practice we do not know the parameters of the Gaussian distributions,and will need to estimate them using our training data:• π̂k = Nk /N , where Nk is the number of class-k observations;P• µ̂k = gi =k xi /Nk ;• Σ̂ =PK Pk=1gi =k (xi− µ̂k )(xi − µ̂k )T /(N − K).Figure 4.5 (right panel) shows the estimated decision boundaries based ona sample of size 30 each from three Gaussian distributions.
Figure 4.1 onpage 103 is another example, but here the classes are not Gaussian.With two classes there is a simple correspondence between linear discriminant analysis and classification by linear regression, as in (4.5). TheLDA rule classifies to class 2 ifxT Σ̂−1(µ̂2 − µ̂1 ) >−11(µ̂2 + µ̂1 )T Σ̂ (µ̂2 − µ̂1 ) − log(N2 /N1 ),2(4.11)1104.
Linear Methods for Classificationand class 1 otherwise. Suppose we code the targets in the two classes as +1and −1, respectively. It is easy to show that the coefficient vector from leastsquares is proportional to the LDA direction given in (4.11) (Exercise 4.2).[In fact, this correspondence occurs for any (distinct) coding of the targets;see Exercise 4.2].
However unless N1 = N2 the intercepts are different andhence the resulting decision rules are different.Since this derivation of the LDA direction via least squares does not use aGaussian assumption for the features, its applicability extends beyond therealm of Gaussian data.
However the derivation of the particular interceptor cut-point given in (4.11) does require Gaussian data. Thus it makessense to instead choose the cut-point that empirically minimizes trainingerror for a given dataset. This is something we have found to work well inpractice, but have not seen it mentioned in the literature.With more than two classes, LDA is not the same as linear regression ofthe class indicator matrix, and it avoids the masking problems associatedwith that approach (Hastie et al., 1994).
A correspondence between regression and LDA can be established through the notion of optimal scoring,discussed in Section 12.5.Getting back to the general discriminant problem (4.8), if the Σk arenot assumed to be equal, then the convenient cancellations in (4.9) do notoccur; in particular the pieces quadratic in x remain. We then get quadraticdiscriminant functions (QDA),11δk (x) = − log |Σk | − (x − µk )T Σ−1k (x − µk ) + log πk .22(4.12)The decision boundary between each pair of classes k and ℓ is described bya quadratic equation {x : δk (x) = δℓ (x)}.Figure 4.6 shows an example (from Figure 4.1 on page 103) where thethree classes are Gaussian mixtures (Section 6.8) and the decision boundaries are approximated by quadratic equations in x.
Here we illustratetwo popular ways of fitting these quadratic boundaries. The right plotuses QDA as described here, while the left plot uses LDA in the enlargedfive-dimensional quadratic polynomial space. The differences are generallysmall; QDA is the preferred approach, with the LDA method a convenientsubstitute 2 .The estimates for QDA are similar to those for LDA, except that separatecovariance matrices must be estimated for each class. When p is large thiscan mean a dramatic increase in parameters.
Since the decision boundariesare functions of the parameters of the densities, counting the number ofparameters must be done with care. For LDA, it seems there are (K −1) × (p + 1) parameters, since we only need the differences δk (x) − δK (x)2 For this figure and many similar figures in the book we compute the decision boundaries by an exhaustive contouring method. We compute the decision rule on a fine latticeof points, and then use contouring algorithms to compute the boundaries.4.3 Linear Discriminant Analysis12212112132212 22 232 2233 3 3 322 22122333 32 2 22 2 22 23 332 222 2122322 21223 322 22 2 222 2 222 22222 2 23 3 332222 2222 2222332 2222 2 2 2 2222222 213 33 3332222223222223223 3 3333 3 322 22 2 2 22 22 2 222333 33222132 22 222222 2222222223 333 3 31 2 22222 21222222222222213 33333321 22 1212211 3 331 13122 122122 22122212 1 211 1 3 33 33222 2112211 22 1 1 11 11 11 1 1 1 11 1133333333331 223 331 2211211311 11221 1 1 133 333 33331211 1 1 1 12211331112333333333 32 11111 111 11 2 1111 1 11 1 1 1 333333311 1 1 13333 3333133 31 1 1111 3331 1 11 1 11 1111 111 1 1 1111 1111 1313 1111 1 1331 11 11 1 3333333331131 33 3 31 11 11 33333333333311 1 1 1 1 111111 11 111 1 1 1 11 11 11333 3 33333311 31133311 13331131 1 113 33113333 333 31112112132212 22 232 2233 3 3 322 22122333 32 2 22 2 22 23 332 222 2122322 21223 322 22 2 222 2 222 22222 2 23 3 332222 2222 2222332 2222 2 2 2 2222222 213 33 3332222223222223223 3 3333 3 322 22 2 2 22 22 2 222333 33222132 22 222222 2222222223 333 3 31 2 22222 21222222222222213 33333321 22 1212211 3 331 13122 122122 22122212 1 211 1 3 33 33222 2112211 22 1 1 11 11 11 1 1 1 11 1133333333331 223 331 2211211311 11221 1 1 133 333 33331211 1 1 1 12211311312333333333 32 11111 111 11 2 1111 1 11 1 1 1 333333311 1 1 13333 3333133 31 1 1111 3331 1 11 1 11 1111 111 1 1 1111 1111 1313 1111 1 1331 11 11 1 3333333331131 33 3 31 11 11 33333333333311 1 1 1 1 111111 11 111 1 1 1 11 11 11333 3 33333311 31133311 13331131 1 113 33113333 333 3FIGURE 4.6.
Two methods for fitting quadratic boundaries. The left plot showsthe quadratic decision boundaries for the data in Figure 4.1 (obtained using LDAin the five-dimensional space X1 , X2 , X1 X2 , X12 , X22 ). The right plot shows thequadratic decision boundaries found by QDA. The differences are small, as isusually the case.between the discriminant functions where K is some pre-chosen class (herewe have chosen the last), and each difference requires p + 1 parameters3 .Likewise for QDA there will be (K − 1) × {p(p + 3)/2 + 1} parameters.Both LDA and QDA perform well on an amazingly large and diverse setof classification tasks.