The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 30
Текст из файла (страница 30)
For example, in the STATLOG project (Michie etal., 1994) LDA was among the top three classifiers for 7 of the 22 datasets,QDA among the top three for four datasets, and one of the pair were in thetop three for 10 datasets. Both techniques are widely used, and entire booksare devoted to LDA. It seems that whatever exotic tools are the rage of theday, we should always have available these two simple tools.
The questionarises why LDA and QDA have such a good track record. The reason is notlikely to be that the data are approximately Gaussian, and in addition forLDA that the covariances are approximately equal. More likely a reason isthat the data can only support simple decision boundaries such as linear orquadratic, and the estimates provided via the Gaussian models are stable.This is a bias variance tradeoff—we can put up with the bias of a lineardecision boundary because it can be estimated with much lower variancethan more exotic alternatives. This argument is less believable for QDA,since it can have many parameters itself, although perhaps fewer than thenon-parametric alternatives.3 Although we fit the covariance matrix Σ̂ to compute the LDA discriminant functions,a much reduced function of it is all that is required to estimate the O(p) parametersneeded to compute the decision boundaries.1124.
Linear Methods for ClassificationRegularized Discriminant Analysis on the Vowel Data•••••••••••••••••••••••••••••••••••••••••••••Test DataTrain Data•••••0.10.20.30.4•••••••••••••••••••••••••••••0.0Misclassification Rate0.5••••0.00.20.4••••••0.6•••••••••••0.81.0αFIGURE 4.7. Test and training errors for the vowel data, using regularizeddiscriminant analysis with a series of values of α ∈ [0, 1]. The optimum for thetest data occurs around α = 0.9, close to quadratic discriminant analysis.4.3.1 Regularized Discriminant AnalysisFriedman (1989) proposed a compromise between LDA and QDA, whichallows one to shrink the separate covariances of QDA toward a commoncovariance as in LDA.
These methods are very similar in flavor to ridgeregression. The regularized covariance matrices have the formΣ̂k (α) = αΣ̂k + (1 − α)Σ̂,(4.13)where Σ̂ is the pooled covariance matrix as used in LDA. Here α ∈ [0, 1]allows a continuum of models between LDA and QDA, and needs to bespecified. In practice α can be chosen based on the performance of themodel on validation data, or by cross-validation.Figure 4.7 shows the results of RDA applied to the vowel data. Boththe training and test error improve with increasing α, although the testerror increases sharply after α = 0.9.
The large discrepancy between thetraining and test error is partly due to the fact that there are many repeatmeasurements on a small number of individuals, different in the trainingand test set.Similar modifications allow Σ̂ itself to be shrunk toward the scalarcovariance,Σ̂(γ) = γ Σ̂ + (1 − γ)σ̂ 2 I(4.14)for γ ∈ [0, 1]. Replacing Σ̂ in (4.13) by Σ̂(γ) leads to a more general familyof covariances Σ̂(α, γ) indexed by a pair of parameters.In Chapter 12, we discuss other regularized versions of LDA, which aremore suitable when the data arise from digitized analog signals and images.4.3 Linear Discriminant Analysis113In these situations the features are high-dimensional and correlated, and theLDA coefficients can be regularized to be smooth or sparse in the originaldomain of the signal.
This leads to better generalization and allows foreasier interpretation of the coefficients. In Chapter 18 we also deal withvery high-dimensional problems, where for example the features are geneexpression measurements in microarray studies. There the methods focuson the case γ = 0 in (4.14), and other severely regularized versions of LDA.4.3.2 Computations for LDAAs a lead-in to the next topic, we briefly digress on the computationsrequired for LDA and especially QDA.
Their computations are simplifiedby diagonalizing Σ̂ or Σ̂k . For the latter, suppose we compute the eigendecomposition for each Σ̂k = Uk Dk UTk , where Uk is p × p orthonormal,and Dk a diagonal matrix of positive eigenvalues dkℓ . Then the ingredientsfor δk (x) (4.12) are−1T• (x − µ̂k )T Σ̂k (x − µ̂k ) = [UTk (x − µ̂k )]T D−1k [Uk (x − µ̂k )];• log |Σ̂k | =Pℓlog dkℓ .In light of the computational steps outlined above, the LDA classifiercan be implemented by the following pair of steps:• Sphere the data with respect to the common covariance estimate Σ̂:1X ∗ ← D− 2 UT X, where Σ̂ = UDUT . The common covariance estimate of X ∗ will now be the identity.• Classify to the closest class centroid in the transformed space, modulothe effect of the class prior probabilities πk .4.3.3 Reduced-Rank Linear Discriminant AnalysisSo far we have discussed LDA as a restricted Gaussian classifier.
Part ofits popularity is due to an additional restriction that allows us to viewinformative low-dimensional projections of the data.The K centroids in p-dimensional input space lie in an affine subspaceof dimension ≤ K − 1, and if p is much larger than K, this will be a considerable drop in dimension. Moreover, in locating the closest centroid, wecan ignore distances orthogonal to this subspace, since they will contributeequally to each class. Thus we might just as well project the X ∗ onto thiscentroid-spanning subspace HK−1 , and make distance comparisons there.Thus there is a fundamental dimension reduction in LDA, namely, that weneed only consider the data in a subspace of dimension at most K − 1.1144.
Linear Methods for ClassificationIf K = 3, for instance, this could allow us to view the data in a twodimensional plot, color-coding the classes. In doing so we would not haverelinquished any of the information needed for LDA classification.What if K > 3? We might then ask for a L < K −1 dimensional subspaceHL ⊆ HK−1 optimal for LDA in some sense. Fisher defined optimal tomean that the projected centroids were spread out as much as possible interms of variance.
This amounts to finding principal component subspacesof the centroids themselves (principal components are described briefly inSection 3.5.1, and in more detail in Section 14.5.1). Figure 4.4 shows such anoptimal two-dimensional subspace for the vowel data. Here there are elevenclasses, each a different vowel sound, in a ten-dimensional input space. Thecentroids require the full space in this case, since K − 1 = p, but we haveshown an optimal two-dimensional subspace. The dimensions are ordered,so we can compute additional dimensions in sequence.
Figure 4.8 shows fouradditional pairs of coordinates, also known as canonical or discriminantvariables. In summary then, finding the sequences of optimal subspacesfor LDA involves the following steps:• compute the K × p matrix of class centroids M and the commoncovariance matrix W (for within-class covariance);1• compute M∗ = MW− 2 using the eigen-decomposition of W;• compute B∗ , the covariance matrix of M∗ (B for between-class covariance), and its eigen-decomposition B∗ = V∗ DB V∗ T .
The columnsvℓ∗ of V∗ in sequence from first to last define the coordinates of theoptimal subspaces.Combining all these operations the ℓth discriminant variable is given by1Zℓ = vℓT X with vℓ = W− 2 vℓ∗ .Fisher arrived at this decomposition via a different route, without referring to Gaussian distributions at all.
He posed the problem:Find the linear combination Z = aT X such that the betweenclass variance is maximized relative to the within-class variance.Again, the between class variance is the variance of the class means ofZ, and the within class variance is the pooled variance about the means.Figure 4.9 shows why this criterion makes sense. Although the directionjoining the centroids separates the means as much as possible (i.e., maximizes the between-class variance), there is considerable overlap betweenthe projected classes due to the nature of the covariances. By taking thecovariance into account as well, a direction with minimum overlap can befound.The between-class variance of Z is aT Ba and the within-class varianceaT Wa, where W is defined earlier, and B is the covariance matrix of theclass centroid matrix M.
Note that B + W = T, where T is the totalcovariance matrix of X, ignoring class information.4.3 Linear Discriminant Analysis115Linear Discriminant Analysisoooooo• ••••••-4-22ooo oooo ooooooo oo o ooooo ooooooo oo o oooo o oooooo o o o o o oooooooooooooooooooooooo oo o oooooooooooo oo ooo oo ooo ooooo ooooooooooo ooo ooooo oooo o o ooooooooooooooooo oo oo o ooo oooooo o ooooo ooo oo ooooo o ooooooooooooooooo o oooo ooo o oooo o oo o oo o ooo o oooooo o ooooooo oo o o ooo o oo oooo oo ooooooooo oo oooooo ooooooo o o oooo oooo oo oooo oooo oo oo oo ooooooooooo o ooo oo oo ooo o ooo o ooooooooo oooooo ooo ooo oooooooooo ooo oo o oo oo oooo oooo o oooo ooooo o oo ooooo ooooo o o o oo o o o o o ooo o o ooo oo oooooooo ooo o o ooo ooo ooooooooooooooooooooooooo oo o ooo••• ••• • •••••o• ••ooooo024o-6-2••oo-2oo-312••0•• • • ••• ••-1oooo o oooooo ooo oooooooo oooo ooo oooo ooo oooooooooooooooooo o ooo oo o oo oooooo o oooooo oo o ooooooo o oooo o o o o oo oooooo ooooo o o oo oooo oooooo o o ooo o oo o ooo o ooooooooooooooooo o oooo oo oooooo o ooo ooo ooo o oo oooooooooo oo oo ooo o o oo o o oo ooo ooooooooo o oo oo ooo oooo ooooo oo o o oo o o o ooo oo oo oooooooo ooooooooo ooooo oo oooo oooo o ooooooooooooooooo oooooo o o oooo ooo o o oooooo o o o ooooooo oooo ooooo oo ooooo ooo o ooo oooo o oo o oo o o oooo o o o o oo ooo oo oo oooooo o ooo o oooo ooo o o o oo o oo o oooooooo o ooooooooooooo oooooooo ooooo oooo o o ooooo o ooo o ooooooo o ooo ooooooooo-224oooo ooo o oooo ooooo o ooo o o o ooooooo o oo o oooooo ooo o oooooooo oo oooooooo ooo oo o o o oooo oo o o ooooo oo ooo ooooooo ooo ooooooo ooo oo oo ooooooo oo ooooooo oo ooooo oo ooo ooo o o ooooooo oooooooo ooo o o oooooooo oooooooooooooooo oo oo o oo oooooooo oooo ooooo oooo ooo ooooo o ooooo o oo o oo oooooooo oo oo ooo o oooooo o oo oo oo oooo oooooooooooooo oooo oo o ooo oo oo ooooo oo o ooo oo ooooo ooo oooo oooo o ooo o o o ooooooo ooo oo o oooo oooooooooooo o ooooo o oo o oo o oooooooo ooo ooooooo o ooo o ooo oo ooo o oooooooo oooo o oo oooooooooooo oooo ooo oooo o o o oo oo oo o oooo ooo ooooooo ooo oo o oooo ooooo0Coordinate 124ooo••••••oo-40oo o oCoordinate 103210-1-2Coordinate 7•Coordinate 2o ooooo o ooo•••••o-4Coordinate 1• •••• ••ooo•-20•Coordinate 3•••• ••• •• •0ooo ooooooooo oo oo ooo o oo oo ooo o ooooooooooo ooo oooooooo o o ooooo ooo oo ooo ooo ooo o o oooooo oo ooooo oo ooo ooo ooo ooo oo oo ooooo o o ooooooooooooooooooooooooo oo ooooooooo o oo oooo o o o ooo oooo ooooooooooooooooooo oooooo oooo oooooooo ooooo oo o o oooooo ooooo oo ooo ooooooo oo oooo o o oo ooo ooo oooooo ooo o ooooooooooooooooo o oooo o o o ooooo o ooo ooo oooooo o ooo oo ooooooooo o oooo ooooo ooo ooo oo o oo o o oo ooooo ooooo ooooooo oo oooooo oo o o oooo oo ooooo o ooo o ooo ooooo ooooo o oooo o ooooooo o ooooo oo ooo oo o ooo oo o ooooooooooo o ooo oo o ooo ooooooo oooooo o ooooooooooo oooooooooooo oo•-2Coordinate 32o-2-1012ooooo3Coordinate 9FIGURE 4.8.