The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 28
Текст из файла (страница 28)
. . , K,with Yk = 1 if G = k else 0. These are collected together in a vectorY = (Y1 , . . . , YK ), and the N training instances of these form an N × Kindicator response matrix Y. Y is a matrix of 0’s and 1’s, with each rowhaving a single 1. We fit a linear regression model to each of the columnsof Y simultaneously, and the fit is given byŶ = X(XT X)−1 XT Y.(4.3)Chapter 3 has more details on linear regression.
Note that we have a coefficient vector for each response column yk , and hence a (p+1)×K coefficientmatrix B̂ = (XT X)−1 XT Y. Here X is the model matrix with p+1 columnscorresponding to the p inputs, and a leading column of 1’s for the intercept.A new observation with input x is classified as follows:• compute the fitted output fˆ(x)T = (1, xT )B̂, a K vector;• identify the largest component and classify accordingly:Ĝ(x) = argmaxk∈G fˆk (x).(4.4)1044. Linear Methods for ClassificationWhat is the rationale for this approach? One rather formal justificationis to view the regression as an estimate of conditional expectation.
For therandom variable Yk , E(Yk |X = x) = Pr(G = k|X = x), so conditionalexpectation of each of the Yk seems a sensible goal. The real issue is: howgood an approximation to conditional expectation is the rather rigid linearregression model? Alternatively, are the fˆk (x) reasonable estimates of theposterior probabilities Pr(G = k|X = x), and more importantly, does thismatter?PIt is quite straightforward to verify that k∈G fˆk (x) = 1 for any x, aslong as there is an intercept in the model (column of 1’s in X). However,the fˆk (x) can be negative or greater than 1, and typically some are.
Thisis a consequence of the rigid nature of linear regression, especially if wemake predictions outside the hull of the training data. These violations inthemselves do not guarantee that this approach will not work, and in facton many problems it gives similar results to more standard linear methods for classification. If we allow linear regression onto basis expansionsh(X) of the inputs, this approach can lead to consistent estimates of theprobabilities.
As the size of the training set N grows bigger, we adaptivelyinclude more basis elements so that linear regression onto these basis functions approaches conditional expectation. We discuss such approaches inChapter 5.A more simplistic viewpoint is to construct targets tk for each class,where tk is the kth column of the K × K identity matrix.
Our predictionproblem is to try and reproduce the appropriate target for an observation.With the same coding as before, the response vector yi (ith row of Y) forobservation i has the value yi = tk if gi = k. We might then fit the linearmodel by least squares:minBNXi=1||yi − [(1, xTi )B]T ||2 .(4.5)The criterion is a sum-of-squared Euclidean distances of the fitted vectorsfrom their targets. A new observation is classified by computing its fittedvector fˆ(x) and classifying to the closest target:Ĝ(x) = argmin ||fˆ(x) − tk ||2 .(4.6)kThis is exactly the same as the previous approach:• The sum-of-squared-norm criterion is exactly the criterion for multiple response linear regression, just viewed slightly differently.
Sincea squared norm is itself a sum of squares, the components decoupleand can be rearranged as a separate linear model for each element.Note that this is only possible because there is nothing in the modelthat binds the different responses together.4.2 Linear Regression of an Indicator Matrix333 3 33333333 333 33 3333333 3 3 333333 33 33 3333 3 333 33333 333 333 3 3333333 3 3333 33333333333 333 33 3 33333333 3 3 3 3 333333333333333333 3 333333333333333333333333333 3333333333 3333 33333 333 33333333333 33 33 333333333 33333333333 3 33 333 333333333333333333333333333333333333333 333 33 3 33 333 333333 33333 3 333333333 333 33 3333333333333333333 333333 333 3333 33 33 333333 33333333333333 33 333333 333 333 333 333333333333 333 33333333233 3 33 33333333 33 33333 3 3 3 333333 3 3333 3 333333233 33233 333 3 33 3 3 33 3233333333322 23322 2233332 2 2 22222 222 222222232 22222 22 2222 2 22 22 2 222232 222 2222222222 2 222 222 222 22222 2222 2222 22222222222 2 222 2222222222222 222222 22 2222222 2 2 22222222222222 2 22222222222222222222 2 22 222222222222222 222222 222222222 22222222 222 22 222222 2 22222222222222222 22222 22 2 2 222222 222222 2222222 2222222 22222 2 222 22 22222222222 2222 222222222222222222222222222222222222222222 2 22222 2222222 222 222 222222 222222222222222122 22222 22 2222 2 2222 222 2 22 22 22 2 2 22 211122 2 2 22211 1122 2222222211 1122 22 222 221 112 2 2 2 2111 11 1 111 11 11 111 12 2111 1 1 11 1 1 1 21 11 1111 11111111 1111 11111111 1 11111 1 1111111 1111 11111111111111111 111111111 11 111111 1111111 11 1 1 1111111111 1 11 1 1 111 1111111 1 1111 11111 111111 1 1111111111 1111 111111 1111111 11 11 111111111 1111111111111111111111 11111111111 111111111111111 111111 1111 11111111111111111111 111 1 111111111111111111111111111111 1 111111 11111111111111111111111111111 1 111 11 111 1111 11111111 1 1 111 11111 11 1111 1 11111 111111 111111 11 1 11 111 1111 1 1 1 1111 1 11 1 1 1 1 1 11111111Linear Discriminant Analysis3X2X2Linear Regression105333 3 33333333 333 33 3333333 3 3 333333 33 33 3333 3 333 33333 333 333 3 3333333 3 3333 33333333333 333 33 3 33333 3 3 3 3 333333333333333333333333 3 33333333333333333333333 3333 3333333 33333 333 3333333333333333 33 33 333333333 3333333333 3 33 333 333333333333333333333333333333333 33 3 33 33333 3 33 33 333 33333 3 333333333 333 3333 33333 333 33 333333333333333333 333333 33 33 333333333333333333 33333 3 33333 333 333 333 333333333333 333 33333333233 3 33 33333333 33 33333 3 3 3 333333 3 3333 3 333333233 33233 333 3 33 3 3 33 3233333333322 23322 2233332 2 2 22222 222 222222232 22222 22 2222 2 22 22 2 222232 222 2222222222 2 222 222 222 22222 2222 222 22222222222 2 222 2222222222222 222222 22 22222222 2 2 22222222222222 2 22222222222222222222 2 22 222222222222222 222222 222222222 22222222 222 22 222222 2 22222222222222222 22222 22 2 2 222222 222222 2222222 2222222 2 222222222 222 22 22222222222 22222222222222222 2222222222222222222222 22222 2222222 2222 22222 222222 2222222222222222122 22222 22 2222 2 2222 222 2 22 22 22 2 2 22 211122 2 2 22211 1122 2222222211 1122 22 222 221 112 2 2 2 2111 11 1 111 11 11 111 12 2111 1 1 11 1 1 1 21 11 1111 11111111 1111 11111111 1 111111 1111 111111 1 1111111111111111 111111111 11 111111 11111111 1 11 1 11 11 1 1 11111111 1111111 1 1111 11111 1111 111 111 1 1 11 1111111111 1111 111111 1111 11 11111111111111 111111111111111111 11111111111 1111111111111111 111111 111111111111111111111111 11111 1 111111111111111111111111 1 111111 1111111111111111111111111111 1 11111111 11 111 11111111111 1 1 111111111 11 1111 1 11111 111111 111111 11 1 11 111 1111 1 1 1 1111 1 11 1 1 1 1 1 111111131X1X1FIGURE 4.2.
The data come from three classes in IR2 and are easily separatedby linear decision boundaries. The right plot shows the boundaries found by lineardiscriminant analysis. The left plot shows the boundaries found by linear regression of the indicator response variables. The middle class is completely masked(never dominates).• The closest target classification rule (4.6) is easily seen to be exactlythe same as the maximum fitted component criterion (4.4).There is a serious problem with the regression approach when the numberof classes K ≥ 3, especially prevalent when K is large. Because of the rigidnature of the regression model, classes can be masked by others. Figure 4.2illustrates an extreme situation when K = 3.
The three classes are perfectlyseparated by linear decision boundaries, yet linear regression misses themiddle class completely.In Figure 4.3 we have projected the data onto the line joining the threecentroids (there is no information in the orthogonal direction in this case),and we have included and coded the three response variables Y1 , Y2 andY3 .
The three regression lines (left panel) are included, and we see thatthe line corresponding to the middle class is horizontal and its fitted valuesare never dominant! Thus, observations from class 2 are classified eitheras class 1 or class 3. The right panel uses quadratic regression rather thanlinear regression. For this simple example a quadratic rather than linearfit (for the middle class at least) would solve the problem.
However, itcan be seen that if there were four rather than three classes lined up likethis, a quadratic would not come down fast enough, and a cubic wouldbe needed as well. A loose but general rule is that if K ≥ 3 classes arelined up, polynomial terms up to degree K − 1 might be needed to resolvethem. Note also that these are polynomials along the derived directionpassing through the centroids, which can have arbitrary orientation. So in1064. Linear Methods for ClassificationDegree = 1; Error = 0.33Degree = 2; Error = 0.0411111111111.01111111111111111111111110.5111111122222 222222222222222222 222 2 2222222220.03333333333333333333333333333333333333333333333333333333333333333333333331111 33333311123222222222222222222222 2222222222 22 222 2222223232 2222222 22222123112222222 222223223333 111113113333113311131111111111111111111111111111111111311111111331111.01111111111111111111122222222222222222220.5333333333 3 2323 3 3333333333333333 333333322222222220.02222222222222222 2222222222222222222322 33333 22233333333333333333333333333222211122111132222211111 3333322311313131111133223333111322211111133333331 11 11121111111 131111111111111111111111122222222213333320.00.20.40.60.81.00.00.20.40.60.81.0FIGURE 4.3.
The effects of masking on linear regression in IR for a three-classproblem. The rug plot at the base indicates the positions and class membership ofeach observation. The three curves in each panel are the fitted regressions to thethree-class indicator variables; for example, for the blue class, yblue is 1 for theblue observations, and 0 for the green and orange. The fits are linear and quadraticpolynomials.