The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 9
Текст из файла (страница 9)
... ... ... ... o.. ..o.. ... ... ... ... ...o.. .. .. .. ...o.. .. .. .. .. .. .. o. . . . ... ... ... oo.. .. .. .. .. .. ..o. .. .. .. .. .. .. .. .. .. .. ..o.o. ..o.o. .. .. .. .. .. .. .. .. .. .. .. .. o.. .. o.. .. .. .. .. .. .. .. .. ... ...
... ...o. . . .. .. .. .. .. .. .. ... oo.....oo.. .. .. .. .. .. .. .. .. .. .. .. .. .. o.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o. ..o. .. .. .. .. ..o. .. .. o..o.. .. ... ... ... ... ... ... ... ... ... ... o. .. .. .. .. .. .. ....... .. .. .. .. .. .. .. .. .. .. .. .. ..
.. ..o.. .. o.. .. .. o.. .. ..o.. .. .. .. .. .. .. .. o.o.. o.o. .. .. .. .. .. ..o. .. .. .. o. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ........oo...o... ... ... ... ... ... o... ... ... ... ... ... ... ... ... ... ... ... ...o... ... ... ... ... ...o...o... ... ... ...
... ... ... ... ...o... ... ... ... ... ... o...o... ... ...o... ...o... ... ... ... ... ... ... ... ... ... ... ... ... ..... .. .. .. .. .. .. .. .. .. .. .. .. .. ..o. . .. .. .. .. .. .. .. .. .. .. .. .. .. o..o. . . . . .o. . . . . . . . . . .o. . . . . . .. .. .. .. .. .. .... .. .. .. .. .. .. .. .. .. .. ..
.. .. .. ... ...o.. o.. .. o.. .. .. .. o. . . . . . .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...o... ... ... ... o........ .. .. .. .. .. .. o. . . . .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ....o. ... ... ... ... ... ... ... o... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...o..o.o.. o.. .. .. .. ... ... ... ... o.. .. o.. .. .. .. .. .. ..o.. .. .. .. o. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. o.. .. .. .. o...o..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .... .. .. .. .. .. .. .. ..o.. .. oo.. .. .. .. .. .. .. .. .. .. .. .. .. o.. .. .. .. .. .. .. .. .. ... o. . . .o. . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. ... . . . .. .. .. .. .. .. o.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. ..o... ... ... ... ... o. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ...o..............o o .. ..o.. .... o.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. .. .. o.. o.. .. .. .. .. .. ... ...o.. o. . . . . . . . . . . . . . . .. ..
.. .. .. .. .. .. .. .. .. .. .. ..... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...o... ... ... ... o..o.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ..... .. .. .. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. ..o.o.................. o. . . . . . . . .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... .... . . . . . . . . . . . . . . . . . . . . .. .. .. .. .. .. .. .. ..o.. .. ... ... ... o... ... ... ... ... ... ... ...o.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...o. . . .o. . . o. . . . . . . .. o.. .. .. .. .. .. .. .. .. .. .. .. o.. .. .. .. .. ..o.. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... .... . . . . . . . . . . . . . . . . . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...o.. .. .. .. .. .. ..
..o.. .. .. .. .. .. o..o.. .. .. .. .. o.. .. .. ..oo .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... o.. o.. o.. .. .. .. .. ..
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ..... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... o.. .. .. .. ..o....................... .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. o. . . .o. .o. . . .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ..... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o.. .. ... ... ... ... ... ... ... ... ... o. .
. . . . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... o. . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .... .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ...oo.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... o.................. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ..... .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ................................................................oFIGURE 2.2. The same classification example in two dimensions as in Figure 2.1. The classes are coded as a binary variable (BLUE = 0, ORANGE = 1) andthen fit by 15-nearest-neighbor averaging as in (2.8). The predicted class is hencechosen by majority vote amongst the 15-nearest neighbors.In Figure 2.2 we see that far fewer training observations are misclassifiedthan in Figure 2.1. This should not give us too much comfort, though, sincein Figure 2.3 none of the training data are misclassified.
A little thoughtsuggests that for k-nearest-neighbor fits, the error on the training datashould be approximately an increasing function of k, and will always be 0for k = 1. An independent test set would give us a more satisfactory meansfor comparing the different methods.It appears that k-nearest-neighbor fits have a single parameter, the number of neighbors k, compared to the p parameters in least-squares fits.
Although this is the case, we will see that the effective number of parametersof k-nearest neighbors is N/k and is generally bigger than p, and decreaseswith increasing k. To get an idea of why, note that if the neighborhoodswere nonoverlapping, there would be N/k neighborhoods and we would fitone parameter (a mean) in each neighborhood.It is also clear that we cannot use sum-of-squared errors on the trainingset as a criterion for picking k, since we would always pick k = 1! It wouldseem that k-nearest-neighbor methods would be more appropriate for themixture Scenario 2 described above, while for Gaussian data the decisionboundaries of k-nearest neighbors would be unnecessarily noisy.162.
Overview of Supervised Learning1−Nearest Neighbor Classifierooooo ooooo o ooo oo ooooo o ooo oooooo o oo ooo o ooo ooooo oooo o oooooooooo oooooo o ooo ooo ooooooooooo ooooo o ooo oo o oo ooo oo oooooooo oo oooooooo o oooo oooo o ooooooo oo oooooooo oo o oooooooooooo o ooo ooooooooo oo oo o ooo oooooooooooFIGURE 2.3. The same classification example in two dimensions as in Figure 2.1. The classes are coded as a binary variable (BLUE = 0, ORANGE = 1), andthen predicted by 1-nearest-neighbor classification.2.3.3 From Least Squares to Nearest NeighborsThe linear decision boundary from least squares is very smooth, and apparently stable to fit. It does appear to rely heavily on the assumptionthat a linear decision boundary is appropriate.
In language we will developshortly, it has low variance and potentially high bias.On the other hand, the k-nearest-neighbor procedures do not appear torely on any stringent assumptions about the underlying data, and can adaptto any situation. However, any particular subregion of the decision boundary depends on a handful of input points and their particular positions,and is thus wiggly and unstable—high variance and low bias.Each method has its own situations for which it works best; in particularlinear regression is more appropriate for Scenario 1 above, while nearestneighbors are more suitable for Scenario 2. The time has come to exposethe oracle! The data in fact were simulated from a model somewhere between the two, but closer to Scenario 2. First we generated 10 means mkfrom a bivariate Gaussian distribution N ((1, 0)T , I) and labeled this classBLUE.
Similarly, 10 more were drawn from N ((0, 1)T , I) and labeled classORANGE. Then for each class we generated 100 observations as follows: foreach observation, we picked an mk at random with probability 1/10, and2.3 Least Squares and Nearest Neighbors17k − Number of Nearest Neighbors0.30151 101694531211171829531672000.200.100.15Test Error0.25LinearTrainTestBayes235812Degrees of Freedom − N/kFIGURE 2.4. Misclassification curves for the simulation example used in Figures 2.1, 2.2 and 2.3.
A single training sample of size 200 was used, and a testsample of size 10, 000. The orange curves are test and the blue are training error for k-nearest-neighbor classification. The results for linear regression are thebigger orange and blue squares at three degrees of freedom. The purple line is theoptimal Bayes error rate.then generated a N (mk , I/5), thus leading to a mixture of Gaussian clusters for each class. Figure 2.4 shows the results of classifying 10,000 newobservations generated from the model.
We compare the results for leastsquares and those for k-nearest neighbors for a range of values of k.A large subset of the most popular techniques in use today are variants ofthese two simple procedures. In fact 1-nearest-neighbor, the simplest of all,captures a large percentage of the market for low-dimensional problems.The following list describes some ways in which these simple procedureshave been enhanced:• Kernel methods use weights that decrease smoothly to zero with distance from the target point, rather than the effective 0/1 weights usedby k-nearest neighbors.• In high-dimensional spaces the distance kernels are modified to emphasize some variable more than others.182. Overview of Supervised Learning• Local regression fits linear models by locally weighted least squares,rather than fitting constants locally.• Linear models fit to a basis expansion of the original inputs allowarbitrarily complex models.• Projection pursuit and neural network models consist of sums of nonlinearly transformed linear models.2.4 Statistical Decision TheoryIn this section we develop a small amount of theory that provides a framework for developing models such as those discussed informally so far.