The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 8
Текст из файла (страница 8)
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ..... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. o.................. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ..... .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..oFIGURE 2.1. A classification example in two dimensions. The classes are codedas a binary variable (BLUE = 0, ORANGE = 1), and then fit by linear regression.The line is the decision boundary defined by xT β̂ = 0.5.
The orange shaded regiondenotes that part of input space classified as ORANGE, while the blue region isclassified as BLUE.The set of points in IR2 classified as ORANGE corresponds to {x : xT β̂ > 0.5},indicated in Figure 2.1, and the two predicted classes are separated by thedecision boundary {x : xT β̂ = 0.5}, which is linear in this case.
We seethat for these data there are several misclassifications on both sides of thedecision boundary. Perhaps our linear model is too rigid— or are such errorsunavoidable? Remember that these are errors on the training data itself,and we have not said where the constructed data came from.
Consider thetwo possible scenarios:Scenario 1: The training data in each class were generated from bivariateGaussian distributions with uncorrelated components and differentmeans.Scenario 2: The training data in each class came from a mixture of 10 lowvariance Gaussian distributions, with individual means themselvesdistributed as Gaussian.A mixture of Gaussians is best described in terms of the generativemodel. One first generates a discrete variable that determines which of142. Overview of Supervised Learningthe component Gaussians to use, and then generates an observation fromthe chosen density. In the case of one Gaussian per class, we will see inChapter 4 that a linear decision boundary is the best one can do, and thatour estimate is almost optimal.
The region of overlap is inevitable, andfuture data to be predicted will be plagued by this overlap as well.In the case of mixtures of tightly clustered Gaussians the story is different. A linear decision boundary is unlikely to be optimal, and in fact isnot. The optimal decision boundary is nonlinear and disjoint, and as suchwill be much more difficult to obtain.We now look at another classification and regression procedure that isin some sense at the opposite end of the spectrum to the linear model, andfar better suited to the second scenario.2.3.2 Nearest-Neighbor MethodsNearest-neighbor methods use those observations in the training set T closest in input space to x to form Ŷ . Specifically, the k-nearest neighbor fitfor Ŷ is defined as follows:1 XŶ (x) =yi ,(2.8)kxi ∈Nk (x)where Nk (x) is the neighborhood of x defined by the k closest points xi inthe training sample.
Closeness implies a metric, which for the moment weassume is Euclidean distance. So, in words, we find the k observations withxi closest to x in input space, and average their responses.In Figure 2.2 we use the same training data as in Figure 2.1, and use15-nearest-neighbor averaging of the binary coded response as the methodof fitting. Thus Ŷ is the proportion of ORANGE’s in the neighborhood, andso assigning class ORANGE to Ĝ if Ŷ > 0.5 amounts to a majority vote inthe neighborhood. The colored regions indicate all those points in inputspace classified as BLUE or ORANGE by such a rule, in this case found byevaluating the procedure on a fine grid in input space. We see that thedecision boundaries that separate the BLUE from the ORANGE regions are farmore irregular, and respond to local clusters where one class dominates.Figure 2.3 shows the results for 1-nearest-neighbor classification: Ŷ isassigned the value yℓ of the closest point xℓ to x in the training data.
Inthis case the regions of classification can be computed relatively easily, andcorrespond to a Voronoi tessellation of the training data. Each point xihas an associated tile bounding the region for which it is the closest inputpoint. For all points x in the tile, Ĝ(x) = gi . The decision boundary is evenmore irregular than before.The method of k-nearest-neighbor averaging is defined in exactly thesame way for regression of a quantitative output Y , although k = 1 wouldbe an unlikely choice.2.3 Least Squares and Nearest Neighbors1515-Nearest Neighbor Classifier..
.... .... .... .... .... .... .... ..... ..... .... .... .... .... .... .... ..... ..... .... .... .... .... .... ..... ..... .... .... .... .... ..... ..... .... .... .... .... .... ..... ..... .... .... .... .... ..... ..... .... .... ......................................................................................................... .... .... .... .... .... .... .... .....
..... .... .... .... .... .... .... ..... ..... .... .... .... .... .... ..... ..... .... .... .... .... ..... ..... .... .... ..... ...... ...... ..... .... .... .... .... ..... ..... .... .... ...... .... .... .... .... .... .... .... ..... ..... .... .... .... .... .... .... ..... ..... .... .... .... .... .... .....
..... .... .... .... .... ..... ..... .... ....... ..... .... ..... ..... .... .... .... .... ..... ..... .... .... ...... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. ... . . . . . . . . . . . . . .o...............................................o .... .... .... ....o.... .... .... .... .... .... .... .... .... .... .... .... .... o.... .... .... ....o.... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... ....
.... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... ...... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .... .. .. .. o. . . . . . . . . . . . .o.. .. ..o..o. . . . . . . . . . . . .o.............................. .. .. .. ... ... ... ... ...o... ... ... ... ... ... ... ... o.. .. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ..... .. .. .. .. .. .. .. .. .. ..o. . .. .. ..o.. .. .. ..o....................................o ..... o..... ..... ..... ..... ..... o.. .. .. .. .. .. ..
.. .. .. .. ... ...o.. .. .. .. ..o.. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...o o o.... .... .... .... .... .... ..... .....o..... .....
..... ..... ..... ..... ..... .....o..... ..... .....oo..... ..... .....o..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... .....
..... ...... . . . . . . . . . .o. . . . . . . . . . . . . . . . . . . . . .. ..o.. o...........................o .... .... .... .... .... .... .... .... .... .... .... .... .... o.... .... .... .... .... ....o.... .... .... .... ....oo....
.... .... .... .... .... .... ....oo.. ....o.... o..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... .....o..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ....... .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. ..o.. .. .. .. ..o.. .. o. . .. .. .. ... o.. .. .. .. .. .. .. .. .. .. .. ..o.. .. .. .. o.. .. o.. .. .. .. .. .. .. .. .. ... ... o..o.. .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ..... .. .. .. .. .. o. .. .. .. .. .. .. .. .. o. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. o........o... ... ... ... ... ... ... ... ... ... ...o... ... ... ... ... ... ..... .. .. .. .. .. ... o. . . .. .. .. .. .. ...o. . .o. .. ..o.. o. . . . . .. .. .. ..o. .. .. .. ... ... ... ... ... ... ... ...o.. .. .. .. .. .. ..o.. .. .. .. .. .. .. .... .. .. .. .. .. .. ... ... ...o..o.. .. .. .. ..