The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 10
Текст из файла (страница 10)
Wefirst consider the case of a quantitative output, and place ourselves in theworld of random variables and probability spaces. Let X ∈ IRp denote areal valued random input vector, and Y ∈ IR a real valued random output variable, with joint distribution Pr(X, Y ). We seek a function f (X)for predicting Y given values of the input X. This theory requires a lossfunction L(Y, f (X)) for penalizing errors in prediction, and by far the mostcommon and convenient is squared error loss: L(Y, f (X)) = (Y − f (X))2 .This leads us to a criterion for choosing f ,EPE(f )==E(Y − f (X))2Z2[y − f (x)] Pr(dx, dy),(2.9)(2.10)the expected (squared) prediction error . By conditioning1 on X, we canwrite EPE asEPE(f ) = EX EY |X [Y − f (X)]2 |X(2.11)and we see that it suffices to minimize EPE pointwise:The solution isf (x) = argminc EY |X [Y − c]2 |X = x .f (x) = E(Y |X = x),(2.12)(2.13)the conditional expectation, also known as the regression function.
Thusthe best prediction of Y at any point X = x is the conditional mean, whenbest is measured by average squared error.The nearest-neighbor methods attempt to directly implement this recipeusing the training data. At each point x, we might ask for the average of all1 Conditioning here amounts to factoring the joint density Pr(X, Y ) = Pr(Y |X)Pr(X)where Pr(Y |X) = Pr(Y, X)/Pr(X), and splitting up the bivariate integral accordingly.2.4 Statistical Decision Theory19those yi s with input xi = x.
Since there is typically at most one observationat any point x, we settle forfˆ(x) = Ave(yi |xi ∈ Nk (x)),(2.14)where “Ave” denotes average, and Nk (x) is the neighborhood containingthe k points in T closest to x. Two approximations are happening here:• expectation is approximated by averaging over sample data;• conditioning at a point is relaxed to conditioning on some region“close” to the target point.For large training sample size N , the points in the neighborhood are likelyto be close to x, and as k gets large the average will get more stable.In fact, under mild regularity conditions on the joint probability distribution Pr(X, Y ), one can show that as N, k → ∞ such that k/N → 0,fˆ(x) → E(Y |X = x).
In light of this, why look further, since it seemswe have a universal approximator? We often do not have very large samples. If the linear or some more structured model is appropriate, then wecan usually get a more stable estimate than k-nearest neighbors, althoughsuch knowledge has to be learned from the data as well. There are otherproblems though, sometimes disastrous. In Section 2.5 we see that as thedimension p gets large, so does the metric size of the k-nearest neighborhood. So settling for nearest neighborhood as a surrogate for conditioningwill fail us miserably.
The convergence above still holds, but the rate ofconvergence decreases as the dimension increases.How does linear regression fit into this framework? The simplest explanation is that one assumes that the regression function f (x) is approximatelylinear in its arguments:f (x) ≈ xT β.(2.15)This is a model-based approach—we specify a model for the regression function. Plugging this linear model for f (x) into EPE (2.9) and differentiatingwe can solve for β theoretically:β = [E(XX T )]−1 E(XY ).(2.16)Note we have not conditioned on X; rather we have used our knowledgeof the functional relationship to pool over values of X. The least squaressolution (2.6) amounts to replacing the expectation in (2.16) by averagesover the training data.So both k-nearest neighbors and least squares end up approximatingconditional expectations by averages.
But they differ dramatically in termsof model assumptions:• Least squares assumes f (x) is well approximated by a globally linearfunction.202. Overview of Supervised Learning• k-nearest neighbors assumes f (x) is well approximated by a locallyconstant function.Although the latter seems more palatable, we have already seen that wemay pay a price for this flexibility.Many of the more modern techniques described in this book are modelbased, although far more flexible than the rigid linear model. For example,additive models assume thatf (X) =pXfj (Xj ).(2.17)j=1This retains the additivity of the linear model, but each coordinate functionfj is arbitrary.
It turns out that the optimal estimate for the additive modeluses techniques such as k-nearest neighbors to approximate univariate conditional expectations simultaneously for each of the coordinate functions.Thus the problems of estimating a conditional expectation in high dimensions are swept away in this case by imposing some (often unrealistic) modelassumptions, in this case additivity.Are we happy with the criterion (2.11)? What happens if we replace theL2 loss function with the L1 : E|Y − f (X)|? The solution in this case is theconditional median,fˆ(x) = median(Y |X = x),(2.18)which is a different measure of location, and its estimates are more robustthan those for the conditional mean.
L1 criteria have discontinuities intheir derivatives, which have hindered their widespread use. Other moreresistant loss functions will be mentioned in later chapters, but squarederror is analytically convenient and the most popular.What do we do when the output is a categorical variable G? The sameparadigm works here, except we need a different loss function for penalizingprediction errors. An estimate Ĝ will assume values in G, the set of possibleclasses. Our loss function can be represented by a K × K matrix L, whereK = card(G). L will be zero on the diagonal and nonnegative elsewhere,where L(k, ℓ) is the price paid for classifying an observation belonging toclass Gk as Gℓ .
Most often we use the zero–one loss function, where allmisclassifications are charged a single unit. The expected prediction errorisEPE = E[L(G, Ĝ(X))],(2.19)where again the expectation is taken with respect to the joint distributionPr(G, X). Again we condition, and can write EPE asEPE = EXKXk=1L[Gk , Ĝ(X)]Pr(Gk |X)(2.20)2.4 Statistical Decision Theory21Bayes Optimal Classifier.. ....
.... .... .... .... .... .... ..... ..... .... .... .... .... .... .... ..... ..... .... .... .... .... .... ..... ..... .... .... .... .... ..... ..... .... .... .... .... .... ..... ..... .... .... .... .... ..... ..... .... .... ......................................................................................................... .... .... ....
.... .... .... .... ..... ..... .... .... .... .... .... .... ..... ..... .... .... .... .... .... ..... ..... .... .... .... .... ..... ..... .... .... .... .... .... ..... ..... .... .... .... .... ..... ..... .... .... ...... .... .... .... .... .... ....
.... ..... ..... .... .... .... .... .... .... ..... ..... .... .... .... .... .... ..... ..... .... .... .... .... ..... ..... .... .... .... .... .... ..... ..... .... .... .... .... ..... ..... .... .... ...... .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ... . . . . . . . . . . . . . .o...............................................o .... .... .... ....o.... .... .... .... .... .... .... .... .... .... .... .... .... o.... .... .... ....o.... .... .... .... .... ....
.... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... .... ...... .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .... .. .. .. o. . . . . . . . . . . . .o.. .. ..o..o. . . . . . . . . . . . .o.............................. .. .. .. ... ... ... ... ...o... ... ... ... ... ... ... ... o.. .. .. .. ...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ..... .. .. .. .. .. .. .. .. .. ..o. . .. .. ..o.. .. .. ..o....................................o .....
o..... ..... ..... ..... ..... o.. .. .. .. .. .. .. .. .. .. .. ... ...o.. .. .. .. ..o.. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...o o o.... .... .... .... .... .... ..... .....o..... ..... ..... ..... ..... .....
..... .....o..... ..... .....oo..... ..... .....o..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ...... . . . . . . . . . .o. . . . .