The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 12
Текст из файла (страница 12)
The median1.02.5 Local Methods in High Dimensions0.8p=3p=20.6p=10.2Distance1p=100.4Unit Cube231Neighborhood0.000.00.20.40.6Fraction of VolumeFIGURE 2.6. The curse of dimensionality is well illustrated by a subcubicalneighborhood for uniform data in a unit cube. The figure on the right shows theside-length of the subcube needed to capture a fraction r of the volume of the data,for different dimensions p. In ten dimensions we need to cover 80% of the rangeof each coordinate to capture 10% of the data.distance from the origin to the closest data point is given by the expression1 1/N 1/p(2.24)d(p, N ) = 1 −2(Exercise 2.3).
A more complicated expression exists for the mean distanceto the closest point. For N = 500, p = 10 , d(p, N ) ≈ 0.52, more thanhalfway to the boundary. Hence most data points are closer to the boundaryof the sample space than to any other data point. The reason that thispresents a problem is that prediction is much more difficult near the edgesof the training sample. One must extrapolate from neighboring samplepoints rather than interpolate between them.Another manifestation of the curse is that the sampling density is proportional to N 1/p , where p is the dimension of the input space and N is thesample size.
Thus, if N1 = 100 represents a dense sample for a single inputproblem, then N10 = 10010 is the sample size required for the same sampling density with 10 inputs. Thus in high dimensions all feasible trainingsamples sparsely populate the input space.Let us construct another uniform example.
Suppose we have 1000 training examples xi generated uniformly on [−1, 1]p . Assume that the truerelationship between X and Y is2Y = f (X) = e−8||X|| ,without any measurement error. We use the 1-nearest-neighbor rule topredict y0 at the test-point x0 = 0. Denote the training set by T . We can242. Overview of Supervised Learningcompute the expected prediction error at x0 for our procedure, averagingover all such samples of size 1000. Since the problem is deterministic, thisis the mean squared error (MSE) for estimating f (0):MSE(x0 )===ET [f (x0 ) − ŷ0 ]2ET [ŷ0 − ET (ŷ0 )]2 + [ET (ŷ0 ) − f (x0 )]2VarT (ŷ0 ) + Bias2 (ŷ0 ).(2.25)Figure 2.7 illustrates the setup.
We have broken down the MSE into twocomponents that will become familiar as we proceed: variance and squaredbias. Such a decomposition is always possible and often useful, and is knownas the bias–variance decomposition. Unless the nearest neighbor is at 0,ŷ0 will be smaller than f (0) in this example, and so the average estimatewill be biased downward. The variance is due to the sampling variance ofthe 1-nearest neighbor. In low dimensions and with N = 1000, the nearestneighbor is very close to 0, and so both the bias and variance are small.
Asthe dimension increases, the nearest neighbor tends to stray further fromthe target point, and both bias and variance are incurred. By p = 10, formore than 99% of the samples the nearest neighbor is a distance greaterthan 0.5 from the origin. Thus as p increases, the estimate tends to be 0more often than not, and hence the MSE levels off at 1.0, as does the bias,and the variance starts dropping (an artifact of this example).Although this is a highly contrived example, similar phenomena occurmore generally. The complexity of functions of many variables can growexponentially with the dimension, and if we wish to be able to estimatesuch functions with the same accuracy as function in low dimensions, thenwe need the size of our training set to grow exponentially as well.
In thisexample, the function is a complex interaction of all p variables involved.The dependence of the bias term on distance depends on the truth, andit need not always dominate with 1-nearest neighbor. For example, if thefunction always involves only a few dimensions as in Figure 2.8, then thevariance can dominate instead.Suppose, on the other hand, that we know that the relationship betweenY and X is linear,Y = X T β + ε,(2.26)where ε ∼ N (0, σ 2 ) and we fit the model by least squares to the training data. For an arbitrary test point x0 , we have ŷ0 = xT0 β̂, which canPNbe written as ŷ0 = xT0 β + i=1 ℓi (x0 )εi , where ℓi (x0 ) is the ith elementof X(XT X)−1 x0 . Since under this model the least squares estimates are2.5 Local Methods in High Dimensions1-NN in One vs. Two Dimensions••X2-1.00.20.0-0.50.00.51.0-1.00.00.5Distance to 1-NN vs.
DimensionMSE vs. Dimension1.0X10.80.8••••MSEVarianceSq. Bias•0.6•••0.2••0.0• •4• •Mse0.40.4•2•1.0•0.6••0.2Average Distance to Nearest Neighbor-0.5X•0.0•-0.50.40.0f(X)0.6•-1.0•••••• •• •••• •• • ••• •• • ••0.50.81.01.01-NN in One Dimension256Dimension810•• • • • • • ••• •246810DimensionFIGURE 2.7. A simulation example, demonstrating the curse of dimensionality and its effect on MSE, bias and variance. The input features are uniformlydistributed in [−1, 1]p for p = 1, . . . , 10 The top left panel shows the target func2tion (no noise) in IR: f (X) = e−8||X|| , and demonstrates the error that 1-nearestneighbor makes in estimating f (0). The training point is indicated by the blue tickmark.
The top right panel illustrates why the radius of the 1-nearest neighborhoodincreases with dimension p. The lower left panel shows the average radius of the1-nearest neighborhoods. The lower-right panel shows the MSE, squared bias andvariance curves as a function of dimension p.262. Overview of Supervised LearningMSE vs. Dimension-1.0-0.50.00.5MSEVarianceSq. Bias••1.0•••• •0.150.0500.0•••••0.10MSE21f(X)30.2040.251-NN in One Dimension••• ••• • • • • • • •24X6810DimensionFIGURE 2.8. A simulation example with the same setup as in Figure 2.7.
Herethe function is constant in all but one dimension: F (X) = 21 (X1 + 1)3 . Thevariance dominates.unbiased, we find thatEPE(x0 )====Ey0 |x0 ET (y0 − ŷ0 )2Var(y0 |x0 ) + ET [ŷ0 − ET ŷ0 ]2 + [ET ŷ0 − xT0 β]2Var(y0 |x0 ) + VarT (ŷ0 ) + Bias2 (ŷ0 )σ 2 + ET xT0 (XT X)−1 x0 σ 2 + 02 .(2.27)Here we have incurred an additional variance σ 2 in the prediction error,since our target is not deterministic. There is no bias, and the variancedepends on x0 . If N is large and T were selected at random, and assumingE(X) = 0, then XT X → N Cov(X) andEx0 EPE(x0 )∼=Ex0 xT0 Cov(X)−1 x0 σ 2 /N + σ 2trace[Cov(X)−1 Cov(x0 )]σ 2 /N + σ 2=σ 2 (p/N ) + σ 2 .(2.28)Here we see that the expected EPE increases linearly as a function of p,with slope σ 2 /N . If N is large and/or σ 2 is small, this growth in variance is negligible (0 in the deterministic case). By imposing some heavyrestrictions on the class of models being fitted, we have avoided the curseof dimensionality.
Some of the technical details in (2.27) and (2.28) arederived in Exercise 2.5.Figure 2.9 compares 1-nearest neighbor vs. least squares in two situations, both of which have the form Y = f (X) + ε, X uniform as before,and ε ∼ N (0, 1). The sample size is N = 500.
For the orange curve, f (x)2.5 Local Methods in High Dimensions27••••••••1.9•1.81.6••2•••LinearCubic•••1.7EPE Ratio2.02.1Expected Prediction Error of 1NN vs. OLS•4••6••810DimensionFIGURE 2.9. The curves show the expected prediction error (at x0 = 0) for1-nearest neighbor relative to least squares for the model Y = f (X) + ε. For theorange curve, f (x) = x1 , while for the blue curve f (x) = 12 (x1 + 1)3 .is linear in the first coordinate, for the blue curve, cubic as in Figure 2.8.Shown is the relative EPE of 1-nearest neighbor to least squares, whichappears to start at around 2 for the linear case. Least squares is unbiasedin this case, and as discussed above the EPE is slightly above σ 2 = 1.The EPE for 1-nearest neighbor is always above 2, since the variance offˆ(x0 ) in this case is at least σ 2 , and the ratio increases with dimension asthe nearest neighbor strays from the target point.
For the cubic case, leastsquares is biased, which moderates the ratio. Clearly we could manufactureexamples where the bias of least squares would dominate the variance, andthe 1-nearest neighbor would come out the winner.By relying on rigid assumptions, the linear model has no bias at all andnegligible variance, while the error in 1-nearest neighbor is substantiallylarger. However, if the assumptions are wrong, all bets are off and the1-nearest neighbor may dominate. We will see that there is a whole spectrum of models between the rigid linear models and the extremely flexible1-nearest-neighbor models, each with their own assumptions and biases,which have been proposed specifically to avoid the exponential growth incomplexity of functions in high dimensions by drawing heavily on theseassumptions.Before we delve more deeply, let us elaborate a bit on the concept ofstatistical models and see how they fit into the prediction framework.282.
Overview of Supervised Learning2.6 Statistical Models, Supervised Learning andFunction ApproximationOur goal is to find a useful approximation fˆ(x) to the function f (x) thatunderlies the predictive relationship between the inputs and outputs. In thetheoretical setting of Section 2.4, we saw that squared error loss lead usto the regression function f (x) = E(Y |X = x) for a quantitative response.The class of nearest-neighbor methods can be viewed as direct estimatesof this conditional expectation, but we have seen that they can fail in atleast two ways:• if the dimension of the input space is high, the nearest neighbors neednot be close to the target point, and can result in large errors;• if special structure is known to exist, this can be used to reduce boththe bias and the variance of the estimates.We anticipate using other classes of models for f (x), in many cases specifically designed to overcome the dimensionality problems, and here we discuss a framework for incorporating them into the prediction problem.2.6.1 A Statistical Model for the Joint Distribution Pr(X, Y )Suppose in fact that our data arose from a statistical modelY = f (X) + ε,(2.29)where the random error ε has E(ε) = 0 and is independent of X.
Note thatfor this model, f (x) = E(Y |X = x), and in fact the conditional distributionPr(Y |X) depends on X only through the conditional mean f (x).The additive error model is a useful approximation to the truth. Formost systems the input–output pairs (X, Y ) will not have a deterministicrelationship Y = f (X). Generally there will be other unmeasured variablesthat also contribute to Y , including measurement error. The additive modelassumes that we can capture all these departures from a deterministic relationship via the error ε.For some problems a deterministic relationship does hold. Many of theclassification problems studied in machine learning are of this form, wherethe response surface can be thought of as a colored map defined in IRp .The training data consist of colored examples from the map {xi , gi }, andthe goal is to be able to color any point.
Here the function is deterministic,and the randomness enters through the x location of the training points.For the moment we will not pursue such problems, but will see that theycan be handled by techniques appropriate for the error-based models.The assumption in (2.29) that the errors are independent and identicallydistributed is not strictly necessary, but seems to be at the back of our mind2.6 Statistical Models, Supervised Learning and Function Approximation29when we average squared errors uniformly in our EPE criterion.