The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 48
Текст из файла (страница 48)
Thenfˆ(x0 )==b(x0 )T (BT W(x0 )B)−1 BT W(x0 )yNXli (x0 )yi .(6.8)(6.9)i=1Equation (6.8) gives an explicit expression for the local linear regressionestimate, and (6.9) highlights the fact that the estimate is linear in the1966. Kernel Smoothing MethodsLocal Linear Equivalent Kernel at BoundaryO1.5OOOOOOO O O O OOOOOO O OOOOOOOOO OOOOOOOOOOOOOO OOOO 0OOOOOOOOO OOOOOOOO OOOOOOOOOOOOOOOOOOOOOO•OO••fˆ(x ••)• ••••••••• ••••••••••••••• •••• • •••••• ••••••••• ••• • • •••••••••••0.0x00.20.40.60.8OOOOOOOOOOOOOOO0OOOOOOOOOO O OOOOOOOOO OOOOOOOOOOOOOO OOOOOOOOOOOOO OOOOOOOO OOOOOOOOOOOOOOOOOOOOOOOOOO••••••••••••••••• •••••••••••••• •••• •••• ˆ •••• f (x )••••••••• ••• • • •••••••••••OO-1.0O-1.0-0.50.00.51.0•OOO1.0OO0.5OO0.0O••Local Linear Equivalent Kernel in InteriorOO-0.51.5•••1.00.00.20.4x00.60.8O1.0FIGURE 6.4.
The green points show the equivalentkernel li (x0 ) for local rePNgression. These are the weights in fˆ(x0 ) =i=1 li (x0 )yi , plotted against theircorresponding xi . For display purposes, these have been rescaled, since in factthey sum to 1. Since the yellow shaded region is the (rescaled) equivalent kernelfor the Nadaraya–Watson local average, we see how local regression automatically modifies the weighting kernel to correct for biases due to asymmetry in thesmoothing window.yi (the li (x0 ) do not involve y). These weights li (x0 ) combine the weighting kernel Kλ (x0 , ·) and the least squares operations, and are sometimesreferred to as the equivalent kernel.
Figure 6.4 illustrates the effect of local linear regression on the equivalent kernel. Historically, the bias in theNadaraya–Watson and other local average kernel methods were correctedby modifying the kernel. These modifications were based on theoreticalasymptotic mean-square-error considerations, and besides being tedious toimplement, are only approximate for finite sample sizes.
Local linear regression automatically modifies the kernel to correct the bias exactly tofirst order, a phenomenon dubbed as automatic kernel carpentry. Considerthe following expansion for Efˆ(x0 ), using the linearity of local regressionand a series expansion of the true function f around x0 ,Efˆ(x0 )=NXli (x0 )f (xi )i=1=f (x0 )NXli (x0 ) + f ′ (x0 )i=1i=1+NXN(xi − x0 )li (x0 )f ′′ (x0 ) X(xi − x0 )2 li (x0 ) + R,2i=1(6.10)where the remainder term R involves third- and higher-order derivatives off , and is typically small under suitable smoothness assumptions.
It can be6.1 One-Dimensional Kernel Smoothers1.5OO OO O OOOO OOOOOOOOO OO OOOOOOOOO OOOOO-0.5OO OO-1.0O0.20.40.60.8OOOOO OOOO•OOO OO OOOOOOOOOOOOOOOOOO OO O OOOO OOOOOOOOO OO OOOOOOOOO OOOOOOO OOOOO0.0OOOOOOOOOOOOOOOOfˆ(x0 )OOOOO1.0•OO0.5OOO OO OO OOOOOOOOOOOOOOOOOO OOOfˆ(x0 )OOOOOOOO-0.51.0OO0.50.0OOOOOOOOOOOOOO0.0OOLocal Quadratic in InteriorOOOO1.0-1.01.5Local Linear in InteriorO197OOOO0.00.20.40.60.8OO1.0FIGURE 6.5.
Local linear fits exhibit bias in regions of curvature of the truefunction. Local quadratic fits tend to eliminate this bias.PNshown (Exercise 6.2) that for local linear regression, i=1 li (x0 ) = 1 andPNi=1 (xi − x0 )li (x0 ) = 0. Hence the middle term equals f (x0 ), and sincethe bias is Efˆ(x0 ) − f (x0 ), we see that it depends only on quadratic andhigher–order terms in the expansion of f .6.1.2 Local Polynomial RegressionWhy stop at local linear fits? We can fit local polynomial fits of any degree d,2dNXX(6.11)minβj (x0 )xji Kλ (x0 , xi ) yi − α(x0 ) −α(x0 ),βj (x0 ), j=1,...,di=1j=1Pdwith solution fˆ(x0 ) = α̂(x0 )+ j=1 β̂j (x0 )xj0 .
In fact, an expansion such as(6.10) will tell us that the bias will only have components of degree d+1 andhigher (Exercise 6.2). Figure 6.5 illustrates local quadratic regression. Locallinear fits tend to be biased in regions of curvature of the true function, aphenomenon referred to as trimming the hills and filling the valleys. Localquadratic regression is generally able to correct this bias.There is of course a price to be paid for this bias reduction, and that isincreased variance. The fit in the right panel of Figure 6.5 is slightly morewiggly, especially in the tails.
Assuming the model yi = f (xi ) + εi , withεi independent and identically distributed with mean zero and varianceσ 2 , Var(fˆ(x0 )) = σ 2 ||l(x0 )||2 , where l(x0 ) is the vector of equivalent kernelweights at x0 . It can be shown (Exercise 6.3) that ||l(x0 )|| increases with d,and so there is a bias–variance tradeoff in selecting the polynomial degree.Figure 6.6 illustrates these variance curves for degree zero, one and two6. Kernel Smoothing Methods0.20.3ConstantLinearQuadratic0.00.1Variance0.40.51980.00.20.40.60.81.0FIGURE 6.6. The variances functions ||l(x)||2 for local constant, linear andquadratic regression, for a metric bandwidth (λ = 0.2) tri-cube kernel.local polynomials.
To summarize some collected wisdom on this issue:• Local linear fits can help bias dramatically at the boundaries at amodest cost in variance. Local quadratic fits do little at the boundaries for bias, but increase the variance a lot.• Local quadratic fits tend to be most helpful in reducing bias due tocurvature in the interior of the domain.• Asymptotic analysis suggest that local polynomials of odd degreedominate those of even degree. This is largely due to the fact thatasymptotically the MSE is dominated by boundary effects.While it may be helpful to tinker, and move from local linear fits at theboundary to local quadratic fits in the interior, we do not recommend suchstrategies.
Usually the application will dictate the degree of the fit. Forexample, if we are interested in extrapolation, then the boundary is ofmore interest, and local linear fits are probably more reliable.6.2 Selecting the Width of the KernelIn each of the kernels Kλ , λ is a parameter that controls its width:• For the Epanechnikov or tri-cube kernel with metric width, λ is theradius of the support region.• For the Gaussian kernel, λ is the standard deviation.• λ is the number k of nearest neighbors in k-nearest neighborhoods,often expressed as a fraction or span k/N of the total training sample.6.2 Selecting the Width of the Kernel199••••••••••••••••••••••••••••••••• •••••••••••••••• •••••• •••••••••••• ••••••• ••••••••• •••••• •• •••••••••••••••••••••• ••••••••••••••• ••••••••••••••••••••••••• •••••••••••••••• •••••••••••• ••••••••••••••••••••• ••••••••••••••••••• •••FIGURE 6.7.
Equivalent kernels for a local linear regression smoother (tri-cubekernel; orange) and a smoothing spline (blue), with matching degrees of freedom.The vertical spikes indicates the target points.There is a natural bias–variance tradeoff as we change the width of theaveraging window, which is most explicit for local averages:• If the window is narrow, fˆ(x0 ) is an average of a small number of yiclose to x0 , and its variance will be relatively large—close to that ofan individual yi . The bias will tend to be small, again because eachof the E(yi ) = f (xi ) should be close to f (x0 ).• If the window is wide, the variance of fˆ(x0 ) will be small relative tothe variance of any yi , because of the effects of averaging.
The biaswill be higher, because we are now using observations xi further fromx0 , and there is no guarantee that f (xi ) will be close to f (x0 ).Similar arguments apply to local regression estimates, say local linear: asthe width goes to zero, the estimates approach a piecewise-linear functionthat interpolates the training data1 ; as the width gets infinitely large, thefit approaches the global linear least-squares fit to the data.The discussion in Chapter 5 on selecting the regularization parameter forsmoothing splines applies here, and will not be repeated.
Local regressionsmoothers are linear estimators; the smoother matrix in f̂ = Sλ y is built upfrom the equivalent kernels (6.8), and has ijth entry {Sλ }ij = li (xj ). Leaveone-out cross-validation is particularly simple (Exercise 6.7), as is generalized cross-validation, Cp (Exercise 6.10), and k-fold cross-validation. Theeffective degrees of freedom is again defined as trace(Sλ ), and can be usedto calibrate the amount of smoothing.
Figure 6.7 compares the equivalentkernels for a smoothing spline and local linear regression. The local regression smoother has a span of 40%, which results in df = trace(Sλ ) = 5.86.The smoothing spline was calibrated to have the same df, and their equivalent kernels are qualitatively quite similar.1 Withuniformly spaced xi ; with irregularly spaced xi , the behavior can deteriorate.2006. Kernel Smoothing Methods6.3 Local Regression in IRpKernel smoothing and local regression generalize very naturally to two ormore dimensions. The Nadaraya–Watson kernel smoother fits a constantlocally with weights supplied by a p-dimensional kernel. Local linear regression will fit a hyperplane locally in X, by weighted least squares, withweights supplied by a p-dimensional kernel.
It is simple to implement andis generally preferred to the local constant fit for its superior performanceon the boundaries.Let b(X) be a vector of polynomial terms in X of maximum degree d.For example, with d = 1 and p = 2 we get b(X) = (1, X1 , X2 ); with d = 2we get b(X) = (1, X1 , X2 , X12 , X22 , X1 X2 ); and trivially with d = 0 we getb(X) = 1.
At each x0 ∈ IRp solveminβ(x0 )NXi=1Kλ (x0 , xi )(yi − b(xi )T β(x0 ))2(6.12)to produce the fit fˆ(x0 ) = b(x0 )T β̂(x0 ). Typically the kernel will be a radialfunction, such as the radial Epanechnikov or tri-cube kernel||x − x0 ||,(6.13)Kλ (x0 , x) = Dλwhere ||·|| is the Euclidean norm. Since the Euclidean norm depends on theunits in each coordinate, it makes most sense to standardize each predictor,for example, to unit standard deviation, prior to smoothing.While boundary effects are a problem in one-dimensional smoothing,they are a much bigger problem in two or higher dimensions, since thefraction of points on the boundary is larger.