The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 47
Текст из файла (страница 47)
This localization is achieved via a weighting functionor kernel Kλ (x0 , xi ), which assigns a weight to xi based on its distance fromx0 . The kernels Kλ are typically indexed by a parameter λ that dictatesthe width of the neighborhood. These memory-based methods require inprinciple little or no training; all the work gets done at evaluation time.The only parameter that needs to be determined from the training data isλ. The model, however, is the entire training data set.We also discuss more general classes of kernel-based techniques , whichtie in with structured methods in other chapters, and are useful for densityestimation and classification.The techniques in this chapter should not be confused with those associated with the more recent usage of the phrase “kernel methods”.
In thischapter kernels are mostly used as a device for localization. We discuss kernel methods in Sections 5.8, 14.5.4, 18.5 and Chapter 12; in those contextsthe kernel computes an inner product in a high-dimensional (implicit) feature space, and is used for regularized nonlinear modeling. We make someconnections to the methodology in this chapter at the end of Section 6.7.1926. Kernel Smoothing Methods0.0OOOOOOOO OOOOOOOOOO1.5OOOOO OOOOOO OO OO OO OOOOOOOOOOOOOOOOO Ofˆ(x0 ) OO O•OO1.0OOO OO OOO OOEpanechnikov KernelOOOOOOOO0.50.51.0OO OOOOOO OOOOOOOOOOOOOOO OOOOOOOOOOOO OO-0.50.0OOO OOOOOO OO OOO OOO OOOOOOOOOOO OOOOOOOOfˆO(x0 ) OO O•OOO OOOOOOOOOOOOOOOOO OOOOOOOOOOOO OO-0.51.5Nearest-Neighbor KernelOO OOOO-1.0-1.0OOOOOO0.00.20.4x00.60.8O1.00.00.20.4x00.60.81.0FIGURE 6.1.
In each panel 100 pairs xi , yi are generated at random from theblue curve with Gaussian errors: Y = sin(4X) + ε, X ∼ U [0, 1], ε ∼ N (0, 1/3). Inthe left panel the green curve is the result of a 30-nearest-neighbor running-meansmoother. The red point is the fitted constant fˆ(x0 ), and the red circles indicatethose observations contributing to the fit at x0 . The solid yellow region indicatesthe weights assigned to observations. In the right panel, the green curve is thekernel-weighted average, using an Epanechnikov kernel with (half ) window widthλ = 0.2.6.1 One-Dimensional Kernel SmoothersIn Chapter 2, we motivated the k–nearest-neighbor averagefˆ(x) = Ave(yi |xi ∈ Nk (x))(6.1)as an estimate of the regression function E(Y |X = x). Here Nk (x) is the setof k points nearest to x in squared distance, and Ave denotes the average(mean).
The idea is to relax the definition of conditional expectation, asillustrated in the left panel of Figure 6.1, and compute an average in aneighborhood of the target point. In this case we have used the 30-nearestneighborhood—the fit at x0 is the average of the 30 pairs whose xi valuesare closest to x0 . The green curve is traced out as we apply this definitionat different values x0 . The green curve is bumpy, since fˆ(x) is discontinuousin x. As we move x0 from left to right, the k-nearest neighborhood remainsconstant, until a point xi to the right of x0 becomes closer than the furthestpoint xi′ in the neighborhood to the left of x0 , at which time xi replaces xi′ .The average in (6.1) changes in a discrete way, leading to a discontinuousfˆ(x).This discontinuity is ugly and unnecessary.
Rather than give all thepoints in the neighborhood equal weight, we can assign weights that dieoff smoothly with distance from the target point. The right panel showsan example of this, using the so-called Nadaraya–Watson kernel-weighted6.1 One-Dimensional Kernel SmoothersaveragePNKλ (x0 , xi )yiˆf (x0 ) = Pi=1,Ni=1 Kλ (x0 , xi )with the Epanechnikov quadratic kernel|x − x0 |,Kλ (x0 , x) = DλwithD(t) =34 (10− t2 )if |t| ≤ 1;otherwise.193(6.2)(6.3)(6.4)The fitted function is now continuous, and quite smooth in the right panelof Figure 6.1. As we move the target from left to right, points enter theneighborhood initially with weight zero, and then their contribution slowlyincreases (see Exercise 6.1).In the right panel we used a metric window size λ = 0.2 for the kernelfit, which does not change as we move the target point x0 , while the sizeof the 30-nearest-neighbor smoothing window adapts to the local densityof the xi .
One can, however, also use such adaptive neighborhoods withkernels, but we need to use a more general notation. Let hλ (x0 ) be a widthfunction (indexed by λ) that determines the width of the neighborhood atx0 . Then more generally we have|x − x0 |.(6.5)Kλ (x0 , x) = Dhλ (x0 )In (6.3), hλ (x0 ) = λ is constant. For k-nearest neighborhoods, the neighborhood size k replaces λ, and we have hk (x0 ) = |x0 − x[k] | where x[k] isthe kth closest xi to x0 .There are a number of details that one has to attend to in practice:• The smoothing parameter λ, which determines the width of the localneighborhood, has to be determined.
Large λ implies lower variance(averages over more observations) but higher bias (we essentially assume the true function is constant within the window).• Metric window widths (constant hλ (x)) tend to keep the bias of theestimate constant, but the variance is inversely proportional to thelocal density. Nearest-neighbor window widths exhibit the oppositebehavior; the variance stays constant and the absolute bias variesinversely with local density.• Issues arise with nearest-neighbors when there are ties in the xi .
Withmost smoothing techniques one can simply reduce the data set byaveraging the yi at tied values of X, and supplementing these newobservations at the unique values of xi with an additional weight wi(which multiples the kernel weight).6. Kernel Smoothing MethodsEpanechnikovTri-cubeGaussian0.40.0Kλ (x0 , x)0.8194-3-2-10123FIGURE 6.2. A comparison of three popular kernels for local smoothing. Eachhas been calibrated to integrate to 1. The tri-cube kernel is compact and has twocontinuous derivatives at the boundary of its support, while the Epanechnikov kernel has none.
The Gaussian kernel is continuously differentiable, but has infinitesupport.• This leaves a more general problem to deal with: observation weightswi . Operationally we simply multiply them by the kernel weights before computing the weighted average. With nearest neighborhoods, itis now natural Pto insist on neighborhoods with a total weight contentk (relative towi ). In the event of overflow (the last observationneeded in a neighborhood has a weight wj which causes the sum ofweights to exceed the budget k), then fractional parts can be used.• Boundary issues arise.
The metric neighborhoods tend to contain lesspoints on the boundaries, while the nearest-neighborhoods get wider.• The Epanechnikov kernel has compact support (needed when usedwith nearest-neighbor window size). Another popular compact kernelis based on the tri-cube function(1 − |t|3 )3 if |t| ≤ 1;D(t) =(6.6)0otherwiseThis is flatter on the top (like the nearest-neighbor box) and is differentiable at the boundary of its support. The Gaussian density function D(t) = φ(t) is a popular noncompact kernel, with the standarddeviation playing the role of the window size. Figure 6.2 comparesthe three.6.1.1 Local Linear RegressionWe have progressed from the raw moving average to a smoothly varyinglocally weighted average by using kernel weighting.
The smooth kernel fitstill has problems, however, as exhibited in Figure 6.3 (left panel). Locallyweighted averages can be badly biased on the boundaries of the domain,6.1 One-Dimensional Kernel Smoothers1.5OOOOOO1.0OOOOOO-0.5O OO-1.00.40.60.8OOOOOOO OO OOO OOOO OOOOOOOOOOO OOO OOOOO0.2OOO OOx0•OOOOOO OOO0.0Ofˆ(x )0.5OOOOOOO OO OOO OOOO OOO OOOOOO OO OOOOOOOOOOO OOOOOOOOOOOOOOOOOO OOO OOO O OOOOO O0O0.0•O-0.51.00.5fˆ(x )OLocal Linear Regression at BoundaryOOOOOO OO OOOOOOOOOOOO OOOOOOOOOOOOOOOOOO OOO OOO O 0OOOOO OO0.0O OO OO1.0OOO-1.01.5N-W Kernel at BoundaryO1950.0x00.20.40.60.81.0FIGURE 6.3. The locally weighted average has bias problems at or near theboundaries of the domain.
The true function is approximately linear here, butmost of the observations in the neighborhood have a higher mean than the targetpoint, so despite weighting, their mean will be biased upwards. By fitting a locallyweighted linear regression (right panel), this bias is removed to first order.because of the asymmetry of the kernel in that region. By fitting straightlines rather than constants locally, we can remove this bias exactly to firstorder; see Figure 6.3 (right panel). Actually, this bias can be present in theinterior of the domain as well, if the X values are not equally spaced (forthe same reasons, but usually less severe).
Again locally weighted linearregression will make a first-order correction.Locally weighted regression solves a separate weighted least squares problem at each target point x0 :minα(x0 ),β(x0 )NXi=12Kλ (x0 , xi ) [yi − α(x0 ) − β(x0 )xi ] .(6.7)The estimate is then fˆ(x0 ) = α̂(x0 ) + β̂(x0 )x0 . Notice that although we fitan entire linear model to the data in the region, we only use it to evaluatethe fit at the single point x0 .Define the vector-valued function b(x)T = (1, x). Let B be the N × 2regression matrix with ith row b(xi )T , and W(x0 ) the N × N diagonalmatrix with ith diagonal element Kλ (x0 , xi ).