The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 49
Текст из файла (страница 49)
In fact, one of the manifestations of the curse of dimensionality is that the fraction of points close to theboundary increases to one as the dimension grows. Directly modifying thekernel to accommodate two-dimensional boundaries becomes very messy,especially for irregular boundaries. Local polynomial regression seamlesslyperforms boundary correction to the desired order in any dimensions. Figure 6.8 illustrates local linear regression on some measurements from anastronomical study with an unusual predictor design (star-shaped). Herethe boundary is extremely irregular, and the fitted surface must also interpolate over regions of increasing data sparsity as we approach the boundary.Local regression becomes less useful in dimensions much higher than twoor three.
We have discussed in some detail the problems of dimensionality, for example, in Chapter 2. It is impossible to simultaneously maintain localness (⇒ low bias) and a sizable sample in the neighborhood (⇒low variance) as the dimension increases, without the total sample size increasing exponentially in p. Visualization of fˆ(X) also becomes difficult inhigher dimensions, and this is often one of the primary goals of smoothing.6.4 Structured Local Regression Models in IRp201VelocityVelocitySouth-NorthSouth-NorthEast-WestEast-WestFIGURE 6.8.
The left panel shows three-dimensional data, where the responseis the velocity measurements on a galaxy, and the two predictors record positionson the celestial sphere. The unusual “star”-shaped design indicates the way themeasurements were made, and results in an extremely irregular boundary. Theright panel shows the results of local linear regression smoothing in IR2 , using anearest-neighbor window with 15% of the data.Although the scatter-cloud and wire-frame pictures in Figure 6.8 look attractive, it is quite difficult to interpret the results except at a gross level.From a data analysis perspective, conditional plots are far more useful.Figure 6.9 shows an analysis of some environmental data with three predictors.
The trellis display here shows ozone as a function of radiation,conditioned on the other two variables, temperature and wind speed. However, conditioning on the value of a variable really implies local to thatvalue (as in local regression). Above each of the panels in Figure 6.9 is anindication of the range of values present in that panel for each of the conditioning values. In the panel itself the data subsets are displayed (responseversus remaining variable), and a one-dimensional local linear regression isfit to the data. Although this is not quite the same as looking at slices ofa fitted three-dimensional surface, it is probably more useful in terms ofunderstanding the joint behavior of the data.6.4 Structured Local Regression Models in IRpWhen the dimension to sample-size ratio is unfavorable, local regressiondoes not help us much, unless we are willing to make some structural assumptions about the model.
Much of this book is about structured regression and classification models. Here we focus on some approaches directlyrelated to kernel methods.2026. Kernel Smoothing Methods0WindTemp501502500WindTempWindTemp50150250WindTemp5432Cube Root Ozone (cube root ppb)1WindTempWindTempWindTempWindTempWindTempWindTempWindTempWindTemp5432154321WindTempWindTempWindTempWindTemp54321050150250050150250Solar Radiation (langleys)FIGURE 6.9.
Three-dimensional smoothing example. The response is (cube-rootof ) ozone concentration, and the three predictors are temperature, wind speed andradiation. The trellis display shows ozone as a function of radiation, conditionedon intervals of temperature and wind speed (indicated by darker green or orangeshaded bars).
Each panel contains about 40% of the range of each of the conditioned variables. The curve in each panel is a univariate local linear regression,fit to the data in the panel.6.4 Structured Local Regression Models in IRp2036.4.1 Structured KernelsOne line of approach is to modify the kernel. The default spherical kernel (6.13) gives equal weight to each coordinate, and so a natural defaultstrategy is to standardize each variable to unit standard deviation. A moregeneral approach is to use a positive semidefinite matrix A to weigh thedifferent coordinates:(x − x0 )T A(x − x0 )Kλ,A (x0 , x) = D.(6.14)λEntire coordinates or directions can be downgraded or omitted by imposingappropriate restrictions on A. For example, if A is diagonal, then we canincrease or decrease the influence of individual predictors Xj by increasingor decreasing Ajj .
Often the predictors are many and highly correlated,such as those arising from digitized analog signals or images. The covariancefunction of the predictors can be used to tailor a metric A that focuses less,say, on high-frequency contrasts (Exercise 6.4). Proposals have been madefor learning the parameters for multidimensional kernels. For example, theprojection-pursuit regression model discussed in Chapter 11 is of this flavor,where low-rank versions of A imply ridge functions for fˆ(X). More generalmodels for A are cumbersome, and we favor instead the structured formsfor the regression function discussed next.6.4.2 Structured Regression FunctionsWe are trying to fit a regression function E(Y |X) = f (X1 , X2 , . . . , Xp ) inIRp , in which every level of interaction is potentially present.
It is naturalto consider analysis-of-variance (ANOVA) decompositions of the formXXf (X1 , X2 , . . . , Xp ) = α +gj (Xj ) +gkℓ (Xk , Xℓ ) + · · ·(6.15)jk<ℓand then introduce structure by eliminating some of the higher-orderterms.PpAdditive models assume only main effect terms: f (X) = α + j=1 gj (Xj );second-order models will have terms with interactions of order at mosttwo, and so on. In Chapter 9, we describe iterative backfitting algorithmsfor fitting such low-order interaction models. In the additive model, forexample, if all but the kthPterm is assumed known, then we can estimate gkby local regression of Y − j6=k gj (Xj ) on Xk . This is done for each functionin turn, repeatedly, until convergence.
The important detail is that at anystage, one-dimensional local regression is all that is needed. The same ideascan be used to fit low-dimensional ANOVA decompositions.An important special case of these structured models are the class ofvarying coefficient models. Suppose, for example, that we divide the p predictors in X into a set (X1 , X2 , .
. . , Xq ) with q < p, and the remainder of2046. Kernel Smoothing MethodsAortic Diameter vs Age20MaleDepth3040506020MaleDepthMaleDepth3040506020MaleDepthMaleDepth30405060MaleDepth2422201816Diameter141210FemaleDepthFemaleDepthFemaleDepthFemaleDepthFemaleDepthFemaleDepth2422201816141210203040506020304050602030405060AgeFIGURE 6.10. In each panel the aorta diameter is modeled as a linear function of age. The coefficients of this model vary with gender and depth downthe aorta (left is near the top, right is low down). There is a clear trend in thecoefficients of the linear model.the variables we collect in the vector Z.
We then assume the conditionallylinear modelf (X) = α(Z) + β1 (Z)X1 + · · · + βq (Z)Xq .(6.16)For given Z, this is a linear model, but each of the coefficients can varywith Z. It is natural to fit such a model by locally weighted least squares:minα(z0 ),β(z0 )NXi=12Kλ (z0 , zi ) (yi − α(z0 ) − x1i β1 (z0 ) − · · · − xqi βq (z0 )) .(6.17)Figure 6.10 illustrates the idea on measurements of the human aorta.A longstanding claim has been that the aorta thickens with age. Here wemodel the diameter of the aorta as a linear function of age, but allow thecoefficients to vary with gender and depth down the aorta.
We used a localregression model separately for males and females. While the aorta clearlydoes thicken with age at the higher regions of the aorta, the relationshipfades with distance down the aorta. Figure 6.11 shows the intercept andslope as a function of depth.6.5 Local Likelihood and Other ModelsFemale0.00.40.8Age Slope1.21416Age Intercept1820Male2050.00.20.40.60.8Distance Down Aorta1.00.00.20.40.60.81.0Distance Down AortaFIGURE 6.11. The intercept and slope of age as a function of distance downthe aorta, separately for males and females. The yellow bands indicate one standard error.6.5 Local Likelihood and Other ModelsThe concept of local regression and varying coefficient models is extremelybroad: any parametric model can be made local if the fitting method accommodates observation weights.
Here are some examples:• Associated with each observation yi is a parameter θi = θ(xi ) = xTi βlinear in the covariate(s) xi , and inference for β is based on the logPNlikelihood l(β) = i=1 l(yi , xTi β). We can model θ(X) more flexiblyby using the likelihood local to x0 for inference of θ(x0 ) = xT0 β(x0 ):l(β(x0 )) =NXKλ (x0 , xi )l(yi , xTi β(x0 )).i=1Many likelihood models, in particular the family of generalized linearmodels including logistic and log-linear models, involve the covariatesin a linear fashion.
Local likelihood allows a relaxation from a globallylinear model to one that is locally linear.2066. Kernel Smoothing Methods• As above, except different variables are associated with θ from thoseused for defining the local likelihood:l(θ(z0 )) =NXKλ (z0 , zi )l(yi , η(xi , θ(z0 ))).i=1For example, η(x, θ) = xT θ could be a linear model in x. This will fita varying coefficient model θ(z) by maximizing the local likelihood.• Autoregressive time series models of order k have the form yt =β0 + β1 yt−1 + β2 yt−2 + · · · + βk yt−k + εt .