The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 38
Текст из файла (страница 38)
The response is the relative change in bone mineral density measured at the spine in adolescents, as a function of age. A separate smoothing splinewas fit to the males and females, with λ ≈ 0.00022. This choice corresponds toabout 12 degrees of freedom.where the Nj (x) are an N -dimensional set of basis functions for representing this family of natural splines (Section 5.2.1 and Exercise 5.4). Thecriterion thus reduces toRSS(θ, λ) = (y − Nθ)T (y − Nθ) + λθT ΩN θ,(5.11)R ′′where {N}ij = Nj (xi ) and {ΩN }jk = Nj (t)Nk′′ (t)dt. The solution iseasily seen to beθ̂ = (NT N + λΩN )−1 NT y,(5.12)a generalized ridge regression.
The fitted smoothing spline is given byfˆ(x)=NXNj (x)θ̂j .(5.13)j=1Efficient computational techniques for smoothing splines are discussed inthe Appendix to this chapter.Figure 5.6 shows a smoothing spline fit to some data on bone mineraldensity (BMD) in adolescents. The response is relative change in spinalBMD over two consecutive visits, typically about one year apart. The dataare color coded by gender, and two separate curves were fit.
This simple5.4 Smoothing Splines153summary reinforces the evidence in the data that the growth spurt forfemales precedes that for males by about two years. In both cases thesmoothing parameter λ was approximately 0.00022; this choice is discussedin the next section.5.4.1 Degrees of Freedom and Smoother MatricesWe have not yet indicated how λ is chosen for the smoothing spline. Laterin this chapter we describe automatic methods using techniques such ascross-validation.
In this section we discuss intuitive ways of prespecifyingthe amount of smoothing.A smoothing spline with prechosen λ is an example of a linear smoother(as in linear operator). This is because the estimated parameters in (5.12)are a linear combination of the yi . Denote by f̂ the N -vector of fitted valuesfˆ(xi ) at the training predictors xi . Thenf̂=N(NT N + λΩN )−1 NT y=Sλ y.(5.14)Again the fit is linear in y, and the finite linear operator Sλ is known asthe smoother matrix. One consequence of this linearity is that the recipefor producing f̂ from y does not depend on y itself; Sλ depends only onthe xi and λ.Linear operators are familiar in more traditional least squares fitting aswell.
Suppose Bξ is a N × M matrix of M cubic-spline basis functionsevaluated at the N training points xi , with knot sequence ξ, and M ≪ N .Then the vector of fitted spline values is given byf̂==Bξ (BTξ Bξ )−1 BTξ yHξ y.(5.15)Here the linear operator Hξ is a projection operator, also known as the hatmatrix in statistics. There are some important similarities and differencesbetween Hξ and Sλ :• Both are symmetric, positive semidefinite matrices.• Hξ Hξ = Hξ (idempotent), while Sλ Sλ Sλ , meaning that the righthand side exceeds the left-hand side by a positive semidefinite matrix.This is a consequence of the shrinking nature of Sλ , which we discussfurther below.• Hξ has rank M , while Sλ has rank N .The expression M = trace(Hξ ) gives the dimension of the projection space,which is also the number of basis functions, and hence the number of parameters involved in the fit.
By analogy we define the effective degrees of1545. Basis Expansions and Regularizationfreedom of a smoothing spline to bedfλ = trace(Sλ ),(5.16)the sum of the diagonal elements of Sλ . This very useful definition allowsus a more intuitive way to parameterize the smoothing spline, and indeedmany other smoothers as well, in a consistent fashion. For example, in Figure 5.6 we specified dfλ = 12 for each of the curves, and the correspondingλ ≈ 0.00022 was derived numerically by solving trace(Sλ ) = 12.
There aremany arguments supporting this definition of degrees of freedom, and wecover some of them here.Since Sλ is symmetric (and positive semidefinite), it has a real eigendecomposition. Before we proceed, it is convenient to rewrite Sλ in theReinsch formSλ = (I + λK)−1 ,(5.17)where K does not depend on λ (Exercise 5.9). Since f̂ = Sλ y solvesmin(y − f )T (y − f ) + λf T Kf ,f(5.18)K is known as the penalty matrix, and indeed a quadratic form in K hasa representation in terms of a weighted sum of squared (divided) seconddifferences.
The eigen-decomposition of Sλ isSλ =NXρk (λ)uk uTk(5.19)1,1 + λdk(5.20)k=1withρk (λ) =and dk the corresponding eigenvalue of K. Figure 5.7 (top) shows the results of applying a cubic smoothing spline to some air pollution data (128observations). Two fits are given: a smoother fit corresponding to a largerpenalty λ and a rougher fit for a smaller penalty. The lower panels represent the eigenvalues (lower left) and some eigenvectors (lower right) of thecorresponding smoother matrices. Some of the highlights of the eigenrepresentation are the following:• The eigenvectors are not affected by changes in λ, and hence the wholefamily of smoothing splines (for a particular sequence x) indexed byλ have the same eigenvectors.PN• Sλ y = k=1 uk ρk (λ)huk , yi, and hence the smoothing spline operates by decomposing y w.r.t. the (complete) basis {uk }, and differentially shrinking the contributions using ρk (λ). This is to be contrasted with a basis-regression method, where the components are5.4 Smoothing Splines15530••20•••0•• • ••••• • •••• •••• • • • •••••• ••• •• • ••• ••• •••• ••••• •• • • ••• ••• • • • •• • • • • • •••••• •••• •••• •••••••••••• •••• • •• •••• • ••••• • •• • • ••••••10Ozone Concentration•-500•50100• • •• • •••••0.60.4df=5df=11••••••0.2Eigenvalues0.81.01.2Daggot Pressure Gradient•••••••• • • • • • • • •• •• •• •• •• •• •• •-0.20.0•51015Order2025-50050100-50050100FIGURE 5.7.
(Top:) Smoothing spline fit of ozone concentration versus Daggotpressure gradient. The two fits correspond to different values of the smoothingparameter, chosen to achieve five and eleven effective degrees of freedom, definedby dfλ = trace(Sλ ). (Lower left:) First 25 eigenvalues for the two smoothing-splinematrices. The first two are exactly 1, and all are ≥ 0. (Lower right:) Third tosixth eigenvectors of the spline smoother matrices.
In each case, uk is plottedagainst x, and as such is viewed as a function of x. The rug at the base of theplots indicate the occurrence of data points. The damped functions represent thesmoothed versions of these functions (using the 5 df smoother).1565. Basis Expansions and Regularizationeither left alone, or shrunk to zero—that is, a projection matrix suchas Hξ above has M eigenvalues equal to 1, and the rest are 0. Forthis reason smoothing splines are referred to as shrinking smoothers,while regression splines are projection smoothers (see Figure 3.17 onpage 80).• The sequence of uk , ordered by decreasing ρk (λ), appear to increasein complexity.
Indeed, they have the zero-crossing behavior of polynomials of increasing degree. Since Sλ uk = ρk (λ)uk , we see how each ofthe eigenvectors themselves are shrunk by the smoothing spline: thehigher the complexity, the more they are shrunk. If the domain of Xis periodic, then the uk are sines and cosines at different frequencies.• The first two eigenvalues are always one, and they correspond to thetwo-dimensional eigenspace of functions linear in x (Exercise 5.11),which are never shrunk.• The eigenvalues ρk (λ) = 1/(1 + λdk ) are an inverse function of theeigenvalues dk of the penalty matrix K, moderated by λ; λ controlsthe rate at which the ρk (λ) decrease to zero. d1 = d2 = 0 and againlinear functions are not penalized.• One can reparametrize the smoothing spline using the basis vectorsuk (the Demmler–Reinsch basis).
In this case the smoothing splinesolvesmin ky − Uθk2 + λθ T Dθ,(5.21)θwhere U has columns uk and D is a diagonal matrix with elementsdk .PN• dfλ = trace(Sλ ) =k=1 ρk (λ). For projection smoothers, all theeigenvalues are 1, each one corresponding to a dimension of the projection subspace.Figure 5.8 depicts a smoothing spline matrix, with the rows ordered withx. The banded nature of this representation suggests that a smoothingspline is a local fitting method, much like the locally weighted regressionprocedures in Chapter 6. The right panel shows in detail selected rows ofS, which we call the equivalent kernels.
As λ → 0, dfλ → N , and Sλ → I,the N -dimensional identity matrix. As λ → ∞, dfλ → 2, and Sλ → H, thehat matrix for linear regression on x.5.5 Automatic Selection of the SmoothingParametersThe smoothing parameters for regression splines encompass the degree ofthe splines, and the number and placement of the knots. For smoothing5.5 Automatic Selection of the Smoothing Parameters157Equivalent KernelsRow 12•••• • • ••• ••• ••••••••••••••••••Smoother Matrix12•Row 25•••••••••••••••••••• ••• ••• •••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •• ••Row 502550•••••••••••••••• ••••••••••••••••••••••••••••••••••••••••••••••••••• ••••••••••••••••••••••••••• •• •••••••••• ••• •••••• • ••• •••••••••• •••••••••••••• ••••••••••••••••••••••••••••••• ••••••••••••••••••••••••••••••••••••••••••••••• •• ••Row 7575100115••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• •••••• ••••••••• • • •••••• •••••••••••••••••• ••••••••••••••Row 100•• •••••••••••••••• •••••••••••••••• ••••••••••••••••••••••••••••••• • • •••••• •••••••••••••••••••••••••••••••••••••••••••• •••••••• •••Row 115••••••••••• •• ••••••••••••••••••••••••••••••••••• • • •••••• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••FIGURE 5.8.