The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 37
Текст из файла (страница 37)
This actually implies three rather than two interior knots (chosen atuniform quantiles of sbp), plus two boundary knots at the extremes of thedata, since we exclude the constant term from each of the hj .Since famhist is a two-level factor, it is coded by a simple binary ordummy variable, and is associated with a single coefficient in the fit of themodel.More compactly we can combine all p vectors of basis functions (andthe constant term) into one big vector h(X), and thenmodel is simplyPtheph(X)T θ, with total number of parameters df = 1 + j=1 dfj , the sum ofthe parameters in each component term. Each basis function is evaluatedat each of the N samples, resulting in a N × df basis matrix H.
At thispoint the model is like any other linear logistic model, and the algorithmsdescribed in Section 4.4.1 apply.We carried out a backward stepwise deletion process, dropping termsfrom this model while preserving the group structure of each term, ratherthan dropping one coefficient at a time. The AIC statistic (Section 7.5) wasused to drop terms, and all the terms remaining in the final model wouldcause AIC to increase if deleted from the model (see Table 5.1). Figure 5.4shows a plot of the final model selected by the stepwise regression.
Thefunctions displayed are fˆj (Xj ) = hj (Xj )T θ̂j for each variable Xj . Thecovariance matrix Cov(θ̂) = Σ is estimated by Σ̂ = (HT WH)−1 , where Wis the diagonal weight matrix from the logistic regression. Hence vj (Xj ) =Var[fˆj (Xj )] = hj (Xj )T Σ̂jj hj (Xj ) is the pointwise variance function of fˆj ,where Cov(θ̂j ) = Σ̂jj is the appropriate sub-matrix of Σ̂.
The shaded regionpin each panel is defined by fˆj (Xj ) ± 2 vj (Xj ).The AIC statistic is slightly more generous than the likelihood-ratio test(deviance test). Both sbp and obesity are included in this model, while14764-202fˆ(tobacco)20fˆ(sbp)485.2 Piecewise Polynomials and Splines100120140160180200220051015202530tobacco2-4-4-20fˆ(famhist)20-2fˆ(ldl)44sbp24681012Absent14famhistPresent0-4-2fˆ(age)42-2-60fˆ(obesity)62ldl1520253035obesity40452030405060ageFIGURE 5.4. Fitted natural-spline functions for each of the terms in the finalmodel selected by the stepwise procedure.
Included are pointwise standard-errorbands. The rug plot at the base of each figure indicates the location of each of thesample values for that variable (jittered to break ties).1485. Basis Expansions and RegularizationTABLE 5.1. Final logistic regression model, after stepwise deletion of naturalsplines terms. The column labeled “LRT” is the likelihood-ratio test statistic whenthat term is deleted from the model, and is the change in deviance from the fullmodel (labeled “none”).TermsnonesbptobaccoldlfamhistobesityageDf444144Deviance458.09467.16470.48472.39479.44466.24481.86AIC502.09503.16506.48508.39521.44502.24517.86LRTP-value9.07612.38714.30721.3568.14723.7680.0590.0150.0060.0000.0860.000they were not in the linear model. The figure explains why, since theircontributions are inherently nonlinear.
These effects at first may come asa surprise, but an explanation lies in the nature of the retrospective data.These measurements were made sometime after the patients suffered aheart attack, and in many cases they had already benefited from a healthierdiet and lifestyle, hence the apparent increase in risk at low values forobesity and sbp. Table 5.1 shows a summary of the selected model.5.2.3 Example: Phoneme RecognitionIn this example we use splines to reduce flexibility rather than increase it;the application comes under the general heading of functional modeling.
Inthe top panel of Figure 5.5 are displayed a sample of 15 log-periodogramsfor each of the two phonemes “aa” and “ao” measured at 256 frequencies.The goal is to use such data to classify a spoken phoneme. These twophonemes were chosen because they are difficult to separate.The input feature is a vector x of length 256, which we can think of asa vector of evaluations of a function X(f ) over a grid of frequencies f .
Inreality there is a continuous analog signal which is a function of frequency,and we have a sampled version of it.The gray lines in the lower panel of Figure 5.5 show the coefficients ofa linear logistic regression model fit by maximum likelihood to a trainingsample of 1000 drawn from the total of 695 “aa”s and 1022 “ao”s. Thecoefficients are also plotted as a function of frequency, and in fact we canthink of the model in terms of its continuous counterpartlogPr(aa|X)=Pr(ao|X)ZX(f )β(f )df,(5.7)5.2 Piecewise Polynomials and Splines14925Phoneme Examples151005Log-periodogram20aaao050100150200250Frequency0.20.0-0.2-0.4Logistic Regression Coefficients0.4Phoneme Classification: Raw and Restricted Logistic Regression050100150200250FrequencyFIGURE 5.5.
The top panel displays the log-periodogram as a function of frequency for 15 examples each of the phonemes “aa” and “ao” sampled from a totalof 695 “aa”s and 1022 “ao”s. Each log-periodogram is measured at 256 uniformlyspaced frequencies. The lower panel shows the coefficients (as a function of frequency) of a logistic regression fit to the data by maximum likelihood, using the256 log-periodogram values as inputs. The coefficients are restricted to be smoothin the red curve, and are unrestricted in the jagged gray curve.1505.
Basis Expansions and Regularizationwhich we approximate by256XX(fj )β(fj ) =256Xx j βj .(5.8)j=1j=1The coefficients compute a contrast functional, and will have appreciablevalues in regions of frequency where the log-periodograms differ betweenthe two classes.The gray curves are very rough. Since the input signals have fairly strongpositive autocorrelation, this results in negative autocorrelation in the coefficients.
In addition the sample size effectively provides only four observations per coefficient.Applications such as this permit a natural regularization. We force thecoefficients to vary smoothly as a function of frequency. The red curve in thelower panel of Figure 5.5 shows such a smooth coefficient curve fit to thesedata.
We see that the lower frequencies offer the most discriminatory power.Not only does the smoothing allow easier interpretation of the contrast, italso produces a more accurate classifier:Training errorTest errorRaw0.0800.255Regularized0.1850.158The smooth red curve was obtained through a very simple use of naturalcubic splines. WePMcan represent the coefficient function as an expansion ofsplines β(f ) = m=1 hm (f )θm . In practice this means that β = Hθ where,H is a p × M basis matrix of natural cubic splines, defined on the set offrequencies.
Here we used M = 12 basis functions, with knots uniformlyplaced over the integers 1, 2, . . . , 256 representing the frequencies. SincexT β = xT Hθ, we can simply replace the input features x by their filteredversions x∗ = HT x, and fit θ by linear logistic regression on the x∗ . Thered curve is thus β̂(f ) = h(f )T θ̂.5.3 Filtering and Feature ExtractionIn the previous example, we constructed a p × M basis matrix H, and thentransformed our features x into new features x∗ = HT x. These filteredversions of the features were then used as inputs into a learning procedure:in the previous example, this was linear logistic regression.Preprocessing of high-dimensional features is a very general and powerful method for improving the performance of a learning algorithm. Thepreprocessing need not be linear as it was above, but can be a general5.4 Smoothing Splines151(nonlinear) function of the form x∗ = g(x). The derived features x∗ canthen be used as inputs into any (linear or nonlinear) learning procedure.For example, for signal or image recognition a popular approach is to firsttransform the raw features via a wavelet transform x∗ = HT x (Section 5.9)and then use the features x∗ as inputs into a neural network (Chapter 11).Wavelets are effective in capturing discrete jumps or edges, and the neuralnetwork is a powerful tool for constructing nonlinear functions of thesefeatures for predicting the target variable.
By using domain knowledgeto construct appropriate features, one can often improve upon a learningmethod that has only the raw features x at its disposal.5.4 Smoothing SplinesHere we discuss a spline basis method that avoids the knot selection problem completely by using a maximal set of knots. The complexity of the fitis controlled by regularization. Consider the following problem: among allfunctions f (x) with two continuous derivatives, find one that minimizes thepenalized residual sum of squaresZNX{yi − f (xi )}2 + λ {f ′′ (t)}2 dt,(5.9)RSS(f, λ) =i=1where λ is a fixed smoothing parameter. The first term measures closenessto the data, while the second term penalizes curvature in the function, andλ establishes a tradeoff between the two. Two special cases are:λ = 0 : f can be any function that interpolates the data.λ = ∞ : the simple least squares line fit, since no second derivative canbe tolerated.These vary from very rough to very smooth, and the hope is that λ ∈ (0, ∞)indexes an interesting class of functions in between.The criterion (5.9) is defined on an infinite-dimensional function space—in fact, a Sobolev space of functions for which the second term is defined.Remarkably, it can be shown that (5.9) has an explicit, finite-dimensional,unique minimizer which is a natural cubic spline with knots at the uniquevalues of the xi , i = 1, .
. . , N (Exercise 5.7). At face value it seems thatthe family is still over-parametrized, since there are as many as N knots,which implies N degrees of freedom. However, the penalty term translatesto a penalty on the spline coefficients, which are shrunk some of the waytoward the linear fit.Since the solution is a natural spline, we can write it asf (x) =NXj=1Nj (x)θj ,(5.10)1525. Basis Expansions and Regularization0.150.100.050.0-0.05Relative Change in Spinal BMD0.20••••Male• •• •Female• • • •••••• •• •••• •• • •• ••••• • •••• • • ••••• • • • •• ••••• •••••• ••••• ••• • • •• • ••••••••••••• • •• •••• • •• •• •••• •••• • •• • • ••• ••• •• • • ••••• •• • • • ••• •• • •• • •• • • • • •• ••• •••••••• • • •• •• •••••••••• ••••••••••••• • •••• ••• •• • •• •••••• • ••••••• ••• • •• • ••• • •••• • •• •••• •• •• ••• •• •• • • • ••• •••••• ••• ••• • • •• •••••• ••• • •• ••••••••••••• •••••••••••• •• • • •• • • • • •• • • • •• • • ••• • •• •• • •• • •• • •• •• •• ••• • • • • • • •• • ••• • • •• ••• ••••••• •••• ••• • • ••• •• • •••••• • • •••••• •• ••••• •••••••••••• •• •• • •• •••••• •••••• •••10152025AgeFIGURE 5.6.