The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 16
Текст из файла (страница 16)
Even if the xi ’s were not drawn randomly, the criterion is still validif the yi ’s are conditionally independent given the inputs xi . Figure 3.1illustrates the geometry of least-squares fitting in the IRp+1 -dimensional3.2 Linear Regression Models and Least Squares45Y••• •••••• • •• •• • • •• •• •• ••••••• • • • • •• •• ••••• •• • • X2••••X1FIGURE 3.1. Linear least squares fitting with X ∈ IR2 . We seek the linearfunction of X that minimizes the sum of squared residuals from Y .space occupied by the pairs (X, Y ).
Note that (3.2) makes no assumptionsabout the validity of model (3.1); it simply finds the best linear fit to thedata. Least squares fitting is intuitively satisfying no matter how the dataarise; the criterion measures the average lack of fit.How do we minimize (3.2)? Denote by X the N × (p + 1) matrix witheach row an input vector (with a 1 in the first position), and similarly lety be the N -vector of outputs in the training set.
Then we can write theresidual sum-of-squares asRSS(β) = (y − Xβ)T (y − Xβ).(3.3)This is a quadratic function in the p + 1 parameters. Differentiating withrespect to β we obtain∂RSS= −2XT (y − Xβ)∂β∂ 2 RSS= 2XT X.∂β∂β T(3.4)Assuming (for the moment) that X has full column rank, and hence XT Xis positive definite, we set the first derivative to zeroXT (y − Xβ) = 0(3.5)to obtain the unique solutionβ̂ = (XT X)−1 XT y.(3.6)463. Linear Methods for Regressionyx2ŷx1FIGURE 3.2.
The N -dimensional geometry of least squares regression with twopredictors. The outcome vector y is orthogonally projected onto the hyperplanespanned by the input vectors x1 and x2 . The projection ŷ represents the vectorof the least squares predictionsThe predicted values at an input vector x0 are given by fˆ(x0 ) = (1 : x0 )T β̂;the fitted values at the training inputs areŷ = Xβ̂ = X(XT X)−1 XT y,(3.7)where ŷi = fˆ(xi ).
The matrix H = X(XT X)−1 XT appearing in equation(3.7) is sometimes called the “hat” matrix because it puts the hat on y.Figure 3.2 shows a different geometrical representation of the least squaresestimate, this time in IRN . We denote the column vectors of X by x0 , x1 , . . . , xp ,with x0 ≡ 1. For much of what follows, this first column is treated like anyother. These vectors span a subspace of IRN , also referred to as the columnspace of X. We minimize RSS(β) = ky − Xβk2 by choosing β̂ so that theresidual vector y − ŷ is orthogonal to this subspace. This orthogonality isexpressed in (3.5), and the resulting estimate ŷ is hence the orthogonal projection of y onto this subspace. The hat matrix H computes the orthogonalprojection, and hence it is also known as a projection matrix.It might happen that the columns of X are not linearly independent, sothat X is not of full rank.
This would occur, for example, if two of theinputs were perfectly correlated, (e.g., x2 = 3x1 ). Then XT X is singularand the least squares coefficients β̂ are not uniquely defined. However,the fitted values ŷ = Xβ̂ are still the projection of y onto the columnspace of X; there is just more than one way to express that projectionin terms of the column vectors of X. The non-full-rank case occurs mostoften when one or more qualitative inputs are coded in a redundant fashion.There is usually a natural way to resolve the non-unique representation,by recoding and/or dropping redundant columns in X.
Most regressionsoftware packages detect these redundancies and automatically implement3.2 Linear Regression Models and Least Squares47some strategy for removing them. Rank deficiencies can also occur in signaland image analysis, where the number of inputs p can exceed the numberof training cases N .
In this case, the features are typically reduced byfiltering or else the fitting is controlled by regularization (Section 5.2.3 andChapter 18).Up to now we have made minimal assumptions about the true distribution of the data. In order to pin down the sampling properties of β̂, we nowassume that the observations yi are uncorrelated and have constant variance σ 2 , and that the xi are fixed (non random). The variance–covariancematrix of the least squares parameter estimates is easily derived from (3.6)and is given byVar(β̂) = (XT X)−1 σ 2 .(3.8)Typically one estimates the variance σ 2 byNX1σ̂ =(yi − ŷi )2 .N − p − 1 i=12The N − p − 1 rather than N in the denominator makes σ̂ 2 an unbiasedestimate of σ 2 : E(σ̂ 2 ) = σ 2 .To draw inferences about the parameters and the model, additional assumptions are needed.
We now assume that (3.1) is the correct model forthe mean; that is, the conditional expectation of Y is linear in X1 , . . . , Xp .We also assume that the deviations of Y around its expectation are additiveand Gaussian. HenceY==E(Y |X1 , .
. . , Xp ) + εpXβ0 +Xj βj + ε,(3.9)j=1where the error ε is a Gaussian random variable with expectation zero andvariance σ 2 , written ε ∼ N (0, σ 2 ).Under (3.9), it is easy to show thatβ̂ ∼ N (β, (XT X)−1 σ 2 ).(3.10)This is a multivariate normal distribution with mean vector and variance–covariance matrix as shown. Also(N − p − 1)σ̂ 2 ∼ σ 2 χ2N −p−1 ,(3.11)a chi-squared distribution with N − p − 1 degrees of freedom. In addition β̂and σ̂ 2 are statistically independent. We use these distributional propertiesto form tests of hypothesis and confidence intervals for the parameters βj .3.
Linear Methods for Regression0.01 0.02 0.03 0.04 0.05 0.06Tail Probabilities48t30t100normal2.02.22.42.62.83.0ZFIGURE 3.3. The tail probabilities Pr(|Z| > z) for three distributions, t30 , t100and standard normal. Shown are the appropriate quantiles for testing significanceat the p = 0.05 and 0.01 levels. The difference between t and the standard normalbecomes negligible for N bigger than about 100.To test the hypothesis that a particular coefficient βj = 0, we form thestandardized coefficient or Z-scorezj =β̂j√ ,σ̂ vj(3.12)where vj is the jth diagonal element of (XT X)−1 . Under the null hypothesisthat βj = 0, zj is distributed as tN −p−1 (a t distribution with N − p − 1degrees of freedom), and hence a large (absolute) value of zj will lead torejection of this null hypothesis. If σ̂ is replaced by a known value σ, thenzj would have a standard normal distribution.
The difference between thetail quantiles of a t-distribution and a standard normal become negligibleas the sample size increases, and so we typically use the normal quantiles(see Figure 3.3).Often we need to test for the significance of groups of coefficients simultaneously. For example, to test if a categorical variable with k levels canbe excluded from a model, we need to test whether the coefficients of thedummy variables used to represent the levels can all be set to zero. Herewe use the F statistic,F =(RSS0 − RSS1 )/(p1 − p0 ),RSS1 /(N − p1 − 1)(3.13)where RSS1 is the residual sum-of-squares for the least squares fit of the bigger model with p1 +1 parameters, and RSS0 the same for the nested smallermodel with p0 + 1 parameters, having p1 − p0 parameters constrained to be3.2 Linear Regression Models and Least Squares49zero.
The F statistic measures the change in residual sum-of-squares peradditional parameter in the bigger model, and it is normalized by an estimate of σ 2 . Under the Gaussian assumptions, and the null hypothesis thatthe smaller model is correct, the F statistic will have a Fp1 −p0 ,N −p1 −1 distribution. It can be shown (Exercise 3.1) that the zj in (3.12) are equivalentto the F statistic for dropping the single coefficient βj from the model. Forlarge N , the quantiles of Fp1 −p0 ,N −p1 −1 approach those of χ2p1 −p0 /(p1 −p0 ).Similarly, we can isolate βj in (3.10) to obtain a 1−2α confidence intervalfor βj :11(β̂j − z (1−α) vj2 σ̂, β̂j + z (1−α) vj2 σ̂).(3.14)Here z (1−α) is the 1 − α percentile of the normal distribution:z (1−0.025)z (1−.05)==1.96,1.645, etc.Hence the standard practice of reporting β̂ ± 2 · se(β̂) amounts to an approximate 95% confidence interval.
Even if the Gaussian error assumptiondoes not hold, this interval will be approximately correct, with its coverageapproaching 1 − 2α as the sample size N → ∞.In a similar fashion we can obtain an approximate confidence set for theentire parameter vector β, namelyCβ = {β|(β̂ − β)T XT X(β̂ − β) ≤ σ̂ 2 χ2p+1(1−α)},(3.15)(1−α)where χ2ℓis the 1 − α percentile of the chi-squared distribution on ℓ(1−0.05)(1−0.1)degrees of freedom: for example, χ25= 11.1, χ25= 9.2.
Thisconfidence set for β generates a corresponding confidence set for the truefunction f (x) = xT β, namely {xT β|β ∈ Cβ } (Exercise 3.2; see also Figure 5.4 in Section 5.2.2 for examples of confidence bands for functions).3.2.1 Example: Prostate CancerThe data for this example come from a study by Stamey et al. (1989).
Theyexamined the correlation between the level of prostate-specific antigen anda number of clinical measures in men who were about to receive a radicalprostatectomy. The variables are log cancer volume (lcavol), log prostateweight (lweight), age, log of the amount of benign prostatic hyperplasia(lbph), seminal vesicle invasion (svi), log of capsular penetration (lcp),Gleason score (gleason), and percent of Gleason scores 4 or 5 (pgg45).The correlation matrix of the predictors given in Table 3.1 shows manystrong correlations.
Figure 1.1 (page 3) of Chapter 1 is a scatterplot matrixshowing every pairwise plot between the variables. We see that svi is abinary variable, and gleason is an ordered categorical variable. We see, for503. Linear Methods for RegressionTABLE 3.1. Correlations of predictors in the prostate cancer data.lweightagelbphsvilcpgleasonpgg45lcavollweightage0.3000.2860.0630.5930.6920.4260.4830.3170.4370.1810.1570.0240.0740.2870.1290.1730.3660.276lbphsvilcpgleason−0.139−0.0890.033−0.0300.6710.3070.4810.4760.6630.757TABLE 3.2.
Linear model fit to the prostate cancer data. The Z score is thecoefficient divided by its standard error (3.12). Roughly a Z score larger than twoin absolute value is significantly nonzero at the p = 0.05 level.TermInterceptlcavollweightagelbphsvilcpgleasonpgg45Coefficient2.460.680.26−0.140.210.31−0.29−0.020.27Std.
Error0.090.130.100.100.100.120.150.150.15Z Score27.605.372.75−1.402.062.47−1.87−0.151.74example, that both lcavol and lcp show a strong relationship with theresponse lpsa, and with each other. We need to fit the effects jointly tountangle the relationships between the predictors and the response.We fit a linear model to the log of prostate-specific antigen, lpsa, afterfirst standardizing the predictors to have unit variance. We randomly splitthe dataset into a training set of size 67 and a test set of size 30. We applied least squares estimation to the training set, producing the estimates,standard errors and Z-scores shown in Table 3.2.
The Z-scores are definedin (3.12), and measure the effect of dropping that variable from the model.A Z-score greater than 2 in absolute value is approximately significant atthe 5% level. (For our example, we have nine parameters, and the 0.025 tailquantiles of the t67−9 distribution are ±2.002!) The predictor lcavol showsthe strongest effect, with lweight and svi also strong.