The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 17
Текст из файла (страница 17)
Notice that lcp isnot significant, once lcavol is in the model (when used in a model withoutlcavol, lcp is strongly significant). We can also test for the exclusion ofa number of terms at once, using the F -statistic (3.13). For example, weconsider dropping all the non-significant terms in Table 3.2, namely age,3.2 Linear Regression Models and Least Squares51lcp, gleason, and pgg45.
We getF =(32.81 − 29.43)/(9 − 5)= 1.67,29.43/(67 − 9)(3.16)which has a p-value of 0.17 (Pr(F4,58 > 1.67) = 0.17), and hence is notsignificant.The mean prediction error on the test data is 0.521. In contrast, prediction using the mean training value of lpsa has a test error of 1.057, whichis called the “base error rate.” Hence the linear model reduces the baseerror rate by about 50%. We will return to this example later to comparevarious selection and shrinkage methods.3.2.2 The Gauss–Markov TheoremOne of the most famous results in statistics asserts that the least squaresestimates of the parameters β have the smallest variance among all linearunbiased estimates. We will make this precise here, and also make clearthat the restriction to unbiased estimates is not necessarily a wise one. Thisobservation will lead us to consider biased estimates such as ridge regressionlater in the chapter.
We focus on estimation of any linear combination ofthe parameters θ = aT β; for example, predictions f (x0 ) = xT0 β are of thisform. The least squares estimate of aT β isθ̂ = aT β̂ = aT (XT X)−1 XT y.(3.17)Considering X to be fixed, this is a linear function cT0 y of the responsevector y. If we assume that the linear model is correct, aT β̂ is unbiasedsinceE(aT β̂)===E(aT (XT X)−1 XT y)aT (XT X)−1 XT XβaT β.(3.18)The Gauss–Markov theorem states that if we have any other linear estimator θ̃ = cT y that is unbiased for aT β, that is, E(cT y) = aT β, thenVar(aT β̂) ≤ Var(cT y).(3.19)The proof (Exercise 3.3) uses the triangle inequality.
For simplicity we havestated the result in terms of estimation of a single parameter aT β, but witha few more definitions one can state it in terms of the entire parametervector β (Exercise 3.3).Consider the mean squared error of an estimator θ̃ in estimating θ:MSE(θ̃)==E(θ̃ − θ)2Var(θ̃) + [E(θ̃) − θ]2 .(3.20)523. Linear Methods for RegressionThe first term is the variance, while the second term is the squared bias.The Gauss-Markov theorem implies that the least squares estimator has thesmallest mean squared error of all linear estimators with no bias. However,there may well exist a biased estimator with smaller mean squared error.Such an estimator would trade a little bias for a larger reduction in variance.Biased estimates are commonly used.
Any method that shrinks or sets tozero some of the least squares coefficients may result in a biased estimate.We discuss many examples, including variable subset selection and ridgeregression, later in this chapter. From a more pragmatic point of view, mostmodels are distortions of the truth, and hence are biased; picking the rightmodel amounts to creating the right balance between bias and variance.We go into these issues in more detail in Chapter 7.Mean squared error is intimately related to prediction accuracy, as discussed in Chapter 2. Consider the prediction of the new response at inputx0 ,Y0 = f (x0 ) + ε0 .Then the expected prediction error of an estimate f˜(x0 ) = xT0 β̃ isE(Y0 − f˜(x0 ))2 = σ 2 + E(xT0 β̃ − f (x0 ))2= σ 2 + MSE(f˜(x0 )).(3.21)(3.22)Therefore, expected prediction error and mean squared error differ only bythe constant σ 2 , representing the variance of the new observation y0 .3.2.3 Multiple Regression from Simple Univariate RegressionThe linear model (3.1) with p > 1 inputs is called the multiple linearregression model.
The least squares estimates (3.6) for this model are bestunderstood in terms of the estimates for the univariate (p = 1) linearmodel, as we indicate in this section.Suppose first that we have a univariate model with no intercept, that is,Y = Xβ + ε.(3.23)The least squares estimate and residuals arePNx i yi,β̂ = P1N21 xi(3.24)ri = yi − xi β̂.In convenient vector notation, we let y = (y1 , . . . , yN )T , x = (x1 , . . .
, xN )Tand definehx, yi==NXi=1Tx i yi ,x y,(3.25)3.2 Linear Regression Models and Least Squares53the inner product between x and y1 . Then we can writeβ̂ =hx, yi,hx, xi(3.26)r = y − xβ̂.As we will see, this simple univariate regression provides the building blockfor multiple linear regression. Suppose next that the inputs x1 , x2 , . . . , xp(the columns of the data matrix X) are orthogonal; that is hxj , xk i = 0for all j 6= k. Then it is easy to check that the multiple least squares estimates β̂j are equal to hxj , yi/hxj , xj i—the univariate estimates.
In otherwords, when the inputs are orthogonal, they have no effect on each other’sparameter estimates in the model.Orthogonal inputs occur most often with balanced, designed experiments(where orthogonality is enforced), but almost never with observationaldata. Hence we will have to orthogonalize them in order to carry this ideafurther. Suppose next that we have an intercept and a single input x. Thenthe least squares coefficient of x has the formβ̂1 =hx − x̄1, yi,hx − x̄1, x − x̄1i(3.27)Pwhere x̄ = i xi /N , and 1 = x0 , the vector of N ones.
We can view theestimate (3.27) as the result of two applications of the simple regression(3.26). The steps are:1. regress x on 1 to produce the residual z = x − x̄1;2. regress y on the residual z to give the coefficient β̂1 .In this procedure, “regress b on a” means a simple univariate regression of bon a with no intercept, producing coefficient γ̂ = ha, bi/ha, ai and residualvector b − γ̂a. We say that b is adjusted for a, or is “orthogonalized” withrespect to a.Step 1 orthogonalizes x with respect to x0 = 1. Step 2 is just a simpleunivariate regression, using the orthogonal predictors 1 and z.
Figure 3.4shows this process for two general inputs x1 and x2 . The orthogonalizationdoes not change the subspace spanned by x1 and x2 , it simply produces anorthogonal basis for representing it.This recipe generalizes to the case of p inputs, as shown in Algorithm 3.1.Note that the inputs z0 , . . . , zj−1 in step 2 are orthogonal, hence the simpleregression coefficients computed there are in fact also the multiple regression coefficients.1 The inner-product notation is suggestive of generalizations of linear regression todifferent metric spaces, as well as to probability spaces.543.
Linear Methods for Regressionyx2zŷx1FIGURE 3.4. Least squares regression by orthogonalization of the inputs. Thevector x2 is regressed on the vector x1 , leaving the residual vector z. The regression of y on z gives the multiple regression coefficient of x2 . Adding together theprojections of y on each of x1 and z gives the least squares fit ŷ.Algorithm 3.1 Regression by Successive Orthogonalization.1. Initialize z0 = x0 = 1.2. For j = 1, 2, . . . , pRegress xj on z0 , z1 , . .
. , , zj−1 to produce coefficients γ̂ℓj =hzℓ , xj i/hzℓ , zℓ i, ℓ = 0, . . . , j − 1 and residual vector zj =Pj−1xj − k=0 γ̂kj zk .3. Regress y on the residual zp to give the estimate β̂p .The result of this algorithm isβ̂p =hzp , yi.hzp , zp i(3.28)Re-arranging the residual in step 2, we can see that each of the xj is a linearcombination of the zk , k ≤ j. Since the zj are all orthogonal, they forma basis for the column space of X, and hence the least squares projectiononto this subspace is ŷ. Since zp alone involves xp (with coefficient 1), wesee that the coefficient (3.28) is indeed the multiple regression coefficient ofy on xp .
This key result exposes the effect of correlated inputs in multipleregression. Note also that by rearranging the xj , any one of them couldbe in the last position, and a similar results holds. Hence stated moregenerally, we have shown that the jth multiple regression coefficient is theunivariate regression coefficient of y on xj·012...(j−1)(j+1)...,p , the residualafter regressing xj on x0 , x1 , . . . , xj−1 , xj+1 , . . . , xp :3.2 Linear Regression Models and Least Squares55The multiple regression coefficient β̂j represents the additionalcontribution of xj on y, after xj has been adjusted for x0 , x1 , .
. . , xj−1 ,xj+1 , . . . , xp .If xp is highly correlated with some of the other xk ’s, the residual vectorzp will be close to zero, and from (3.28) the coefficient β̂p will be veryunstable. This will be true for all the variables in the correlated set. Insuch situations, we might have all the Z-scores (as in Table 3.2) be small—any one of the set can be deleted—yet we cannot delete them all. From(3.28) we also obtain an alternate formula for the variance estimates (3.8),Var(β̂p ) =σ2σ2=.hzp , zp ikzp k2(3.29)In other words, the precision with which we can estimate β̂p depends onthe length of the residual vector zp ; this represents how much of xp isunexplained by the other xk ’s.Algorithm 3.1 is known as the Gram–Schmidt procedure for multipleregression, and is also a useful numerical strategy for computing the estimates.
We can obtain from it not just β̂p , but also the entire multiple leastsquares fit, as shown in Exercise 3.4.We can represent step 2 of Algorithm 3.1 in matrix form:X = ZΓ,(3.30)where Z has as columns the zj (in order), and Γ is the upper triangular matrix with entries γ̂kj . Introducing the diagonal matrix D with jth diagonalentry Djj = kzj k, we getX==ZD−1 DΓQR,(3.31)the so-called QR decomposition of X.