The Elements of Statistical Learning. Data Mining_ Inference_ and Prediction (811377), страница 33
Текст из файла (страница 33)
A scatterplot matrix of the South African heart disease data.Each plot shows a pair of risk factors, and the cases and controls are color coded(red is a case). The variable family history of heart disease (famhist) is binary(yes or no).1244. Linear Methods for ClassificationTABLE 4.3. Results from stepwise logistic regression fit to South African heartdisease data.(Intercept)tobaccoldlfamhistageCoefficient−4.2040.0810.1680.9240.044Std.
Error0.4980.0260.0540.2230.010Z score−8.453.163.094.144.52other correlated variables, they are no longer needed (and can even get anegative sign).At this stage the analyst might do some model selection; find a subsetof the variables that are sufficient for explaining their joint effect on theprevalence of chd. One way to proceed by is to drop the least significant coefficient, and refit the model. This is done repeatedly until no further termscan be dropped from the model. This gave the model shown in Table 4.3.A better but more time-consuming strategy is to refit each of the modelswith one variable removed, and then perform an analysis of deviance todecide which variable to exclude.
The residual deviance of a fitted modelis minus twice its log-likelihood, and the deviance between two models isthe difference of their individual residual deviances (in analogy to sums-ofsquares). This strategy gave the same final model as above.How does one interpret a coefficient of 0.081 (Std. Error = 0.026) fortobacco, for example? Tobacco is measured in total lifetime usage in kilograms, with a median of 1.0kg for the controls and 4.1kg for the cases. Thusan increase of 1kg in lifetime tobacco usage accounts for an increase in theodds of coronary heart disease of exp(0.081) = 1.084 or 8.4%. Incorporating the standard error we get an approximate 95% confidence interval ofexp(0.081 ± 2 × 0.026) = (1.03, 1.14).We return to these data in Chapter 5, where we see that some of thevariables have nonlinear effects, and when modeled appropriately, are notexcluded from the model.4.4.3 Quadratic Approximations and InferenceThe maximum-likelihood parameter estimates β̂ satisfy a self-consistencyrelationship: they are the coefficients of a weighted least squares fit, wherethe responses arezi = xTi β̂ +(yi − p̂i ),p̂i (1 − p̂i )(4.29)4.4 Logistic Regression125and the weights are wi = p̂i (1 − p̂i ), both depending on β̂ itself.
Apart fromproviding a convenient algorithm, this connection with least squares hasmore to offer:• The weighted residual sum-of-squares is the familiar Pearson chisquare statisticNX(yi − p̂i )2,(4.30)p̂ (1 − p̂i )i=1 ia quadratic approximation to the deviance.• Asymptotic likelihood theory says that if the model is correct, thenβ̂ is consistent (i.e., converges to the true β).• A central limit theorem then shows that the distribution of β̂ converges to N (β, (XT WX)−1 ). This and other asymptotics can be derived directly from the weighted least squares fit by mimicking normaltheory inference.• Model building can be costly for logistic regression models, becauseeach model fitted requires iteration. Popular shortcuts are the Raoscore test which tests for inclusion of a term, and the Wald test whichcan be used to test for exclusion of a term. Neither of these requireiterative fitting, and are based on the maximum-likelihood fit of thecurrent model.
It turns out that both of these amount to addingor dropping a term from the weighted least squares fit, using thesame weights. Such computations can be done efficiently, withoutrecomputing the entire weighted least squares fit.Software implementations can take advantage of these connections. Forexample, the generalized linear modeling software in R (which includes logistic regression as part of the binomial family of models) exploits themfully. GLM (generalized linear model) objects can be treated as linear modelobjects, and all the tools available for linear models can be applied automatically.4.4.4 L1 Regularized Logistic RegressionThe L1 penalty used in the lasso (Section 3.4.2) can be used for variableselection and shrinkage with any linear regression model.
For logistic regression, we would maximize a penalized version of (4.20):pN hXiXTyi (β0 + β T xi ) − log(1 + eβ0 +β xi ) − λ(4.31)max|βj | .β0 ,β i=1j=1As with the lasso, we typically do not penalize the intercept term, and standardize the predictors for the penalty to be meaningful. Criterion (4.31) is1264. Linear Methods for Classificationconcave, and a solution can be found using nonlinear programming methods (Koh et al., 2007, for example). Alternatively, using the same quadraticapproximations that were used in the Newton algorithm in Section 4.4.1,we can solve (4.31) by repeated application of a weighted lasso algorithm.Interestingly, the score equations [see (4.24)] for the variables with non-zerocoefficients have the formxTj (y − p) = λ · sign(βj ),(4.32)which generalizes (3.58) in Section 3.4.4; the active variables are tied intheir generalized correlation with the residuals.Path algorithms such as LAR for lasso are more difficult, because thecoefficient profiles are piecewise smooth rather than linear.
Nevertheless,progress can be made using quadratic approximations.0.40.20.0Coefficients βj (λ)0.6124567********************************************************************************************************************************************************************************************************************************************************************************************* ************************************************************************************************************ *************************** ************************ *********************** ***************************************************** *********************************************** ******* *************** ************** **************************** ******** *************** ******* *************************** ******** **************************************** ******************************************** ******************************************************************************* ****************************************************************************************************************************************************************************************************************************************************************************************************************************0.00.51.01.5agefamhistldltobaccosbpalcoholobesity2.0||β(λ)||1FIGURE 4.13.
L1 regularized logistic regression coefficients for the SouthAfrican heart disease data, plotted as a function of the L1 norm. The variableswere all standardized to have unit variance. The profiles are computed exactly ateach of the plotted points.Figure 4.13 shows the L1 regularization path for the South Africanheart disease data of Section 4.4.2. This was produced using the R packageglmpath (Park and Hastie, 2007), which uses predictor–corrector methodsof convex optimization to identify the exact values of λ at which the activeset of non-zero coefficients changes (vertical lines in the figure). Here theprofiles look almost linear; in other examples the curvature will be morevisible.Coordinate descent methods (Section 3.8.6) are very efficient for computing the coefficient profiles on a grid of values for λ.
The R package glmnet4.4 Logistic Regression127(Friedman et al., 2010) can fit coefficient paths for very large logistic regression problems efficiently (large in N or p). Their algorithms can exploitsparsity in the predictor matrix X, which allows for even larger problems.See Section 18.4 for more details, and a discussion of L1 -regularized multinomial models.4.4.5 Logistic Regression or LDA?In Section 4.3 we find that the log-posterior odds between class k and Kare linear functions of x (4.9):logPr(G = k|X = x)Pr(G = K|X = x)==logπk1− (µk + µK )T Σ−1 (µk − µK )πK2+xT Σ−1 (µk − µK )αk0 + αkT x.(4.33)This linearity is a consequence of the Gaussian assumption for the classdensities, as well as the assumption of a common covariance matrix.
Thelinear logistic model (4.17) by construction has linear logits:logPr(G = k|X = x)= βk0 + βkT x.Pr(G = K|X = x)(4.34)It seems that the models are the same. Although they have exactly the sameform, the difference lies in the way the linear coefficients are estimated. Thelogistic regression model is more general, in that it makes less assumptions.We can write the joint density of X and G asPr(X, G = k) = Pr(X)Pr(G = k|X),(4.35)where Pr(X) denotes the marginal density of the inputs X. For both LDAand logistic regression, the second term on the right has the logit-linearformTeβk0 +βk x(4.36)Pr(G = k|X = x) =PK−1 β +β T x ,1 + ℓ=1 e ℓ0 ℓwhere we have again arbitrarily chosen the last class as the reference.The logistic regression model leaves the marginal density of X as an arbitrary density function Pr(X), and fits the parameters of Pr(G|X) by maximizing the conditional likelihood—the multinomial likelihood with probabilities the Pr(G = k|X).
Although Pr(X) is totally ignored, we can thinkof this marginal density as being estimated in a fully nonparametric andunrestricted fashion, using the empirical distribution function which placesmass 1/N at each observation.With LDA we fit the parameters by maximizing the full log-likelihood,based on the joint densityPr(X, G = k) = φ(X; µk , Σ)πk ,(4.37)1284. Linear Methods for Classificationwhere φ is the Gaussian density function.
Standard normal theory leadseasily to the estimates µ̂k ,Σ̂, and π̂k given in Section 4.3. Since the linearparameters of the logistic form (4.33) are functions of the Gaussian parameters, we get their maximum-likelihood estimates by plugging in the corresponding estimates. However, unlike in the conditional case, the marginaldensity Pr(X) does play a role here. It is a mixture densityPr(X) =KXπk φ(X; µk , Σ),(4.38)k=1which also involves the parameters.What role can this additional component/restriction play? By relyingon the additional model assumptions, we have more information about theparameters, and hence can estimate them more efficiently (lower variance).If in fact the true fk (x) are Gaussian, then in the worst case ignoring thismarginal part of the likelihood constitutes a loss of efficiency of about 30%asymptotically in the error rate (Efron, 1975).