Regression models for data sciense (779323), страница 16
Текст из файла (страница 16)
These are the increases if we actually knew σ 2 .Let’s look at the variance inflation factors. These measure how much variance inflation the variablecauses relative to the setting where it was orthogonal to the other regressors. This is nice becauseit has a well contained interpretation within a single model fit. Also, one doesn’t have to do all ofthe model refitting we did above to explore variance inflation.
So, in general, the VIFs are the mostconvenient entity to work with.> library(car)> fit <- lm(Fertility ~ . , data = swiss)> vif(fit)AgricultureExaminationEducationlity2.2843.6752.775.108> sqrt(vif(fit)) #If you prefer sd inflationAgricultureExaminationEducationlity1.5111.9171.666.052Catholic Infant.Morta\1.9371\Catholic Infant.Morta\1.3921\Impact of over- and under-fitting on residual varianceestimationWatch this video before beginning.¹⁰¹Assuming that the model is linear with additive iid errors, we can mathematically describe theimpact of omitting necessary variables or including unnecessary ones.
These two rules follow: * Ifwe underfit the model, that is omit necessary covariates, the variance estimate is biased. * If we¹⁰¹https://www.youtube.com/watch?v=Mg6WUKkRiS8&list=PLpl-gQkQivXjqHAJd2t-J_One_fYE55tC&index=32Multiple variables and model selection99correctly or overfit the model, including all necessary covariates and possibly some unnecessaryones, the variance estimate is unbiased.
However, the variance of the variance is larger if we includeunnecessary variables.These make sense. If we’ve omitted important variables, we’re attributing residual variation that isreally systematic variation explainable by those omitted covariates. Therefore, we would expect avariance estimate that is systematically off (biased). We would also expect absence of bias when wethrow the kitchen sink at the model and include everything (necessary and unnecessary).
However,then our variance estimate is unstable (the variance of the variance estimate is larger).Covariate model selectionIdeally, you include only the necessary variables in a regression model. However, it’s impossible toknow in practice which ones are necessary and which ones are not. Thus we have to discuss variableselection a little bit. Automated covariate selection is a difficult topic. It depends heavily on howrich of a covariate space one wants to explore. The space of models explodes quickly as you addinteractions and polynomial terms.In the Data Science Specialization prediction class, we’ll cover many modern methods for traversinglarge model spaces for the purposes of prediction.
In addition, principal components or factoranalytic models on covariates are often useful for reducing complex covariate spaces.It should also be noted that careful design can often eliminate the need for complex model searches atthe analyses stage. For example, randomized, randomized block designs, crossover designs, clinicaltrials, A/B testing are all examples of designs where randomization, balance and stratification areused to create data sets that have more direct analyses. However, control over the design is oftenlimited in data science.I’ll give my favorite approach for model selection when I’m trying to get a parsimonious explanatorymodel.
(I would use a different strategy for prediction.) Given a coefficient that I’m interested in, Ilike to use covariate adjustment and multiple models to probe that effect to evaluate it for robustnessand to see what other covariates knock it out or amplify it. In other words, if I have an effect, orabsence of an effect, that I’d like to report, I try to first come up with criticisms of that effect andthen use models to try to answer those criticisms.As an example, if I had a significant effect of lead exposure on brain size I would think about thefollowing criticism.
Were the high exposure people smaller than the low exposure people. To addressthis, I would consider adding head size (intra-cranial volume). If the lead exposed were more obesethan the non-exposed, I would put a model with body mass index (BMI) included. This isn’t aterribly systematic approach, but it tends to teach you a lot about the the data as you get your handsdirty. Most importantly, it makes you think hard about the questions your asking and what are thepotential criticisms to your results.
Heading those criticisms off at the pass early on is a good idea.Multiple variables and model selection100How to do nested model testing in ROne particular model selection technique is so useful I’ll cover it since it likely wouldn’t be coveredin a machine learning or prediction class. If the models of interest are nested and without lots ofparameters differentiating them, it’s fairly uncontroversial to use nested likelihood ratio tests formodel selection.
Consider the following example:> fit1 <- lm(Fertility ~ Agriculture, data = swiss)> fit3 <- update(fit, Fertility ~ Agriculture + Examination + Education)> fit5 <- update(fit, Fertility ~ Agriculture + Examination + Education + Cathol\ic + Infant.Mortality)> anova(fit1, fit3, fit5)Analysis of Variance TableModel 1: Fertility ~ AgricultureModel 2: Fertility ~ Agriculture + Examination + EducationModel 3: Fertility ~ Agriculture + Examination + Education + Catholic +Infant.MortalityRes.Df RSS Df Sum of SqF Pr(>F)145 6283243 3181 23102 30.2 8.6e-09 ***341 2105 21076 10.5 0.00021 ***--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Notice how the three models I’m interested in are nested. That is, Model 3 contains all of the Model2 variables which contains all of the Model 1 variables. The P-values are for a test of whether allof the new variables are all zero or not (i.e. whether or not they’re necessary).
So this model wouldconclude that all of the added Model 3 terms are necessary over Model 2 and all of the Model 2 termsare necessary over Model 1. So, unless there were some other compelling reasons, we’d pick Model3. Again, you don’t want to blindly follow a model selection procedure, but when the models arenaturally nested, this is a reasonable approach.Exercises1.
Load the dataset Seatbelts as part of the datasets package via data(Seatbelts). Useas.data.frame to convert the object to a dataframe. Fit a linear model of driver deaths withkms, PetrolPrice and law as predictors.2. Perform a model selection exercise to arrive at a final model.
Watch a video solution.¹⁰²¹⁰²https://www.youtube.com/watch?v=ffu80TAq2zY&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0&index=46Generalized Linear ModelsWatch this video before beginning.¹⁰³Generalized linear models (GLMs) were a great advance in statistical modeling. The originalmanuscript with the GLM framework was from Nelder and Wedderburn in 1972¹⁰⁴. in the Journalof the Royal Statistical Society.
The McCullagh and Nelder book¹⁰⁵ is the famous standard treatiseon the subject.Recall linear models. Linear models are the most useful applied statistical technique. However, theyare not without their limitations. Additive response models don’t make much sense if the responseis discrete, or strictly positive. Additive error models often don’t make sense, for example, if theoutcome has to be positive. Transformations, such as taking a cube root of a count outcome, areoften hard to interpret.In addition, there’s value in modeling the data on the scale that it was collected. Particularlyinterpretable transformations, natural logarithms in specific, aren’t applicable for negative or zerovalues.The generalized linear model is family of models that includes linear models.
By extending thefamily, it handles many of the issues with linear models, but at the expense of some complexity andloss of some of the mathematical tidiness. A GLM involves three components• An exponential family model for the response.• A systematic component via a linear predictor.• A link function that connects the means of the response to the linear predictor.The three most famous cases of GLMs are: linear models, binomial and binary regression and Poissonregression. We’ll go through the GLM model specification and likelihood for all three. For linearmodels, we’ve developed them throughout the book.
The next two chapters will be devoted tobinomial and Poisson regression. We’ll only focus on the most popular and useful link functions.Example, linear modelsLet’s go through an example. Assume that our response is Yi ∼ N (µi , σ 2 ). The Gaussian distributionis an exponential family distribution. Define the linear predictor to be∑ηi = pk=1 Xik βk .¹⁰³https://youtu.be/xEwM1nzQckY¹⁰⁴http://www.jstor.org/stable/2344614¹⁰⁵McCullagh, Peter, and John A. Nelder.
Generalized linear models. Vol. 37. CRC press, 1989.102Generalized Linear ModelsThe link function as g so that g(µ) = η. For linear models g(µ) = µ so that µi = ηi This yields thesame likelihood model as our additive error Gaussian linear modelYi =p∑Xik βk + ϵik=1iidwhere ϵi ∼ N (0, σ 2 ). So, we’ve specified our model as a GLM above and with a more traditionallinear model specification below. Let’s try an example where the GLM is more necessary.Example, logistic regressionAssume that our outcome is a 0, 1 variable. Let’s model Yi ∼ Bernoulli(µi ) so that E[Yi ] = µiwhere 0 ≤ µi ≤ 1.• Linear predictor: ηi =∑pXik βk)(µ• Link function g(µ) = η = log 1−µk=1In this case, g is the (natural) log odds, referred to as the logit. Note then we can invert the logitfunction as:µi =exp(ηi )1and 1 − µi =1 + exp(ηi )1 + exp(ηi )Some people like to call this the expit function.