Regression models for data sciense (779323), страница 12
Текст из файла (страница 12)
Error t value Pr(>|t|)(Intercept)60.83224.1059 14.816 1.032e-18Agriculture0.12420.08111.531 1.329e-01factor(CatholicBin)17.88433.74842.103 4.118e-02Thus, 7.8843 is the estimated change in the intercept in the expected relationship between Agricultureand Fertility going from a non-Catholic majority province to a Catholic majority.Often, however, we want both a different intercept and slope. This is easily obtained with aninteraction termYi = β0 + Xi1 β1 + Xi2 β2 + Xi1 Xi2 β3 + ϵi .Now consider with Xi2 = 0, the model reduces to:Yi = β0 + Xi1 β1 + ϵi .When Xi2 = 1 the model isYi = (β0 + β2 ) + Xi1 (β1 + β3 ) + ϵi .Thus, the coefficient in front of the main effect Xi2 , labeled β2 in our model, is the change in theintercept, while the coefficient in front of interaction term Xi2 Xi1 , labeled β3 in our model, is thechange in the slope. Let’s try it:72Multivariable examples and tricks> summary(lm(Fertility ~ Agriculture * factor(CatholicBin), data = swiss))$coefEstimate Std.
Error t value62.049934.78916 12.9563(Intercept)Agriculture0.096120.09881 0.9727factor(CatholicBin)12.8577010.62644 0.2689Agriculture:factor(CatholicBin)1 0.089140.17611 0.5061Pr(>|t|)1.919e-163.361e-017.893e-016.153e-01Thus, 2.8577 is the estimated change in the intercept of the linear relationship between Agriculureand Fertility going from non-Catholic majority to Catholic majority to Catholic majority provinces.The interaction term 0.9891 is the estimate change in the slope. The estimated intercept in nonCatholic provinces is 62.04993 while the estimated intercept in Catholic provinces is 62.04993 +2.85770. The estimated slope in non-Catholic majority provinces is 0.09612 while it is 0.09612 +0.08914 for Catholic majority provinces. If the factor has more than two levels, all of the main effectsare change in the intercepts from the reference level while all of the interaction terms are changesin slope (again compared to the reference level).Homework exercise, plot both lines on the data to see the fit!Exercises1.
Do exercise 1 of the previous chapter if you have not already. Load the dataset Seatbelts aspart of the datasets package via data(Seatbelts). Use as.data.frame to convert the objectto a dataframe. Fit a linear model of driver deaths with kms and PetrolPrice as predictors.Interpret your results.2. Repeat question 1 for the outcome being the log of the count of driver deaths. Interpret yourcoefficients. Watch a video solution.⁸⁵3. Refer to question 1. Add the dummy variable law and interpret the results. Repeat this questionwith a factor variable that you create called lawFactor that takes the levels No and Yes.
Changethe reference level from No to Yes. Watch a video solution.⁸⁶4. Discretize the PetrolPrice variable into four factor levels. Fit the linear model with this factorto see how R treats multiple level factor variables. Watch a video solution.⁸⁷5. Perform the plot requested at the end of the last chapter.⁸⁵https://www.youtube.com/watch?v=GfIjC4rM08A&index=40&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0⁸⁶https://www.youtube.com/watch?v=ikKQv98i-EQ&index=41&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0⁸⁷https://www.youtube.com/watch?v=4FB8O-Vt1I0&index=42&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0AdjustmentWatch this video before beginning.⁸⁸Adjustment, is the idea of putting regressors into a linear model to investigate the role of a thirdvariable on the relationship between another two.
Since it is often the case that a third variable candistort, or confound if you will, the relationship between two others.As an example, consider looking at lung cancer rates and breath mint usage. For the sake ofcompleteness, imagine if you were looking at forced expiratory volume (a measure of lung function)and breath mint usage. If you found a statistically significant regression relationship, it wouldn’t bewise to rush off to the newspapers with the headline “Breath mint usage causes shortness of breath!”,for a variety of reasons.
First off, even if the association is sound, you don’t know that it’s causal. But,more importantly in this case, the likely culprit is smoking habits. Smoking rates are likely relatedto both breath mint usage rates and lung function. How would you defend your finding against theaccusation that it’s just variability in smoking habits?If your finding held up among non-smokers and smokers analyzed separately, then you might havesomething. In other words, people wouldn’t even begin to believe this finding unless it held up whileholding smoking status constant.
That is the idea of adding a regression variable into a model asadjustment. The coefficient of interest is interpreted as the effect of the predictor on the response,holding the adjustment variable constant.In this chapter, we’ll use simulation to investigate how adding a regressor into a model addressesthe idea of adjustment.Experiment 1Let’s first generate some data. Consider the modelYi = β0 + β1 X + τ T + ϵiWe’re interested in the relationship between our binary treatment, T , and Y . However, we’reconcerned that the relationship may depend on the continuous variable, X.Let’s simulate some data.⁸⁸https://youtu.be/SFPM9IuP2m8Adjustment74n <- 100; t <- rep(c(0, 1), c(n/2, n/2)); x <- c(runif(n/2), runif(n/2));beta0 <- 0; beta1 <- 2; tau <- 1; sigma <- .2y <- beta0 + x * beta1 + t * tau + rnorm(n, sd = sigma)Let’s plot the data.
Below I give the code for the first plot, the rest of the code for plots throughoutthis chapter is omitted. (However, you can see the course git repository for the rest of the code.)Simulation 1plot(x, y, type = "n", frame = FALSE)abline(lm(y ~ x), lwd = 2)abline(h = mean(y[1 : (n/2)]), lwd = 3)abline(h = mean(y[(n/2 + 1) : n]), lwd = 3)fit <- lm(y ~ x + t)abline(coef(fit)[1], coef(fit)[2], lwd = 3)abline(coef(fit)[1] + coef(fit)[3], coef(fit)[2], lwd = 3)points(x[1 : (n/2)], y[1 : (n/2)], pch = 21, col = "black", bg = "lightblue", ce\x = 2)points(x[(n/2 + 1) : n], y[(n/2 + 1) : n], pch = 21, col = "black", bg = "salmon\", cex = 2)75AdjustmentExperiment 1.Looking at this plot notices that the X variable is unrelated to treatment/group status (color).
Inaddition, the X variable is clearly linearly related to Y, but the intercept of this relationship dependson group status. The treatment variable is also related to Y; especially look at the horizontal lineswhich connect the group means onto the Y axis. The third line is the what you would get if you justfit X and ignored group. Furthermore, notice that the relationship between group status and Y isconstant depending on X. In other words, both the apparent relationship and our estimated modelhave parallel lines. (Remember, our model, by not including an interaction term, did not allow forestimated non parallel lines.)Finally, notice that the estimated relationship between the group variable and the outcome doesn’tchange much, regardless of whether X is accounted for or not.
You can see this by comparing thedistance between the horizontal lines and the distance between the intercepts of the fitted lines.The horizontal lines are the group averages (disregarding X). That the relationship doesn’t changemuch is ultimately a statement about balance. The nuisance variable (X) is well balanced betweenlevels of the group variable.
So, whether you account for X or not, you get about the same answer.Moreover, we have lots of data at every level of X to make a direct comparison of the group on Y.One way to try to achieve such balance with high probability is to randomize the group variable.This is especially useful, of course, when one doesn’t get to observe the nuisance covariate. Thoughbe careful that as the number of unobserved covariatesNow let’s consider less ideal settings.76AdjustmentExperiment 2Experiment 2.In this experiment, the X variable is highly related to group status. That is, if you know the X variable,you could very easily predict which group they belonged to.
If we disregard X, there’s an apparentstrong relationship between the group variable and Y. However, if we account for X, there’s basicallynone. In this case, the apparent effect of group on Y is entirely explained by X. Our regression modelwould likely have a strong significant effect if group was included by itself and this effect wouldvanish if X was included.Further notice, there’s no data to directly compare the groups at any particular value of X. (There’sno vertical overlap between the blue and red points.) Thus the adjusted effect is entirely based onthe model, specifically the assumption of linearity.
Try to drawing curves on this plot assumingnon-linear relationships outside of their cloud of points for the blue and red groups. You quicklywill conclude that many relationship are possible that would differ from this model’s conclusions.Worse still, you have no data to check the assumptions.
Of course, R will churn forward withoutany complaints fitting this model and reporting no significant difference between the groups.It’s worth noting at this point, that our experiments just show how the data can arrive at differenteffects when X is included or not. In a real application, t may be the case that X should be includedand maybe that it shouldn’t be.77AdjustmentFor example, consider an example that I was working on a few years ago. Imagine if group waswhether or not the subject was taking blood pressure medication and X was systolic blood pressure(ostensibly, the two variable giving the same information).
It may not make sense to adjust for bloodpressure when looking at blood pressure medication on the outcome.On the other hand consider another setting I ran into. A colleague was studying chemical brainmeasurements of patients a severe mental disorder versus controls post mortem. However, the timethe time since death was highly related to the time the brain was stored since death, perhaps dueto the differential patient sources of the two groups. The time since death was strongly related tothe outcome we were studying. In this case, it is very hard to study the groups as they were socontaminated by this nuisance covariate.Thus we arrive at the conclusion that whether or not to include a covariate is a complex processrelying on both the statistics and a careful investigation into the subject matter being studied.Experiment 3Experiment 3.In this experiment, we simulated data where the marginal (ignoring X) and conditional (using X)associations differ. First note that if X is ignored, one would estimate a higher marginal mean forY of the red group over the blue group.