Regression models for data sciense (779323), страница 14
Текст из файла (страница 14)
Under this way of thinking,the distinctions over which of these two kinds of standardization are used is more academic thanpractical.A common use for residuals is to diagnose normality of the errors. This is often done by plotting theresidual quantiles versus normal quantiles. This is called a residual QQ plot. Your residuals shouldfall roughly on a line if plotted in a normal QQ plot. There is of course noise and a perfect fit wouldnot be expected even if the model held.Leverage is largely measured by one quantity, so called hat diagonals, which can be obtained inR by the function hatvalues.
The hat values are necessarily between 0 and 1 with larger valuesindicating greater (potential for) leverage.After leverage, there are quite a few ways to probe for **influence. These are:• dffits - change in the predicted response when the $iˆ{th}$ point is deleted in fitting themodel.• dfbetas - change in individual coefficients when the $iˆ{th}$ point is deleted in fitting themodel.• cooks.distance - overall change in the coefficients when the $iˆ{th}$ point is deleted.⁹⁰http://youtu.be/b6iqeHs_iroResiduals, variation, diagnostics85In other words, the dffits check for influence in the fitted values, dfbetas check for influence in thecoefficients individually and cooks.distance checks for influence in the coefficients as a collective.Finally, there’s a residual measure that’s also an influence measure.
Particularly, consider resid(fit)/ (1 - hatvalues(fit)) where fit is the linear model fit. This is the so-called PRESS residuals.These are the residual error from leave one out cross validation. That is, the difference in the responseand the predicted response at data point i, where it was not included in the model fitting.How do I use all of these things?First of all, be wary of simplistic rules for diagnostic plots and measures. The use of these toolsis context specific.
It’s better to understand what they are trying to accomplish and use themjudiciously. Not all diagnostics measures have meaningful absolute scales. You can look at themrelative to the values across the data. Even for the ones with known exact distributions to establishcutoffs, those distributions (like the externally studentized residual) have degrees of freedom thatdepend on the sample size, so a single threshold can’t be used across all settings.A better way to think about these tool is as diagnostics, like a physician diagnosing a health issue.These tools probe your data in different ways to diagnose different problems.
Some examplesinclude: * Patterns in your residual plots generally indicate some poor aspect of model fit. *Heteroskedasticity (non constant variance). * Missing model terms. * Temporal patterns (plotresiduals versus collection order). * Residual QQ plots investigate normality of the errors. * Leveragemeasures (hat values) can be useful for diagnosing data entry errors and points that have a highpotential for influence. * Influence measures get to the bottom line, ‘how does deleting or includingthis point impact a particular aspect of the model’.Let’s do some experiments to see how these measure hold up.86Residuals, variation, diagnosticsIMage for first simulation.Simulation examplesCase 1In what follows, we’re going to try several simulation settings and see the values of some on theresiduals, influence measures and leverage. In our first case, we simulate 100 points. The 101st point,c(10, 10), has created a strong regression relationship where there shouldn’t be one.
Note weprepend this point at the beginning of the Y and X vectors.n <- 100; x <- c(10, rnorm(n)); y <- c(10, c(rnorm(n)))plot(x, y, frame = FALSE, cex = 2, pch = 21, bg = "lightblue", col = "black")abline(lm(y ~ x))First, let’s look at the dfbetas. Note the dfbetas are 101 by 2 dimensional, since there’s a dfbetafor both the intercept and the slope. Let’s just output the first 10 for the slope.Residuals, variation, diagnostics87> fit <- lm(y ~ x)> round(dfbetas(fit)[1 : 10, 2], 3)123456789106.007 -0.019 -0.007 0.014 -0.002 -0.083 -0.034 -0.045 -0.112 -0.008Clearly the first point has a much, much larger dfbeta for the slope than the other points.
That is,the slope changes dramatically from when we include this point to not including it. Try it out withCook’s distance and the dffits. Let’s look at leverage.round(hatvalues(fit)[1 : 10], 3)123456789100.445 0.010 0.011 0.011 0.030 0.017 0.012 0.033 0.021 0.010Again, this point has a much higher leverage value than that of the other points. By having a largeleverage value and dfbeta, we’re seeing that this point has a high potential for influence, and decidedto exert it.Case 2Consider a second case where the point lies on a natural line defined by the data, but is well outsideof the cloud of X values.
Since the code is so similar, I don’t show it. But, as always, it can be foundin the github repo for the courses.88Residuals, variation, diagnosticsSecond simulation example.Now let’s consider the dfbetas and the leverage for the first 10 observations.> round(dfbetas(fit2)[1 : 10, 2], 3)123456789-0.072 -0.041 -0.007 0.012 0.008 -0.187 0.017 0.100 -0.059> round(hatvalues(fit2)[1 : 10], 3)123456789100.164 0.011 0.014 0.012 0.010 0.030 0.017 0.017 0.013 0.021100.035As we would expect, the dfbeta value for the first point is well with the range of the other points.The leverage is much larger than the others.
In this case, the point has high leverage, but choses notto exert it as influence.Play around with more simulation examples to get a feeling for what these measures do. This willhelp more than anything in understanding their value.Example described by StefanskiWatch this video before beginning.⁹¹⁹¹http://youtu.be/oMW7jGEdZ4889Residuals, variation, diagnosticsWe end with a really fun example from Stefanski in TAS 2007 volume 61⁹². This paper illustrateshow a residual plot can unveil hidden treasures that would be nearly impossible to detect with otherkinds of plots.
He has several examples on his website and we go through one. First, let’s read in thedata and do a scatterplot matrix.dat <- read.table('http://www4.stat.ncsu.edu/~stefanski/NSF_Supported/Hidden_Ima\ges/orly_owl_files/orly_owl_Lin_4p_5_flat.txt', header = FALSE)pairs(dat)Scatterplot matrix from the Stefanski data.It looks like a big mess of nothing. We can fit a model and get that all of the variables are highlysignificant> summary(lm(V1 ~ . -1, data =Estimate Std.
Error t valueV20.98560.127987.701V30.97150.126647.671V40.86060.119587.197V50.92670.08328 11.127dat))$coefPr(>|t|)1.989e-142.500e-148.301e-134.778e-28Can we call it a day? Let’s check a residual plot.⁹²http://amstat.tandfonline.com/doi/abs/10.1198/000313007X190079Residuals, variation, diagnostics90fit <- lm(V1 ~ . - 1, data = dat); plot(predict(fit), resid(fit), pch = '.')Residuals versus fitted values from the Stefanski data.There appears to be a pattern. The moral of the story here is that residual plots can really hone inon systematic patters in the data that are completely non-apparent from other plots.Residuals, variation, diagnostics91Back to the Swiss dataPlot of the influence, leverage and residuals from the swiss datasetExercises1.
Load the dataset Seatbelts as part of the datasets package via data(Seatbelts). Useas.data.frame to convert the object to a dataframe. Fit a linear model of driver deaths withkms, PetrolPrice and law as predictors.2. Refer to question 1. Directly estimate the residual variation via the function resid. Comparewith R’s residual variance estimate. Watch a video solution.⁹³3. Refer to question 1. Perform an analysis of diagnostic measures including, dffits, dfbetas,influence and hat diagonals. Watch a video solution.⁹⁴⁹³https://www.youtube.com/watch?v=T8nPIeH1rwU&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0&index=44⁹⁴https://www.youtube.com/watch?v=XEqlmqFTVOI&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0&index=45Multiple variables and modelselectionWatch this video before beginning.⁹⁵This chapter represents a challenging question: “How do we chose what to variables to include in aregression model?”.
Sadly, no single easy answer exists and the most reasonable answer would be“It depends.”. These concepts bleed into ideas of machine learning, which is largely focused on highdimensional variable selection and weighting. In this chapter we cover some of the basics and, mostimportantly, the consequences of over- and under-fitting a model.Multivariable regressionIn our Coursera Data Science Specialization, we have an entire class on prediction and machinelearning.
So, in this class, our focus will be on modeling. That is, our primary concern is winding upwith an interpretable model, with interpretable coefficients. This is a very different process than ifwe only care about prediction or machine learning. Prediction has a different set of criteria, needsfor interpretability and standards for generalizability. In modeling, our interest lies in parsimonious,interpretable representations of the data that enhance our understanding of the phenomena understudy.Like nearly all aspects of statistics, good modeling decisions are context dependent. Consider a goodmodel for prediction, versus one for studying mechanisms, versus one for trying to establish causaleffects. There are, however, some principles to help you guide your way.Parsimony is a core concept in model selection.
The idea of parsimony is to keep your models assimple as possible (but no simpler). This builds on the idea of Occam’s razor⁹⁶, in that all else beingequal simpler explanations are better than complex ones. Simpler models are easier to interpret andare less finicky. Complex models often have issues with fitting and, especially, overfitting. (To see acounterargument, consider Andrew Gelman’s blog.⁹⁷.)Another principle that I find useful for looking at statistical models is to consider them as lensesthrough which to look at your data.
(I attribute this quote to great statistician Scott Zeger.) Underthis philosophy, what’s the right model - whatever one connects the data to a true, parsimoniousstatement about what you’re studying. Unwin and authors have formalized these ideas more intosomething they call exploratory model analysis⁹⁸ I like this, as it turns our focus away from trying⁹⁵https://youtu.be/zfhNo8uNBho?list=PLpl-gQkQivXjqHAJd2t-J_One_fYE55tC⁹⁶https://en.wikipedia.org/wiki/Occam’s_razor⁹⁷http://andrewgelman.com/2004/12/10/against_parsimo/⁹⁸http://www.sciencedirect.com/science/article/pii/S016794730200292XMultiple variables and model selection93to get a single, best, true model and instead focuses on.