Regression models for data sciense (779323), страница 6
Текст из файла (страница 6)
The intercept -259.63 is the expected price of a 0 carat diamond.We’re not interested in 0 carat diamonds (it’s hard to get a good price for them ;-). Let’s fit the modelwith a more interpretable intercept by centering our X variable.> fit2 <- lm(price ~ I(carat - mean(carat)), data = diamond)coef(fit2)(Intercept) I(carat - mean(carat))500.13721.0Thus the new intercept, 500.1, is the expected price for the average sized diamond of the data (0.2042carats). Notice the estimated slope didn’t change at all.Now let’s try changing the scale.
This is useful since a one carat increase in a diamond is pretty big.What about changing units to 1/10th of a carat? We can just do this by just dividing the coefficientby 10, no need to refit the model.Thus, we expect a 372.102 (SIN) dollar change in price for every 1/10th of a carat increase in massof diamond.Let’s show via R that this is the same as rescaling our X variable and refitting. To go from 1 carat to1/10 of a carat units, we need to multiply our data by 10.> fit3 <- lm(price ~ I(carat * 10), data = diamond)> coef(fit3)(Intercept) I(carat * 10)-259.6372.1Now, let’s predicting the price of a diamond.
This should be as easy as just evaluating the fitted lineat the price we want to> newx <- c(0.16, 0.27, 0.34)> coef(fit)[1] + coef(fit)[2] * newx[1] 335.7 745.1 1005.5Therefore, we predict the price to be 335.7, 745.1 and 1005.5 for a 0.16, 0.26 and 0.34 carat diamonds.Of course, our prediction models will get more elaborate and R has a generic function, predict, toput our X values into the model for us. This is generally preferable and less The data has to go intothe model as a data frame with the same named X variables.32Statistical linear regression models> predict(fit, newdata = data.frame(carat = newx))123335.7 745.1 1005.5Let’s visualize our prediction. In the following plot, the predicted values at the observed Xs are thered points on the fitted line. The new X values are the at vertical lines, which are connected to thepredicted values via the connected horizontal lines.Illustrating prediction with regression.Exercises1.
Fit a linear regression model to the father.son dataset with the father as the predictor andthe son as the outcome. Give a p-value for the slope coefficient and perform the relevanthypothesis test. Watch a video solution.⁵¹2. Refer to question 1. Interpret both parameters. Recenter for the intercept if necessary. Watcha video solution.⁵²⁵¹https://www.youtube.com/watch?v=LxA2x2VvPWo&index=19&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0⁵²https://www.youtube.com/watch?v=YtXTK9ztE00&index=20&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0Statistical linear regression models333.
Refer to question 1. Predict the son’s height if the father’s height is 80 inches. Would yourecommend this prediction? Why or why not? Watch a video solution.⁵³4. Load the mtcars dataset. Fit a linear regression with miles per gallon as the outcome andhorsepower as the predictor. Interpret your coefficients, recenter for the intercept if necessary.Watch a video solution.⁵⁴5.
Refer to question 4. Overlay the fit onto a scatterplot. Watch a video solution.⁵⁵6. Refer to question 4. Test the hypothesis of no linear relationship between horsepower andmiles per gallon. Watch a video solution.⁵⁶7. Refer to question 4. Predict the miles per gallon for a horsepower of 111. Watch a videosolution.⁵⁷⁵³https://www.youtube.com/watch?v=kB95XqatMho&index=21&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0⁵⁴https://www.youtube.com/watch?v=4yc5ACmtYMw&index=22&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0⁵⁵https://www.youtube.com/watch?v=mhskQnUIVO4&index=23&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0⁵⁶https://www.youtube.com/watch?v=zjP82piLr1E&index=24&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0⁵⁷https://www.youtube.com/watch?v=UxSrHtY_klY&index=25&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0ResidualsWatch this video before beginning⁵⁸Residual variationResiduals represent variation left unexplained by our model. We emphasize the difference betweenresiduals and errors.
The errors unobservable true errors from the known coefficients, while residualsare the observable errors from the estimated coefficients. In a sense, the residuals are estimates ofthe errors.Consider again the diamond data set from UsingR. Recall that the data is diamond prices (Singaporedollars) and diamond weight in carats (standard measure of diamond mass, 0.2 $g$). To get the datause library(UsingR); data(diamond). Recall the data and our linear regression fit looked like thefollowing:⁵⁸https://www.youtube.com/watch?v=5vu-rW_FI0E&list=PLpl-gQkQivXjqHAJd2t-J_One_fYE55tC&index=1335ResidualsDiamond data plotted along with best fitting regression line.Recall our linear model wasYi = β0 + β1 Xi + ϵiwhere we are assuming that ϵi ∼ N (0, σ 2 ).
Our observed outcome is Yi with associated predictorvalue, Xi . Let’s label the predicted outcome for index i as Ŷi . Recall that we obtain our predictionsby plugging our observed Xi into the linear regression equation:Ŷi = β̂0 + β̂1 XiThe residual is defined as the difference the between the observed and predicted outcomeei = Yi − Ŷi .The residuals are exactly the vertical distance between the observed data point and the associatedpoint on the regression line.
Positive residuals have associated Y values above the fitted line andnegative residuals have values below.36ResidualsPicture of the residuals for the diamond data. Residuals are the signed length of the red lines.∑Least squares minimizes the sum of the squared residuals, ni=1 e2i . Note that the ei are observable,while the errors, ϵi are not. The residuals can be thought of as estimates of the errors.Properties of the residualsLet’s consider some properties of the residuals.First, under our model, their expected value is 0,∑E[ei ] = 0. If an intercept is included, ni=1 ei = 0. Note this tells us that the residuals are notindependent. If we know n − 1 of them, we know the nth . In fact, we will only have n − p freeresiduals, where p is the number of coefficients in our regression model, so p = 2 for linear∑ regressionwith an intercept and slope. If a regressor variable, Xi , is included in the model then ni=1 ei Xi = 0.What do we use residuals for? Most importantly, residuals are useful for investigating poor modelfit.
Residual plots highlight poor model fit.Another use for residuals is to create covariate adjusted variables. Specifically, residuals can bethought of as the outcome (Y) with the linear association of the predictor (X) removed. So, forexample, if you wanted to create a weight variable with the linear effect of height removed, youwould fit a linear regression with weight as the outcome and height as the predictor and take theresiduals. (Note this only works if the relationship is linear.)Residuals37Finally, we should note the different sorts of variation one encounters in regression. There’s thetotal variability in our response, usually called total variation.
One then differentiates residualvariation (variation after removing the predictor) from systematic variation (variation explainedby the regression model). These two kinds of variation add up to the total variation, which we’ll seelater.ExampleWatch this video before beginning⁵⁹The code below shows how to obtain the residuals.> data(diamond)> y <- diamond$price; x <- diamond$carat; n <- length(y)> fit <- lm(y ~ x)## The easiest way to get the residuals> e <- resid(fit)## Obtain the residuals manually, get the predicted Ys first> yhat <- predict(fit)## The residuals are y - yhat. Let's check by comparing this## with R's build in resid function> max(abs(e -(y - yhat)))[1] 9.486e-13## Let's do it again hard coding the calculation of Yhatmax(abs(e - (y - coef(fit)[1] - coef(fit)[2] * x)))[1] 9.486e-13Residuals versus XA useful plot is the residuals versus the X values. This allows us to zoom in on instances of poormodel fit.
Whenever we look at a residual plot, we are searching for systematic patterns of any sort.Here’s the plot for diamond data.⁵⁹https://www.youtube.com/watch?v=DSsSwKJ9frg&list=PLpl-gQkQivXjqHAJd2t-J_One_fYE55tC&index=1438ResidualsPlot of the mass (horizontal) versus residuals (vertical)Let’s go through some more contrived examples to highlight Here’s a plot of nonlinear data wherewe’ve fit a line.39ResidualsPlot of simulated non-linear data.Here’s what happens when you focus in on the residuals.40ResidualsPlot of residuals versus XAnother common feature where our model fails is when the variance around the regression lineis not constant. Remember our errors are assumed to be Gaussian with a constant error. Here’s anexample where heteroskedasticity is not at all apparent in the scatterplot.Scatterplot demonstrating heteroskedasticity.Now look at the consequences of focusing in on the residuals.41ResidualsResiduals versus X.If we look at the residual plot for the diamond data, things don’t look so bad.Residuals versus X.Estimating residual variationWatch this before beginning⁶⁰We’ve talked at length about how to estimate β0 and β1 .
However, there’s another parameter in ourmodel, σ. Recall that our model is Yi = β0 + β1 Xi + ϵi , where ϵi ∼ N (0, σ 2 ).It seems natural to use our residual variationto estimate population error variation. In fact, the∑maximum likelihood estimate of σ 2 is n1 ni=1 e2i , the average squared residual. Since the residuals⁶⁰https://www.youtube.com/watch?v=ZE3a4OZFWPA&list=PLpl-gQkQivXjqHAJd2t-J_One_fYE55tC&index=1542Residualshave a zero mean (if an intercept is included), this is close to the the calculating the variance of theresiduals. However, to obtain unbiasedness, most people use1 ∑ 2σ̂ =e.n − 2 i=1 in2The n − 2 instead of n is so that E[σ̂ 2 ] = σ 2 . This is exactly analogous to dividing by n − 1 in theordinary variance calculation.