Regression models for data sciense, страница 7
Описание файла
PDF-файл из архива "Regression models for data sciense", который расположен в категории "". Всё это находится в предмете "математическое моделирование" из 9 семестр (1 семестр магистратуры), которые можно найти в файловом архиве МГТУ им. Н.Э.Баумана. Не смотря на прямую связь этого архива с МГТУ им. Н.Э.Баумана, его также можно найти и в других разделах. Архив можно найти в разделе "книги и методические указания", в предмете "математическое моделирование" в общих файлах.
Просмотр PDF-файла онлайн
Текст 7 страницы из PDF
In fact, the ordinary variance (using var in R on a vector) is exactlythe same as the residual variance estimate from a model that has an intercept and no slope. Then − 2 instead of n − 1 when we include a slope can be thought of as losing a degree of freedomfrom having to estimate an extra parameter (the slope).Most of this is typically opaque to the user, since we just grab the correct residual variance outputfrom lm. But, to solidify the concepts, let’s go through the diamond example to make sure that wecan hard code the estiamtes on our own.
(And from then on we’ll just use lm.)Diamond exampleFinding residual variance estimates.> y <- diamond$price; x <- diamond$carat; n <- length(y)> fit <- lm(y ~ x)## the estimate from lm> summary(fit)$sigma[1] 31.84## directly calculating from the residuals> sqrt(sum(resid(fit)^2) / (n - 2))[1] 31.84Summarizing variationA way to think about regression is in the decomposition of variability of our response.
The totalvariability in our response is the variability around an intercept. This is also the variance estimatefrom a model with only an intercept:Total variability =n∑(Yi − Ȳ )2i=1The regression variability is the variability that is explained by adding the predictor. Mathematically,this is:43ResidualsRegression variabilty =∑ni=1 (Ŷi− Ȳ )2 .The residual variability is what’s leftover around the regression lineResidual variability =n∑(Yi − Ŷi )2i=1It’s a nice fact that the error and regression variability add up to the total variability:n∑i=1(Yi − Ȳ ) =2n∑i=1(Yi − Ŷi ) +2n∑(Ŷi − Ȳ )2i=1Thus, we can think of regression as explaining away variability.
The fact that all of the quantitiesare positive and that they add up this way allows us to define the proportion of the total variabilityexplained by the model.Consider our diamond example again. The plot below shows the variation explained by a modelwith an intercept only (representing total variation) and that when the mass is included as a linearpredictor. Notice how much the variation decreases when including the diamond mass.Here’s the code:e = c(resid(lm(price ~ 1, data = diamond)),resid(lm(price ~ carat, data = diamond)))fit = factor(c(rep("Itc", nrow(diamond)),rep("Itc, slope", nrow(diamond))))g = ggplot(data.frame(e = e, fit = fit), aes(y = e, x = fit, fill = fit))g = g + geom_dotplot(binaxis = "y", size = 2, stackdir = "center", binwidth = 20)g = g + xlab("Fitting approach")g = g + ylab("Residual price")g44ResidualsResiduals for intercept only and linear regression for the diamond example.R squaredR squared is the percentage of the total variability that is explained by the linear relationship withthe predictor∑n(Ŷi − Ȳ )2R = ∑i=1n2i=1 (Yi − Ȳ )2Here are some summary notes about R squared.• R2 is the percentage of variation explained by the regression model.•0 ≤ R2 ≤ 1• R2 is the sample correlation squared• R2 can be a misleading summary of model fit.– Deleting data can inflate it.– (For later.) Adding terms to a regression model always increases R2 .Anscombe’s residual (named after their inventor) are a famous example of our R squared doesn’ttell the whole story about model fit.
In this example, four data sets have equivalent R squared valuesand beta values, but dramatically different model fits. The result is to suggest that reducing data toa single number, be it R squared, a test statistic or a P-value, often masks important aspects of thedata. The code is quite simple: data(anscombe);example(anscombe).45ResidualsPlot of Anscombe’s data set.Exercises1. Fit a linear regression model to the father.son dataset with the father as the predictor andthe son as the outcome.
Plot the son’s height (horizontal axis) versus the residuals (verticalaxis). Watch a video solution.⁶¹2. Refer to question 1. Directly estimate the residual variance and compare this estimate to theoutput of lm. Watch a video solution.⁶²3. Refer to question 1. Give the R squared for this model. Watch a video solution.⁶³4. Load the mtcars dataset.
Fit a linear regression with miles per gallon as the outcome andhorsepower as the predictor. Plot horsepower versus the residuals. Watch a video solution.⁶⁴5. Refer to question 4. Directly estimate the residual variance and compare this estimate to theoutput of lm. Watch a video solution.⁶⁵6. Refer to question 4. Give the R squared for this model. Watch a video solution.⁶⁶⁶¹https://www.youtube.com/watch?v=WnFuqlS3vvc&index=26&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0⁶²https://www.youtube.com/watch?v=M5scUi6JTCI&index=27&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0⁶³https://www.youtube.com/watch?v=A3IqBqjbVjo&index=28&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0⁶⁴https://www.youtube.com/watch?v=g0YPXDpQ15s&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0&index=29⁶⁵https://www.youtube.com/watch?v=R_RPGz4UpO4&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0&index=30⁶⁶https://www.youtube.com/watch?v=eavifxTZgfQ&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0&index=31Regression inferenceWatch this before beginning.⁶⁷In this chapter, we’ll consider statistical inference for regression models.Reminder of the modelConsider our regression model:Yi = β0 + β1 Xi + ϵiwhere ϵ ∼ N (0, σ 2 ).
Let’s consider some ways for doing inference for our regression parameters.For this development, we assume that the true model is known. We also assume that you’ve seenconfidence intervals and hypothesis tests before. If not, consider taking the Statistical Inferencecourse and book before approaching this material.Remember our estimates:β̂0 = Ȳ − β̂1 X̄andβ̂1 = Cor(Y, X)Sd(Y ).Sd(X)ReviewLet’s review some important components of statistical inference.
Consider statistics like thefollowing:θ̂ − θσ̂θ̂where θ̂ is an estimate of interest, θ is the estimand of interest and σ̂θ̂ is the standard error of θ̂. Wesee that in many cases such statistics often have the following properties:⁶⁷https://www.youtube.com/watch?v=vSdws014e4k&list=PLpl-gQkQivXjqHAJd2t-J_One_fYE55tC&index=1647Regression inference1. They are normally distributed and have a finite sample Student’s T distribution undernormality assumptions.2.
They can be used to test H0 : θ = θ0 versus Ha : θ >, <, ̸= θ0 .3. They can be used to create a confidence interval for θ via θ̂ ± Q1−α/2 σ̂θ̂ where Q1−α/2 is therelevant quantile from either a normal or T distribution.In the case of regression with iid Gaussian sampling assumptions on the errors, our inferences willfollow very similarly to what you saw in your inference class.We won’t cover asymptotics for regression analysis, but suffice it to say that under assumptions onthe ways in which the X values are collected, the iid sampling model, and mean model, the normalresults hold to create intervals and confidence intervalsResults for the regression parametersFirst, we need standard errors for our regression parameters.
These are given by:σβ̂21 = V ar(β̂1 ) = σ 2 /n∑(Xi − X̄)2i=1and(σβ̂20= V ar(β̂0 ) =1X̄ 2+ ∑n2ni=1 (Xi − X̄))σ2In practice, σ is replaced by its residual variance estimate covered in the last chapter.Given how often this came up in inference, it’s probably not surprising that under iid Gaussianerrorsβ̂j − βjσ̂β̂jfollows a t distribution with n-2 degrees of freedom and a normal distribution for large n. This canbe used to create confidence intervals and perform hypothesis tests.Example diamond data setWatch this before beginning⁶⁸Let’s go through a didactic example using our diamond pricing data. First, let’s define our outcome,predictor and estimate all of the parameters.
(Note, again we’re hard coding these results, but lmwill give it to us automatically).⁶⁸https://www.youtube.com/watch?v=V4Y7MHbn3lw&list=PLpl-gQkQivXjqHAJd2t-J_One_fYE55tC&index=17Regression inference48library(UsingR); data(diamond)y <- diamond$price; x <- diamond$carat; n <- length(y)beta1 <- cor(y, x) * sd(y) / sd(x)beta0 <- mean(y) - beta1 * mean(x)e <- y - beta0 - beta1 * xsigma <- sqrt(sum(e^2) / (n-2))ssx <- sum((x - mean(x))^2)Now let’s calculate the standard errors for our regression coefficients and the t statistic.
The naturalnull hypotheses are H0 : βj = 0. So our t statistics are just the estimates divided by their standarderrors.seBeta0 <- (1 / n + mean(x) ^ 2 / ssx) ^ .5 * sigmaseBeta1 <- sigma / sqrt(ssx)tBeta0 <- beta0 / seBeta0tBeta1 <- beta1 / seBeta1Now let’s consider getting P-values. Recall that P-values are the probability of getting a statistic asor larger than was actually obtained, where the probability is calculated under the null hypothesis.Below I also do some formatting to get it to look like the output from lm.>>>,>>>pBeta0 <- 2 * pt(abs(tBeta0), df = n - 2, lower.tail = FALSE)pBeta1 <- 2 * pt(abs(tBeta1), df = n - 2, lower.tail = FALSE)coefTable <- rbind(c(beta0, seBeta0, tBeta0, pBeta0), c(beta1, seBeta1, tBeta1\pBeta1))colnames(coefTable) <- c("Estimate", "Std.
Error", "t value", "P(>|t|)")rownames(coefTable) <- c("(Intercept)", "x")coefTableEstimate Std. Error t valueP(>|t|)(Intercept)-259.617.32 -14.99 2.523e-19x3721.081.7945.50 6.751e-40So the first column are the actual estimates. The second is the standard errors, the third is the t value(the first divided by the second) and the final is the t probability of getting an unsigned statistic thatlarge under the null hypothesis (the P-value for the two sided test). Of course, we don’t actually gothrough this exercise every time; lm does it for us.Regression inference49> fit <- lm(y ~ x);> summary(fit)$coefficientsEstimate Std.