Regression models for data sciense (779323), страница 7

Файл №779323 Regression models for data sciense (Regression models for data sciense) 7 страницаRegression models for data sciense (779323) страница 72017-12-252017-12-25СтудИзба

Regression models for data sciense

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 7)

In fact, the ordinary variance (using var in R on a vector) is exactlythe same as the residual variance estimate from a model that has an intercept and no slope. Then − 2 instead of n − 1 when we include a slope can be thought of as losing a degree of freedomfrom having to estimate an extra parameter (the slope).Most of this is typically opaque to the user, since we just grab the correct residual variance outputfrom lm. But, to solidify the concepts, let’s go through the diamond example to make sure that wecan hard code the estiamtes on our own.

(And from then on we’ll just use lm.)Diamond exampleFinding residual variance estimates.> y <- diamond$price; x <- diamond$carat; n <- length(y)> fit <- lm(y ~ x)## the estimate from lm> summary(fit)$sigma[1] 31.84## directly calculating from the residuals> sqrt(sum(resid(fit)^2) / (n - 2))[1] 31.84Summarizing variationA way to think about regression is in the decomposition of variability of our response.

The totalvariability in our response is the variability around an intercept. This is also the variance estimatefrom a model with only an intercept:Total variability =n∑(Yi − Ȳ )2i=1The regression variability is the variability that is explained by adding the predictor. Mathematically,this is:43ResidualsRegression variabilty =∑ni=1 (Ŷi− Ȳ )2 .The residual variability is what’s leftover around the regression lineResidual variability =n∑(Yi − Ŷi )2i=1It’s a nice fact that the error and regression variability add up to the total variability:n∑i=1(Yi − Ȳ ) =2n∑i=1(Yi − Ŷi ) +2n∑(Ŷi − Ȳ )2i=1Thus, we can think of regression as explaining away variability.

The fact that all of the quantitiesare positive and that they add up this way allows us to define the proportion of the total variabilityexplained by the model.Consider our diamond example again. The plot below shows the variation explained by a modelwith an intercept only (representing total variation) and that when the mass is included as a linearpredictor. Notice how much the variation decreases when including the diamond mass.Here’s the code:e = c(resid(lm(price ~ 1, data = diamond)),resid(lm(price ~ carat, data = diamond)))fit = factor(c(rep("Itc", nrow(diamond)),rep("Itc, slope", nrow(diamond))))g = ggplot(data.frame(e = e, fit = fit), aes(y = e, x = fit, fill = fit))g = g + geom_dotplot(binaxis = "y", size = 2, stackdir = "center", binwidth = 20)g = g + xlab("Fitting approach")g = g + ylab("Residual price")g44ResidualsResiduals for intercept only and linear regression for the diamond example.R squaredR squared is the percentage of the total variability that is explained by the linear relationship withthe predictor∑n(Ŷi − Ȳ )2R = ∑i=1n2i=1 (Yi − Ȳ )2Here are some summary notes about R squared.• R2 is the percentage of variation explained by the regression model.•0 ≤ R2 ≤ 1• R2 is the sample correlation squared• R2 can be a misleading summary of model fit.– Deleting data can inflate it.– (For later.) Adding terms to a regression model always increases R2 .Anscombe’s residual (named after their inventor) are a famous example of our R squared doesn’ttell the whole story about model fit.

In this example, four data sets have equivalent R squared valuesand beta values, but dramatically different model fits. The result is to suggest that reducing data toa single number, be it R squared, a test statistic or a P-value, often masks important aspects of thedata. The code is quite simple: data(anscombe);example(anscombe).45ResidualsPlot of Anscombe’s data set.Exercises1. Fit a linear regression model to the father.son dataset with the father as the predictor andthe son as the outcome.

Plot the son’s height (horizontal axis) versus the residuals (verticalaxis). Watch a video solution.⁶¹2. Refer to question 1. Directly estimate the residual variance and compare this estimate to theoutput of lm. Watch a video solution.⁶²3. Refer to question 1. Give the R squared for this model. Watch a video solution.⁶³4. Load the mtcars dataset.

Fit a linear regression with miles per gallon as the outcome andhorsepower as the predictor. Plot horsepower versus the residuals. Watch a video solution.⁶⁴5. Refer to question 4. Directly estimate the residual variance and compare this estimate to theoutput of lm. Watch a video solution.⁶⁵6. Refer to question 4. Give the R squared for this model. Watch a video solution.⁶⁶⁶¹https://www.youtube.com/watch?v=WnFuqlS3vvc&index=26&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0⁶²https://www.youtube.com/watch?v=M5scUi6JTCI&index=27&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0⁶³https://www.youtube.com/watch?v=A3IqBqjbVjo&index=28&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0⁶⁴https://www.youtube.com/watch?v=g0YPXDpQ15s&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0&index=29⁶⁵https://www.youtube.com/watch?v=R_RPGz4UpO4&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0&index=30⁶⁶https://www.youtube.com/watch?v=eavifxTZgfQ&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0&index=31Regression inferenceWatch this before beginning.⁶⁷In this chapter, we’ll consider statistical inference for regression models.Reminder of the modelConsider our regression model:Yi = β0 + β1 Xi + ϵiwhere ϵ ∼ N (0, σ 2 ).

Let’s consider some ways for doing inference for our regression parameters.For this development, we assume that the true model is known. We also assume that you’ve seenconfidence intervals and hypothesis tests before. If not, consider taking the Statistical Inferencecourse and book before approaching this material.Remember our estimates:β̂0 = Ȳ − β̂1 X̄andβ̂1 = Cor(Y, X)Sd(Y ).Sd(X)ReviewLet’s review some important components of statistical inference.

Consider statistics like thefollowing:θ̂ − θσ̂θ̂where θ̂ is an estimate of interest, θ is the estimand of interest and σ̂θ̂ is the standard error of θ̂. Wesee that in many cases such statistics often have the following properties:⁶⁷https://www.youtube.com/watch?v=vSdws014e4k&list=PLpl-gQkQivXjqHAJd2t-J_One_fYE55tC&index=1647Regression inference1. They are normally distributed and have a finite sample Student’s T distribution undernormality assumptions.2.

They can be used to test H0 : θ = θ0 versus Ha : θ >, <, ̸= θ0 .3. They can be used to create a confidence interval for θ via θ̂ ± Q1−α/2 σ̂θ̂ where Q1−α/2 is therelevant quantile from either a normal or T distribution.In the case of regression with iid Gaussian sampling assumptions on the errors, our inferences willfollow very similarly to what you saw in your inference class.We won’t cover asymptotics for regression analysis, but suffice it to say that under assumptions onthe ways in which the X values are collected, the iid sampling model, and mean model, the normalresults hold to create intervals and confidence intervalsResults for the regression parametersFirst, we need standard errors for our regression parameters.

These are given by:σβ̂21 = V ar(β̂1 ) = σ 2 /n∑(Xi − X̄)2i=1and(σβ̂20= V ar(β̂0 ) =1X̄ 2+ ∑n2ni=1 (Xi − X̄))σ2In practice, σ is replaced by its residual variance estimate covered in the last chapter.Given how often this came up in inference, it’s probably not surprising that under iid Gaussianerrorsβ̂j − βjσ̂β̂jfollows a t distribution with n-2 degrees of freedom and a normal distribution for large n. This canbe used to create confidence intervals and perform hypothesis tests.Example diamond data setWatch this before beginning⁶⁸Let’s go through a didactic example using our diamond pricing data. First, let’s define our outcome,predictor and estimate all of the parameters.

(Note, again we’re hard coding these results, but lmwill give it to us automatically).⁶⁸https://www.youtube.com/watch?v=V4Y7MHbn3lw&list=PLpl-gQkQivXjqHAJd2t-J_One_fYE55tC&index=17Regression inference48library(UsingR); data(diamond)y <- diamond$price; x <- diamond$carat; n <- length(y)beta1 <- cor(y, x) * sd(y) / sd(x)beta0 <- mean(y) - beta1 * mean(x)e <- y - beta0 - beta1 * xsigma <- sqrt(sum(e^2) / (n-2))ssx <- sum((x - mean(x))^2)Now let’s calculate the standard errors for our regression coefficients and the t statistic.

The naturalnull hypotheses are H0 : βj = 0. So our t statistics are just the estimates divided by their standarderrors.seBeta0 <- (1 / n + mean(x) ^ 2 / ssx) ^ .5 * sigmaseBeta1 <- sigma / sqrt(ssx)tBeta0 <- beta0 / seBeta0tBeta1 <- beta1 / seBeta1Now let’s consider getting P-values. Recall that P-values are the probability of getting a statistic asor larger than was actually obtained, where the probability is calculated under the null hypothesis.Below I also do some formatting to get it to look like the output from lm.>>>,>>>pBeta0 <- 2 * pt(abs(tBeta0), df = n - 2, lower.tail = FALSE)pBeta1 <- 2 * pt(abs(tBeta1), df = n - 2, lower.tail = FALSE)coefTable <- rbind(c(beta0, seBeta0, tBeta0, pBeta0), c(beta1, seBeta1, tBeta1\pBeta1))colnames(coefTable) <- c("Estimate", "Std.

Error", "t value", "P(>|t|)")rownames(coefTable) <- c("(Intercept)", "x")coefTableEstimate Std. Error t valueP(>|t|)(Intercept)-259.617.32 -14.99 2.523e-19x3721.081.7945.50 6.751e-40So the first column are the actual estimates. The second is the standard errors, the third is the t value(the first divided by the second) and the final is the t probability of getting an unsigned statistic thatlarge under the null hypothesis (the P-value for the two sided test). Of course, we don’t actually gothrough this exercise every time; lm does it for us.Regression inference49> fit <- lm(y ~ x);> summary(fit)$coefficientsEstimate Std.

Характеристики

Тип файла

PDF-файл

Размер

3,77 Mb

Материал

Regression models for data sciense

Тип материала

Книга

Предмет

Математическое моделирование

Высшее учебное заведение

МГТУ им. Н.Э.Баумана

Список файлов книги

regression-models-for-data-sciense-455223438-1514184926.rar

Regression models for data sciense.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.