Regression models for data sciense (779323), страница 15

Файл №779323 Regression models for data sciense (Regression models for data sciense) 15 страницаRegression models for data sciense (779323) страница 152017-12-252017-12-25СтудИзба

Regression models for data sciense

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 15)

This is useful, since there are uncountableways that a model can be wrong. In this lecture, we’ll focus on variable inclusion and exclusion.The Rumsfeldian tripletBefore we begin, I’d like to give a quote from Donal Rumsfeld, the controversial Secretary of Defenseof the US during the start of the Afghanistan the second Iraq wars. He gave this quote regardingweapons of mass destruction (read more about it here⁹⁹):“There are known knowns. These are things we know that we know. There are known unknowns.That is to say, there are things that we know we don’t know. But there are also unknown unknowns.There are things we don’t know we don’t know.” - Donald RumsfeldThis quote, widely derided for its intended purpose, is quite insightful in the unintended contextof regression model selection.

Specifically, in our context “Known Knowns” are regressors that weknow we should check to include in the model and have. The “Known Unknowns” are regressorsthat we would like to include in the model, but don’t have. The “Unknown Unknowns” are regressorsthat we don’t even know about that we should have included in the model.In this chapter, we’ll talk about Known Knowns; variables that are potentially of interest in our modelthat we have. Known Unknowns and Unknown Unknowns (especially) are more challenging to dealwith.

A central method for dealing with Unknown Unknowns is randomization. If you’d like tocompare a treatment to a control, or perform an A/B test of two advertising strategies, randomizationwill help insure that your treatment is balanced across levels of the Unknown Unknowns with highprobability.

(Of course, being unobserved, you can never know whether or not the randomizationwas effective.)For Known Unknowns, those variables we wish we had collected but are aware about, there areseveral strategies. For example, a proxy variable might be of use. As an example, we had some brainvolumetric measurements via MRIs and really wished we had done the processing to get intracranial volume (head size). The need for this variable was because we didn’t want to compare brainvolumetric measurements and conclude that bigger people with bigger heads have more brain mass.This would be a useless conclusion, for example whales have bigger brains than dolphins, but thatdoesn’t tell you much about whales or dolphins.

More interesting would be if whales who wereexposed to toxic chemicals had lower brain volume relative to their intra-cranial volume than whaleswho weren’t exposed. In our case, (we were studying humans), we used height, gender and otheranthropomorphic measurements to get a good guess of intra-cranial volume.For the rest of the lecture, let’s discuss the known knowns and what their unnecessary inclusion andexclusion implies in our analysis.General rulesHere we state a couple of general rules regarding model selection for our known knowns.⁹⁹https://en.wikipedia.org/wiki/There_are_known_knownsMultiple variables and model selection94• Omitting variables results in bias in the coefficients of interest - unless the regressors areuncorrelated with the omitted ones.I want to reiterate this point: if the omitted variable is uncorrelated with the included variables, itsomission has no impact on estimation.

It might explain some residual variation, thus it could havean impact on inference. As previously mentioned, this lack of impact of uncorrelated variables iswhy we randomize treatments; randomization attempts to disassociate our treatment indicator withvariables that we don’t have to put in the model. Formal theories of inference can be designed aroundthe use of randomization. However, in a practical sense, if there’s too many unobserved confoundingvariables, even randomization won’t help you, since with high probability one will stay correlatedwith the treatment.In most cases we won’t have randomization. So, to avoid bias, why don’t we throw everything intothe regression model? The following rule prevents us from doing that:• Including variables that we shouldn’t have increases standard errors of the regressionvariables.Actually, including any new variables increases the actual (not estimated) standard errors of otherregressors.

So we don’t want to idly throw variables into the model. In addition the model must tendtoward perfect fit as the number of non-redundant regressors approaches the sample size. Our R2increases monotonically as more regressors are included, even unrelated white noise.R squared goes up as you put regressors in the modelLet’s try a simulation. In this simulation, no regression relationship exists. We simulate data and pregressors as random normals.

The plot is of the R2 .n <- 100plot(c(1, n), 0 : 1, type = "n", frame = FALSE, xlab = "p", ylab = "R^2")y <- rnorm(n); x <- NULL; r <- NULLfor (i in 1 : n){x <- cbind(x, rnorm(n))r <- c(r, summary(lm(y ~ x))$r.squared)}lines(1 : n, r, lwd = 3)abline(h = 1)Multiple variables and model selection95Plot of R2 by n as more regressors are included. No actual regressionNotice that the R2 goes up, monotonically, as the number of regressors is increased. This reminds usof a couple of things.

First, irrelevant variables explain residual variation by chance. And, when evaluating fit, we have to take into account the number of regressors included. The adjusted R2 is betterfor these purposes than R2 since it accounts for the number of variables included in the model. In R,you can get the adjusted R2 very easily with by grabbing summary(fitted_model)$adj.r.squaredinstead of summary(fitted_model)$r.squared.Simulation demonstrating variance inflationWatch this video before beginning.¹⁰⁰Now let’s use simulation to demonstrate variation inflation.

In this case, we’re going to simulatethree regressors, x1, x2 and x3. We then repeatedly generate data from a model, where y only dependson x1. We fit three models, y ∼ x1, y ∼ x1 + x2, and y ∼ x1 + x2 + x3. We do this over andover again and look at the standard deviation of the x1 coefficient.¹⁰⁰https://youtu.be/sP5JJlOCNNo?list=PLpl-gQkQivXjqHAJd2t-J_One_fYE55tCMultiple variables and model selection96> n <- 100; nosim <- 1000> x1 <- rnorm(n); x2 <- rnorm(n); x3 <- rnorm(n);> betas <- sapply(1 : nosim, function(i){y <- x1 + rnorm(n, sd = .3)c(coef(lm(y ~ x1))[2],coef(lm(y ~ x1 + x2))[2],coef(lm(y ~ x1 + x2 + x3))[2])})> round(apply(betas, 1, sd), 5)x1x1x10.02839 0.02872 0.02884Notice that the standard error for the x1 coefficient goes up as more regressors are included (left toright in our vector output).

It’s important to note that these are the actual standard errors (obtainedby repeatedly simulating the data). These aren’t obtainable in a single dataset since we only get onerealization. The estimated standard errors, the ones we have access to in a data analysis, may notgo up as you include more regressors.Now let’s see if we can make the variance inflation worse. In this case, I’ve made x2 and x3 correlatedwith x1.>>>>n <- 100; nosim <- 1000x1 <- rnorm(n); x2 <- x1/sqrt(2) + rnorm(n) /sqrt(2)x3 <- x1 * 0.95 + rnorm(n) * sqrt(1 - 0.95^2);betas <- sapply(1 : nosim, function(i){y <- x1 + rnorm(n, sd = .3)c(coef(lm(y ~ x1))[2],coef(lm(y ~ x1 + x2))[2],coef(lm(y ~ x1 + x2 + x3))[2])})> round(apply(betas, 1, sd), 5)x1x1x10.03131 0.04270 0.09653Notice that the variance inflation goes up quite a bit more.

This is an issue with including variablesthat are highly correlated with the ones that we are interested in. In the first simulation, theregressors were simulated independently, and the variance inflation wasn’t bad. In the second, theywere correlated and it was much worse.Summary of variance inflation• Notice variance inflation was much worse when we included a variable that was highly relatedto x1.Multiple variables and model selection97• We don’t know σ, the residual variance, so we can’t know the actual variance inflationamount.• However, σ drops out of the ratio of the standard errors.

Thus, if one sequentially addsvariables, one can check the variance (or sd) inflation for including each one.• When the other regressors are actually orthogonal (correlation 0) to the regressor of interest,then there is no variance inflation.• The variance inflation factor (VIF) is the increase in the variance for the ith regressor comparedto the ideal setting where it is orthogonal to the other regressors.– The square root of the VIF is the increase in the sd instead of variance.• Remember, variance inflation is only part of the picture.

We want to include certain variables,even if they dramatically inflate our variance.Let’s revisit our previous simulation to show how one can estimate the relative increase in variance.Let’s simulate a single dataset, and I’ll show how to get the relative increase in variance for includingx2 and x3. All you need to do is take the ratio of the variances for that coefficient. If you don’t exactlyunderstand the code, don’t worry. The idea is that we can obtain these from an observed data set.> y <- x1 + rnorm(n, sd = .3)> a <- summary(lm(y ~ x1))$cov.unscaled[2,2]> c(summary(lm(y ~ x1 + x2))$cov.unscaled[2,2],summary(lm(y~ x1 + x2 + x3))$cov.unscaled[2,2]) / a[1] 1.895 9.948Now let’s check it by referring to our previous simulation and see what the relative variance for x1is when including the x2 and x2 plus x3 models.> temp <- apply(betas, 1, var); temp[2 : 3] / temp[1]x1x11.860 9.506Notice that it’s the same (about).

In other words, from a single observed dataset we can perfectlyestimate the relative variance inflation caused by adding a regressor.Swiss data revisited98Multiple variables and model selection> data(swiss);> fit1 <- lm(Fertility ~ Agriculture, data = swiss)> a <- summary(fit1)$cov.unscaled[2,2]>fit2 <- update(fit, Fertility ~ Agriculture + Examination)> fit3 <- update(fit, Fertility ~ Agriculture + Examination + Education)> c(summary(fit2)$cov.unscaled[2,2],summary(fit3)$cov.unscaled[2,2]) / a[1] 1.892 2.089Thus inclusion of Examination increases the variance of the Agriculture effect by 89.2% while furtheradding Examination and Education causes a 108.9% increase. Again, the observed standard errorswon’t follow these percentages.

Характеристики

Тип файла

PDF-файл

Размер

3,77 Mb

Материал

Regression models for data sciense

Тип материала

Книга

Предмет

Математическое моделирование

Высшее учебное заведение

МГТУ им. Н.Э.Баумана

Список файлов книги

regression-models-for-data-sciense-455223438-1514184926.rar

Regression models for data sciense.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.