Regression models for data sciense (779323), страница 12

Файл №779323 Regression models for data sciense (Regression models for data sciense) 12 страницаRegression models for data sciense (779323) страница 122017-12-252017-12-25СтудИзба

Regression models for data sciense

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 12)

Error t value Pr(>|t|)(Intercept)60.83224.1059 14.816 1.032e-18Agriculture0.12420.08111.531 1.329e-01factor(CatholicBin)17.88433.74842.103 4.118e-02Thus, 7.8843 is the estimated change in the intercept in the expected relationship between Agricultureand Fertility going from a non-Catholic majority province to a Catholic majority.Often, however, we want both a different intercept and slope. This is easily obtained with aninteraction termYi = β0 + Xi1 β1 + Xi2 β2 + Xi1 Xi2 β3 + ϵi .Now consider with Xi2 = 0, the model reduces to:Yi = β0 + Xi1 β1 + ϵi .When Xi2 = 1 the model isYi = (β0 + β2 ) + Xi1 (β1 + β3 ) + ϵi .Thus, the coefficient in front of the main effect Xi2 , labeled β2 in our model, is the change in theintercept, while the coefficient in front of interaction term Xi2 Xi1 , labeled β3 in our model, is thechange in the slope. Let’s try it:72Multivariable examples and tricks> summary(lm(Fertility ~ Agriculture * factor(CatholicBin), data = swiss))$coefEstimate Std.

Error t value62.049934.78916 12.9563(Intercept)Agriculture0.096120.09881 0.9727factor(CatholicBin)12.8577010.62644 0.2689Agriculture:factor(CatholicBin)1 0.089140.17611 0.5061Pr(>|t|)1.919e-163.361e-017.893e-016.153e-01Thus, 2.8577 is the estimated change in the intercept of the linear relationship between Agriculureand Fertility going from non-Catholic majority to Catholic majority to Catholic majority provinces.The interaction term 0.9891 is the estimate change in the slope. The estimated intercept in nonCatholic provinces is 62.04993 while the estimated intercept in Catholic provinces is 62.04993 +2.85770. The estimated slope in non-Catholic majority provinces is 0.09612 while it is 0.09612 +0.08914 for Catholic majority provinces. If the factor has more than two levels, all of the main effectsare change in the intercepts from the reference level while all of the interaction terms are changesin slope (again compared to the reference level).Homework exercise, plot both lines on the data to see the fit!Exercises1.

Do exercise 1 of the previous chapter if you have not already. Load the dataset Seatbelts aspart of the datasets package via data(Seatbelts). Use as.data.frame to convert the objectto a dataframe. Fit a linear model of driver deaths with kms and PetrolPrice as predictors.Interpret your results.2. Repeat question 1 for the outcome being the log of the count of driver deaths. Interpret yourcoefficients. Watch a video solution.⁸⁵3. Refer to question 1. Add the dummy variable law and interpret the results. Repeat this questionwith a factor variable that you create called lawFactor that takes the levels No and Yes.

Changethe reference level from No to Yes. Watch a video solution.⁸⁶4. Discretize the PetrolPrice variable into four factor levels. Fit the linear model with this factorto see how R treats multiple level factor variables. Watch a video solution.⁸⁷5. Perform the plot requested at the end of the last chapter.⁸⁵https://www.youtube.com/watch?v=GfIjC4rM08A&index=40&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0⁸⁶https://www.youtube.com/watch?v=ikKQv98i-EQ&index=41&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0⁸⁷https://www.youtube.com/watch?v=4FB8O-Vt1I0&index=42&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0AdjustmentWatch this video before beginning.⁸⁸Adjustment, is the idea of putting regressors into a linear model to investigate the role of a thirdvariable on the relationship between another two.

Since it is often the case that a third variable candistort, or confound if you will, the relationship between two others.As an example, consider looking at lung cancer rates and breath mint usage. For the sake ofcompleteness, imagine if you were looking at forced expiratory volume (a measure of lung function)and breath mint usage. If you found a statistically significant regression relationship, it wouldn’t bewise to rush off to the newspapers with the headline “Breath mint usage causes shortness of breath!”,for a variety of reasons.

First off, even if the association is sound, you don’t know that it’s causal. But,more importantly in this case, the likely culprit is smoking habits. Smoking rates are likely relatedto both breath mint usage rates and lung function. How would you defend your finding against theaccusation that it’s just variability in smoking habits?If your finding held up among non-smokers and smokers analyzed separately, then you might havesomething. In other words, people wouldn’t even begin to believe this finding unless it held up whileholding smoking status constant.

That is the idea of adding a regression variable into a model asadjustment. The coefficient of interest is interpreted as the effect of the predictor on the response,holding the adjustment variable constant.In this chapter, we’ll use simulation to investigate how adding a regressor into a model addressesthe idea of adjustment.Experiment 1Let’s first generate some data. Consider the modelYi = β0 + β1 X + τ T + ϵiWe’re interested in the relationship between our binary treatment, T , and Y . However, we’reconcerned that the relationship may depend on the continuous variable, X.Let’s simulate some data.⁸⁸https://youtu.be/SFPM9IuP2m8Adjustment74n <- 100; t <- rep(c(0, 1), c(n/2, n/2)); x <- c(runif(n/2), runif(n/2));beta0 <- 0; beta1 <- 2; tau <- 1; sigma <- .2y <- beta0 + x * beta1 + t * tau + rnorm(n, sd = sigma)Let’s plot the data.

Below I give the code for the first plot, the rest of the code for plots throughoutthis chapter is omitted. (However, you can see the course git repository for the rest of the code.)Simulation 1plot(x, y, type = "n", frame = FALSE)abline(lm(y ~ x), lwd = 2)abline(h = mean(y[1 : (n/2)]), lwd = 3)abline(h = mean(y[(n/2 + 1) : n]), lwd = 3)fit <- lm(y ~ x + t)abline(coef(fit)[1], coef(fit)[2], lwd = 3)abline(coef(fit)[1] + coef(fit)[3], coef(fit)[2], lwd = 3)points(x[1 : (n/2)], y[1 : (n/2)], pch = 21, col = "black", bg = "lightblue", ce\x = 2)points(x[(n/2 + 1) : n], y[(n/2 + 1) : n], pch = 21, col = "black", bg = "salmon\", cex = 2)75AdjustmentExperiment 1.Looking at this plot notices that the X variable is unrelated to treatment/group status (color).

Inaddition, the X variable is clearly linearly related to Y, but the intercept of this relationship dependson group status. The treatment variable is also related to Y; especially look at the horizontal lineswhich connect the group means onto the Y axis. The third line is the what you would get if you justfit X and ignored group. Furthermore, notice that the relationship between group status and Y isconstant depending on X. In other words, both the apparent relationship and our estimated modelhave parallel lines. (Remember, our model, by not including an interaction term, did not allow forestimated non parallel lines.)Finally, notice that the estimated relationship between the group variable and the outcome doesn’tchange much, regardless of whether X is accounted for or not.

You can see this by comparing thedistance between the horizontal lines and the distance between the intercepts of the fitted lines.The horizontal lines are the group averages (disregarding X). That the relationship doesn’t changemuch is ultimately a statement about balance. The nuisance variable (X) is well balanced betweenlevels of the group variable.

So, whether you account for X or not, you get about the same answer.Moreover, we have lots of data at every level of X to make a direct comparison of the group on Y.One way to try to achieve such balance with high probability is to randomize the group variable.This is especially useful, of course, when one doesn’t get to observe the nuisance covariate. Thoughbe careful that as the number of unobserved covariatesNow let’s consider less ideal settings.76AdjustmentExperiment 2Experiment 2.In this experiment, the X variable is highly related to group status. That is, if you know the X variable,you could very easily predict which group they belonged to.

If we disregard X, there’s an apparentstrong relationship between the group variable and Y. However, if we account for X, there’s basicallynone. In this case, the apparent effect of group on Y is entirely explained by X. Our regression modelwould likely have a strong significant effect if group was included by itself and this effect wouldvanish if X was included.Further notice, there’s no data to directly compare the groups at any particular value of X. (There’sno vertical overlap between the blue and red points.) Thus the adjusted effect is entirely based onthe model, specifically the assumption of linearity.

Try to drawing curves on this plot assumingnon-linear relationships outside of their cloud of points for the blue and red groups. You quicklywill conclude that many relationship are possible that would differ from this model’s conclusions.Worse still, you have no data to check the assumptions.

Of course, R will churn forward withoutany complaints fitting this model and reporting no significant difference between the groups.It’s worth noting at this point, that our experiments just show how the data can arrive at differenteffects when X is included or not. In a real application, t may be the case that X should be includedand maybe that it shouldn’t be.77AdjustmentFor example, consider an example that I was working on a few years ago. Imagine if group waswhether or not the subject was taking blood pressure medication and X was systolic blood pressure(ostensibly, the two variable giving the same information).

It may not make sense to adjust for bloodpressure when looking at blood pressure medication on the outcome.On the other hand consider another setting I ran into. A colleague was studying chemical brainmeasurements of patients a severe mental disorder versus controls post mortem. However, the timethe time since death was highly related to the time the brain was stored since death, perhaps dueto the differential patient sources of the two groups. The time since death was strongly related tothe outcome we were studying. In this case, it is very hard to study the groups as they were socontaminated by this nuisance covariate.Thus we arrive at the conclusion that whether or not to include a covariate is a complex processrelying on both the statistics and a careful investigation into the subject matter being studied.Experiment 3Experiment 3.In this experiment, we simulated data where the marginal (ignoring X) and conditional (using X)associations differ. First note that if X is ignored, one would estimate a higher marginal mean forY of the red group over the blue group.

Характеристики

Тип файла

PDF-файл

Размер

3,77 Mb

Материал

Regression models for data sciense

Тип материала

Книга

Предмет

Математическое моделирование

Высшее учебное заведение

МГТУ им. Н.Э.Баумана

Список файлов книги

regression-models-for-data-sciense-455223438-1514184926.rar

Regression models for data sciense.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.