Regression models for data sciense (779323), страница 13

Файл №779323 Regression models for data sciense (Regression models for data sciense) 13 страницаRegression models for data sciense (779323) страница 132017-12-252017-12-25СтудИзба

Regression models for data sciense

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 13)

However, if we look at the intercept in the fitted model,78Adjustmentthe blue group has a higher intercept. In other words, if you were to fit this linear model as lm(Y∼ Group) you would get one answer and lm(Y ∼ Group + X) would give you the exact oppositeanswer, and in both cases the group effect would be highly statistically significant!Also in this settings, there isn’t a lot of overlap between the groups for any given X. That means thereisn’t a lot of direct evidence to compare the groups without relying heavily on the model. In otherwords, group status is related to X quite strongly (though not as strongly as in the previous example).The adjusted relationship suggest that the blue group is larger than the red group. However, thereversal of the effect comes as bigger X means more likely red and bigger X means higher Y.Let’s concoct an example around a way this data could have occurred.

Suppose that you’recomparing two ad campaigns (labeled blue and red). Y is the sales from the ad (suppose you canmeasure this) and X is time of day that the ad is shown. Ads shown later on in the day do better thanads shown earlier. However, the blue ad campaign tended to get run in the morning while the redone tended to get run in the evening. So, ignoring time of day leads to the erroneous conclusion thatthe the red ad did better.

Again randomization of the ads to time slots would likely have eliminatedthis problem.Experiment 4Experiment 479AdjustmentNow that you’ve gotten the hang of it. You can see how marginal and conditional associationscan differ. Experiment 4 is a case where the marginal association is minimal yet the conditionalassociation is large. In this case, by adding X to the model, the group effect became more statisticallysignificant.Experiment 5Adjustment 5.Let’s look at a weird one. In this case, the best fitting model has both a group main effect andinteraction with X. The main point here is that there is no meaningful group effect, the effect ofgroup depends on what level of X you’re at.

At a small value of X, the red group is here and at alarge value of X, the blue group is higher; at intermediate values, they’re the same. Thus, it makesno sense to talk about a group effect in this example; group and X are intrinsically linked in theirimpact on Y.As an example, imagine if Y is health outcome, X is time and group is two medications. One makesyou much better right away then much worse as time goes on and the other doesn’t do much at thestart but steadily improves symptoms over time.

Of course, most examples seen in practice aren’tthat extreme. Still even with a slight departure in constant slopes, the meaning of a main groupeffect goes away.Adjustment80Some final thoughtsNothing we’ve discussed is intrinsic to having a discrete group and continuous X. One, the other,both or neither could be discrete. What this reinforces is that modeling multivariable relationshipsis hard. You should continue to play around with simulations to see how theinclusion or exclusionof another variable can change apparent relationships.We should also caution that our discussion only dealt with associations.

Establishing causal or trulymechanistic relationships requires quite a bit more thinking.Exercises1. Load the dataset Seatbelts as part of the datasets package via data(Seatbelts). Useas.data.frame to convert the object to a dataframe. Fit a linear model of driver deaths withkms and PetrolPrice as predictors. Interpret your results.2. Compare the kms coefficient with and without the inclusion of the PetrolPrice variable inthe model. Watch a video solution.⁸⁹3. Compare the PetrolPrice coefficient with and without the inclusion fo the kms variable inthe model.⁸⁹https://www.youtube.com/watch?v=LTTsm8FfgeI&index=43&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0Residuals, variation, diagnostics[Watch this video before beginning)[http://youtu.be/VohfwSJuG4k]ResidualsRecall from Chapter 6 that the vertical distances between the observed data points and the fittedregression line are called residuals.

We can generalize this idea to the vertical distances between theobserved data and the fitted surface in multivariable settings.∑To be specific, recall our linear models, which was specified as Yi = pk=1 Xik βj + ϵi . Throughoutiidthis lecture, we’ll also assume that ϵi ∼ N (0, σ 2 ), even though this assumption isn’t necessary forthe definition of the residuals.We define the residuals as:ei = Yi − Ŷi = Yi −p∑Xik β̂j .k=1This definition is identical (Yi − Ŷi ) to our definition in the linear regression case.

The residuals arethe vertical distances between the observed data points and the fitted regression surface. Just like inlinear regression, ourestimate of residual variation is basically the average of the squared residuals.∑n22i=1 eiSpecifically, σ̂ = n−p . Just like the before, we divide by n − p rather than n so that the estimateis unbiased, E[σ̂ 2 ] = σ 2 .Obtaining and plotting residuals in R is particularly easy. The function resid will return the residualsof a model fit with lm. Some useful plots, including a residual plot, can be performed with the plotfunction on the output of a lm fit. Consider the swiss dataset from previous chapters.data(swiss); par(mfrow = c(2, 2))fit <- lm(Fertility ~ .

, data = swiss); plot(fit)82Residuals, variation, diagnosticsThe result of the method plot on the swiss dataset.Consider the upper left hand plot of the residuals (ei ) versus the fitted values (Ŷi ). Often, a horiztonalreference line at 0 is drawn since (whenever an intercept is included) the residuals must sum to 0and so will lie above and below the zero. Just like in our previous residual plots, one is look for anysystematic patters or large outlying observations.Note that this is one of many residual plots that one may be interested in performing. For example,one might want to look at plots of residuals by individual predictors or, as is done by plot, versusleverage (defined later in this chapter).Influential, high leverage and outlying pointsAs previously mentioned, it is a good idea to check our data for outliers.

We may want to referback to these points to see if we can ascertain how they became outliers, such as a misrecording. Inaddition, we may want to fit the models with and without those points included in order to ascertaintheir influence on the model fit and inferential goals.Outliers can results for a variety of reasons.

They can be real, but inconvenient, data. They couldarise from spurious processes like processing or recording errors. They can have varying degrees ofinfluence on our model. Thus, calling a point an outlier is vague and we need a more precise languageto discuss points that fall outside of our cloud of data. The plot below is useful for understandingdifferent sorts of outliers.Residuals, variation, diagnostics83Plot of simulated data with four different kinds of highlighted orange points.The lower left hand point is not an outlier having neither leverage nor influence on our fitted model.The upper left hand point is an outlier in the Y direction, but not in the X.

It will have little impacton our fitted model, since there’s lots of X points nearby to counteract its effect. This point is said tohave low leverage and influence. The upper right hand point is outside of the range of X values and Yvalues, but conforms nicely to the regression relationship. This point has It will also have little effecton the fitted model. It has high leverage, but chooses not to extert it, and thus has low influence.The lower left hand point is outside of the range of X values, but not the Y values.

However, it doesnot conform to the relationship of the remainder of points at all. This outlier has high leverage andinfluence.From this discussion you can maybe guess at the formal definition of two important terms: leverageand influence. Leverage discusses how outside of the norm a points X values are from the cloud ofother X values. A point with high leverage has the opportunity to dramatically impact the regressionmodel.

Whether or not it does so depends on how closely it conforms to the fit.The other concept, influence, is a measure of how much impact a point has on the regression fit.The most direct way to measure influence is fit the model with the point included and excluded.Residuals, variation, diagnostics84Residuals, Leverage and Influence measuresWatch this video before beginning.⁹⁰Now that we understand the three concepts of residuals, leverage and influence, we present a laundrylist of probes. Do ?influence.measures to see the full suite of influence measures in stats.First consider residuals. We already know if fit is the output of lm (as in fit = lm(y ∼ x1 +x2)), then resid(fit) returns the ordinary residuals.

A problem, though, is that these are definedas Yi − Ŷi and thus have the units of the outcome. This isn’t great for comparing residual valuesacross different analyses with different experiments. So, some efforts to standardize residuals havebeen made. Two of the most common are:• rstandard - residuals divided by their standard deviations• rstudent - residuals divided by their standard deviations, where the ith data point was deletedin the calculation of the standard deviation for the residual to follow a t distributionBoth of these endeavor to create T-like (as in Student’s T distribution) statistics so that one canthreshold residuals using T cutoffs. This is why these sorts of residuals are called studentized.

Therstudent residuals are exactly T distributed while the rstandard is not. The rstandard residuals aresometimes called internally standardized while the rstudent are called externally. The distinctionbetween the residuals is mostly for establishing probability based cutoffs. Instead, we recommendlooking at the residuals as a collective and using the cutoffs loosely.

Характеристики

Тип файла

PDF-файл

Размер

3,77 Mb

Материал

Regression models for data sciense

Тип материала

Книга

Предмет

Математическое моделирование

Высшее учебное заведение

МГТУ им. Н.Э.Баумана

Список файлов книги

regression-models-for-data-sciense-455223438-1514184926.rar

Regression models for data sciense.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.