Regression models for data sciense (779323), страница 16

Файл №779323 Regression models for data sciense (Regression models for data sciense) 16 страницаRegression models for data sciense (779323) страница 162017-12-252017-12-25СтудИзба

Regression models for data sciense

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 16)

These are the increases if we actually knew σ 2 .Let’s look at the variance inflation factors. These measure how much variance inflation the variablecauses relative to the setting where it was orthogonal to the other regressors. This is nice becauseit has a well contained interpretation within a single model fit. Also, one doesn’t have to do all ofthe model refitting we did above to explore variance inflation.

So, in general, the VIFs are the mostconvenient entity to work with.> library(car)> fit <- lm(Fertility ~ . , data = swiss)> vif(fit)AgricultureExaminationEducationlity2.2843.6752.775.108> sqrt(vif(fit)) #If you prefer sd inflationAgricultureExaminationEducationlity1.5111.9171.666.052Catholic Infant.Morta\1.9371\Catholic Infant.Morta\1.3921\Impact of over- and under-fitting on residual varianceestimationWatch this video before beginning.¹⁰¹Assuming that the model is linear with additive iid errors, we can mathematically describe theimpact of omitting necessary variables or including unnecessary ones.

These two rules follow: * Ifwe underfit the model, that is omit necessary covariates, the variance estimate is biased. * If we¹⁰¹https://www.youtube.com/watch?v=Mg6WUKkRiS8&list=PLpl-gQkQivXjqHAJd2t-J_One_fYE55tC&index=32Multiple variables and model selection99correctly or overfit the model, including all necessary covariates and possibly some unnecessaryones, the variance estimate is unbiased.

However, the variance of the variance is larger if we includeunnecessary variables.These make sense. If we’ve omitted important variables, we’re attributing residual variation that isreally systematic variation explainable by those omitted covariates. Therefore, we would expect avariance estimate that is systematically off (biased). We would also expect absence of bias when wethrow the kitchen sink at the model and include everything (necessary and unnecessary).

However,then our variance estimate is unstable (the variance of the variance estimate is larger).Covariate model selectionIdeally, you include only the necessary variables in a regression model. However, it’s impossible toknow in practice which ones are necessary and which ones are not. Thus we have to discuss variableselection a little bit. Automated covariate selection is a difficult topic. It depends heavily on howrich of a covariate space one wants to explore. The space of models explodes quickly as you addinteractions and polynomial terms.In the Data Science Specialization prediction class, we’ll cover many modern methods for traversinglarge model spaces for the purposes of prediction.

In addition, principal components or factoranalytic models on covariates are often useful for reducing complex covariate spaces.It should also be noted that careful design can often eliminate the need for complex model searches atthe analyses stage. For example, randomized, randomized block designs, crossover designs, clinicaltrials, A/B testing are all examples of designs where randomization, balance and stratification areused to create data sets that have more direct analyses. However, control over the design is oftenlimited in data science.I’ll give my favorite approach for model selection when I’m trying to get a parsimonious explanatorymodel.

(I would use a different strategy for prediction.) Given a coefficient that I’m interested in, Ilike to use covariate adjustment and multiple models to probe that effect to evaluate it for robustnessand to see what other covariates knock it out or amplify it. In other words, if I have an effect, orabsence of an effect, that I’d like to report, I try to first come up with criticisms of that effect andthen use models to try to answer those criticisms.As an example, if I had a significant effect of lead exposure on brain size I would think about thefollowing criticism.

Were the high exposure people smaller than the low exposure people. To addressthis, I would consider adding head size (intra-cranial volume). If the lead exposed were more obesethan the non-exposed, I would put a model with body mass index (BMI) included. This isn’t aterribly systematic approach, but it tends to teach you a lot about the the data as you get your handsdirty. Most importantly, it makes you think hard about the questions your asking and what are thepotential criticisms to your results.

Heading those criticisms off at the pass early on is a good idea.Multiple variables and model selection100How to do nested model testing in ROne particular model selection technique is so useful I’ll cover it since it likely wouldn’t be coveredin a machine learning or prediction class. If the models of interest are nested and without lots ofparameters differentiating them, it’s fairly uncontroversial to use nested likelihood ratio tests formodel selection.

Consider the following example:> fit1 <- lm(Fertility ~ Agriculture, data = swiss)> fit3 <- update(fit, Fertility ~ Agriculture + Examination + Education)> fit5 <- update(fit, Fertility ~ Agriculture + Examination + Education + Cathol\ic + Infant.Mortality)> anova(fit1, fit3, fit5)Analysis of Variance TableModel 1: Fertility ~ AgricultureModel 2: Fertility ~ Agriculture + Examination + EducationModel 3: Fertility ~ Agriculture + Examination + Education + Catholic +Infant.MortalityRes.Df RSS Df Sum of SqF Pr(>F)145 6283243 3181 23102 30.2 8.6e-09 ***341 2105 21076 10.5 0.00021 ***--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Notice how the three models I’m interested in are nested. That is, Model 3 contains all of the Model2 variables which contains all of the Model 1 variables. The P-values are for a test of whether allof the new variables are all zero or not (i.e. whether or not they’re necessary).

So this model wouldconclude that all of the added Model 3 terms are necessary over Model 2 and all of the Model 2 termsare necessary over Model 1. So, unless there were some other compelling reasons, we’d pick Model3. Again, you don’t want to blindly follow a model selection procedure, but when the models arenaturally nested, this is a reasonable approach.Exercises1.

Load the dataset Seatbelts as part of the datasets package via data(Seatbelts). Useas.data.frame to convert the object to a dataframe. Fit a linear model of driver deaths withkms, PetrolPrice and law as predictors.2. Perform a model selection exercise to arrive at a final model.

Watch a video solution.¹⁰²¹⁰²https://www.youtube.com/watch?v=ffu80TAq2zY&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0&index=46Generalized Linear ModelsWatch this video before beginning.¹⁰³Generalized linear models (GLMs) were a great advance in statistical modeling. The originalmanuscript with the GLM framework was from Nelder and Wedderburn in 1972¹⁰⁴. in the Journalof the Royal Statistical Society.

The McCullagh and Nelder book¹⁰⁵ is the famous standard treatiseon the subject.Recall linear models. Linear models are the most useful applied statistical technique. However, theyare not without their limitations. Additive response models don’t make much sense if the responseis discrete, or strictly positive. Additive error models often don’t make sense, for example, if theoutcome has to be positive. Transformations, such as taking a cube root of a count outcome, areoften hard to interpret.In addition, there’s value in modeling the data on the scale that it was collected. Particularlyinterpretable transformations, natural logarithms in specific, aren’t applicable for negative or zerovalues.The generalized linear model is family of models that includes linear models.

By extending thefamily, it handles many of the issues with linear models, but at the expense of some complexity andloss of some of the mathematical tidiness. A GLM involves three components• An exponential family model for the response.• A systematic component via a linear predictor.• A link function that connects the means of the response to the linear predictor.The three most famous cases of GLMs are: linear models, binomial and binary regression and Poissonregression. We’ll go through the GLM model specification and likelihood for all three. For linearmodels, we’ve developed them throughout the book.

The next two chapters will be devoted tobinomial and Poisson regression. We’ll only focus on the most popular and useful link functions.Example, linear modelsLet’s go through an example. Assume that our response is Yi ∼ N (µi , σ 2 ). The Gaussian distributionis an exponential family distribution. Define the linear predictor to be∑ηi = pk=1 Xik βk .¹⁰³https://youtu.be/xEwM1nzQckY¹⁰⁴http://www.jstor.org/stable/2344614¹⁰⁵McCullagh, Peter, and John A. Nelder.

Generalized linear models. Vol. 37. CRC press, 1989.102Generalized Linear ModelsThe link function as g so that g(µ) = η. For linear models g(µ) = µ so that µi = ηi This yields thesame likelihood model as our additive error Gaussian linear modelYi =p∑Xik βk + ϵik=1iidwhere ϵi ∼ N (0, σ 2 ). So, we’ve specified our model as a GLM above and with a more traditionallinear model specification below. Let’s try an example where the GLM is more necessary.Example, logistic regressionAssume that our outcome is a 0, 1 variable. Let’s model Yi ∼ Bernoulli(µi ) so that E[Yi ] = µiwhere 0 ≤ µi ≤ 1.• Linear predictor: ηi =∑pXik βk)(µ• Link function g(µ) = η = log 1−µk=1In this case, g is the (natural) log odds, referred to as the logit. Note then we can invert the logitfunction as:µi =exp(ηi )1and 1 − µi =1 + exp(ηi )1 + exp(ηi )Some people like to call this the expit function.

Характеристики

Тип файла

PDF-файл

Размер

3,77 Mb

Материал

Regression models for data sciense

Тип материала

Книга

Предмет

Математическое моделирование

Высшее учебное заведение

МГТУ им. Н.Э.Баумана

Список файлов книги

regression-models-for-data-sciense-455223438-1514184926.rar

Regression models for data sciense.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.