Regression models for data sciense (779323), страница 4

Файл №779323 Regression models for data sciense (Regression models for data sciense) 4 страницаRegression models for data sciense (779323) страница 42017-12-252017-12-25СтудИзба

Regression models for data sciense

Просмтор этого файла доступен только зарегистрированным пользователям. Но у нас супер быстрая регистрация: достаточно только электронной почты!

Регистрация/авторизация

Текст из файла (страница 4)

Example, a value of 2 from normalized data means thatdata point was two standard deviations larger than the mean.Normalization is very useful for creating data that comparable across experiments by getting rid ofany shifting or scaling effects.The empirical covarianceThis class is largely considering how varaibles covary. This is estimated by the empirical covariance.Consider now when we have pairs of data, (Xi , Yi ). Their empirical covariance is defined as:1 ∑1Cov(X, Y ) =(Xi − X̄)(Yi − Ȳ ) =n − 1 i=1n−1n( n∑)Xi Yi − nX̄ Ȳi=1This measure is of limited utility, since its units are the product of the units of the two variables.

Amore useful definition normalizes the two variables first.The correlation is defined as:16NotationCor(X, Y ) =Cov(X, Y )Sx Sywhere Sx and Sy are the estimates of standard deviations for the X observations and Y observations,respectively. The correlation is simply the covariance of the separately normalized X and Y data.Because the the data have been normalized, the correlation is a unit free quantity and thus has moreof a hope of being interpretable across settings.Some facts about correlationFirst, the order of the arguments is irrelevant Cor(X, Y ) = Cor(Y, X) Secondly, it has to bebetween -1 and 1, −1 ≤ Cor(X, Y ) ≤ 1.

Thirdly, the correlation is exactly -1 or 1 only whenthe observations fall perfectly on a negatively or positively sloped, line, respectively. Fourthly,Cor(X, Y ) measures the strength of the linear relationship between the two variables, with strongerrelationships as Cor(X, Y ) heads towards -1 or 1. Finally, Cor(X, Y ) = 0 implies no linearrelationship.Exercises1. Take the Galton dataset and find the mean, standard deviation and correlation between theparental and child heights. Watch a video solution.²⁸2. Center the parent and child variables and verify that the centered variable means are 0. Watcha video solution.²⁹3.

Rescale the parent and child variables and verify that the scaled variable standard deviationsare 1. Watch a video solution.³⁰4. Normalize the parental and child heights. Verify that the normalized variables have mean 0and standard deviation 1 and take the correlation between them. Watch a video solution.³¹²⁸https://www.youtube.com/watch?v=6zq-excgkHg&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0&index=4²⁹https://www.youtube.com/watch?v=OT9tn_jtzus&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0&index=5³⁰https://www.youtube.com/watch?v=y32m9mjEQsk&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0&index=6³¹https://www.youtube.com/watch?v=D7LmrbjenZk&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0&index=7Ordinary least squaresWatch this video before beginning³²Ordinary least squares (OLS) is the workhorse of statistics.

It gives a way of taking complicatedoutcomes and explaining behavior (such as trends) using linearity. The simplest application of OLSis fitting a line.General least squares for linear equationsConsider again the parent and child height data from Galton.³²https://www.youtube.com/watch?v=LapyH7MG3Q4&list=PLpl-gQkQivXjqHAJd2t-J_One_fYE55tC&index=618Ordinary least squaresPlot of parent and child heights.Let’s try fitting the best line. Let Yi be the ith child’s height and Xi be the ith (average over the pairof) parental heights.

Consider finding the best line of the formChild Height = β0 + Parent Heightβ1 ,Let’s try using least squares by minimizing the following equation over β0 and β1 :n∑{Yi − (β0 + β1 Xi )}2 .i=1Minimizing this equation will minimize the sum of the squared distances between the fitted line atthe pareNnts heights (β1 Xi ) and the observed child heights (Yi ).The result actually has a closed form. Specifically, the least squares of the line:Y = β0 + β1 X,19Ordinary least squaresthrough the data pairs (Xi , Yi ) with Yi as the outcome obtains the line Y = β̂0 + β̂1 X where:β̂1 = Cor(Y, X)Sd(Y )and β̂0 = Ȳ − β̂1 X̄.Sd(X)At this point, a couple of notes are in order. First, the slope, β̂1 , has the units of Y /X. Secondly, theintercept, β̂0 , has the units of Y .The line passes through the point (X̄, Ȳ ).

If you center your Xs and Ys first, then the line will passthrough the origin. Moreover, the slope is the same one you would get if you centered the data,(Xi − X̄, Yi − Ȳ ), and either fit a linear regression or regression through the origin.To elaborate, regression through the origin, assuming that β0 = 0, yields the following solution tothe least squares criteria:∑nXi Yiβ̂1 = ∑i=1,n2i=1 XiThis is exactly the correlation times the ratio in the standard deviations if the both the Xs and Yshave been centered first. (Try it out using R to verify this!)It is interesting to think about what happens when you reverse the role of X and Y .

Specifically, theslope of the regression line with X as the outcome and Y as the predictor is Cor(Y, X)Sd(X)/Sd(Y ).i −X̄, Yi −Ȳ }, the slope is simply the correlation, Cor(Y, X), regardlessIf you normalized the data, { XSd(X) Sd(Y )of which variable is treated as the outcome.Revisiting Galton’s dataWatch this video before beginning³³Let’s double check our calculations using RFitting Galton’s data using linear regression.>>>>>y <- galton$childx <- galton$parentbeta1 <- cor(y, x) * sd(y) / sd(x)beta0 <- mean(y) - beta1 * mean(x)rbind(c(beta0, beta1), coef(lm(y ~ x)))(Intercept)x23.94 0.6463[1,][2,]23.94 0.6463³³https://www.youtube.com/watch?v=O7cDyrjWBBc&index=7&list=PLpl-gQkQivXjqHAJd2t-J_One_fYE55tCOrdinary least squares20We can see that the result of lm is identical to hard coding the fit ourselves.

Let’s Reversing theoutcome/predictor relationship.> beta1 <- cor(y, x) * sd(x) / sd(y)> beta0 <- mean(x) - beta1 * mean(y)> rbind(c(beta0, beta1), coef(lm(x ~ y)))(Intercept)y[1,]46.14 0.3256[2,]46.14 0.3256Now let’s show that regression through the origin yields an equivalent slope if you center the datafirst> yc <- y - mean(y)> xc <- x - mean(x)> beta1 <- sum(yc * xc) / sum(xc ^ 2)c(beta1, coef(lm(y ~ x))[2])x0.6463 0.6463Now let’s show that normalizing variables results in the slope being the correlation.> yn <- (y - mean(y))/sd(y)> xn <- (x - mean(x))/sd(x)> c(cor(y, x), cor(yn, xn), coef(lm(yn ~ xn))[2])xn0.4588 0.4588 0.4588The image below plots the data again, the best fitting line and standard error bars for the fit.

We’llwork up to that point later. But, understanding that our fitted line is estimated with error is animportant concept. You can find the code for the plot here³⁴.³⁴https://github.com/bcaffo/courses/blob/master/07_RegressionModels/01_03_ols/index.Rmd21Ordinary least squaresImage of the data, the fitted line and error bars.Showing the OLS resultIf you would like to see a proof of why the ordinary least squares result works out to be the waythat it is: watch this video³⁵.Exercises1. Install and load the package UsingR and load the father.son data with data(father.son).Get the linear regression fit where the son’s height is the outcome and the father’s height isthe predictor. Give the intercept and the slope, plot the data and overlay the fitted regressionline. Watch a video solution.³⁶³⁵https://www.youtube.com/watch?v=COVQX8WZVA8&index=8&list=PLpl-gQkQivXjqHAJd2t-J_One_fYE55tC³⁶https://www.youtube.com/watch?v=HH78kFrT-5k&index=8&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0Ordinary least squares222.

Refer to problem 1. Center the father and son variables and refit the model omitting theintercept. Verify that the slope estimate is the same as the linear regression fit from problem1. Watch a video solution.³⁷3. Refer to problem 1. Normalize the father and son data and see that the fitted slope is thecorrelation. Watch a video solution.³⁸4.

Go back to the linear regression line from Problem 1. If a father’s height was 63 inches, whatwould you predict the son’s height to be? Watch a video solution.³⁹5. Consider a data set where the standard deviation of the outcome variable is double that ofthe predictor. Also, the variables have a correlation of 0.3. If you fit a linear regression model,what would be the estimate of the slope? Watch a video solution.⁴⁰6. Consider the previous problem. The outcome variable has a mean of 1 and the predictor hasa mean of 0.5.

What would be the intercept? Watch a video solution.⁴¹7. True or false, if the predictor variable has mean 0, the estimated intercept from linearregression will be the mean of the outcome? Watch a video solution.⁴²8. Consider problem 5 again. What would be the estimated slope if the predictor and outcomewere reversed? Watch a video solution.⁴³³⁷https://www.youtube.com/watch?v=Bf0euQ_-CuE&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0&index=10³⁸https://www.youtube.com/watch?v=Bf0euQ_-CuE&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0&index=10³⁹https://www.youtube.com/watch?v=46eu_SrKVNE&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0&index=11⁴⁰https://www.youtube.com/watch?v=rRADoy09tXg&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0&index=12⁴¹https://www.youtube.com/watch?v=TRxhUJB2zfg&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0&index=13⁴²https://www.youtube.com/watch?v=XBXL70A9eDw&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0&index=14⁴³https://www.youtube.com/watch?v=kzmyzpHcNtg&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0&index=15Regression to the meanWatch this video before beginning⁴⁴A historically famous idea, regression to the meanHere is a fundamental question.

Why is it that the children of tall parents tend to be tall, but notas tall as their parents? Why do children of short parents tend to be short, but not as short as theirparents? Conversely, why do parents of very short children, tend to be short, but not a short as theirchild? And the same with parents of very tall children?We can try this with anything that is measured with error. Why do the best performing athletes thisyear tend to do a little worse the following? Why do the best performers on hard exams always doa little worse on the next hard exam?These phenomena are all examples of so-called regression to the mean.

Regression to the mean,was invented by Francis Galton in the paper “Regression towards mediocrity in hereditary stature”The Journal of the Anthropological Institute of Great Britain and Ireland , Vol. 15, (1886). The ideaserved as a foundation for the discovery of linear regression.Think of it this way, imagine if you simulated pairs of random normals.

Характеристики

Тип файла

PDF-файл

Размер

3,77 Mb

Материал

Regression models for data sciense

Тип материала

Книга

Предмет

Математическое моделирование

Высшее учебное заведение

МГТУ им. Н.Э.Баумана

Список файлов книги

regression-models-for-data-sciense-455223438-1514184926.rar

Regression models for data sciense.pdf

Поделитесь ссылкой:

Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.

Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.

Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.

Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.

Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.

Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.

Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.

Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.

Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.

Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.

Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.

Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.