Regression models for data sciense (779323), страница 2
Текст из файла (страница 2)
. . .Interpreting Logistic Regression . . . . . .Visualizing fitting logistic regression curvesRavens logistic regression . . . . . . . . . .Some summarizing comments . . . . . . .Exercises . . . . . . . . . . . . . . . . . . ...........................................................................................................................................................................................................................................105105106108108109113115115Count data . . .
. . . . . . .Poisson distribution . . . .Poisson distribution . . . .Linear regression . . . . .Poisson regression . . . . .Mean-variance relationshipRates . . . . . . . . . . . .Exercises . . . . . . . . . .........................................................................................................................................................................................................................................................................................116116117118120121123124Bonus material . .
. . . . . . . . . . . . . .How to fit functions using linear modelsNotes . . . . . . . . . . . . . . . . . . . .Harmonics using linear models . . . . . .Thanks! . . . . . . . . . . . . . . . . . ........................................................................................................................................125125126127129PrefaceAbout this bookThis book is written as a companion book to the Regression Models¹ Coursera class as part of theData Science Specialization². However, if you do not take the class, the book mostly stands on itsown.
A useful component of the book is a series of YouTube videos³ that comprise the Courseraclass.The book is intended to be a low cost introduction to the important field of regression models. Theintended audience are students who are numerically and computationally literate, who would liketo put those skills to use in Data Science or Statistics. The book is offered for free as a series ofmarkdown documents on github and in more convenient forms (epub, mobi) on LeanPub.This book is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0International License⁴, which requires author attribution for derivative works, non-commercial useof derivative works and that changes are shared in the same way as the original work.About the coverThe picture on the cover is a public domain image taken from Francis Galton’s paper on hereditarystature.
It represents an important leap in the development of regression and correlation as well asregression to the mean.¹https://www.coursera.org/course/regmods²https://www.coursera.org/specialization/jhudatascience/1?utm_medium=courseDescripTop³https://www.youtube.com/playlist?list=PLpl-gQkQivXjqHAJd2t-J_One_fYE55tC⁴http://creativecommons.org/licenses/by-nc-sa/4.0/IntroductionBefore beginningThis book is designed as a companion to the Regression Models⁵ Coursera class as part of the DataScience Specialization⁶, a ten course program offered by three faculty, Jeff Leek, Roger Peng andBrian Caffo, at the Johns Hopkins University Department of Biostatistics.The videos associated with this book can be watched in full here⁷, though the relevant links tospecific videos are placed at the appropriate locations throughout.Before beginning, we assume that you have a working knowledge of the R programming language.If not, there is a wonderful Coursera class by Roger Peng, that can be found here⁸.
In addition,students should know the basics of frequentist statistical inference. There is a Coursera class here⁹and a LeanPub book here¹⁰.The entirety of the book is on GitHub here¹¹. Please submit pull requests if you find errata! In additionthe course notes can be found also on GitHub here¹². While most code is in the book, all of the codefor every figure and analysis in the book is in the R markdown files files (.Rmd) for the respectivelectures.Finally, we should mention swirl (statistics with interactive R programming). swirl is an intelligenttutoring system developed by Nick Carchedi, with contributions by Sean Kross and Bill and GinaCroft.
It offers a way to learn R in R. Download swirl here¹³. There’s a swirl module for this course!¹⁴.Try it out, it’s probably the most effective way to learn.Regression modelsWatch this video before beginning¹⁵⁵https://www.coursera.org/course/regmods⁶https://www.coursera.org/specialization/jhudatascience/1?utm_medium=courseDescripTop⁷https://www.youtube.com/playlist?list=PLpl-gQkQivXjqHAJd2t-J_One_fYE55tC⁸https://www.coursera.org/course/rprog⁹https://www.coursera.org/course/statinference¹⁰https://leanpub.com/LittleInferenceBook¹¹https://github.com/bcaffo/regmodsbook¹²https://github.com/bcaffo/courses/tree/master/07_RegressionModels¹³http://swirlstats.com¹⁴https://github.com/swirldev/swirl_courses#swirl-courses¹⁵https://www.youtube.com/watch?v=58ZPhK32sU8&index=1&list=PLpl-gQkQivXjqHAJd2t-J_One_fYE55tCIntroduction3Regression models are the workhorse of data science. They are the most well described, practicaland theoretically understood models in statistics.
A data scientist well versed in regression modelswill be able to solve and incredible array of problems.Perhaps the key insight for regression models is that they produce highly interpretable model fits.This is unlike machine learning algorithms, which often sacrifice interpretability for improvedprediction performance or automation. These are, of course, valuable attributes in their own rights.However, the benefit of simplicity, parsimony and intrepretability offered by regression models (andtheir close generalizations) should make them a first tool of choice for any practical problem.Motivating examplesFrancis Galton’s height dataFrancis Galton, the 19th century polymath, can be credited with discovering regression. In hislandmark paper Regression Toward Mediocrity in Hereditary Stature¹⁶ he compared the heights ofparents and their children. He was particularly interested in the idea that the children of tall parentstended to be tall also, but a little shorter than their parents.
Children of short parents tended to beshort, but not quite as short as their parents. He referred to this as “regression to mediocrity” (orregression to the mean). In quantifying regression to the mean, he invented what we would callregression.It is perhaps surprising that Galton’s specific work on height is still relevant today. In fact thisEuropean Journal of Human Genetics manuscript¹⁷ compares Galton’s prediction models versusthose using modern high throughput genomic technology (spoiler alert, Galton wins).Some questions from Galton’s data come to mind. How would one fit a model that relates parentand child heights? How would one predict a childs height based on their parents? How would wequantify regression to the mean? In this class, we’ll answer all of these questions plus many more.Simply Statistics versus Kobe BryantSimply Statistics¹⁸ is a blog by Jeff Leek, Roger Peng and Rafael Irizarry.
It is one of the most widelyread statistics blogs, written by three of the top statisticians in academics. Rafa wrote a (somewhattongue in cheek) post regarding ball hogging¹⁹ among NBA basketball players. (By the way, yourauthor has played basketball with Rafael, who is quite good by the way, but certainly doesn’t passup shots; glass houses and whatnot.)Here’s some key sentences:¹⁶http://galton.org/essays/1880-1889/galton-1886-jaigi-regression-stature.pdf¹⁷http://www.nature.com/ejhg/journal/v17/n8/full/ejhg20095a.html¹⁸http://simplystatistics.org/¹⁹http://simplystatistics.org/2013/01/28/data-supports-claim-that-if-kobe-stops-ball-hogging-the-lakers-will-win-more/Introduction4• “Data supports the claim that if Kobe stops ball hogging the Lakers will win more”• “Linear regression suggests that an increase of 1% in % of shots taken by Kobe results in a dropof 1.16 points (+/- 0.22) in score differential.”In this book we will cover how to create summary statements like this using regression modelbuilding.
Note the nice interpretability of the linear regression model. With this model Rafanumerically relates the impact of more shots taken on score differential.Summary notes: questions for this bookRegression models are incredibly handy statistical tools. One can use them to answer all sorts ofquestions. Consider three of the most common tasks for regression models:1. Prediction Eg: to use the parent’s heights to predict children’s heights.2.
Modeling Eg: to try to find a parsimonious, easily described mean relationship betweenparental and child heights.3. Covariation Eg: to investigate the variation in child heights that appears unrelated to parentalheights (residual variation) and to quantify what impact genotype information has beyondparental height in explaining child height.An important aspect, especially in questions 2 and 3 is assessing modeling assumptions. For example,it is important to figure out how/whether and what assumptions are needed to generalize findingsbeyond the data in question.
Presumably, if we find a relationship between parental and childheights, we’d like to extend that knowledge beyond the data used to build the model. This requiresassumptions. In this book, we’ll cover the main assumptions necessary.Exploratory analysis of Galton’s DataWatch this video before beginning²⁰Let’s look at the data first. This data was created by Francis Galton in 1885.
Galton was a statisticianwho invented the term and concepts of regression and correlation, founded the journal Biometrika,and was the cousin of Charles Darwin.You may need to run install.packages("UsingR") if the UsingR library is not installed. Let’s lookat the marginal (parents disregarding children and children disregarding parents) distributions first.The parental distribution is all heterosexual couples. The parental average was corrected for gendervia multiplying female heights by 1.08. Remember, Galton didn’t have regression to help figure outa betetr way to do this correction!²⁰https://www.youtube.com/watch?v=1akVPR0LDsg&index=2&list=PLpl-gQkQivXjqHAJd2t-J_One_fYE55tC5IntroductionLoading and plotting Galton’s data.library(UsingR); data(galton); library(reshape); long <- melt(galton)g <- ggplot(long, aes(x = value, fill = variable))g <- g + geom_histogram(colour = "black", binwidth=1)g <- g + facet_grid(.