Regression models for data sciense (779323), страница 5
Текст из файла (страница 5)
The largest first ones wouldbe the largest by chance, and the probability that there are smaller for the second simulation ishigh. In other words P (Y < x|X = x) gets bigger as x heads to the very large values. SimilarlyP (Y > x|X = x) gets bigger as x heads to very small values. Think of the regression line asthe intrinsic part and the regression to the mean as the result of noise. Unless Cor(Y, X) = 1 theintrinsic part isn’t perfect and so we should think about how much regression to the mean shouldoccur.
In other words, what should we multiply tall parent’s heights by to predict their children’sheight?Regression to the meanLet’s investigate this with Galton’s father and son data. (In this case ) Suppose that we normalize X(child’s height) and Y (father’s height) so that they both have mean 0 and variance 1. Then, recall,our regression line passes through (0, 0) (the mean of the X and Y). If the slope of the regressionline is Cor(Y, X), regardless of which variable is the outcome (recall, both standard deviations are1). Notice if X is the outcome and you create a plot where X is the horizontal axis, the slope of theleast squares line that you plot is 1/Cor(Y, X). Let’s plot the normalized father and son heights toinvestigate.⁴⁴https://www.youtube.com/watch?v=-I0_4JIeGws&list=PLpl-gQkQivXjqHAJd2t-J_One_fYE55tC&index=9Regression to the meanCode for the plot.library(UsingR)data(father.son)y <- (father.son$sheight - mean(father.son$sheight)) / sd(father.son$sheight)x <- (father.son$fheight - mean(father.son$fheight)) / sd(father.son$fheight)rho <- cor(x, y)library(ggplot2)g = ggplot(data.frame(x, y), aes(x = x, y = y))g = g + geom_point(size = 5, alpha = .2, colour = "black")g = g + geom_point(size = 4, alpha = .2, colour = "red")g = g + geom_vline(xintercept = 0)g = g + geom_hline(yintercept = 0)g = g + geom_abline(position = "identity")g = g + geom_abline(intercept = 0, slope = rho, size = 2)g = g + geom_abline(intercept = 0, slope = 1 / rho, size = 2)g = g + xlab("Father's height, normalized")g = g + ylab("Son's height, normalized")g2425Regression to the meanRegression to the mean, illustrated.Let’s investigate the plot and the regression fits.
If you had to predict a son’s normalized height, itwould be Cor(Y, X) ∗ Xi where Xi was the normalized father’s height. Conversely, if you had topredict a father’s normalized height, it would be Cor(Y, X) ∗ Yi .Multiplication by this correlation shrinks toward 0 (regression toward the mean). It is in this waythat Galton used regression to account for regression toward the mean. If the correlation is 1 thereis no regression to the mean, (if father’s height perfectly determines child’s height and vice versa).Note since Galton’s original seminal paper, the idea of regression to the mean has been generalizedand expanded upon.
However, the core remains. In paired measurements, if there’s randomness thenthe extreme values of one element of the pair will be likely less extreme in the other element.The number of applications of this phenomena is staggering. Some financial advisors recommendputting your money in your worst performing fund because of regression to the mean. (If there’sa lot of noise, those are the most likely to gain in value.) An example that I’ve run into is thatstudents performing the best on midterm exams often do much worse on the final.
Athletes oftenfollow a phenomenal season with merely a good season. It’s a useful exercise to think wheneverpaired observations are being evaluated whether real intrinsic properties are being discussed, or justRegression to the mean26regression to the mean.Exercises1. You have two noisy scales and a bunch of people that you’d like to weigh. You weigh eachperson on both scales. The correlation was 0.75. If you normalized each set of weights, whatwould you have to multiply the weight on one scale to get a good estimate of the weight onthe other scale? Watch a video solution.⁴⁵2. Consider the previous problem.
Someone’s weight was 2 standard deviations above the meanof the group on the first scale. How many standard deviations above the mean would youestimate them to be on the second? Watch a video solution.⁴⁶3. You ask a collection of husbands and wives to guess how many jellybeans are in a jar. Thecorrelation is 0.2. The standard deviation for the husbands is 10 beans while the standarddeviation for wives is 8 beans. Assume that the data were centered so that 0 is the mean foreach. The centered guess for a husband was 30 beans (above the mean).
What would be yourbest estimate of the wife’s guess? Watch a video solution.⁴⁷⁴⁵https://youtu.be/rZsnJ0EzVHo⁴⁶http://youtu.be/2lHYXeRl0eg⁴⁷https://youtu.be/htFH-4-vjS8Statistical linear regression modelsWatch this video before beginning⁴⁸Up to this point, we’ve only considered estimation. Estimation is useful, but we also need to knowhow to extend our estimates to a population. This is the process of statistical inference. Our approachto statistical inference will be through a statistical model. At the bare minimum, we need a fewdistributional assumptions on the errors.
However, we’ll focus on full model assumptions underGaussianity.Basic regression model with additive Gaussian errors.Consider developing a probabilistic model for linear regression. Our starting point will assume asystematic component via a line and then independent and identically distributed Gaussian errors.We can write the model out as:Yi = β0 + β1 Xi + ϵiHere, the ϵi are assumed to be independent and identically distributed as N (0, σ 2 ).
Under this model,E[Yi | Xi = xi ] = µi = β0 + β1 xiandV ar(Yi | Xi = xi ) = σ 2 .This model implies that the Yi are independent and normally distributed with means β0 + β1 xi andvariance σ 2 . We could write this more compactly asYi | Xi = xi ∼ N (β0 + β1 xi , σ 2 ).While this specification of the model is a perhaps better for advanced purposes, specifying themodel as linear with additive error terms is generally more useful. With that specification, we canhypothesize and discuss the nature of the errors. In fact, we’ll even cover ways to estimate them toinvestigate our model assumption.Remember that our least squares estimates of β0 and β1 are:⁴⁸https://www.youtube.com/watch?v=ewS1Kkzl8mw&list=PLpl-gQkQivXjqHAJd2t-J_One_fYE55tC&index=1028Statistical linear regression modelsβ̂1 = Cor(Y, X)Sd(Y )and β̂0 = Ȳ − β̂1 X̄.Sd(X)It is convenient that under our Gaussian additive error model that the maximum likelihood estimatesof β0 and β1 are the least squares estimates.Interpreting regression coefficients, the interceptWatch this video before beginning⁴⁹Our model allows us to attach statistical interpretations to our parameters.
Let’s start with theintercept; β0 represents the expected value of the response when the predictor is 0. We can showthis as:E[Y |X = 0] = β0 + β1 × 0 = β0 .Note, the intercept isn’t always of interest. For example, when X = 0 is impossible or far outsideof the range of data.
Take as a specific instance, when X is blood pressure, no one is interested instudying blood pressure’s impact on anything for values near 0.There is a way to make your intercept more interpretable. Consider that:Yi = β0 + β1 Xi + ϵi = β0 + aβ1 + β1 (Xi − a) + ϵi = β̃0 + β1 (Xi − a) + ϵi .Therefore, shifting your X values by value a changes the intercept, but not the slope. Often a is setto X̄, so that the intercept is interpreted as the expected response at the average X value.Interpreting regression coefficients, the slopeNow that we understand how to interpret the intercept, let’s try interpreting the slope. Our slope,β1 , is the expected change in response for a 1 unit change in the predictor. We can show that asfollows:E[Y | X = x + 1] − E[Y | X = x] = β0 + β1 (x + 1) − (β0 + β1 x) = β1Notice that the interpretation of β1 is tied to the units of the X variable.
Let’s consider the impact ofchanging the units.Yi = β0 + β1 Xi + ϵi = β0 +β1(Xi a) + ϵi = β0 + β̃1 (Xi a) + ϵia⁴⁹https://www.youtube.com/watch?v=71dDzKPYEdU&list=PLpl-gQkQivXjqHAJd2t-J_One_fYE55tC&index=1129Statistical linear regression modelsTherefore, multiplication of X by a factor a results in dividing the coefficient by a factor of a.As an example, suppose that X is height in meters (m) and Y is weight in kilograms (kg). Then β1 iskg/m.
Converting X to centimeters implies multiplying X by 100 cm/m. To get β1 in the right unitsif we had fit the model in meters, we have to divide by 100 cm/m. Or, we can write out the notationas:100cmkg1mXm ×= (100X)cm and β1×=mm100cm(β1100)kgcmUsing regression for predictionWatch this video before beginning⁵⁰Regression is quite useful for prediction. If we would like to guess the outcome at a particular valueof the predictor, say X, the regression model guesses:β̂0 + β̂1 XIn other words, just find the Y value on the line with the corresponding X value. Regression,especially linear regression, often doesn’t produce the best prediction algorithms.
However, itproduces parsimonious and interpretable models along with the predictions. Also, as we’ll see laterwe’ll be able to get easily described estimates of uncertainty associated with our predictions.ExampleLet’s analyze the diamond data set from the UsingR package. The data is diamond prices (in Singaporedollars) and diamond weight in carats. Carats are a standard measure of diamond mass, 0.2 grams.To get the data use library(UsingR); data(diamond)First let’s plot the data. Here’s the code I used⁵⁰https://www.youtube.com/watch?v=5isJA7T5_VE&list=PLpl-gQkQivXjqHAJd2t-J_One_fYE55tC&index=1230Statistical linear regression modelslibrary(UsingR)data(diamond)library(ggplot2)g = ggplot(diamond, aes(x = carat, y = price))g = g + xlab("Mass (carats)")g = g + ylab("Price (SIN $)")g = g + geom_point(size = 7, colour = "black", alpha=0.5)g = g + geom_point(size = 5, colour = "blue", alpha=0.2)g = g + geom_smooth(method = "lm", colour = "black")gand here’s the plot.Plot of the diamond data with mass by caratsFirst, let’s fit the linear regression model.
This is done with the lm function in R (lm stands for linearmodel). The syntax is lm(Y ∼ X) where Y is the response and X is the predictor.Statistical linear regression models31> fit <- lm(price ~ carat, data = diamond)> coef(fit)(Intercept)carat-259.63721.0The function coef grabs the fitted coefficients and conveniently names them for you. Therefore,we estimate an expected 3721.02 (SIN) dollar increase in price for every carat increase in mass ofdiamond.