Regression models for data sciense (779323), страница 18
Текст из файла (страница 18)
Even more interesting is that bythe wisdom of the crowd, the final probabilities tend to match the percentage of times that eventhappens. That is, house declared 50 to 1 horses tend to win about 1 out of 51 times and lose 50 outof 51 times, even though that 50 to 1 was set by a random collection of mostly uninformed bettors.This is why even your sports junkie friend with seemingly endless up to date sports knowledgecan’t make a killing betting on sports; the crowd is just too smart as a group even though most ofthe individuals know much less.Finally, and then I’ll get back on track, many of the machine learning algorithms work on thisprinciple of the wisdom of crowds: many small dumb models can make really good models.
This isoften called ensemble learning, where a lot of independent weak classifiers are combined to make108Binary GLMsa strong one. Random forests and boosting come to mind as examples. In the Coursera PracticalMachine Learning class, we cover ensemble learning algorithms.Getting back to the issues at hand, recall that probabilities are between 0 and 1. Odds are between0 and infinity. So, it makes sense to model the log of the odds (the logit), since it goes from minusinfinity to plus infinity. The log of the odds is called the logit:g = logit(p) = log(p/1 − p)We can go backwards from the logit to the probability with the so-called expit (inverse logit):expit(g) = eg /(1 + eg ) = 1/(1 + e−g ) = p.Modeling the oddsLet’s come up with notation for modeling the odds.
Recall that Yi was our outcome, a 0 or 1 indicatorof whether or not the Raven’s won game i.Letpi = P(Yi = 1 | Xi = xi , β0 , β1 )be the probability of a win for number of points xi . Then the odds that they win is pi /(1 − pi ) andthe log odds is log pi /(1 − pi ) = logit(pi ).Logistic regression puts the model on the log of the odds (logit) scale. In other words,logit(pi ) = β0 + β1 xiOr, equivalently, we could just sayP (Yi = 1 | Xi = xi , β0 , β1 ) = pi =exp β0 + β1 xi1 + exp(β0 + β1 xi )Interpreting Logistic RegressionRecall our model is:logit{P (Yi = 1 | Xi = xi )} = β0 + β1 xiLet’s write this as:109Binary GLMslog{O(Yi = 1 | Xi = xi )} = β0 + β1 xiwhere O(Yi = 1 | Xi = xi ) refers to the odds.
Interpreting {$$}\beta_0{$$} is straightforward, it’sthe log of the odds of the Ravens winning for a 0 point game. Just like in regular regression, for thisto have meaning, a 0 X value has to have meaning. In this case, there’s a structural considerationthat’s being ignored in that the Ravens can’t win if they score 0 points (they can only tie or, muchmore likely, lose). This is an unfortunate assumption of our model.For interpreting the β1 coefficient, consider the following:{log{O(Yi = 1 | Xi = xi +1)}−log{O(Yi = 1 | Xi = xi )} = logO(Yi = 1 | Xi = xi + 1)O(Yi = 1 | Xi = xi )}= β1So that β1 is the log of the relative increase in the odds of the Ravens winning for a one point increasein score.
The ratio of two odds is called, not surprisingly, the odds ratio. So β1 is the log odds ratioof the Ravens winning associated with a one point increase in score.We can get rid of the log by exponentiating and then get that exp(β1 ) is the odds ratio associatedwith a one point increase in score. It’s a nifty fact that you can often perform this exponentiationin your head, since for numbers close to zero, exponentiation is about 1 + that number. So, if youhave a logistic regression slope coefficient of 0.01, you know that e to that coefficient is about 1.01.So you know that the coefficient estimates a 1% increase in the odds of a success for every 1 unitincrease in the regressor.Visualizing fitting logistic regression curvesWatch this video before beginning.¹¹³Let’s visualize what the logistic regression model isfitting.
Consider setting β0 to 0 and varying β1 . For X being a regressor equally spaced between -10and 10. Notice that the logistic curves vary in their curvature.¹¹³https://youtu.be/-g3wtUAW1rU?list=PLpl-gQkQivXjqHAJd2t-J_One_fYE55tC110Binary GLMsx = seq(-10, 10, length = 1000)beta0 = 0; beta1s = seq(.25, 1.5, by = .1)plot(c(-10, 10), c(0, 1), type = "n", xlab = "X", ylab = "Probability", frame = \FALSE)sapply(beta1s, function(beta1) {y = 1 / (1 + exp( -1 * ( beta0 + beta1 * x ) ))lines(x, y, type = "l", lwd = 3)})Plot of logistic curves for varying slope coefficients.Try making the slope negative and see what happens.
(It flips the curve from increasing todecreasing.) Now let’s hold β1 fixed and vary β0 .111Binary GLMsx = seq(-10, 10, length = 1000)beta0s = seq(-2, 2, by = .5); beta1 = 1plot(c(-10, 10), c(0, 1), type = "n", xlab = "X", ylab = "Probability", frame = \FALSE)sapply(beta0s, function(beta0) {y = 1 / (1 + exp( -1 * ( beta0 + beta1 * x ) ))lines(x, y, type = "l", lwd = 3)})Plot of logistic curves for varying intercepts.Notice that varying the intercept shifts the curve back and forth.
Let’s superimpose some data withthe fitted curve.112Binary GLMsx = seq(-10, 10, length = 1000)beta0 = 0; beta1 = 1p = 1 / (1 + exp(-1 * (beta0 + beta1 * x)))y = rbinom(prob = p, size = 1, n = length(p))plot(x, y, frame = FALSE, xlab = "x", ylab = "y")lines(lowess(x, y), type = "l", col = "blue", lwd = 3)fit = glm(y ~ x, family = binomial)lines(x, predict(fit, type = "response"), lwd = 3, col = "red")Image of simulated binary data, fitted model (red) and lowess smooth (blue).The plot above shows the simulated binary data (black points), the fitted logistic curve (red) and alowess smoother through the data (blue). The lowess smoother shows a non-parametric estimate ofthe probability of a success at each x value.
Think of it as a moving proportion. Logistic regressiongets to move around the intercept and slope of the logistic curve to fit the data well. Here the fit saysthat the probability of a 1 for low values of x is very small, the probability of a 1 for high values ofx is high and it is intermediate at the points in the middle.113Binary GLMsRavens logistic regressionWatch this video before beginning.¹¹⁴Now let’s run our binary regression model on the Ravens data.> logRegRavens = glm(ravensData$ravenWinNum ~ ravensData$ravenScore,family="bino\mial")> summary(logRegRavens)Call:glm(formula = ravensData$ravenWinNum ~ ravensData$ravenScore,family = "binomial")Deviance Residuals:Min1Q Median-1.758 -1.1000.5303Q0.806Max1.495Coefficients:Estimate Std. Error z value Pr(>|z|)(Intercept)-1.68001.5541-1.080.28ravensData$ravenScore0.10660.06671.600.11(Dispersion parameter for binomial family taken to be 1)Null deviance: 24.435Residual deviance: 20.895AIC: 24.89on 19on 18degrees of freedomdegrees of freedomNumber of Fisher Scoring iterations: 5## plotting the fit> plot(ravensData$ravenScore,logRegRavens$fitted,pch=19,col="blue",xlab="Score",\ylab="Prob Ravens Win")¹¹⁴https://youtu.be/lZCCj-IxYOA?list=PLpl-gQkQivXjqHAJd2t-J_One_fYE55tC114Binary GLMsFitted model for the Ravens data.In this case, the data only covers some of the logistic curve, so that the full “S” of the curve isn’tvisible.
To interpret our coefficients, let’s exponentiate them.> exp(logRegRavens$coeff)(Intercept) ravensData$ravenScore0.18641.1125> exp(confint(logRegRavens))2.5 % 97.5 %(Intercept)0.005675 3.106ravensData$ravenScore 0.996230 1.303The first line of code shows that the exponentiated slope coefficient is 1.11. Thus, we estimate a 11%increase in the odds of winning per 1 point increase in score.
However, the data are variable andthe confident interval goes from 0.99 to 1.303. Since this interval contains 1 (or contains 0 on the logscale), it’s not statistically significant. (It’s pretty close, though.)If we had included another variable in our model, say home versus away game indicator, then ourslope is interpreted holding the value of the covariate held fixed. Just like in multivariable regression.We can also compare nested models using ANOVA and, by and large, our general model discussioncarries over to this setting as well.Binary GLMs115Some summarizing commentsOdds aren’t probabilities.
In binary GLMs, we model the log of the odds (logit) and our slopeparameters are interpreted as log odds ratios. Odds ratios of 1 or log odds ratios of 0 are interpretedas no effect of the regressor on the outcome.Exercises1. Load the dataset Seatbelts as part of the datasets package via data(Seatbelts). Useas.data.frame to convert the object to a dataframe. Create a new outcome variable forwhether or not greater than 119 drivers were killed that month. Fit a logistic regression GLMwith this variable as the outcome and kms, PetrolPrice and law as predictors. Interpret yourparameters. Watch a video solution.¹¹⁵2.
Fit a binomial model with DriversKilled as the outcome and drivers as the total count withkms , PetrolPrice and law as predictors, interpret your results. Watch a video solution.¹¹⁶3. Refer to Question 1. Use the anova function to compare models with just law, law andPetrolPrice and all three predictors. Watch a video solution.¹¹⁷¹¹⁵https://www.youtube.com/watch?v=CXWZqzKdkp4&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0&index=52¹¹⁶https://www.youtube.com/watch?v=M2KLD_ZFgdo&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0&index=53¹¹⁷https://www.youtube.com/watch?v=npHpBLqkhLg&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0&index=54Count dataWatch this video before beginning.¹¹⁸Acknowledgement to Jeff Leek for much of the code and organization of this chapter.Many data take the form of unbounded count data.
For example, consider the number of calls to acall center or the number of flu cases in an area or the number of hits to a web site.In some of these cases the counts are clearly bounded. However, modeling the counts as unboundedis often done when the upper limit is not known or very large relative to the number of events.If the upper bound is known, the techniques we’re discussing can be used to model the proportionor rate. The starting point for most count analysis is the the Poisson distribution.Poisson distributionThe Poisson distribution is the goto distribution for modeling counts and rates. We’ll define a rateas a count per unit of time.