Regression models for data sciense (779323), страница 19
Текст из файла (страница 19)
For example your heart rate is often expressed in beats per minute. So,we might look at web hits per day, or disease cases per exposure time (incidence rates). Also, thoughnot exactly a rate, we can treat proportions as if rates when n is large and the success probability issmall.We would write that a random variable is Poisson, X ∼ Poisson(tλ), if its density function is:P (X = x) =(tλ)x e−tλx!where x = 0, 1, . .
.. The mean of the Poisson is E[X] = tλ, thus E[X/t] = λ. The variance ofthe Poisson is V ar(X) = tλ. The Poisson tends to a normal as tλ gets large and approximates abinomial with large n and small p where we would think of tλ as np.Here are some plots of the Poisson density to illustrate how it closely approximates a normal.par(mfrow = c(1, 3))plot(0 : 10, dpois(0 : 10, lambda = 2), type = "h", frame = FALSE)plot(0 : 20, dpois(0 : 20, lambda = 10), type = "h", frame = FALSE)plot(0 : 200, dpois(0 : 200, lambda = 100), type = "h", frame = FALSE)¹¹⁸https://youtu.be/YtotMuVmOUM?list=PLpl-gQkQivXjqHAJd2t-J_One_fYE55tC117Count dataPoisson densities as the mean increases.Poisson distributionLet’s analyze some data using the Poisson distribution. Consider the daily counts to Jeff Leek’s website: http://biostat.jhsph.edu/∼jleek/¹¹⁹We’re interested in the number of hits per day.
Since the unit of time is always one day, set t = 1and then the Poisson mean is interpreted as web hits per day. If we set t = 24 then our Poisson ratewould be interpreted as web hits per hour.Let’s load the data: {lang=r, line-numbers=off} ∼∼∼ > download.file(“https://dl.dropboxusercontent.com/u/7710864> load(“./data/gaData.rda”) > gaData$julian = julian(gaData$date) > head(gaData) date visits simplystats julian 1 2011-01-01 0 0 14975 2 2011-01-02 0 0 14976 3 2011-01-03 0 0 14977 4 2011-01-04 0 014978 5 2011-01-05 0 0 14979 6 2011-01-06 0 0 14980plot(gaData$julian,gaData$visits,pch=19,col=”darkgrey”,xlab=”Julian”,ylab=”Visits”) ∼∼∼¹¹⁹http://biostat.jhsph.edu/~jleek/118Count dataPlot of the count of web hits by day.Linear regressionWe could try to fit the data with linear regression. This is an often reasonable thing to do.
Specifically,we would start with the modelYi = β0 + beta1 xi + ϵiwhere Yi is the number of web hits on day i and xi is the day (expressed as a Julian date, the numberof days since January 1st, 1970).This model isn’t anywhere near as objectionable as when applied in the binary case. Two stickingpoints remain.
First, the response is a count and thus is non-negative, while our model doesn’tacknowledge that. Secondly, the errors are typically assumed Gaussian, which is not an accurateapproximation for small counts. As the counts get larger, straight application of linear or multivariable regression models becomes more compelling.119Count data1234> plot(gaData$julian,gaData$visits,pch=19,col="darkgrey",xlab="Julian",ylab="Vis\its")> lm1 = lm(gaData$visits ~ gaData$julian)> abline(lm1,col="red",lwd=3)Plot of the data plus the fitted line.The non-negativity could be handled by a (natural) log transformation of the outcome.
The log hasa special interpretation with respect to regression, so we cover it here. First, there is the issue of zerocounts (which can’t be logged). Often a +1 is added to all data before taking the log, a reasonablesolution but one that harms the nice interpretation properties of the log. Secondly, a square root orcube root transformation is often applied (which works just fine on zeros). While correcting nicelyfor skewness, this approach the issue of losing the nice interpretation of logs. Thus for the time being,let’s assume no zero counts in the discussion.Consider now the model:log(Yi ) = β0 + β1 xi + ϵiThe quantity eE[log(Y )] geometric mean of Y . When you take the natural log of outcomes and fit aregression model, your exponentiated coefficients estimate things about geometric means.
Thus oureβ0 is the geometric mean hits on day 0 while eβ1 is the relative increase or decrease in hits goingfrom one day to the next.Let’s try this briefly with Jeff’s data.120Count data> round(exp(coef(lm(I(log(gaData$visits + 1)) ~ gaData$julian))), 5)(Intercept) gaData$julian0.0001.002Thus our model estimates a 0.2% increase in geometric mean daily web hits each day.
What’snice about the geometric mean is it’s a multiplicative quantity. In this case it make sense to thinkmultiplicatively, as we would very naturally think in the terms of percent increases or decreases inthe daily rate of web traffic.Poisson regressionWatch this video before beginning.¹²⁰Poisson regression is similar to logging the outcome. However, instead we log the model meanexactly as in the binary chapter where we logged the modeled odds. This takes care of the problemof zero counts elegantly.Consider a model where we assume that Yi ∼ Poisson(µi ). andlog(E[Yi | Xi = xi ]) = log(µi ) = β0 + β1 xiNote that we’re not logging the outcome, we’re logging the assumed mean in the model.We interpret our coefficients as follows.
eβ0 is the expected mean of the outcome when xi = 0. Usingthe relationship:E[Yi | Xi = xi + 1]= eβ1E[Yi | Xi = xi ]we interpret {$$}eˆ{\beta_1}{$$} as the expected relative increase in the outcome for a unit change inthe regressor. If there’s more than one regressor in the model, then the coefficients are interpretedin the terms of the other regressors being held fixed.Let’s try it in R for Jeff’s data:> plot(gaData$julian,gaData$visits,pch=19,col="darkgrey",xlab="Julian",ylab="Vis\its")> glm1 = glm(gaData$visits ~ gaData$julian,family="poisson")> abline(lm1,col="red",lwd=3); lines(gaData$julian,glm1$fitted,col="blue",lwd=3)¹²⁰https://youtu.be/hg51LjG1xIc?list=PLpl-gQkQivXjqHAJd2t-J_One_fYE55tC121Count dataData with fitted Poisson regression line.Mean-variance relationshipThe Poisson model suggest a specific relationship between the mean and the variance.
Specifically,if Yi ∼ Poisson(µi ), then E[Yi ] = Var(Yi ). We can often check whether or not this relationshipapparently holds. For example, we can plot the fitted values (estimates E[Yi ]) by generalized versionof residuals for Poisson models. Bins of the> plot(glm1$fitted,glm1$residuals,pch=19,col="grey",ylab="Residuals",xlab="Fitte\d")122Count dataPlot of the fitted means versus the residuals.There are several methods for dealing with data that, while being counts, do not follow the meanvariance relationship assumed by the Poisson model.
Perhaps the easiest is to assume a so-calledquasi-Poisson model. This is from the idea of quasi-likelihood. Here, the model is extended tohave the variance be a constant (non-fixed) multiple of the mean. A very related approach areso-called negative binomial models. These models typically assume a more general mean/variancerelationship.Other approaches directly model the mean/variance relationship or rely on asymptotics to be robustto the assumption. We omit a full discussion of general of methods for addressing complex meanvariance relationships and simply show a quasi-Poisson fit for the data of this chapter.> glm2 = glm(visits ~ julian,family="quasipoisson", data = gaData)## Confidence interval expressed as a percentage> 100 * (exp(confint(glm2)) - 1)[2,]Waiting for profiling to be done...2.5 %97.5 %julian0.20729240.2520376## As compared to the standard Poisson interval> 100 * (exp(confint(glm1)) - 1)[2,]Waiting for profiling to be done...123Count datajulian2.5 %0.219244397.5 %0.2399335In this case the distinction between the intervals is minimal.
Again, we reiterate that this onlypursues one direction of model departure.RatesWe fit rates or proportions in Poisson models by including the temporal or sample size componentas a (natural) log offset in the model specification. Recall that Yi was the number of web hits.
LetWi be the number of hits directed to the site from the Simply Statistics BLOG site.Consider the model whereWi ∼ Poisson(µi )so thatlog(µi ) = β0 + β1 xi + log(Yi )This is a model for the proportion in the sense that µi is the expected count and our model is:log(µi /Yi ) = β0 + β1 xiIn this case we were interested in a proportion. If our interest was in rates, counts over a time period,such as incident rate (cases per time at risk), the time variable would be included as the offset.12345glm3 = glm(simplystats ~ julian(gaData$date),offset=log(visits+1),family="poisson",data=gaData)plot(julian(gaData$date),glm3$fitted,col="blue",pch=19,xlab="Date",ylab="Fitted \Counts")points(julian(gaData$date),glm1$fitted,col="red",pch=19)124Count dataPlot of the fitted rates.Exercises1.
Load the dataset Seatbelts as part of the datasets package via data(Seatbelts). Useas.data.frame to convert the object to a dataframe. Fit a Poisson regression GLM withUKDriversKilled as the outcome and kms, PetrolPrice and law as predictors. Interpret yourresults.
Watch a video solution.¹²¹2. Refer to question 1. Fit a linear model with the log of drivers killed as the outcome. Interpretyour results. Watch a video solution.¹²²3. Refer to question 1. Fit your Poisson log-linear model with drivers as a log offset (to considerthe proportion of drivers killed of those killed or seriously injured.) Watch a video solution.¹²³4. Refer to Question 1. Use the anova function to compare models with just law, law andPetrolPrice and all three predictors. Watch a video solution.¹²⁴¹²¹https://www.youtube.com/watch?v=TXO-SHOV_j4&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0&index=55¹²²https://www.youtube.com/watch?v=7RyaIhmpM48&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0&index=56¹²³https://www.youtube.com/watch?v=HylRM_XrUe0&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0&index=57¹²⁴https://www.youtube.com/watch?v=ewfjP1i8gPs&list=PLpl-gQkQivXji7JK1OP1qS7zalwUBPrX0&index=58Bonus materialWatch this video before beginning.¹²⁵.This chapter is a bit of an mishmash of interesting things that one can accomplish with linear models.How to fit functions using linear modelsUp to this point, we’ve only consider fitting lines, planes and polynomials for linear models.