Count Dependent Variables

In this section we will cover how to model our dependent variable if it is a count variable. A count variable takes integer values greater than or equal to zero.

Poisson Regression

Poisson Regression is designed for count dependent variables as the responses can not be less than 0. The log-likelihood formula it uses to derive its parameter estimates is:

\(L(\beta)=\sum_{i=1}^{n} y_{i}ln\ \lambda(z)-\lambda(z)-ln(y_{i}!)\)

where \(\lambda(z)\) = \(e^{z}\) forming:

\(L(\beta)=\sum_{i=1}^{n} y_{i}(\beta_{0}+\beta_{1}x_{1})-e^{\beta_{0}+\beta_{1}x_{1}}-ln(y_{i}!)\)

With a link function of:

log link: \(ln(y)=\beta_{0}+\beta_{1}x_{1}\) (Frees, 2010)

Interpretation

For a one unit increase in x, we can expect the count to be multiplied by \(e^{b_{1}}\).

Fit of the Model

One way to check the fit of the model is through the pearson residuals where the formula for one pearson residual is:

\(\frac{y_{i}-\hat{y}}{\sqrt{\hat{y}}}\)

Which is just the raw residual divided by the standard deviation of the model fit. Since we are using poisson regression the mean and variance are equal to lambda, so the standard deviation is just the square root of lambda. (Frees, 2010)

If we sum up the square values from these residuals it forms the pearson chi-square statistic:

\(\sum_{i=1}^{n} \frac{(y_{i}-\hat{y})^{2}}{\hat{y}}\)

Where smaller values are favorable.

Another statistic that can be used is the deviance. The deviance formula for a poisson distribution is:

\(D=2\sum_{i=1}^{n} y_{i}[ln(\frac{y_{i}}{\hat{y}})-1]+\hat{y}]\)

Where deviance represents the fit between the best fit possible fit model and the current model. Again smaller values are better.

Poisson Regression with Exposures

When the time intervals or number of observations over which counts are recorded are not the same we can use poisson with exposures. For example, if trying to figure out which subject in a school is the easiest you might count the number of A’s received in each subject. Which ever one has the most number of A’s you could deem the easiest. However, there is a problem with this approach. What if the number of students in each subject is not the same. For instance, if a subject has 30 students and receives 25 A’s this is much likely an easier subject than a subject with one-hundred students and receives 30 A’s.

To adjust for this disparity in students size of a subject we add exposures. The predictions can then be interpreted as the number of A’s per student for each subject. Whichever subject results in the highest proportion of A’s would be called the easiest.

The formula of adding exposures is similar to poisson, but now it starts as a ratio:

\(\frac{A's Received}{student}=e^{\beta_{0}+\beta_{1}x_{1}}\)

\(ln(\frac{A's Received}{student})=\beta_{0}+\beta_{1}x_{1}\)

\(ln(A's Received)=\beta_{0}+\beta_{1}x_{1}+\underset{offset}{ln(student)}\)

Models Besides Poisson

Poisson regression assumes that the mean is equal to the variance, which is known as equidispersion. If the variance is greater than the expected value then it is known as overdispersion. If the variance is less than the expected value it is known as underdispersion.

Addressing Overdispersion

The count models that can handle overdispersion, or variance greater than the mean, are negative binomial, zero-inflated, hurdle, and heterogeneity. The negative binomial is like a more flexible version of the poisson regression because it has an added parameter.

Addressing Underdispersion

The only model of those covered in Exam SRM that can handle underdispersion where the variance is less than the mean is the hurdle model. This means the hurdle model can handle both underdispersion and overdispersion.

Zero-Inflated Model

Sometimes in insurance a policy holder might not report a claim in the fear of increasing rates. This can lead to an excess of zeros in the data. Some of these zero are simply from the sampling where a policy holder didn’t occur any type of accident. While the rest of these zeros are from policy holders that had an accident but chose not to report it. One model that can handle this type of data is the zero-inflated model.

The zero-inflated model works by building two models. The first is a model to handle binary data such as a logistic model. Its outcome tells us whether the zeros come from non reporting, or if they are real zero’s, which in this case means did not have an accident. The P(y=1) from this model represents that an accident occurred, but was not reported. The second model built is to predict the counts of the number of claims. It is generally a poisson or negative binomial. Together, the pmf of the two models is:

\(\pi_{i}+(1-\pi_{i})g_{i}(0)\) if j=0

\((1-\pi_{i})g_{i}(j)\) if j \(\ne\) 0

-\(g_{i}(j)\) is the predicted probability of count for poisson model (Frees, 2010)

Hurdle Models

A hurdle model is another method to use when there are an excess number of zeros. The zeros in this case represent not overcoming the “hurdle”. The example used in Regression Modeling and Financial Concepts is that a person must first seek out health care, and then after the desire for health care they must decide on the amount of health care. The seeking out of the health care in this example is the hurdle, as it must take place for the second part of the process which is the amount of health care to take place. The pmf for a hurdle model is:

\(\pi\) if j=0

\((1-\pi) \frac{g_{i}(j)}{1-g_{i}(0)}\) if j \(\ne\) 0

-\(\pi\) is the probability of not overcoming the hurdle

The \(\frac{g_{i}(j)}{1-g_{i}(0)}\) represents a truncated poisson distribution where:

P(Y=y|y>0)=\(\frac{p(y=y)}{p(y>0)}\)=\(\frac{p(y=y)}{1-p(y=0)}\) (Frees, 2010)

Heterogeneity Model

Instead of using a discrete distribution models the response as a continuous mixture. (Frees, 2010)

Sources:

Introduction to SAS. UCLA: Statistical Consulting Group. from https://stats.oarc.ucla.edu/sas/modules/introduction-to-the-features-of-sas/ (accessed August 22, 2021).

Grace-Martin, K. (2020, November 19). The exposure variable in poisson regression models. The Analysis Factor. Retrieved December 6, 2022, from https://www.theanalysisfactor.com/the-exposure-variable-in-poission-regression-models/#:~:text=Poisson%20models%20handle%20exposure%20variables,right%20side%20of%20the%20equation.

Frees, E. W. (2010). Regression modeling with actuarial and financial applications. Cambridge University Press.