R Notebook

12E3. When count data are zero-inflated, using a model that ignores zero-inflation will tend to induce which kind of inferential error?

Ignoring zero-inflation will tend to underestimate the rate of events. Why? Because a count distribution with extra zeros added to it will have a lower mean. So treating such data as single-process count data will result in a lower estimate for the mean rate.

12E4. Over-dispersion is common in count data. Give an example of a natural process that might produce over-dispersed counts. Can you also give an example of a process that might produce under-dispersed counts?

Over-dispersion can arise simply from variation in underlying rates across units. For example, if we count the number of ice creams sold by various ice cream shops for each day over an entire month, the aggregated counts will likely be over-dispersed. This is because some shops sell more ice cream than others—they do not all have the same average rate of sales across days. Under-dispersion is considered less often. Under-dispersed count data has less variation than expected. One common process that might produce under-dispersed counts is when sequential observations are directly correlated with one another, autocorrelation. This is the premise of Conway-Maxwell-Poisson (aka COM-Poisson) distributions, which arise from one model of this kind, the state-dependent queuing model, commonplace in the study of servers and production systems of many kinds. Simply stated, when the rate at which jobs are completed depends upon how many jobs are waiting to be completed, then counts may be highly autocorrelated. This reduces variation in the observed counts, resulting in under-dispersion.

12H1. In 2014, a paper was published that was entitled “Female hurricanes are deadlier than male hurricanes.”191 As the title suggests, the paper claimed that hurricanes with female names have caused greater loss of life, and the explanation given is that people unconsciously rate female hurricanes as less dangerous and so are less likely to evacuate. Statisticians severely criticized the paper after publication. Here, you’ll explore the complete data used in the paper and consider the hypothesis that hurricanes with female names are deadlier. Load the data with: R code 12.38 library(rethinking) data(Hurricanes) Acquaint yourself with the columns by inspecting the help ?Hurricanes. In this problem, you’ll focus on predicting deaths using femininity of each hurricane’s name. Fit and interpret the simplest possible model, a Poisson model of deaths using femininity as a predictor. You can use quap or ulam. Compare the model to an intercept-only Poisson model of deaths. How strong is the association between femininity of name and deaths? Which storms does the model fit (retrodict) well? Which storms does it fit poorly?

The problem left you freedom to specify your own priors. We’ll need to do some prior simulations again, of course. This means thinking about typical ranges of hurricane deaths (the intercept) and how much of an effect the name could possibly have (the slope). Let’s start with a very simple Poisson model predicting deaths using only femininity as a predictor. I’m going to standardize femininity, for the usual reasons. Then we’ll simulate from the priors and see how not to make the prior predictions silly.

library(rethinking)

## Loading required package: cmdstanr

## This is cmdstanr version 0.8.0

## - CmdStanR documentation and vignettes: mc-stan.org/cmdstanr

## - CmdStan path: /Users/brianbeckage/.cmdstan/cmdstan-2.36.0

## - CmdStan version: 2.36.0

## Loading required package: posterior

## This is posterior version 1.6.0

## 
## Attaching package: 'posterior'

## The following objects are masked from 'package:stats':
## 
##     mad, sd, var

## The following objects are masked from 'package:base':
## 
##     %in%, match

## Loading required package: parallel

## rethinking (Version 2.42)

## 
## Attaching package: 'rethinking'

## The following object is masked from 'package:stats':
## 
##     rstudent

data(Hurricanes)
d <- Hurricanes
d$fmnnty_std <- ( d$femininity - mean(d$femininity) )/sd(d$femininity)
dat <- list( D=d$deaths , F=d$fmnnty_std )
# model formula - no fitting yet
f <- alist(
D ~ dpois(lambda),
log(lambda) <- a + bF*F,
a ~ dnorm(1,1),
bF ~ dnorm(0,1) )

Now if you aren’t an expert on hurricanes, you won’t have a sense of what the plausible range of D should be. The most deadly hurricane in recorded history is the 1900 “Great Galveston hurricane,” which killed somewhere between 6000 and 10000 people. Note that this storm doesn’t have a personal name, and it doesn’t appear in the sample, because storms didn’t get names until after 1950. Most storms since 1950 are also much less deadly than this, killing fewer than a dozen people. But occasionally there are storms that kill a hundred or more. Hurricane mortality is a thick-tailed phenomenon. So you might have a hunch already that a simple Poisson model is not going to do a good job. But let’s do out best. What do the priors above produce?

N <- 100
a <- rnorm(N,1,1)
bF <- rnorm(N,0,1)
F_seq <- seq( from=-2 , to=2 , length.out=30 )
plot( NULL , xlim=c(-2,2) , ylim=c(0,500) ,
xlab="name femininity (std)" , ylab="deaths" )
for ( i in 1:N ) lines( F_seq , exp( a[i] + bF[i]*F_seq ) , col=grau() , lwd=1.5 )

This allows for some rather implausible extreme trends. But the typical trend is very modest, only a very small increase (or decrease) in deaths as femininity of the name changes. I’d rather use a tighter prior on the slope, because it makes no sense to think that the femininity of a storm’s name could increase the deaths by an order of magnitude. But let’s proceed with these priors and see what happens.

To fit the model:

m1 <- ulam( f , data=dat , chains=4 , log_lik=TRUE )

## Running MCMC with 4 sequential chains, with 1 thread(s) per chain...
## 
## Chain 1 Iteration:   1 / 1000 [  0%]  (Warmup) 
## Chain 1 Iteration: 100 / 1000 [ 10%]  (Warmup) 
## Chain 1 Iteration: 200 / 1000 [ 20%]  (Warmup) 
## Chain 1 Iteration: 300 / 1000 [ 30%]  (Warmup) 
## Chain 1 Iteration: 400 / 1000 [ 40%]  (Warmup) 
## Chain 1 Iteration: 500 / 1000 [ 50%]  (Warmup) 
## Chain 1 Iteration: 501 / 1000 [ 50%]  (Sampling) 
## Chain 1 Iteration: 600 / 1000 [ 60%]  (Sampling) 
## Chain 1 Iteration: 700 / 1000 [ 70%]  (Sampling) 
## Chain 1 Iteration: 800 / 1000 [ 80%]  (Sampling) 
## Chain 1 Iteration: 900 / 1000 [ 90%]  (Sampling) 
## Chain 1 Iteration: 1000 / 1000 [100%]  (Sampling) 
## Chain 1 finished in 0.1 seconds.
## Chain 2 Iteration:   1 / 1000 [  0%]  (Warmup) 
## Chain 2 Iteration: 100 / 1000 [ 10%]  (Warmup) 
## Chain 2 Iteration: 200 / 1000 [ 20%]  (Warmup) 
## Chain 2 Iteration: 300 / 1000 [ 30%]  (Warmup) 
## Chain 2 Iteration: 400 / 1000 [ 40%]  (Warmup) 
## Chain 2 Iteration: 500 / 1000 [ 50%]  (Warmup) 
## Chain 2 Iteration: 501 / 1000 [ 50%]  (Sampling) 
## Chain 2 Iteration: 600 / 1000 [ 60%]  (Sampling) 
## Chain 2 Iteration: 700 / 1000 [ 70%]  (Sampling) 
## Chain 2 Iteration: 800 / 1000 [ 80%]  (Sampling) 
## Chain 2 Iteration: 900 / 1000 [ 90%]  (Sampling) 
## Chain 2 Iteration: 1000 / 1000 [100%]  (Sampling) 
## Chain 2 finished in 0.1 seconds.
## Chain 3 Iteration:   1 / 1000 [  0%]  (Warmup) 
## Chain 3 Iteration: 100 / 1000 [ 10%]  (Warmup) 
## Chain 3 Iteration: 200 / 1000 [ 20%]  (Warmup) 
## Chain 3 Iteration: 300 / 1000 [ 30%]  (Warmup) 
## Chain 3 Iteration: 400 / 1000 [ 40%]  (Warmup) 
## Chain 3 Iteration: 500 / 1000 [ 50%]  (Warmup) 
## Chain 3 Iteration: 501 / 1000 [ 50%]  (Sampling) 
## Chain 3 Iteration: 600 / 1000 [ 60%]  (Sampling) 
## Chain 3 Iteration: 700 / 1000 [ 70%]  (Sampling) 
## Chain 3 Iteration: 800 / 1000 [ 80%]  (Sampling) 
## Chain 3 Iteration: 900 / 1000 [ 90%]  (Sampling) 
## Chain 3 Iteration: 1000 / 1000 [100%]  (Sampling) 
## Chain 3 finished in 0.1 seconds.
## Chain 4 Iteration:   1 / 1000 [  0%]  (Warmup) 
## Chain 4 Iteration: 100 / 1000 [ 10%]  (Warmup) 
## Chain 4 Iteration: 200 / 1000 [ 20%]  (Warmup) 
## Chain 4 Iteration: 300 / 1000 [ 30%]  (Warmup) 
## Chain 4 Iteration: 400 / 1000 [ 40%]  (Warmup) 
## Chain 4 Iteration: 500 / 1000 [ 50%]  (Warmup) 
## Chain 4 Iteration: 501 / 1000 [ 50%]  (Sampling) 
## Chain 4 Iteration: 600 / 1000 [ 60%]  (Sampling) 
## Chain 4 Iteration: 700 / 1000 [ 70%]  (Sampling) 
## Chain 4 Iteration: 800 / 1000 [ 80%]  (Sampling) 
## Chain 4 Iteration: 900 / 1000 [ 90%]  (Sampling) 
## Chain 4 Iteration: 1000 / 1000 [100%]  (Sampling) 
## Chain 4 finished in 0.1 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 0.1 seconds.
## Total execution time: 0.5 seconds.

precis( m1 )

The model seems to think there is a reliable association between femininity of name and deaths. Now in order to see which hurricanes the model retrodicts well, we can plot the implied trend over the raw data points. I’ll compute and plot the expected death count, 89% interval of the expectation, and 89% interval of the expected distribution of deaths (using Poisson sampling).

# plot raw data
plot( dat$F , dat$D , pch=16 , lwd=2 ,
col=rangi2 , xlab="femininity (std)" , ylab="deaths" )
# compute model-based trend
pred_dat <- list( F=seq(from=-2,to=1.5,length.out=30) )
lambda <- link( m1 , data=pred_dat )
lambda.mu <- apply(lambda,2,mean)
lambda.PI <- apply(lambda,2,PI)
# superimpose trend
lines( pred_dat$F , lambda.mu )
shade( lambda.PI , pred_dat$F )
# compute sampling distribution
deaths_sim <- sim(m1,data=pred_dat)
deaths_sim.PI <- apply(deaths_sim,2,PI)
# superimpose sampling interval as dashed lines
lines( pred_dat$F , deaths_sim.PI[1,] , lty=2 )
lines( pred_dat$F , deaths_sim.PI[2,] , lty=2 )

We can’t even see the 89% interval of the expected value, because it is so narrow. The sampling distribution isn’t much wider itself. What you can see here is that femininity accounts for very little of the variation in deaths, especially at the high end. There’s a lot of over-dispersion, which is very common in Poisson models. As a consequence, this homogenous Poisson model does a poor job for most of the hurricanes in the sample, as most of them lie outside the prediction envelop (the dashed boundaries). Any trend also seems to be driven by a small number of extreme storms with feminine names. As you might expect, PSIS does not like this:

stem( PSISk( m1 ) )

## Some Pareto k values are very high (>1). Set pointwise=TRUE to inspect individual points.

## 
##   The decimal point is 1 digit(s) to the left of the |
## 
##   -0 | 4421
##    0 | 11222344555555666777788888990000111112222222244566778899
##    2 | 12233455801344445778889
##    4 | 1
##    6 | 
##    8 | 051
##   10 | 0
##   12 | 3
##   14 | 
##   16 | 1
##   18 | 
##   20 | 
##   22 | 90

There are four storms with k values above 1, and one above 2. Anyway, this emphasizes that an understanding of process as opposed to purely statistical models is important!