Exercise 12 Solutions

12E3. When count data are zero-inflated, using a model that ignores zero-inflation will tend to induce which kind of inferential error?

Ignoring zero-inflation will tend to underestimate the rate of events. Why? Because a count distribution with extra zeros added to it will have a lower mean. So treating such data as single-process count data will result in a lower estimate for the mean rate.

12E4. Over-dispersion is common in count data. Give an example of a natural process that might produce over-dispersed counts. Can you also give an example of a process that might produce under-dispersed counts?

Over-dispersioncanarisesimplyfromvariationinunderlyingratesacrossunits.Forexample, if we count the number of ice creams sold by various ice cream shops for each day over an entire month, the aggregated counts will likely be over-dispersed. This is because some shops sell more ice cream than others—they do not all have the same average rate of sales across days.

Under-dispersion is considered less often. Under-dispersed count data has less variation than expected. One common process that might produce under-dispersed counts is when sequential ob- servations are directly correlated with one another, autocorrelation. This is the premise of Conway- Maxwell-Poisson (aka COM-Poisson) distributions, which arise from one model of this kind, the state-dependent queuing model, commonplace in the study of servers and production systems of many kinds. Simply stated, when the rate at which jobs are completed depends upon how many jobs are waiting to be completed, then counts may be highly autocorrelated. This reduces variation in the observed counts, resulting in under-dispersion.

12H1. In 2014, a paper was published that was entitled “Female hurricanes are deadlier than male hurricanes.”191 As the title suggests, the paper claimed that hurricanes with female names have caused greater loss of life, and the explanation given is that people unconsciously rate female hurricanes as less dangerous and so are less likely to evacuate. Statisticians severely criticized the paper after publication. Here, you’ll explore the complete data used in the paper and consider the hypothesis that hurricanes with female names are deadlier. Load the data with:

R code 12.38 library(rethinking) data(Hurricanes)

Acquaint yourself with the columns by inspecting the help ?Hurricanes. In this problem, you’ll focus on predicting deaths using femininity of each hurricane’s name. Fit and interpret the simplest possible model, a Poisson model of deaths using femininity as a predictor. You can use quap or ulam. Compare the model to an intercept-only Poisson model of deaths. How strong is the association between femininity of name and deaths? Which storms does the model fit (retrodict) well? Which storms does it fit poorly?

Note that femininity in the data set is on a 1-11 scale from totally masculine (1) to totally feminine (11) for name based on the average of 9 scores from 9 raters.

library(rethinking)

## Loading required package: rstan

## Loading required package: StanHeaders

## Loading required package: ggplot2

## rstan (Version 2.21.8, GitRev: 2e1f913d3ca3)

## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)

## Loading required package: cmdstanr

## This is cmdstanr version 0.5.3

## - CmdStanR documentation and vignettes: mc-stan.org/cmdstanr

## - CmdStan path: /Users/brianbeckage/.cmdstan/cmdstan-2.31.0

## - CmdStan version: 2.31.0

## 
## A newer version of CmdStan is available. See ?install_cmdstan() to install it.
## To disable this check set option or environment variable CMDSTANR_NO_VER_CHECK=TRUE.

## Loading required package: parallel

## rethinking (Version 2.31)

## 
## Attaching package: 'rethinking'

## The following object is masked from 'package:rstan':
## 
##     stan

## The following object is masked from 'package:stats':
## 
##     rstudent

data(Hurricanes)
d <- Hurricanes
head(d)

# Standardize feminity to a N(0,1) or standard normal.
# could also use the scale() function in base R
d$fmnnty_std <- ( d$femininity - mean(d$femininity) )/sd(d$femininity)
dat <- list( D=d$deaths , F=d$fmnnty_std )
head(dat)

## $D
##  [1]   2   4   3   1   0  60  20  20   0 200   7  15   1   0  22  50   0  46   3
## [20]   3   5  37   3  75   6   3  15   3 256  22   0   2   0 117   1  21   5   0
## [39]   1  15   5   2  21   3   0   1   4   8  12   5   3   5   0   1  13  21   3
## [58]  15  62   3   6   9   8  26  10   3   3   1   0  56   8   2   3  51   1  10
## [77]   7   8  25   5   1  15   1  62   5   1   1  52  84  41   5 159
## 
## $F
##  [1] -0.0009347453 -1.6707580424 -0.9133139565  0.9458703621  0.4810742824
##  [6]  0.4122162926  0.5499353710  0.8253673305  0.5327193242  0.9630864089
## [11] -0.2591568553  0.0679232445  0.9630864089  0.9630864089  0.9286574139
## [16]  0.7737253874  0.6015773141  0.8425833773  0.9802993570  0.3605712508
## [21]  0.7909383355  0.6360063090  0.8253673305  0.4810742824  0.6187933608
## [26]  0.4638613343  0.1539972812  0.6704353039  0.7048673975  0.8253673305
## [31]  0.5327193242  0.1884262762  0.9975154038  0.5843643659  0.6015773141
## [36]  0.6704353039  1.1352313836  0.0334942496 -1.5846840057 -1.5674710576
## [41] -1.3264649944 -1.2748199526  0.9458703621  0.9802993570 -1.5846840057
## [46] -1.4125390310  0.9114413671  0.8425833773 -1.4986130677  0.8942284190
## [51]  0.8081543823 -1.2059619628 -1.5330420626  0.4810742824  0.7048673975
## [56] -1.2059619628 -1.3781100361 -1.5846840057 -1.4125390310  0.9458703621
## [61]  0.1367812344  0.5327193242  0.5327193242  0.1195682863 -1.4125390310
## [66]  0.8081543823 -1.5158260158 -1.3953229842 -1.3781100361 -1.5330420626
## [71]  0.7737253874  1.1008023886  0.7392963925  0.8081543823 -0.8100238730
## [76] -1.2059619628 -0.2419408085 -1.2748199526 -1.7740450272  0.5327193242
## [81]  0.9802993570 -1.3436810411  0.7392963925  0.8425833773  0.5671483191
## [86] -1.3608939893  0.9458703621 -1.5674710576 -1.5158260158  0.7737253874
## [91] -1.4986130677  0.6876513507

f <- alist(
        D ~ dpois(lambda),
        log(lambda) <- a + bF*F,
        a ~ dnorm(0,10),
        bF ~ dnorm(0,10) )

m1 <- ulam( f , data=dat , chains=4 , log_lik=TRUE )

## In file included from /var/folders/4f/_h6ql5191cl7bc390kh2xd380000gn/T/RtmpmCalvw/model-1038eeaaaa44.hpp:1:
## In file included from /Users/brianbeckage/.cmdstan/cmdstan-2.31.0/stan/src/stan/model/model_header.hpp:4:
## In file included from /Users/brianbeckage/.cmdstan/cmdstan-2.31.0/stan/lib/stan_math/stan/math.hpp:19:
## In file included from /Users/brianbeckage/.cmdstan/cmdstan-2.31.0/stan/lib/stan_math/stan/math/rev.hpp:10:
## In file included from /Users/brianbeckage/.cmdstan/cmdstan-2.31.0/stan/lib/stan_math/stan/math/rev/fun.hpp:198:
## In file included from /Users/brianbeckage/.cmdstan/cmdstan-2.31.0/stan/lib/stan_math/stan/math/prim/functor.hpp:14:
## In file included from /Users/brianbeckage/.cmdstan/cmdstan-2.31.0/stan/lib/stan_math/stan/math/prim/functor/integrate_ode_rk45.hpp:6:
## In file included from /Users/brianbeckage/.cmdstan/cmdstan-2.31.0/stan/lib/stan_math/stan/math/prim/functor/ode_rk45.hpp:9:
## In file included from /Users/brianbeckage/.cmdstan/cmdstan-2.31.0/stan/lib/stan_math/lib/boost_1.78.0/boost/numeric/odeint.hpp:76:
## In file included from /Users/brianbeckage/.cmdstan/cmdstan-2.31.0/stan/lib/stan_math/lib/boost_1.78.0/boost/numeric/odeint/integrate/observer_collection.hpp:23:
## In file included from /Users/brianbeckage/.cmdstan/cmdstan-2.31.0/stan/lib/stan_math/lib/boost_1.78.0/boost/function.hpp:30:
## In file included from /Users/brianbeckage/.cmdstan/cmdstan-2.31.0/stan/lib/stan_math/lib/boost_1.78.0/boost/function/detail/prologue.hpp:17:
## In file included from /Users/brianbeckage/.cmdstan/cmdstan-2.31.0/stan/lib/stan_math/lib/boost_1.78.0/boost/function/function_base.hpp:21:
## In file included from /Users/brianbeckage/.cmdstan/cmdstan-2.31.0/stan/lib/stan_math/lib/boost_1.78.0/boost/type_index.hpp:29:
## In file included from /Users/brianbeckage/.cmdstan/cmdstan-2.31.0/stan/lib/stan_math/lib/boost_1.78.0/boost/type_index/stl_type_index.hpp:47:
## /Users/brianbeckage/.cmdstan/cmdstan-2.31.0/stan/lib/stan_math/lib/boost_1.78.0/boost/container_hash/hash.hpp:132:33: warnin

## g: 'unary_function<const std::error_category *, unsigned long>' is deprecated [-Wdeprecated-declarations]
##         struct hash_base : std::unary_function<T, std::size_t> {};
##                                 ^
## /Users/brianbeckage/.cmdstan/cmdstan-2.31.0/stan/lib/stan_math/lib/boost_1.78.0/boost/container_hash/hash.hpp:692:18: note: in instantiation of template class 'boost::hash_detail::hash_base<const std::error_category *>' requested here
##         : public boost::hash_detail::hash_base<T*>
##                  ^
## /Users/brianbeckage/.cmdstan/cmdstan-2.31.0/stan/lib/stan_math/lib/boost_1.78.0/boost/container_hash/hash.hpp:420:24: note: in instantiation of template class 'boost::hash<const std::error_category *>' requested here
##         boost::hash<T> hasher;
##                        ^
## /Users/brianbeckage/.cmdstan/cmdstan-2.31.0/stan/lib/stan_math/lib/boost_1.78.0/boost/container_hash/hash.hpp:551:9: note: in instantiation of function template specialization 'boost::hash_combine<const std::error_category *>' requested here
##         hash_combine(seed, &v.category());
##         ^
## /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/__functional/unary_function.h:23:29: note: 'unary_function<const std::error_category *, unsigned long>' has been explicitly marked deprecated here
## struct _LIBCPP_TEMPLATE_VIS _LIBCPP_DEPRECATED_IN_CXX11 unary_function
##                             ^

## /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/__config:825:41: note: expanded from macro '_LIBCPP_DEPRECATED_IN_CXX11'
## #    define _LIBCPP_DEPRECATED_IN_CXX11 _LIBCPP_DEPRECATED
##                                         ^
## /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/__config:810:49: note: expanded from macro '_LIBCPP_DEPRECATED'
## #      define _LIBCPP_DEPRECATED __attribute__((deprecated))
##                                                 ^

## 1 warning generated.

## Running MCMC with 4 sequential chains, with 1 thread(s) per chain...
## 
## Chain 1 Iteration:   1 / 1000 [  0%]  (Warmup) 
## Chain 1 Iteration: 100 / 1000 [ 10%]  (Warmup) 
## Chain 1 Iteration: 200 / 1000 [ 20%]  (Warmup) 
## Chain 1 Iteration: 300 / 1000 [ 30%]  (Warmup) 
## Chain 1 Iteration: 400 / 1000 [ 40%]  (Warmup) 
## Chain 1 Iteration: 500 / 1000 [ 50%]  (Warmup) 
## Chain 1 Iteration: 501 / 1000 [ 50%]  (Sampling) 
## Chain 1 Iteration: 600 / 1000 [ 60%]  (Sampling) 
## Chain 1 Iteration: 700 / 1000 [ 70%]  (Sampling) 
## Chain 1 Iteration: 800 / 1000 [ 80%]  (Sampling) 
## Chain 1 Iteration: 900 / 1000 [ 90%]  (Sampling) 
## Chain 1 Iteration: 1000 / 1000 [100%]  (Sampling) 
## Chain 1 finished in 0.1 seconds.
## Chain 2 Iteration:   1 / 1000 [  0%]  (Warmup) 
## Chain 2 Iteration: 100 / 1000 [ 10%]  (Warmup) 
## Chain 2 Iteration: 200 / 1000 [ 20%]  (Warmup) 
## Chain 2 Iteration: 300 / 1000 [ 30%]  (Warmup) 
## Chain 2 Iteration: 400 / 1000 [ 40%]  (Warmup) 
## Chain 2 Iteration: 500 / 1000 [ 50%]  (Warmup) 
## Chain 2 Iteration: 501 / 1000 [ 50%]  (Sampling) 
## Chain 2 Iteration: 600 / 1000 [ 60%]  (Sampling) 
## Chain 2 Iteration: 700 / 1000 [ 70%]  (Sampling) 
## Chain 2 Iteration: 800 / 1000 [ 80%]  (Sampling) 
## Chain 2 Iteration: 900 / 1000 [ 90%]  (Sampling) 
## Chain 2 Iteration: 1000 / 1000 [100%]  (Sampling) 
## Chain 2 finished in 0.1 seconds.
## Chain 3 Iteration:   1 / 1000 [  0%]  (Warmup) 
## Chain 3 Iteration: 100 / 1000 [ 10%]  (Warmup) 
## Chain 3 Iteration: 200 / 1000 [ 20%]  (Warmup) 
## Chain 3 Iteration: 300 / 1000 [ 30%]  (Warmup) 
## Chain 3 Iteration: 400 / 1000 [ 40%]  (Warmup) 
## Chain 3 Iteration: 500 / 1000 [ 50%]  (Warmup) 
## Chain 3 Iteration: 501 / 1000 [ 50%]  (Sampling) 
## Chain 3 Iteration: 600 / 1000 [ 60%]  (Sampling) 
## Chain 3 Iteration: 700 / 1000 [ 70%]  (Sampling) 
## Chain 3 Iteration: 800 / 1000 [ 80%]  (Sampling) 
## Chain 3 Iteration: 900 / 1000 [ 90%]  (Sampling) 
## Chain 3 Iteration: 1000 / 1000 [100%]  (Sampling) 
## Chain 3 finished in 0.1 seconds.
## Chain 4 Iteration:   1 / 1000 [  0%]  (Warmup) 
## Chain 4 Iteration: 100 / 1000 [ 10%]  (Warmup) 
## Chain 4 Iteration: 200 / 1000 [ 20%]  (Warmup) 
## Chain 4 Iteration: 300 / 1000 [ 30%]  (Warmup) 
## Chain 4 Iteration: 400 / 1000 [ 40%]  (Warmup) 
## Chain 4 Iteration: 500 / 1000 [ 50%]  (Warmup) 
## Chain 4 Iteration: 501 / 1000 [ 50%]  (Sampling) 
## Chain 4 Iteration: 600 / 1000 [ 60%]  (Sampling) 
## Chain 4 Iteration: 700 / 1000 [ 70%]  (Sampling) 
## Chain 4 Iteration: 800 / 1000 [ 80%]  (Sampling) 
## Chain 4 Iteration: 900 / 1000 [ 90%]  (Sampling) 
## Chain 4 Iteration: 1000 / 1000 [100%]  (Sampling) 
## Chain 4 finished in 0.1 seconds.
## 
## All 4 chains finished successfully.
## Mean chain execution time: 0.1 seconds.
## Total execution time: 0.5 seconds.

precis( m1 )

range(dat$F)

## [1] -1.774045  1.135231

# plot raw data
plot( dat$F , dat$D , pch=16 , lwd=2 ,
    col=rangi2 , xlab="femininity (std)" , ylab="deaths" )

# compute model-based trend

# First create a set of values of f to make predictions at
pred_dat <- list( F=seq(from=-2,to=1.5,length.out=30) )

# simulating the predicted lambdas at these values of f
# for our fitted model m1.  by default, link simulates
# a 1000 draws from the posterioir
lambda <- link( m1 , data=pred_dat )

# next taking the mean and PI which is a percentile interval
# PI is built in function in rethinking with a default 0.89 probability interval
lambda.mu <- apply(lambda,2,mean)
lambda.PI <- apply(lambda,2,PI)

# superimpose trend
lines( pred_dat$F , lambda.mu )
shade( lambda.PI , pred_dat$F )


# compute sampling distribution
deaths_sim <- sim(m1,data=pred_dat)
deaths_sim.PI <- apply(deaths_sim,2,PI)
# superimpose sampling interval as dashed lines
lines( pred_dat$F , deaths_sim.PI[1,] , lty=2 )
lines( pred_dat$F , deaths_sim.PI[2,] , lty=2 )

We can’t even see the 89% interval of the expected value, because it is so narrow. The sampling dis- tribution isn’t much wider itself. What you can see here is that femininity accounts for very little of the variation in deaths, especially at the high end. There’s a lot of over-dispersion, which is very common in Poisson models. As a consequence, this homogenous Poisson model does a poor job for most of the hurricanes in the sample, as most of them lie outside the prediction envelop (the dashed boundaries).

Any trend also seems to be driven by a small number of extreme storms with feminine names. As you might expect, PSIS does not like this:

stem(PSISk(m1))

## Some Pareto k values are very high (>1). Set pointwise=TRUE to inspect individual points.

## 
##   The decimal point is 1 digit(s) to the left of the |
## 
##   -0 | 2
##    0 | 12344445566788890000001123333445556666667777788999
##    2 | 0111222222333444444444555678991
##    4 | 13
##    6 | 
##    8 | 401
##   10 | 4
##   12 | 9
##   14 | 
##   16 | 3
##   18 | 
##   20 | 2
##   22 | 
##   24 | 4

This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.

plot(cars)

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Cmd+Option+I.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.