Statistical Rethinking - Chapter 12

Monsters & Mixtures

Introduction

These are answers and solutions to the exercises at the end of chapter 12 in Satistical Rethinking 2 by Richard McElreath. I have created these notes as a part of my ongoing involvement in the AU Bayes Study Group. Much of my inspiration for these solutions, where necessary, has been obtained from the solutions provided to instructors by Richard McElreath himself.

R Environment

For today’s exercise, I load the following packages:

library(rethinking)
library(rstan)
library(ggplot2)
library(tidybayes)

Easy Exercises

Practice E1

Question: What is the difference between an ordered categorical variable and an unordered one? Define and then give an example of each.

Answer:
Ordered categorical variables are those which are expressed on ordinal scales. Not very helpful, right? Well, these are variables which establish a pre-defined number of distinct outcomes. These may be expressed as numbers, but don’t have to be. What’s special about these variables is that their values (i.e. categories) can be ordered meaningfully from one extreme to the another without implying equal distances between the values. As an example, think of sparrow (Passer domesticus) weight. We may have measured these in grams (as a continuous variable), but now want to model simply whether our sparrows are light-, medium-, or heavy-weights. This is an ordered categorical variable because we now how they can be ordered from one extreme to another, but we don’t assume that a sparrow has to become heavier by the same margin to classify as medium-weight instead of light-weight than to classify as heavy-weight instead of medium-weight.

Unordered categorical variables are much like their ordered counterpart, but come with an important distinction: we cannot order these in any meaningful way. Again, think of the common house sparrow (Passer domesticus) and their colouration patterns. We may want to record them as overall black, brown, or grey. These are categories, but we certainly cannot order these from one extreme to another.

Practice E2

Question: What kind of link function does an ordered logistic regression employ? How does it differ from an ordinary logit link?

Answer:
Cumulative logit - \(OrdLogit( = \phi, K)\). This link function defines a number of \(K-1\) cumulative, proportional probabilities (\(\phi\)) of each outcome category. The \(K-1\) \(\phi_i\) sum up to 1 so that the \(\phi_K = 1\) and can subsequently be dropped from the model. The link thus states that the linear model defines the log-cumulative odds of an event.

Practice E3

Question: When count data are zero-inflated, using a model that ignores zero-inflation will tend to induce which kind of inferential error?

Answer:
Under-prediction of the true rate of events. Zero-inflation means that counts of zero arise through more than one process at least one of which is not accounted for in our model. Subsequently our estimate of the true rate will be pushed closer to 0 than it truly is.

Practice E4

Question: Over-dispersion is common in count data. Give an example of a natural process that might produce over-dispersed counts. Can you also give an example of a process that might produce under- dispersed counts?

Answer:
Over-dispersion often comes about as a result of heterogeneity in rates across different sampling units/systems. As an example, think of lizard counts in different patches of a dryland area over a given period of time. The resulting count data will likely be over-dispersed since the rate at which lizards are observed will vary strongly across the different study sites.

Under-dispersion, on the other hand, shows less variation in the rates than would be expected. This is often the case when autocrrelation plays a role. For example, if we track our lizard abundances at each patch through time in half-day intervals, we are likely to end up with highly autocorrelated and under-dispersed counts/rates for each study site.

Medium Exercises

Practice M1

Question: At a certain university, employees are annually rated from 1 to 4 on their productivity, with 1 being least productive and 4 most productive. In a certain department at this certain university in a certain year, the numbers of employees receiving each rating were (from 1 to 4): 12, 36, 7, 41. Compute the log cumulative odds of each rating.

Answer:

n <- c(12, 36, 7, 41) # assignment
q <- n / sum(n) # proportions
p <- cumsum(q) # cumulative proportions
o <- p / (1 - p) # cumulative odds
log(o) # log-cumulative odds
## [1] -1.9459101  0.0000000  0.2937611        Inf

Practice M2

Question: Make a version of Figure 12.5 for the employee ratings data given just above.

Answer:

# plot raw proportions
plot(1:4, p,
  xlab = "rating", ylab = "cumulative proportion",
  xlim = c(0.7, 4.3), ylim = c(0, 1), xaxt = "n", cex = 3
)
axis(1, at = 1:4, labels = 1:4)
# plot gray cumulative probability lines
for (x in 1:4) lines(c(x, x), c(0, p[x]), col = "gray", lwd = 4)
# plot blue discrete probability segments
for (x in 1:4) lines(c(x, x) + 0.1, c(p[x] - q[x], p[x]), col = "slateblue", lwd = 4)
# add number labels
text(1:4 + 0.2, p - q / 2, labels = 1:4, col = "slateblue", cex = 3)

Practice M3

Question: Can you modify the derivation of the zero-inflated Poisson distribution (ZIPoisson) from the chapter to construct a zero-inflated binomial distribution?

Answer:
Let’s remind ourselves of the ZI-Poisson distribution from the chapter:

\[ ZIPoisson(p, \lambda) \]

with:
- p = probability of no count generating process occurring
- λ = rate at which counts are produced when process occurs

This is an extension of the Poisson distribution which is in itself a special version of the Binomial distribution with many trials and a low success rate. The zero-inflated Poisson may also be expressed as (\(F\) and \(S\) indicate binomial process has succeeded or failed in prohibiting the poisson process from happening, respectively):

\[Pr(0|p_0, λ) = Pr(F|p_0) + Pr(S|p_0)*Pr(0|λ) = p_0 + (1 − p_0) * exp(−λ)\] For zero-observations.

\[Pr(y|y>0,p_0,\lambda) = Pr(S|p_0)(0) + Pr(F|p_0)*Pr(y|\lambda) = (1-p_0)\frac{\lambda^yexp(-\lambda)}{y!}\] For non zero-observations.

So how do we now get to a binomial specification here? By changing the Poisson likelihood (\(exp(−λ)\)) to a Binomial likelihood (\((1 − q)^n\) with \(q\) denoting the probability of success):

\[Pr(0|p_0, q, n) = p_0 + (1 − p_0)(1 − q)^n\] For zero-observations.

\[Pr(y|p_0, q, n) = (1 − p_0) Binom(y, n, q) = (1 − p_0) \frac{n!}{y!(n − y)!}*q^y(1 − q)^{n−y}\] For non zero-observations.

Hard Exercises

Practice H1

Question: In 2014, a paper was published that was entitled “Female hurricanes are deadlier than male hurricanes.” As the title suggests, the paper claimed that hurricanes with female names have caused greater loss of life, and the explanation given is that people unconsciously rate female hurricanes as less dangerous and so are less likely to evacuate. Statisticians severely criticized the paper after publication. Here, you’ll explore the complete data used in the paper and consider the hypothesis that hurricanes with female names are deadlier. Load the data with:

library(rethinking)
data(Hurricanes)

Acquaint yourself with the columns by inspecting the help ?Hurricanes. In this problem, you’ll focus on predicting deaths using femininity of each hurricane’s name.

Fit and interpret the simplest possible model, a Poisson model of deaths using femininity as a predictor. You can use quap or ulam. Compare the model to an intercept-only Poisson model of deaths. How strong is the association between femininity of name and deaths? Which storms does the model fit (retrodict) well? Which storms does it fit poorly?

Answer: First, let’s prepare the data:

d <- Hurricanes # load data on object called d
d$fem_std <- (d$femininity - mean(d$femininity)) / sd(d$femininity) # standardised femininity
dat <- list(D = d$deaths, F = d$fem_std)

Now that we have standardised data for the feminity of our hurricane names which makes priors easier to formulate, we can specify our initial model idea:

# model formula
f <- alist(
  D ~ dpois(lambda), # poisson outcome distribution
  log(lambda) <- a + bF * F, # log-link for lambda with linear model
  # priors in log-space, 0 corresponds to outcome of 1
  a ~ dnorm(1, 1),
  bF ~ dnorm(0, 1)
)

But are these priors any good? Let’s simulate them why don’t we:

N <- 1e3
a <- rnorm(N, 1, 1)
bF <- rnorm(N, 0, 1)
F_seq <- seq(from = -2, to = 2, length.out = 30) # sequence from -2 to 2 because femininity data is standardised
plot(NULL,
  xlim = c(-2, 2), ylim = c(0, 500),
  xlab = "name femininity (std)", ylab = "deaths"
)
for (i in 1:N) {
  lines(F_seq,
    exp(a[i] + bF[i] * F_seq), # inverse link to get outcome scale
    col = grau(), lwd = 1.5
  )
}

I’d think that’s pretty alright. We allow for both positive and negative trends between death toll and femininity of hurricane name, but don’t have a lot of explosive trends in our priors. These strong trends are quite unintuitive. Our vast majority of trends however are very ambiguous and so I proceed with these priors and run the model:

mH1 <- ulam(f, data = dat, chains = 4, cores = 4, log_lik = TRUE)
precis(mH1)
##         mean         sd      5.5%     94.5%   n_eff     Rhat4
## a  2.9998653 0.02345130 2.9622623 3.0368836 1198.55 1.0013218
## bF 0.2378941 0.02423528 0.1999448 0.2766489 1367.56 0.9983856

So according to this, there is a positive relationship between hurricane name femininity and death toll. Which hurricanes do we actually retrodict well, though? Let’s plot, this:

# plot raw data
plot(dat$F, dat$D,
  pch = 16, lwd = 2,
  col = rangi2, xlab = "femininity (std)", ylab = "deaths"
)
# compute model-based trend
pred_dat <- list(F = seq(from = -2, to = 2, length.out = 1e2))
lambda <- link(mH1, data = pred_dat) # predict deaths
lambda.mu <- apply(lambda, 2, mean) # get mean prediction
lambda.PI <- apply(lambda, 2, PI) # get prediction interval
# superimpose trend
lines(pred_dat$F, lambda.mu)
shade(lambda.PI, pred_dat$F)
# compute sampling distribution
deaths_sim <- sim(mH1, data = pred_dat) # simulate posterior observations
deaths_sim.PI <- apply(deaths_sim, 2, PI) # get simulation interval
# superimpose sampling interval as dashed lines
lines(pred_dat$F, deaths_sim.PI[1, ], lty = 2)
lines(pred_dat$F, deaths_sim.PI[2, ], lty = 2)

Ok. There is quite a bit to unpack here. First of all, our model does not retrodict many of the hurricanes well even though it is quite certain of its predictions (grey shaded area which is hardly visible). Quite obviously, this model misses many of the hurricane death tolls to the right hand side of the above plot. This is a clear sign of over-dispersion which our model failed to account for. The weak, positive trend we are seeing here seems to be informed largely by these highly influential data points. We can assess whether and how influential some data points are with the Paraeto-K values (anything above 1 indicates an influential data point) following:

ggplot(as.data.frame(PSISk(mH1)), aes(x = PSISk(mH1))) +
  stat_halfeye() +
  theme_bw() +
  labs(title = "Paraeto-K values", subtitle = "Values > 1 indicate highly influential data")

Boy! Some hurricanes really do drive our model to a big extent!

Practice H2

Question: Counts are nearly always over-dispersed relative to Poisson. So fit a gamma-Poisson (aka negative-binomial) model to predict deaths using femininity. Show that the over-dispersed model no longer shows as precise a positive association between femininity and deaths, with an 89% interval that overlaps zero. Can you explain why the association diminished in strength?

Answer: To start this off, I load the library and data again, so much of the exercise and my solutions can stand by itself:

library(rethinking)
data(Hurricanes)
d <- Hurricanes # load data on object called d
d$fem_std <- (d$femininity - mean(d$femininity)) / sd(d$femininity) # standardised femininity
dat <- list(D = d$deaths, F = d$fem_std)

Again, with the data prepared, we fit our model - the same model as before just with a different outcome distribution:

mH2 <- ulam(
  alist(
    D ~ dgampois(lambda, scale),
    log(lambda) <- a + bF * F,
    a ~ dnorm(1, 1),
    bF ~ dnorm(0, 1),
    scale ~ dexp(1) # strictly positive hence why exponential prior
  ),
  data = dat, chains = 4, cores = 4, log_lik = TRUE
)
precis(mH2)
##            mean         sd        5.5%     94.5%    n_eff     Rhat4
## a     2.9735807 0.15873978  2.71980856 3.2273444 2096.943 0.9997735
## bF    0.2172123 0.15302770 -0.03677845 0.4554010 1538.676 0.9986574
## scale 0.4534511 0.06101545  0.36222346 0.5564104 1850.649 1.0007542

Cool. Our previously identified positive relationship between standardised femininity of hurricane name and death toll is still there albeit slightly diminished in magnitude. However, the credible interval around it has widened considerably and overlaps zero now.

Let’s compare the estimates of our models side by side:

plot(coeftab(mH1, mH2))

These shows quite clearly how our new model is much more uncertain of the parameters.

So what about the predictions of this new model? I plot them the exact same way as previously:

# plot raw data
plot(dat$F, dat$D,
  pch = 16, lwd = 2,
  col = rangi2, xlab = "femininity (std)", ylab = "deaths"
)
# compute model-based trend
pred_dat <- list(F = seq(from = -2, to = 2, length.out = 1e2))
lambda <- link(mH2, data = pred_dat)
lambda.mu <- apply(lambda, 2, mean)
lambda.PI <- apply(lambda, 2, PI)
# superimpose trend
lines(pred_dat$F, lambda.mu)
shade(lambda.PI, pred_dat$F)
# compute sampling distribution
deaths_sim <- sim(mH2, data = pred_dat)
deaths_sim.PI <- apply(deaths_sim, 2, PI)
# superimpose sampling interval as dashed lines
lines(pred_dat$F, deaths_sim.PI[1, ], lty = 2)
lines(pred_dat$F, deaths_sim.PI[2, ], lty = 2)

What’s there left to say other than: “Look at that increased uncertainty of our model” at this point? Well, we can talk about the accuracy of our predictions. They still blow. The uncertainty of our model is nice and all, but with a predictive accuracy like this why would we trust the model?

For now, let’s turn to the conceptual part of this exercise: “Why has the association diminished with the new model?” The question comes down to understanding what the gamma distribution does to our model. The gamma distribution allows for a death rate to be calculated for each outcome individually rather than one overall death rate for all hurricanes. These individual rates are sampled from a common distribution which is a function of the femininity of hurricane names. As a matter of fact, we can plot this:

post <- extract.samples(mH2)
par(mfrow = c(1, 3))
for (fem in -1:1) {
  for (i in 1:1e2) {
    curve(dgamma2(
      x, # where to calculate density
      exp(post$a[i] + post$bF[i] * fem), # linear model with inverse link applied
      post$scale[i] # scale for gamma
    ),
    from = 0, to = 70, xlab = "mean deaths", ylab = "Density",
    ylim = c(0, 0.19), col = col.alpha("black", 0.2),
    add = ifelse(i == 1, FALSE, TRUE)
    )
  }
  mtext(concat("femininity  =   ", fem))
}

These are the gamma distributions samples from the posterior distribution of death rates when assuming same femininity of name for all of them at three different levels of femininity. Yes, a distribution sampled from another distribution. The above plots simply show the uncertainty of which gamma distribution to settle on.

Since our model and gamma distributions are informed by a, bF, and the scale for the gamma distribution at the same time many combinations of a and bF are consistent with the data which results in a wider posterior distribution.

Finally, let’s look at Paraeto-K values and potentially influential data again:

ggplot(as.data.frame(PSISk(mH2)), aes(x = PSISk(mH2))) +
  stat_halfeye() +
  theme_bw() +
  labs(title = "Paraeto-K values", subtitle = "Values > 1 indicate highly influential data")

MUCH BETTER than before!

Practice H3

Question: In order to infer a strong association between deaths and femininity, it’s necessary to include an interaction effect. In the data, there are two measures of a hurricane’s potential to cause death: damage_norm and min_pressure. Consult ?Hurricanes for their meanings. It makes some sense to imagine that femininity of a name matters more when the hurricane is itself deadly. This implies an interaction between femininity and either or both of damage_norm and min_pressure. Fit a series of models evaluating these interactions. Interpret and compare the models. In interpreting the estimates, it may help to generate counterfactual predictions contrasting hurricanes with masculine and feminine names. Are the effect sizes plausible?

Answer: To start this off, I load the library and data again, so much of the exercise and my solutions can stand by itself:

library(rethinking)
data(Hurricanes)
d <- Hurricanes # load data on object called d
d$fem_std <- (d$femininity - mean(d$femininity)) / sd(d$femininity) # standardised femininity
dat <- list(D = d$deaths, F = d$fem_std)
dat$P <- standardize(d$min_pressure)
dat$S <- standardize(d$damage_norm)

The data is ready and I step into my model fitting procedure. Here, I start with a basic model which builds on the previous gamma-Poisson model by adding an interaction between femininity and min_pressure:

mH3a <- ulam(
  alist(
    D ~ dgampois(lambda, scale),
    log(lambda) <- a + bF * F + bP * P + bFP * F * P,
    a ~ dnorm(1, 1),
    c(bF, bP, bFP) ~ dnorm(0, 1),
    scale ~ dexp(1)
  ),
  data = dat, cores = 4, chains = 4, log_lik = TRUE
)
precis(mH3a)
##             mean         sd        5.5%      94.5%    n_eff     Rhat4
## a      2.7480924 0.14797040  2.51802706  2.9826248 2779.554 0.9991208
## bFP    0.3033002 0.14366822  0.07765289  0.5252648 2038.177 0.9998207
## bP    -0.6775038 0.13796539 -0.89876426 -0.4637189 2245.162 0.9994850
## bF     0.3020955 0.14253393  0.06907292  0.5218616 2758.835 1.0009592
## scale  0.5519101 0.08253778  0.42639517  0.6899619 2521.881 0.9991405

As minimum pressure gets lower, a storm grows stronger (I was confused by that myself when answering these exercises). Quite obviously, the lower the pressure in a storm, the more severe the storm, and the more people die which is reflected by the negative value in bP. bF is still estimated to be positive. This time, the interval doesn’t even overlap zero. Meanwhile, the interaction effect bFP is positive. I find it hard to interpret this so I’d rather plot some predictions against real data:

P_seq <- seq(from = -3, to = 2, length.out = 1e2) # pressure sequence
# 'masculine' storms
d_pred <- data.frame(F = -1, P = P_seq)
lambda_m <- link(mH3a, data = d_pred)
lambda_m.mu <- apply(lambda_m, 2, mean)
lambda_m.PI <- apply(lambda_m, 2, PI)
# 'feminine' storms
d_pred <- data.frame(F = 1, P = P_seq)
lambda_f <- link(mH3a, data = d_pred)
lambda_f.mu <- apply(lambda_f, 2, mean)
lambda_f.PI <- apply(lambda_f, 2, PI)
# Plotting, sqrt() to make differences easier to spot, can't use log because there are storm with zero deaths
plot(dat$P, sqrt(dat$D),
  pch = 1, lwd = 2, col = ifelse(dat$F > 0, "red", "dark gray"),
  xlab = "minimum pressure (std)", ylab = "sqrt(deaths)"
)
lines(P_seq, sqrt(lambda_m.mu), lty = 2)
shade(sqrt(lambda_m.PI), P_seq)
lines(P_seq, sqrt(lambda_f.mu), lty = 1, col = "red")
shade(sqrt(lambda_f.PI), P_seq, col = col.alpha("red", 0.2))

Our model expects masculine (grey) storms to be less deadly, on average, than feminine (red) ones. As pressure drops (toward the rightward side of the plot above), these differences become smaller and smaller. Quite evidently, some of these storms are influencing what our model predicts much more so than others:

ggplot(as.data.frame(PSISk(mH3a)), aes(x = PSISk(mH3a))) +
  stat_halfeye() +
  theme_bw() +
  labs(title = "Paraeto-K values", subtitle = "Values > 1 indicate highly influential data")

Let’s turn to the second variable we may want to add damage_norm - the damage caused by each storm:

mH3b <- ulam(
  alist(
    D ~ dgampois(lambda, scale),
    log(lambda) <- a + bF * F + bS * S + bFS * F * S,
    a ~ dnorm(1, 1),
    c(bF, bS, bFS) ~ dnorm(0, 1),
    scale ~ dexp(1)
  ),
  data = dat, chains = 4, cores = 4, log_lik = TRUE
)
precis(mH3b)
##             mean        sd        5.5%     94.5%    n_eff     Rhat4
## a     2.57124612 0.1337868  2.35859198 2.7875071 2385.847 1.0001830
## bFS   0.31046602 0.1988274 -0.02343345 0.6104912 2263.938 1.0016245
## bS    1.25128079 0.2248294  0.90735742 1.6330986 2179.841 1.0001087
## bF    0.08612896 0.1262273 -0.12038911 0.2904270 2270.913 0.9994756
## scale 0.68412665 0.1023655  0.52962462 0.8539851 2156.253 1.0014874

That just eradicated the effect of femininity of hurricane name (bF)! The newly added interaction parameter bFS is incredibly strong and positive. Again, let’s visualise this:

S_seq <- seq(from = -1, to = 5.5, length.out = 1e2) # damage sequence
# 'masculine' storms
d_pred <- data.frame(F = -1, S = S_seq)
lambda_m <- link(mH3b, data = d_pred)
lambda_m.mu <- apply(lambda_m, 2, mean)
lambda_m.PI <- apply(lambda_m, 2, PI)
# 'feminine' storms
d_pred <- data.frame(F = 1, S = S_seq)
lambda_f <- link(mH3b, data = d_pred)
lambda_f.mu <- apply(lambda_f, 2, mean)
lambda_f.PI <- apply(lambda_f, 2, PI)
# plot
plot(dat$S, sqrt(dat$D),
  pch = 1, lwd = 2, col = ifelse(dat$F > 0, "red", "dark gray"),
  xlab = "normalized damage (std)", ylab = "sqrt(deaths)"
)
lines(S_seq, sqrt(lambda_m.mu), lty = 2)
shade(sqrt(lambda_m.PI), S_seq)
lines(S_seq, sqrt(lambda_f.mu), lty = 1, col = "red")
shade(sqrt(lambda_f.PI), S_seq, col = col.alpha("red", 0.2))

We can clearly see how our model makes less of a distinction between masculine and feminine hurricanes overall at this point. Damage norm scales multiplicatively. The distances grow fast as we approach the rightward side of the plot. This is difficult for the model to account for. Hence why the model is underwhelming.

So why is the interaction effect so strong? Probably because of those 3-4 highly influential feminine storms at the upper-righthand corner of our plot above which implies that feminine storms are especially deadly when they are damaging to begin with. Personally, I don’t trust this association and would argue that there is no logical reason for it and most likely an artefact of the limited data availability.

Practice H4

Question: In the original hurricanes paper, storm damage (damage_norm) was used directly. This assumption implies that mortality increases exponentially with a linear increase in storm strength, because a Poisson regression uses a log link. So it’s worth exploring an alternative hypothesis: that the logarithm of storm strength is what matters. Explore this by using the logarithm of damage_norm as a predictor. Using the best model structure from the previous problem, compare a model that uses log(damage_norm) to a model that uses damage_norm directly. Compare their DIC/WAIC values as well as their implied predictions. What do you conclude?

Answer: To start this off, I load the library and data again, so much of the exercise and my solutions can stand by itself:

library(rethinking)
data(Hurricanes)
d <- Hurricanes # load data on object called d
d$fem_std <- (d$femininity - mean(d$femininity)) / sd(d$femininity) # standardised femininity
dat <- list(D = d$deaths, F = d$fem_std)
dat$S2 <- standardize(log(d$damage_norm))

Let’s fit the model as before and compare it to the previously identified best model:

mH4 <- ulam(
  alist(
    D ~ dgampois(lambda, scale),
    log(lambda) <- a + bF * F + bS * S2 + bFS * F * S2,
    a ~ dnorm(1, 1),
    c(bF, bS, bFS) ~ dnorm(0, 1),
    scale ~ dexp(1)
  ),
  data = dat, chains = 4, cores = 4, log_lik = TRUE
)
compare(mH3b, mH4, func = PSIS)
##          PSIS       SE    dPSIS      dSE    pPSIS       weight
## mH4  630.3395 31.13000  0.00000       NA 5.204153 1.000000e+00
## mH3b 671.2661 34.22309 40.92665 13.62399 7.131042 1.296855e-09

Model mH4 clearly outperforms the earlier (non-logarithmic) model mH3b. How do the parameter estimates look in comparison?

plot(coeftab(mH3b, mH4),
  labels = paste(rep(rownames(coeftab(mH3b, mH4)@coefs), each = 2),
    rep(c("Norm", "Log"), nrow(coeftab(mH3b, mH4)@coefs) * 2),
    sep = "-"
  )
)

With the log-transformed input, bFS has increased in magnitude. What do the resulting predictions look like?

S2_seq <- seq(from = -3, to = 1.8, length.out = 1e2)
# 'masculine' storms
d_pred <- data.frame(F = -1, S2 = S2_seq)
lambda_m <- link(mH4, data = d_pred)
lambda_m.mu <- apply(lambda_m, 2, mean)
lambda_m.PI <- apply(lambda_m, 2, PI)
# 'feminine' storms
d_pred <- data.frame(F = 1, S2 = S2_seq)
lambda_f <- link(mH4, data = d_pred)
lambda_f.mu <- apply(lambda_f, 2, mean)
lambda_f.PI <- apply(lambda_f, 2, PI)
# plot
plot(dat$S2, sqrt(dat$D),
  pch = 1, lwd = 2, col = ifelse(dat$F > 0, "red", "dark gray"),
  xlab = "normalized damage (std)", ylab = "sqrt(deaths)"
)
lines(S2_seq, sqrt(lambda_m.mu), lty = 2)
shade(sqrt(lambda_m.PI), S2_seq)
lines(S2_seq, sqrt(lambda_f.mu), lty = 1, col = "red")
shade(sqrt(lambda_f.PI), S2_seq, col = col.alpha("red", 0.2))

Now this model fits the data much better! Still not perfect, but much better.

Practice H5

Question: One hypothesis from developmental psychology, usually attributed to Carol Gilligan, proposes that women and men have different average tendencies in moral reasoning. Like most hypotheses in social psychology, it is merely descriptive. The notion is that women are more concerned with care (avoiding harm), while men are more concerned with justice and rights. Culture-bound nonsense? Yes. Descriptively accurate? Maybe.

Evaluate this hypothesis, using the Trolley data, supposing that contact provides a proxy for physical harm. Are women more or less bothered by contact than are men, in these data? Figure out the model(s) that is needed to address this question.

Answer: Again, let’s start by preparing the data:

library(rethinking)
data(Trolley)
d <- Trolley
dat <- list(
  R = d$response,
  A = d$action,
  I = d$intention,
  C = d$contact
)
dat$Gid <- ifelse(d$male == 1, 1L, 2L)

Now for a model. We use the same model skeleton as provided in the book. However, this time around, we want cutpoints in our ordered, cumulative logit for both genders separately:

dat$F <- 1L - d$male # indicator of femaleness to turn intercepts on and off
mH5 <- ulam(
  alist(
    R ~ dordlogit(phi, cutpoints),
    phi <- a * F + bA[Gid] * A + bC[Gid] * C + BI * I,
    BI <- bI[Gid] + bIA[Gid] * A + bIC[Gid] * C,
    c(bA, bI, bC, bIA, bIC)[Gid] ~ dnorm(0, 0.5),
    a ~ dnorm(0, 0.5),
    cutpoints ~ dnorm(0, 1.5)
  ),
  data = dat, chains = 4, cores = 4
)
precis(mH5, depth = 2)
##                     mean         sd        5.5%       94.5%     n_eff     Rhat4
## bIC[1]       -1.28776413 0.12452329 -1.49349376 -1.09043808 1210.4762 1.0001366
## bIC[2]       -1.15514189 0.13629426 -1.37697516 -0.93731630 1291.4985 0.9991484
## bIA[1]       -0.43745288 0.10184632 -0.60350695 -0.27738267 1525.4676 0.9991819
## bIA[2]       -0.42701493 0.11528001 -0.60877149 -0.24292151 1426.6589 0.9982597
## bC[1]        -0.47894931 0.09009984 -0.62355913 -0.33291247 1110.8138 1.0006476
## bC[2]        -0.20209081 0.09928662 -0.36387442 -0.04410148 1291.8395 0.9992070
## bI[1]        -0.33743538 0.07394627 -0.45406548 -0.22173726 1139.1604 1.0011958
## bI[2]        -0.25429907 0.08275479 -0.38873521 -0.12311571 1198.5963 0.9982721
## bA[1]        -0.59429031 0.07235707 -0.70868384 -0.47803201 1167.8686 1.0010484
## bA[2]        -0.33249883 0.07800468 -0.45872190 -0.20674815 1240.8940 0.9992808
## a            -0.78641535 0.08390429 -0.91792954 -0.65361595  900.4896 1.0011368
## cutpoints[1] -3.02272722 0.06529648 -3.12770540 -2.92015258 1084.9758 1.0014825
## cutpoints[2] -2.32822869 0.06142668 -2.42573329 -2.23193379 1099.1906 1.0005929
## cutpoints[3] -1.72837561 0.05989128 -1.82493937 -1.63501623 1041.7192 1.0015730
## cutpoints[4] -0.67227857 0.05731897 -0.76242491 -0.57960853 1056.7917 1.0018666
## cutpoints[5]  0.01783675 0.05619122 -0.07240868  0.10860230 1120.0222 1.0022789
## cutpoints[6]  0.94726231 0.05829195  0.85344993  1.04378423 1170.5995 1.0018747

The parameter estimates of interest here are:
1. a (-0.79) - main effect of being female on cumulative log-odds. On average women, have more moral qualms about the trolley problem it seems.
2. bC[2] (-0.2) - the interaction effect of being both female and in a contact scenario as opposed to bc[1] (-0.48) which is the same scenario but for men. Women in our set-up had less moral issues with contact events.

The latter goes against the previously stated hypothesis. Why is that? Because the people in our study are much more complex than just their genders.

Practice H6

Question: The data in data(Fish) are records of visits to a national park. See ?Fish for details. The question of interest is how many fish an average visitor takes per hour, when fishing. The problem is that not everyone tried to fish, so the fish_caught numbers are zero-inflated. As with the monks example in the chapter, there is a process that determines who is fishing (working) and another process that determines fish per hour (manuscripts per day), conditional on fishing (working). We want to model both. Otherwise we’ll end up with an underestimate of rate of fish extraction from the park.

You will model these data using zero-inflated Poisson GLMs. Predict fish_caught as a function of any of the other variables you think are relevant. One thing you must do, however, is use a proper Poisson offset/exposure in the Poisson portion of the zero-inflated model. Then use the hours variable to construct the offset. This will adjust the model for the differing amount of time individuals spent in the park.

Answer: One last time, for this week, we prepare some data:

library(rethinking)
data(Fish)
d <- Fish
str(d)
## 'data.frame':    250 obs. of  6 variables:
##  $ fish_caught: int  0 0 0 0 1 0 0 0 0 1 ...
##  $ livebait   : int  0 1 1 1 1 1 1 1 0 1 ...
##  $ camper     : int  0 1 0 1 0 1 0 0 1 1 ...
##  $ persons    : int  1 1 1 2 1 4 3 4 3 1 ...
##  $ child      : int  0 0 0 1 0 2 1 3 2 0 ...
##  $ hours      : num  21.124 5.732 1.323 0.548 1.695 ...

The model I want to build will look at the following variables:
- fish_caught. This is our response variable.
- livebait. I suggest that using livebait increases number of fish caught.
- camper. I assume that being a camper increases the chances that one goes fishing, but not necessarily the number of fish caught.
- persons. I assume that the number of people in a group increases how many fish are caught.
- child. Being a child should reasonably determine both whether one fishes (I assume children fish less - call it personal bias) and also that they are less effective than adults.
- hours. How long one fishes surely determines how many fish are caught. I want to include the base rate of these into my model of fish caught. Since said model will be log-linked, I log-transform them.

Let me fit that model:

d$loghours <- log(d$hours)
mH6 <- ulam(
  alist(
    # outcome distribution
    fish_caught ~ dzipois(p, mu),
    # linear model for probability of fishing
    logit(p) <- a0 + bC0 * camper + bc0 * child,
    # linear model of catching a number of fish
    log(mu) <- a + bb * livebait + bp * persons + bc * child + bl * loghours,
    c(a0, a) ~ dnorm(0, 1),
    c(bC0, bc0, bb, bp, bc, bl) ~ dnorm(0, 0.5)
  ),
  data = d, chains = 4, cores = 4, log_lik = TRUE
)
precis(mH6)
##           mean         sd       5.5%       94.5%    n_eff     Rhat4
## a   -2.0808366 0.24498827 -2.4844336 -1.70047414 1144.879 1.0007799
## a0  -0.4594813 0.27584773 -0.8874882 -0.03156759 1341.574 0.9994358
## bl   0.1683819 0.03488956  0.1148935  0.22670272 1407.210 1.0018504
## bc  -0.8362166 0.10888732 -1.0065860 -0.65821472 1469.922 1.0037577
## bp   0.8613102 0.04366810  0.7935201  0.93331197 1414.751 0.9998013
## bb   1.4132781 0.19281123  1.1202646  1.73085696 1210.522 1.0001570
## bc0  0.9744269 0.22258670  0.6239488  1.32434268 1580.138 1.0020808
## bC0 -0.6584943 0.29483444 -1.1374854 -0.18031343 1423.052 0.9991185

Remember that \(p\) stands for the probability of not going to fish. Overall, park visitors are pretty likely to fish (a0 =-0.46, logit scale). In line with my intuition, campers are more likely to fish (bC0) while children are less likely to fish (bc0).

Once one is actually fishing, one is not likely to catch much fish (a=-2.08). The parameters pertaining to number of fish caught are expressed on log-scale and so transforming a into the outcome scale (counts of fish) by exponentiating, an adult who fishes by themselves without livebait (this is what a refers to) catches around 0.1249302 fish on average. More people catch more fish (bp). Using livebait is effective (bb). In line with my intuition, children catch less fish than adults (bc). Lastly, the more time one spends fishing, the more fish one catches (bl).

Let’s turn to some actual model predictions of p and mu:

zip_link <- link(mH6)
str(zip_link)
## List of 2
##  $ p : num [1:2000, 1:250] 0.435 0.358 0.391 0.48 0.437 ...
##  $ mu: num [1:2000, 1:250] 0.668 0.462 0.533 0.697 0.244 ...

p cases provide estimates for additional zeros not obtained through the actual Poisson-process behind fishing catches whose output lies with mu.

For example, if $p =$0.39 (the inverse logit of a0 in the model above) and \(μ = 1\) (much higher than a in the model above for ease here), then the implied predictive distribution is:

zeros <- rbinom(
  n = 1e4, # number of samples
  size = 1,
  prob = inv_logit(precis(mH6)[2, 1]) # probability of going fishing
)
obs_fish <- (1 - zeros) * rpois(1e4, 1)
simplehist(obs_fish)

Let’s see what our model would predict given the estimated parameters:

fish_sim <- sim(mH6)
str(fish_sim)
##  num [1:1000, 1:250] 0 1 0 0 1 1 0 0 0 1 ...

This now contains a simulation output for each sample in our data! We could now plot ourselves into oblivion with counterfactual plots. We can also produce counterfactual posterior predictions. As an example, I assume a party of 1 adult spending 1 hour in the park without any use of livebait and not staying the night as a camper:

# new data
pred_dat <- list(
  loghours = log(1), # note that this is zero, the baseline rate
  persons = 1,
  child = 0,
  livebait = 0,
  camper = 0
)
# sim predictions - want expected number of fish, but must use both processes
fish_link <- link(mH6, data = pred_dat)
# summarize
p <- fish_link$p
mu <- fish_link$mu
(expected_fish_mean <- mean((1 - p) * mu))
## [1] 0.1840517
(expected_fish_PI <- PI((1 - p) * mu))
##        5%       94% 
## 0.1248585 0.2547767

This tells us that this hypothetical person is expected to catch 0.1840517 fish with the interval displayed above.

Session Info

sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.1252    
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] tidybayes_2.3.1      rethinking_2.13      rstan_2.21.2         ggplot2_3.3.3        StanHeaders_2.21.0-7
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.6           mvtnorm_1.1-1        lattice_0.20-41      tidyr_1.1.3          prettyunits_1.1.1    ps_1.6.0             assertthat_0.2.1     digest_0.6.27        utf8_1.2.1          
## [10] V8_3.4.0             plyr_1.8.6           R6_2.5.0             backports_1.2.1      stats4_4.0.2         evaluate_0.14        coda_0.19-4          highr_0.9            blogdown_1.3        
## [19] pillar_1.6.0         rlang_0.4.10         curl_4.3             callr_3.7.0          jquerylib_0.1.3      R.utils_2.10.1       R.oo_1.24.0          rmarkdown_2.7        styler_1.4.1        
## [28] labeling_0.4.2       stringr_1.4.0        loo_2.4.1            munsell_0.5.0        compiler_4.0.2       xfun_0.22            pkgconfig_2.0.3      pkgbuild_1.2.0       shape_1.4.5         
## [37] htmltools_0.5.1.1    tidyselect_1.1.0     tibble_3.1.1         gridExtra_2.3        bookdown_0.21        arrayhelpers_1.1-0   codetools_0.2-16     matrixStats_0.58.0   fansi_0.4.2         
## [46] crayon_1.4.1         dplyr_1.0.5          withr_2.4.2          MASS_7.3-51.6        R.methodsS3_1.8.1    distributional_0.2.2 ggdist_2.4.0         grid_4.0.2           jsonlite_1.7.2      
## [55] gtable_0.3.0         lifecycle_1.0.0      DBI_1.1.1            magrittr_2.0.1       scales_1.1.1         RcppParallel_5.1.2   cli_2.4.0            stringi_1.5.3        farver_2.1.0        
## [64] bslib_0.2.4          ellipsis_0.3.1       generics_0.1.0       vctrs_0.3.7          rematch2_2.1.2       forcats_0.5.1        tools_4.0.2          svUnit_1.0.6         R.cache_0.14.0      
## [73] glue_1.4.2           purrr_0.3.4          processx_3.5.1       yaml_2.2.1           inline_0.3.17        colorspace_2.0-0     knitr_1.32           sass_0.3.1
Erik Kusch
Erik Kusch
PhD Student

In my research, I focus on statistical approaches to understanding complex processes and patterns in biology using a variety of data banks.

Related