Chapter 09 Markov Chain Monte Carlo

Material

Slides Chapter 9

Introduction

These are answers and solutions to the exercises at the end of chapter 9 in Satistical Rethinking 2 by Richard McElreath. I have created these notes as a part of my ongoing involvement in the AU Bayes Study Group. Much of my inspiration for these solutions, where necessary, has been obtained from Taras Svirskyi, William Wolf, and Corrie Bartelheimer as well as the solutions provided to instructors by Richard McElreath himself.

`R` Environment

For today’s exercise, I load the following packages:

library(rethinking)
library(rstan)
library(ggplot2)
library(tidybayes)

Easy Exercises

Practice E1

Question: Which of the following is a requirement of the simple Metropolis algorithm?

The parameters must be discrete.
The likelihood function must be Gaussian.
The proposal distribution must be symmetric.

Answer:

Not a requirement. Metropolis can accommodate continuous and discrete parameters.
Not a requirement. Distribution could be any symmetric distribution. Not just Gaussian.
This is a requirement.

Practice E2

Question: Gibbs sampling is more efficient than the Metropolis algorithm. How does it achieve this extra efficiency? Are there any limitations to the Gibbs sampling strategy?

Answer: Gibbs uses adaptive proposals when considering which location in the posterior to sample next. This makes it more efficient because less proposed steps are rejected.

Practice E3

Question: Which sort of parameters can Hamiltonian Monte Carlo not handle? Can you explain why?

Answer: Discrete parameters. HMC depends on gradients which to explore using a physics simulation. Discrete parameters would not allow for the construction of any gradients.

Practice E4

Question: Explain the difference between the effective number of samples, n_eff as calculated by Stan, and the actual number of samples.

Answer: Effective sample number (n_eff) identifies the number of ‘ideal’ (i.e. uncorrelated) samples. Since MCMC algorithms explore the posterior as a chain of samples, each sample is usually correlated with the previous one to some extent. Conclusively, n_eff identifies the number of samples used for estimating the posterior mean/distribution whereas actual number of samples is simply the number of data points we have.

n_eff is usually smaller than the actual number of samples (unless we have anti-correlated MCMC samples).

Practice E5

Question: Which value should Rhat approach, when a chain is sampling the posterior distribution correctly?

Answer: $\hat{R}$ or Rhat, in R, reflects variance within a chain versus variance between chains. If these are the same, $\hat{R}$ will be $1.0$ - i.e.: it does not matter from which chain we would infere parameters and predictions. Values higher than 1.0 can indicate problems in the model. Values much higher than 1 indicate serious issues.

Practice E6

Question: Sketch a good trace plot for a Markov chain, one that is effectively sampling from the posterior distribution. What is good about its shape? Then sketch a trace plot for a malfunctioning Markov chain. What about its shape indicates malfunction?

Answer:

Good trace plot

y <- rnorm(1e4, mean = 1, sd = 2)
m.E6Good <- ulam(
  alist(
    y ~ dnorm(mu, sigma),
    mu <- alpha,
    alpha ~ dnorm(0, 10),
    sigma ~ dcauchy(0, 1)
  ),
  data = list(y = y),
  cores = 2,
  chains = 2,
  start = list(
    alpha = 0,
    sigma = 1
  )
)
traceplot(m.E6Good)

## [1] 1000
## [1] 1
## [1] 1000

These trace plots show that the chains quickly find the region with highest posterior probability and stay there.

Bad trace plot

y <- rnorm(1e4, mean = 1, sd = 2)
m.E6Bad <- ulam(
  alist(
    y ~ dnorm(mu, sigma),
    mu <- a1 + a2,
    a1 ~ dnorm(0, 10),
    a2 ~ dnorm(0, 10),
    sigma ~ dcauchy(0, 1)
  ),
  data = list(y = y),
  chains = 2,
  cores = 2,
  start = list(
    a1 = 0,
    a2 = 0,
    sigma = 1
  ),
)
traceplot(m.E6Bad)

## [1] 1000
## [1] 1
## [1] 1000

This is a problem of unidentifiable parameters as a1 and a2 can cancel each other out to arrive at the correct mu and so we see non-stationary behaviour in the trace plots of a1 and a2 while the trace plot for sigma is doing alright.

Medium Exercises

Practice M1

Question: Re-estimate the terrain ruggedness model from the chapter, but now using a uniform prior and an exponential prior for the standard deviation, sigma. The uniform prior should be dunif(0,10) and the exponential should be dexp(1). Do the different priors have any detectable influence on the posterior distribution?

Answer: The ruggedness model in question is m8.3 in the book (or m9.1 in ulam() specification). First, I prepare the data like I did previously.

data(rugged)
d <- rugged
d$log_gdp <- log(d$rgdppc_2000)
d <- d[complete.cases(d$rgdppc_2000), ]
d$log_gdp_std <- d$log_gdp / mean(d$log_gdp)
d$rugged_std <- d$rugged / max(d$rugged)
d$cid <- ifelse(d$cont_africa == 1, 1, 2)
dd.trim <- list(
  log_gdp_std = d$log_gdp_std,
  rugged_std = d$rugged_std,
  cid = as.integer(d$cid)
)

Let’s fit that model with the different priors:

## Exponential prior for sigma
m.M1Exp <- ulam(
  alist(
    log_gdp_std ~ dnorm(mu, sigma),
    mu <- a[cid] + b[cid] * (rugged_std - 0.215),
    a[cid] ~ dnorm(1, 0.1),
    b[cid] ~ dnorm(0, 0.3),
    sigma ~ dexp(1)
  ),
  data = dd.trim,
  chains = 4,
  cores = 4,
)
## Uniform prior for sigma
m.M1Uni <- ulam(
  alist(
    log_gdp_std ~ dnorm(mu, sigma),
    mu <- a[cid] + b[cid] * (rugged_std - 0.215),
    a[cid] ~ dnorm(1, 0.1),
    b[cid] ~ dnorm(0, 0.3),
    sigma ~ dnorm(0, 10)
  ),
  data = dd.trim,
  chains = 4,
  cores = 4,
)

Now on to inspect the model. Let’s start with the parameter estimates in comparison

coeftab(m.M1Exp, m.M1Uni)

##       m.M1Exp m.M1Uni
## a[1]     0.89    0.89
## a[2]     1.05    1.05
## b[1]     0.13    0.13
## b[2]    -0.14   -0.14
## sigma    0.11    0.11
## nobs      170     170

These are strikingly the same. What about the individual model outputs in more detail?

precis(m.M1Exp, depth = 2)

##             mean          sd        5.5%       94.5%    n_eff     Rhat4
## a[1]   0.8870817 0.015625699  0.86196179  0.91173540 2453.919 0.9995577
## a[2]   1.0507770 0.009968219  1.03527611  1.06640703 2834.441 0.9988734
## b[1]   0.1344067 0.074307822  0.01486287  0.25218389 2786.188 0.9993677
## b[2]  -0.1413442 0.054855132 -0.22964887 -0.05187494 2324.832 0.9983652
## sigma  0.1117154 0.006171670  0.10228974  0.12208002 2725.266 0.9988256

precis(m.M1Uni, depth = 2)

##             mean          sd        5.5%       94.5%    n_eff     Rhat4
## a[1]   0.8865936 0.015580736  0.86128553  0.91082334 2489.074 0.9989360
## a[2]   1.0501777 0.010010613  1.03404118  1.06614541 2152.883 1.0007549
## b[1]   0.1312147 0.074609926  0.01239339  0.24998421 2244.528 0.9993558
## b[2]  -0.1420136 0.054996077 -0.22957192 -0.05372842 2023.621 0.9987402
## sigma  0.1115782 0.006224722  0.10188964  0.12166315 3600.101 0.9990594

Again, these are very similar aside from the effective number of samples (n_eff) which is much higher for all parameter estimates in the model with the exponential prior on sigma (m.M1Exp) except for sigma itself, which boasts a higher n_eff in the uniform-prior model (m.M1Uni). As such, we conclude that while the different priors have an impact on n_eff, they do not change the posterior distributions. Let me visualise this:

Plot_df <- data.frame(
  Posteriors = c(
    extract.samples(m.M1Exp, n = 1e4)$sigma,
    extract.samples(m.M1Uni, n = 1e4)$sigma
  ),
  Name = rep(c("Exp", "Uni"), each = 1e4),
  Model = rep(c("m.M1Exp", "m.M1Uni"), each = 1e4)
)

ggplot(Plot_df, aes(y = Model, x = Posteriors)) +
  stat_halfeye() +
  labs(x = "Parameter Estimate", y = "Model") +
  theme_bw()

That really does look the same to me.

Practice M2

Question: The Cauchy and exponential priors from the terrain ruggedness model are very weak. They can be made more informative by reducing their scale. Compare the dcauchy and dexp priors for progressively smaller values of the scaling parameter. As these priors become stronger, how does each influence the posterior distribution?

Answer: I write a for loop here to minimise code needs:

RepTimes <- 4 # how many steps I want to try
ScalingFactor <- 10 # by what factor to make priors stronger
# empty lists to store models in
Explist <- as.list(rep(NA, RepTimes))
Caulist <- as.list(rep(NA, RepTimes))
# Loop over all models
for (Mod_Iter in 0:(RepTimes - 1)) {
  dd.trim$ScalingFactor <- ScalingFactor
  dd.trim$Mod_Iter <- Mod_Iter
  ## Exponential prior for sigma
  m.M2Exp <- ulam(
    alist(
      log_gdp_std ~ dnorm(mu, sigma),
      mu <- a[cid] + b[cid] * (rugged_std - 0.215),
      a[cid] ~ dnorm(1, 0.1),
      b[cid] ~ dnorm(0, 0.3),
      sigma ~ dexp(1 * ScalingFactor^Mod_Iter)
    ),
    data = dd.trim,
    chains = 4,
    cores = 4,
  )
  Explist[[Mod_Iter + 1]] <- m.M2Exp
  ## Cauchy prior for sigma
  m.M2Cau <- ulam(
    alist(
      log_gdp_std ~ dnorm(mu, sigma),
      mu <- a[cid] + b[cid] * (rugged_std - 0.215),
      a[cid] ~ dnorm(1, 0.1),
      b[cid] ~ dnorm(0, 0.3),
      sigma ~ dcauchy(0, 1 / ScalingFactor^Mod_Iter)
    ),
    data = dd.trim,
    chains = 4,
    cores = 4,
  )
  Caulist[[Mod_Iter + 1]] <- m.M2Cau
}
coeftab(Explist[[1]], Explist[[2]], Explist[[3]], Explist[[4]])

##       Explist[[1]] Explist[[2]] Explist[[3]] Explist[[4]]
## a[1]     0.89         0.89         0.89         0.89     
## a[2]     1.05         1.05         1.05         1.05     
## b[1]     0.13         0.13         0.13         0.13     
## b[2]    -0.14        -0.14        -0.14        -0.15     
## sigma    0.11         0.11         0.11         0.09     
## nobs      170          170          170          170

coeftab(Caulist[[1]], Caulist[[2]], Caulist[[3]], Caulist[[4]])

##       Caulist[[1]] Caulist[[2]] Caulist[[3]] Caulist[[4]]
## a[1]     0.89         0.89         0.89         0.89     
## a[2]     1.05         1.05         1.05         1.05     
## b[1]     0.14         0.13         0.13         0.13     
## b[2]    -0.14        -0.14        -0.14        -0.14     
## sigma    0.11         0.11         0.11         0.11     
## nobs      170          170          170          170

The more restrictive exponential priors decrease the estimate for sigma. On the other hand, the more restrictive cauchy priors have no effect, it seems.

Let’s explore why this is by looking at the priors themselves:

par(mfrow = c(1, 2))
curve(dexp(x, 1),
  from = 0, to = 5, ylab = "Density", xlab = "sigma",
  col = "royalblue4"
)
curve(dexp(x, 10), from = 0, to = 5, add = T)
curve(dexp(x, 100), from = 0, to = 5, add = T, col = col.desat("red"))
curve(dexp(x, 1000), from = 0, to = 5, add = T, col = col.desat("green"))
mtext("Exponential Prior")
legend("topright",
  col = c("royalblue4", "black", col.desat("red"), col.desat("green")),
  lty = c(1, 1, 1), legend = c("Exp(1)", "Exp(10)", "Exp(100)", "Exp(1000)"), bty = "n"
)

curve(2 * dcauchy(x, 0, 1),
  from = 0, to = 5, ylab = "Density", xlab = "sigma",
  col = "royalblue4"
)
curve(2 * dcauchy(x, 0, 0.1), from = 0, to = 5, add = T, col = "black")
curve(2 * dcauchy(x, 0, 0.01), from = 0, to = 5, add = T, col = col.desat("red"))
curve(2 * dcauchy(x, 0, 0.001), from = 0, to = 5, add = T, col = col.desat("green"))
mtext("Cauchy Prior")
legend("topright",
  col = c("royalblue4", "black", col.desat("red"), col.desat("green")),
  lty = c(1, 1, 1), legend = c("Cauchy(0, 1)", "Cauchy(0, 0.1)", "Cauchy(0, 0.01)", "Cauchy(0, 0.001)"), bty = "n"
)

The cauchy distributions show thicker tails while the exponential distributions quickly concentrate. Hence why a concentrated Cauchy prior allow more flexibility that a concentrated exponential prior.

Practice M3

Question: Re-estimate one of the Stan models from the chapter, but at different numbers of warmup iterations. Be sure to use the same number of sampling iterations in each case. Compare the n_eff values.

Answer: The ruggedness model was fine so far so I continue with that one. Here, I build this model with a fixed run length and fixed starting values for each run with changing warmup values:

start <- list(a = c(1, 1), b = c(0, 0), sigma = 1) # use fixed start values for comparability of runs
m.M3 <- ulam(
  alist(
    log_gdp_std ~ dnorm(mu, sigma),
    mu <- a[cid] + b[cid] * (rugged_std - 0.215),
    a[cid] ~ dnorm(1, 0.1),
    b[cid] ~ dnorm(0, 0.3),
    sigma ~ dexp(1)
  ),
  data = dd.trim,
  start = start,
  chains = 2, cores = 2,
  iter = 100
)
warm_list <- c(5, 10, 100, 500, 1000) # define warmup values to run through
n_eff <- matrix(NA, nrow = length(warm_list), ncol = 5) # first make matrix to hold n_eff results
for (i in 1:length(warm_list)) { # loop over warm_list and collect n_eff
  w <- warm_list[i]
  m_temp <- ulam(m.M3, chains = 2, cores = 2, iter = 1000 + w, warmup = w, start = start)
  n_eff[i, ] <- precis(m_temp, 2)$n_eff
}
colnames(n_eff) <- rownames(precis(m_temp, 2))
rownames(n_eff) <- warm_list
n_eff # columns show parameters, rows show n_eff

##             a[1]        a[2]        b[1]        b[2]       sigma
## 5       2.314186    1.587251    2.713325    1.270369    1.776862
## 10   2243.084776 2157.086156  737.957589 1010.214712  953.010860
## 100  1725.334719 2294.576251  878.481785 1177.016946 1122.495229
## 500  2999.738299 3282.963810 2292.173710 2737.037252 2200.949134
## 1000 2485.029304 3406.341675 2372.274092 2772.175825 2607.552453

As we can see, past just 10 warmup samples, n_eff does not change much (in terms of how useful our samples are). In this case, we could be quite happy with a warmup of 10.

Hard Exercises

Practice H1

Question: Run the model below and then inspect the posterior distribution and explain what it is accomplishing.

mp <- map2stan(
  alist(
    a ~ dnorm(0, 1),
    b ~ dcauchy(0, 1)
  ),
  data = list(y = 1),
  start = list(a = 0, b = 0),
  iter = 1e4,
  chains = 2, cores = 2,
  warmup = 100,
  WAIC = FALSE
)

Compare the samples for the parameters a and b. Can you explain the different trace plots, using what you know about the Cauchy distribution?

Answer: First of all, let’s inspect the posterior:

precis(mp)

##            mean         sd      5.5%    94.5%     n_eff    Rhat4
## a  0.0003388167  0.9988213 -1.601441 1.590561 12762.761 1.000120
## b -0.1918181852 13.8995715 -5.379742 5.346423  3892.011 1.000517

Oof. Those uncertainties don’t look good at all! So what does the model even do? It simply just samples a from a normal distribution with mean 0 and standard deviation 1. b is sampled from a cauchy distribution. Let’s look at the traceplot for this:

plot(mp, n_cols = 1, col = "royalblue4")

As we can see, there are quite some outliers in the sampling of the cauchy distribution (b). Why is that? Because the cauchy distribution has very heavy tails thus making it more likely to jump to a value that is far out there in terms of posterior probability. Note that this also decreases n_eff. lp in the above is the log-posterior.

Now let’s see how the samples we drew measure up against the underlying functions of a and b, respectively:

post <- extract.samples(mp)
par(mfrow = c(1, 2))
dens(post$a)
curve(dnorm(x, 0, 1), from = -4, to = 4, add = T, lty = 2)
legend("topright", lty = c(1, 2), legend = c("Sample", "Exact density"), bty = "n")
mtext("Normal")
dens(post$b, col = "royalblue4", xlim = c(-10, 10))
curve(dcauchy(x, 0, 1),
  from = -10, to = 10, add = T, lty = 2,
  col = "royalblue4"
)
mtext("Cauchy")

As we can see, the normal distribution has been reconstructed well. The cauchy distributions hasn’t.

Practice H2

Question: Recall the divorce rate example from Chapter 5. Repeat that analysis, using ulam() this time, fitting models m5.1, m5.2, and m5.3. Use compare to compare the models on the basis of WAIC or PSIS. Explain the results.

Answer: First, I need to load the data and prepare it for ulam():

data(WaffleDivorce)
d <- WaffleDivorce
d$D <- standardize(d$Divorce)
d$M <- standardize(d$Marriage)
d$A <- standardize(d$MedianAgeMarriage)
d_trim <- list(D = d$D, M = d$M, A = d$A)

Now I fit the models with ulam():

m5.1_stan <- ulam(
  alist(
    D ~ dnorm(mu, sigma),
    mu <- a + bA * A,
    a ~ dnorm(0, 0.2),
    bA ~ dnorm(0, 0.5),
    sigma ~ dexp(1)
  ),
  data = d_trim,
  chains = 4, cores = 4,
  log_lik = TRUE # this is needed to get the terms for calculating PSIS or WAIC
)
m5.2_stan <- ulam(
  alist(
    D ~ dnorm(mu, sigma),
    mu <- a + bM * M,
    a ~ dnorm(0, 0.2),
    bM ~ dnorm(0, 0.5),
    sigma ~ dexp(1)
  ),
  data = d_trim,
  chains = 4, cores = 4,
  log_lik = TRUE # this is needed to get the terms for calculating PSIS or WAIC
)
m5.3_stan <- ulam(
  alist(
    D ~ dnorm(mu, sigma),
    mu <- a + bA * A + bM * M,
    a ~ dnorm(0, 0.2),
    bA ~ dnorm(0, 0.5),
    bM ~ dnorm(0, 0.5),
    sigma ~ dexp(1)
  ),
  data = d_trim,
  chains = 4, cores = 4,
  log_lik = TRUE # this is needed to get the terms for calculating PSIS or WAIC
)

Now we compare the models:

compare(m5.1_stan, m5.2_stan, m5.3_stan, func = PSIS)

##               PSIS        SE     dPSIS       dSE    pPSIS       weight
## m5.1_stan 125.7210 12.708327  0.000000        NA 3.630705 0.7253039155
## m5.3_stan 127.6690 12.852350  1.947996 0.6705316 4.773054 0.2738533387
## m5.2_stan 139.2364  9.936093 13.515361 9.1363047 2.923975 0.0008427459

compare(m5.1_stan, m5.2_stan, m5.3_stan, func = WAIC)

##               WAIC        SE     dWAIC       dSE    pWAIC       weight
## m5.1_stan 125.7778 12.641919  0.000000        NA 3.659072 0.6960655494
## m5.3_stan 127.4407 12.591741  1.662916 0.6770545 4.658881 0.3030766321
## m5.2_stan 139.1754  9.813604 13.397613 9.2109285 2.893468 0.0008578185

WAIC tells a similar story as PSIS, but the model only containing age (m5.1_stan) wins. The model with both predictors (m5.3_stan) does almost as well. However, their respective PSIS and WAIC values are nearly identical. Furthermore, both models get assigned all of the WAIC weight. Let’s call these equal in performance and investigate why:

precis(m5.3_stan)

##                mean         sd       5.5%      94.5%    n_eff     Rhat4
## a     -0.0001904293 0.10140928 -0.1591984  0.1619373 1877.251 1.0002239
## bA    -0.6023698429 0.16025804 -0.8510854 -0.3467602 1085.019 1.0007578
## bM    -0.0550634908 0.16034205 -0.3109204  0.2015101 1187.780 0.9998155
## sigma  0.8275838910 0.08826874  0.7028130  0.9779113 1474.265 1.0028212

While m5.3_stan contains the marriage predictor, it is very unsure of it’s influence. In practical terms, this means that m5.1_stan and m5.3_stan make basically the same predictions

Practice H3

Question: Sometimes changing a prior for one parameter has unanticipated effects on other parameters. This is because when a parameter is highly correlated with another parameter in the posterior, the prior influences both parameters. Here’s an example to work and think through.
Go back to the leg length example in Chapter 5. Here is the code again, which simulates height and leg lengths for 100 imagined individuals:

N <- 100 # number of individuals
height <- rnorm(N, 10, 2) # sim total height of each
leg_prop <- runif(N, 0.4, 0.5) # leg as proportion of height
leg_left <- leg_prop * height + rnorm(N, 0, 0.02) # sim left leg as proportion + error
leg_right <- leg_prop * height + rnorm(N, 0, 0.02) # sim right leg as proportion + error
d <- data.frame(height, leg_left, leg_right) # combine into data frame

And below is the model you fit before, resulting in a highly correlated posterior for the two beta parameters. This time, fit the model using ulam():

m5.8s <- ulam(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + bl * leg_left + br * leg_right,
    a ~ dnorm(10, 100),
    bl ~ dnorm(2, 10),
    br ~ dnorm(2, 10),
    sigma ~ dexp(1)
  ),
  data = d,
  chains = 4,
  cores = 4,
  start = list(
    a = 10,
    bl = 0,
    br = 0.1,
    sigma = 1
  )
)

Compare the posterior distribution produced by the code above to the posterior distribution produced when you change the prior for br so that it is strictly positive:

m5.8s2 <- ulam(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + bl * leg_left + br * leg_right,
    a ~ dnorm(10, 100),
    bl ~ dnorm(2, 10),
    br ~ dnorm(2, 10),
    sigma ~ dexp(1)
  ),
  data = d,
  chains = 4,
  cores = 4,
  constraints = list(br = "lower=0"),
  start = list(
    a = 10,
    bl = 0,
    br = 0.1,
    sigma = 1
  )
)

Note the constraints list. What this does is constrain the prior distribution of br so that it has positive probability only above zero. In other words, that prior ensures that the posterior distribution for br will have no probability mass below zero.
Compare the two posterior distributions for m5.8s and m5.8s2. What has changed in the posterior distribution of both beta parameters? Can you explain the change induced by the change in prior?

Answer: It’s probably easiest to just look at the posterior distributions of the beta prameters through the pairs() function:

pairs(m5.8s, main = "Model 1")

pairs(m5.8s2, main = "Model 2")

As we can see, the beta distributions have shifted drastically between the different models. Interestingly, bl and br were perfectly symmetric in m5.8s, but are skewed in m5.8s2. Given how the height of a person is approximated in both models (a + bl*leg_left + br*leg_right), the distributions of leg lengths are necessarily negatively correlated (you can be of the same height with a short right leg and long left leg, long left leg and short right leg, or two medium-length legs). Thus, by setting br to be strictly positive in m5.8s2 and made it skewed, we have forced bl to be equally skewed in a mirror image of br.

Practice H4

Question: For the two models fit in the previous problem, use WAIC or PSIS to compare the effective numbers of parameters for each model. You will need to use log_lik=TRUE to instruct ulam() to compute the terms that both WAIC and PSIS need. Which model has more effective parameters? Why?

Answer: Let’s run the models:

m.H4_1 <- ulam(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + bl * leg_left + br * leg_right,
    a ~ dnorm(10, 100),
    bl ~ dnorm(2, 10),
    br ~ dnorm(2, 10),
    sigma ~ dexp(1)
  ),
  data = d,
  chains = 4,
  cores = 4,
  start = list(
    a = 10,
    bl = 0,
    br = 0.1,
    sigma = 1
  ),
  log_lik = TRUE
)
m.H4_2 <- ulam(
  alist(
    height ~ dnorm(mu, sigma),
    mu <- a + bl * leg_left + br * leg_right,
    a ~ dnorm(10, 100),
    bl ~ dnorm(2, 10),
    br ~ dnorm(2, 10),
    sigma ~ dexp(1)
  ),
  data = d,
  chains = 4,
  cores = 4,
  constraints = list(br = "lower=0"),
  start = list(
    a = 10,
    bl = 0,
    br = 0.1,
    sigma = 1
  ),
  log_lik = TRUE
)

Now we compare them with WAIC:

compare(m.H4_1, m.H4_2)

##            WAIC       SE     dWAIC      dSE    pWAIC    weight
## m.H4_1 182.1474 10.21060 0.0000000       NA 2.961292 0.6063273
## m.H4_2 183.0112  9.88398 0.8638001 2.349502 2.382919 0.3936727

The models are pretty much tied. The model with truncated priors (m.H4_2) is less flexible as indicated by pWAIC. This is because the prior is more informative and the variance in the posterior distribution is smaller as a result.

Practice H5

Question: Modify the Metropolis algorithm code from the chapter to handle the case that the island populations have a different distribution than the island labels. This means the island’s number will not be the same as its population.

Answer: First of all, we need our 10 islands with population sizes of 1-10, but in random order:

pop_size <- sample(1:10)

Now we can use the code from the chapter almost unaltered safe for one exception - we need to use indexing to translate island location into population size:

num_weeks <- 1e5
positions <- rep(NA, num_weeks)
current <- 10
for (i in 1:num_weeks) {
  positions[i] <- current # record current position
  proposal <- current + sample(c(-1, 1), size = 1) # flip coin to generate proposal
  # now make sure he loops around the archipelago
  if (proposal < 1) proposal <- 10
  if (proposal > 10) proposal <- 1
  prob_move <- pop_size[proposal] / pop_size[current] # move?
  current <- ifelse(runif(1) < prob_move, proposal, current)
}

To see if this works, we can plot population size against frequency of visit by the king:

f <- table(positions) # compute frequencies
plot(as.vector(f), pop_size,
  type = "n", # plot frequencies against relative population sizes
  xlab = "frequency", ylab = "population size"
) # empty plot
text(x = f, y = pop_size, labels = names(f)) # add names of islands / their positions

Practice H6

Question: Modify the Metropolis algorithm code from the chapter to write your own simple MCMC estimator for globe tossing data and model from Chapter 2.

Answer: We want to fit the following model:

$$w∼Binom(θ,n)$$ $$θ∼Unif(0,1)$$ Our Metropolis algorithm looks like this:

set.seed(42)
# the globe tossing data
w <- 6
n <- 9
# prior on p
p_prior <- function(p) dunif(p, min = 0, max = 1)
# initializing MCMC
iter <- 1e4
p_sample <- rep(0, iter)
p_current <- 0.5 # start value
for (i in 1:iter) {
  p_sample[i] <- p_current # # record current p
  p_proposal <- runif(1, min = 0, max = 1) # generate proposal
  # compute likelihood for current and proposal
  lkhd_current <- dbinom(w, n, p_current)
  lkhd_proposal <- dbinom(w, n, p_proposal)
  prob_proposal <- lkhd_proposal * p_prior(p_proposal)
  prob_current <- lkhd_current * p_prior(p_current)
  prob_accept <- prob_proposal / prob_current
  p_current <- ifelse(runif(1) < prob_accept, p_proposal, p_current)
}

Let’s visualise what happened here:

plot(p_sample, type = "l", col = "royalblue4")

Finally, let’s plot the posterior distribution:

dens(p_sample, col = "royalblue4", adj = 1)
curve(dbeta(x, w + 1, n - w + 1), from = 0, to = 1, add = T, lty = 2)
abline(v = median(p_sample))

Session Info

sessionInfo()

## R version 4.0.5 (2021-03-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.1252    
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] tidybayes_2.3.1      rethinking_2.13      rstan_2.21.2         ggplot2_3.3.6        StanHeaders_2.21.0-7
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.7           mvtnorm_1.1-1        lattice_0.20-41      tidyr_1.1.3          prettyunits_1.1.1    ps_1.6.0             assertthat_0.2.1     digest_0.6.27        utf8_1.2.1          
## [10] V8_3.4.1             plyr_1.8.6           R6_2.5.0             backports_1.2.1      stats4_4.0.5         evaluate_0.14        coda_0.19-4          highr_0.9            blogdown_1.3        
## [19] pillar_1.6.0         rlang_0.4.11         curl_4.3.2           callr_3.7.0          jquerylib_0.1.4      R.utils_2.10.1       R.oo_1.24.0          rmarkdown_2.7        styler_1.4.1        
## [28] labeling_0.4.2       stringr_1.4.0        loo_2.4.1            munsell_0.5.0        compiler_4.0.5       xfun_0.22            pkgconfig_2.0.3      pkgbuild_1.2.0       shape_1.4.5         
## [37] htmltools_0.5.1.1    tidyselect_1.1.0     tibble_3.1.1         gridExtra_2.3        bookdown_0.22        arrayhelpers_1.1-0   codetools_0.2-18     matrixStats_0.61.0   fansi_0.4.2         
## [46] crayon_1.4.1         dplyr_1.0.5          withr_2.4.2          MASS_7.3-53.1        R.methodsS3_1.8.1    distributional_0.2.2 ggdist_2.4.0         grid_4.0.5           jsonlite_1.7.2      
## [55] gtable_0.3.0         lifecycle_1.0.0      DBI_1.1.1            magrittr_2.0.1       scales_1.1.1         KernSmooth_2.23-18   RcppParallel_5.1.2   cli_3.0.0            stringi_1.5.3       
## [64] farver_2.1.0         bslib_0.2.4          ellipsis_0.3.2       generics_0.1.0       vctrs_0.3.7          rematch2_2.1.2       forcats_0.5.1        tools_4.0.5          svUnit_1.0.6        
## [73] R.cache_0.14.0       glue_1.4.2           purrr_0.3.4          processx_3.5.1       yaml_2.2.1           inline_0.3.17        colorspace_2.0-0     knitr_1.33           sass_0.3.1

Last updated on Feb 25, 2021

Chapter 09

Markov Chain Monte Carlo