simulation-parameters.Rmd

# Simulating replications {#simulating-replications}

```{r, echo = FALSE}
library(knitr)
opts_chunk$set(cache = TRUE, warning = FALSE, message = FALSE, tidy = FALSE, fig.height = 5, fig.width = 6.67, out.height = "3in", out.width = "4in")
options(digits = 5)

library(ggplot2)
theme_set(theme_bw())
library(scales)
```

In Chapter \@ref(simulation), we ran a single simulation of our players' batting averages, used it to perform estimation, and then examined whether our results were accurate. This is a valuable way to sanity-check the accuracy of the empirical Bayes method.

But what if we just got lucky? What if empirical Bayes shrinkage works about half the time, and if the players had batted a bit differently it would have given terrible results? Similarly, even if the method worked on 10,000 players, can we tell if it would have worked on 1000, or 100? These are important concerns if we want to trust the method on our real data.

In this final chapter, we'll extend the simulation from Chapter \@ref(simulation). Rather than simulating a single example, we'll create **50 simulations**, and run the empirical Bayes method on each of them. We'll similarly learn how to vary an input parameter, the number of players, which will examine how the empirical Bayes approach is sensitive to the number of observations.

## Setup

As usual, we start with code that sets up the variables analyzed in this chapter (in this case, the same code as Chapter \@ref(simulation).

```{r career}
library(Lahman)
library(dplyr)
library(tidyr)
library(purrr)

# Grab career batting average of non-pitchers
# (allow players that have pitched <= 3 games, like Ty Cobb)
pitchers <- Pitching %>%
  group_by(playerID) %>%
  summarize(gamesPitched = sum(G)) %>%
  filter(gamesPitched > 3)

# include the "bats" (handedness) and "year" column for later
career <- Batting %>%
  filter(AB > 0) %>%
  anti_join(pitchers, by = "playerID") %>%
  group_by(playerID) %>%
  summarize(H = sum(H), AB = sum(AB))

library(ebbr)
library(broom)

prior <- ebb_fit_prior(career, H, AB)
alpha0 <- tidy(prior)$alpha
beta0 <- tidy(prior)$beta
```

## Replicating the beta-binomial simulation

The `crossing()` function from tidyr is very useful for performing multiple replications of a tidy simulation. Instead of performing repeating an operation in a loop, you can replicate your data within one data frame. [^speed]

[^speed]: These simulations of 50 replications are the slowest-running code examples in the entire book. If you're following along and want to speed it up, you could decrease the number of replications.

```{r sim_replications}
set.seed(2017)

sim_replications <- career %>%
  crossing(replication = 1:50) %>%
  mutate(p = rbeta(n(), alpha0, beta0),
         H = rbinom(n(), AB, p))
```

After simulating values of $p$ and $H$, we can then nest within each replication, and use the purrr package's `map()` function to fit the priors. The dataset is then stored with one row for each of the 50 replications, with the prior for each stored in a list column.[^manymodels]

```{r sim_replication_models_run, dependson = "sim_replications", eval = FALSE}
library(ebbr)

sim_replication_models <- sim_replications %>%
  nest(-replication) %>%
  mutate(prior = map(data, ~ ebb_fit_prior(., H, AB)))
```

```{r sim_replication_models, echo = FALSE}
load("intermediate-datasets/sim_replication_models.rda")
```

```{r dependson = "sim_replication_models"}
sim_replication_models
```

[^manymodels]: To learn more about the philosophy of storing models in a list column, check out [Chapter 25](http://r4ds.had.co.nz/many-models.html) of the book [R for Data Science](http://r4ds.had.co.nz/).

### Estimations of hyperparameters

In each replication, we started by estimating a prior distribution, in the form of $\alpha_0$ and $\beta_0$ hyperparameters. Since these estimated hyperparameters are the foundation of any empirical Bayes method, we'd like to know if they're consistently accurate.

```{r sim_replication_priors, dependson = "sim_replication_models"}
sim_replication_priors <- sim_replication_models %>%
  unnest(map(prior, tidy), .drop = TRUE)

sim_replication_priors
```

Figure \@ref(fig:simreplicationpriors) shows our estimations of $\alpha_0$ and $\beta_0$ across all 50 replications, along with the true values shown as a dashed horizontal line. We notice that our estimates of are mostly unbiased: generally they're equally likely to be above or below the true parameter. We also note that the mean $\frac{\alpha_0}{\alpha_0+\beta_0}$ is almost always between .250 and .252. Since this is what every player is being shrunk towards, it's good that the estimate is so precise.

```{r simreplicationpriors, dependson = "sim_replication_priors", echo = FALSE, fig.cap = "Estimated hyperparameters $\\alpha_0$, $\\beta_0$, and the mean $\\frac{\\alpha_0}{\\alpha_0+\\beta_0}$ across 50 replications."}
true_values <- data_frame(parameter = c("alpha", "beta", "mean"),
                          true = c(alpha0, beta0, alpha0 / (alpha0 + beta0)))

sim_replication_priors %>%
  gather(parameter, value, -replication) %>%
  inner_join(true_values, by = "parameter") %>%
  ggplot(aes(1, value)) +
  geom_boxplot() +
  geom_hline(aes(yintercept = true), color = "red", lty = 2) +
  facet_wrap(~ parameter, scales = "free_y") +
  labs(x = "",
       y = "Estimated parameter (true value shown as red line)")
```

If we'd often estimated the hyperparameter poorly, then we should worry about using them as our prior. But our accuracy gives us confidence that we have enough data to apply the empirical Bayesian approach.

### Estimates, intervals, and hypothesis testing across replications

We can then can examine whether the empirical Bayes shrinkage and credible intervals were consistently effective.

In Section \@ref(simulation-mse), we used the mean squared error (MSE) between the estimate and the true batting average $p$ as a metric for evaluating the method's performance, and for comparing it to the raw batting average $H / AB$. We can now repeat that comparison across the 50 replications (Figure \@ref(fig:simreplicationmse)).

```{r sim_replication_au, dependson = "sim_replication_models"}
sim_replication_au <- sim_replication_models %>%
  unnest(map2(prior, data, augment))
```

```{r sim_replication_mse, dependson = "sim_replication_au"}
sim_replication_mse <- sim_replication_au %>%
  rename(Raw = .raw, Shrunken = .fitted) %>%
  gather(type, estimate, Raw, Shrunken) %>%
  group_by(type, replication) %>%
  summarize(mse = mean((estimate - p) ^ 2))
```

```{r simreplicationmse, dependson = "sim_replication_mse", echo = FALSE, fig.cap = "Comparison of the mean-squared error on 50 replications, using either the raw batting average or the shrunken batting average."}
ggplot(sim_replication_mse, aes(type, mse)) +
  geom_boxplot() +
  ylab("Mean squared error across 50 replications")
```

It looks like the MSE of empirical Bayes shrunken estimates was always much lower than the raw estimates, and was pretty consistent in its range. This is a good sign: even in 50 replications, it never fails "catastrophically." This is not true of all statistical methods!

In Section \@ref(credible-intervals) we also saw that the credible intervals were well calibrated, where 95% credible intervals generally contained the true value about 95% of the time. We can now see if this is consistently true across replications (Figure \@ref(fig:simreplicationcoverage)). Indeed, it looks like the coverage of a 95% credible interval was generally between 94.4% and 95.5%.

```{r simreplicationcoverage, dependson = "sim_replication_au", echo = FALSE, fig.cap = "Distribution of the coverage probability of a 95\\% credible interval across simulations."}
sim_replication_au %>%
  mutate(cover = .low <= p & p <= .high) %>%
  group_by(replication) %>%
  summarize(coverage = mean(cover)) %>%
  ggplot(aes(coverage)) +
  geom_histogram(binwidth = .001) +
  labs(x = "% of time true value was in a 95% credible interval")
```

Is it well calibrated at other levels: does an 80% credible interval contain the true value about 80% of the time? Figure \@ref(fig:estimatecredlevel) from the last chapter tried varying the level of the credible interval, and examined how it affected the coverage probability. We can now recreate that plot, but do so across all fifty replications (Figure \@ref(fig:simreplicationintervals)).

```{r sim_replication_intervals, dependson = "sim_replication_models"}
sim_replication_intervals <- sim_replication_models %>%
  crossing(cred_level = c(seq(.5, .9, .05), .95)) %>%
  unnest(pmap(list(prior, data, cred_level = cred_level), augment)) %>%
  select(replication, cred_level, p, .low, .high)
```

```{r simreplicationintervals, echo = FALSE, dependson = "sim_replication_intervals", fig.cap = "Comparison of the level of the credible interval to the fraction of players where the interval contains the true value. Each line represents one replication of the simulation; the red line represents $x=y$."}
sim_replication_intervals %>%
  mutate(cover = .low <= p & p <= .high) %>%
  group_by(replication, cred_level) %>%
  summarize(coverage = mean(cover)) %>%
  ggplot(aes(cred_level, coverage, group = replication)) +
  geom_line(alpha = .3) +
  geom_abline(color = "red") +
  labs(x = "Credibility level",
       y = "% of credible intervals containing true parameter")
```

Each of these lines is one replication tracing from a "50% credible interval" to a "95% credible interval." Since all the replications are close to the red $x=y$ line, we can see that an X% credible interval contains the true value about X% of the time. This is an important lesson of tidy simulations: whenever you can make a plot to check one simulation to check accuracy or calibration, you can also recreate the plot across many replications.

We can also examine our method for false discovery rate control, and see whether we can trust a q-value of (say) .05 to keep the FDR below 5%, just as we did last chapter in Figure \@ref(fig:qvaluetruefdr). The approach (code not shown) is similar to the one for credible interval coverage: group by each replication, then perform the same analysis we did on a single replication (Figure \@ref(fig:simreplicationproptests)).

```{r sim_replication_prop_tests, dependson = "sim_replication_au", echo = FALSE}
sim_replication_prop_tests <- sim_replication_au %>%
  nest(-replication) %>%
  unnest(map(data, add_ebb_prop_test, threshold = .3, sort = TRUE))
```

```{r simreplicationproptests, dependson = "sim_replication_prop_tests", echo = FALSE, fig.cap = "Comparison of the q-value threshold and the resulting false discovery rate. Each line represents one replication of the simulation; the red line represents $x=y$."}
sim_replication_prop_tests %>%
  group_by(replication) %>%
  mutate(fdr = cummean(p < .3)) %>%
  ggplot(aes(.qvalue, fdr, group = replication)) +
  geom_line(alpha = .3) +
  geom_abline(color = "red") +
  labs(x = "Q-value threshold",
       y = "Proportion of false discoveries below this threshold")
```

Each of these lines represents a replication tracing along every possible q-value threshold. We see that the proportion of false discoveries below a q-value is sometimes higher than the q-value promises, and sometimes lower. That's OK: the promise of FDR control isn't that the false discovery rate will always be exactly 5% (that would be impossible due to random noise), but that it is on average.

## Varying sample size

In these two chapters, we've re-simulated our set of baseball players, and confirmed that empirical Bayes generally performed well. In what situations might the method *not* work?

One example of a case where empirical Bayes performs poorly is if we had fewer observations. If there were only three or four batters, we'd have no ability to estimate their prior beta distribution accurately, and therefore be shrinking the batting averages towards an arbitrary estimate. This is particularly dangerous because empirical Bayes doesn't account for the uncertainty in hyperparameter estimates, so our poor estimate of the prior would be all it had to go on.[^traditionalbayes]

[^traditionalbayes]: Traditional Bayesian methods handle this by modeling the uncertainty about the beta distribution explicitly. That is, instead of estimating the $\alpha_0$ and $\beta_0$ hyperparameters of the beta, they would have a *hyperprior* for the distributions of $\alpha_0$ and $\beta_0$, which would get updated with the evidence. This is challenging but well-studied; see Chapter 5.3 of Bayesian Data Analysis for an example that also uses the beta-binomial [@Gelman03].

How many observations are *enough* to use these methods? 100? 1000? While this book was being developed in a series of online posts, I often received this question, and never had a good answer. But through simulation, we have the opportunity to examine the effect of the sample size on the performance of empirical Bayes estimation.

### Simulating varying numbers of observations

Let's consider six possible sample sizes: 30, 100, 300, 1000, 3000, and 10,000 (10,000 is pretty close to our actual dataset size of `r nrow(career)`). We could perform simulations for each of these sample sizes, by randomly sampling from the set of players each time.[^replacement]

[^replacement]: To imitate the distribution that a new set of players might have, we resample with replacement, similarly to the process of bootstrapping.

```{r varying_size_priors_run, dependson = "career", eval = FALSE}
set.seed(2017)

# nifty trick for sampling different numbers within each group
# randomly shuffle with sample_frac(1), then filter
varying_size_sim <- career %>%
  select(-H) %>%
  crossing(size = c(30, 100, 300, 1000, 3000, 10000),
           replication = 1:50) %>%
  group_by(size, replication) %>%
  sample_frac(1, replace = TRUE) %>%
  filter(row_number() <= size) %>%
  ungroup()

varying_size_priors <- varying_size_sim %>%
  mutate(p = rbeta(n(), alpha0, beta0),
         H = rbinom(n(), AB, p)) %>%
  nest(-size, -replication) %>%
  mutate(prior = map(data, ~ ebb_fit_prior(., H, AB)))
```

```{r varying_size_priors, dependson = "career", echo = FALSE}
load("intermediate-datasets/varying_size_priors.rda")

```

The first step of empirical Bayes is to estimate the prior hyperparameters $\alpha_0$ and $\beta_0$. How did the accuracy of these estimations depend on the sample size?

```{r varying_size_params, dependson = "varying_size_priors", echo = FALSE}
varying_size_params <- varying_size_priors %>%
  unnest(map(prior, tidy), .drop = TRUE)
```

```{r varyingsizeparamsplot, dependson = "varying_size_params", echo = FALSE, fig.cap = "Estimated hyperparameters $\\alpha_0$, $\\beta_0$, and the mean $\\frac{\\alpha_0}{\\alpha_0+\\beta_0}$ across 50 replications for each sample size. One condition with much higher $\\alpha_0$ and $\\beta_0$ was removed for readability.", fig.width = 8, fig.height = 6}
true_values <- data_frame(parameter = c("alpha", "beta", "mean"),
                          true = c(alpha0, beta0, alpha0 / (alpha0 + beta0)))

varying_size_params %>%
  filter(alpha < 1000) %>%
  gather(parameter, estimate, alpha, beta, mean) %>%
  inner_join(true_values, by = "parameter") %>%
  ggplot(aes(factor(size), estimate)) +
  geom_boxplot() +
  geom_hline(aes(yintercept = true), color = "red", lty = 2) +
  facet_wrap(~ parameter, scales = "free_y") +
  labs(x = "Sample size",
       y = "Estimated hyperparameter")
```

Figure \@ref(fig:varyingsizeparamsplot) shows that for smaller numbers of observations, there was greater variance in the hyperparameter estimates. This makes sense: more data gives more evidence, and therefore a more consistently accurate maximum likelihood estimate.

The performance helps illustrate the danger of empirical Bayes on smaller samples. Rather than shrinking towards the true mean of `r alpha0 / (alpha0 + beta0)`, the method may shrink everyone towards values below .24 or above .26. While this may seem like a small difference, they would affect *everyone* in the dataset as a systematic error. In a number of the replications with 30 players, the algorithm also greatly overestimated both $\alpha$ and $\beta$, which means its prior would have too little variance (and the algorithm would "over-shrink").

```{r varying_size_au, dependson = "varying_size_priors", echo = FALSE}
varying_size_au <- varying_size_priors %>%
  unnest(map2(prior, data, augment))
```

```{r varyingsizemse, dependson = "varying_size_au", echo = FALSE, fig.cap = "Distribution of the mean squared error (MSE) of empirical Bayes estimates across 50 replications, for simulated datasets of varying sizes."}
varying_size_au %>%
  group_by(size, replication) %>%
  summarize(mse = mean((.fitted - p) ^ 2)) %>%
  ggplot(aes(factor(size), mse)) +
  geom_boxplot() +
  expand_limits(y = 0) +
  labs(x = "Number of observations",
       y = "Mean squared error")
```

How does this difference affect the accuracy of the empirical Bayes estimates? We can use the mean squared error (MSE) to quantify this. Figure \@ref(fig:varyingsizemse) shows that the most dramatic difference across sizes was the variance of the MSE. That is, it was possible for empirical Bayes on lower sample sizes to show substantially less accurate estimates than on higher sample sizes, but also possible for the estimates to be *more* accurate.[^counterintuitive]

[^counterintuitive]: This may seem like a counter-intuitive result (how could the algorithm end up closer with less data?), but it's simply because with smaller samples one might get "lucky" with how the batters perform, with fewer that hit unusually high or low records relative to their true $p$. This is one complication of comparing across sample sizes.

When we perform empirical Bayes on a real dataset, we don't have the luxury of knowing whether we were in a more accurate or less accurate "replication", so consistency matters. Thus, based on this MSE plot we might recommend sticking to datasets with at least 1,000 observations, or at least 100. Still, it's nice to see that even on 30 observations, we were often able to achieve accurate estimates.

### Coverage of credible intervals

Besides the accuracy of the estimates, we can examine the credible intervals. Are the intervals still well-calibrated? Or are they sometimes systematically overconfident or underconfident?

```{r varying_size_coverages, dependson = "varying_size_au", echo = FALSE}
varying_size_coverages <- varying_size_au %>%
  mutate(cover = .low < p & p < .high) %>%
  group_by(size, replication) %>%
  summarize(coverage = mean(cover), number = sum(cover)) %>%
  mutate(p.value = pbinom(number, size, .95),
         fdr = p.adjust(p.value, method = "BH")) %>%
  ungroup()
```

```{r varyingsizecoverages, dependson = "varying_size_coverages", echo = FALSE, fig.cap = "The coverage probabilities of 95\\% credible intervals, comparing 50 replications of each sample size. Cases where the coverage probability is lower than we'd expect by chance are shown as red points."}
set.seed(2017)

signif <- varying_size_coverages %>%
  filter(fdr < .05)

ggplot(varying_size_coverages, aes(factor(size), coverage)) +
  geom_boxplot(outlier.alpha = 0) +
  geom_jitter(data = signif, color = "red", width = .2, height = 0) +
  geom_hline(yintercept = .95, color = "red", lty = 2) +
  scale_y_continuous(labels = percent_format()) +
  labs(x = "Sample size",
       y = "Coverage of 95% intervals across replications")
```

Figure \@ref(fig:varyingsizecoverages) shows the distribution of coverage proportions for each sample size. The credible intervals generally centered around 95%, especially for larger sample sizes. There is greater variation in the proportions for smaller sample sizes, but that's partly an artifact of the noise present in smaller sizes (out of 30 95% intervals, it's easy for only 26 (86.7%) to contain the true $p$ just by bad luck).

To separate this out, we show red points for replications where the coverage was low to a statistically significant extent.[^significantlow] We can see that there were 9 cases where we're confident the intervals were too narrow, all in the $n=30$ replications. There was one particularly disastrous case replication where the credible intervals contained the true value only about 1/3 of the time. (More on that in a moment).

[^significantlow]: Here, statistical significance was determined by computing a p-value with a binomial test, then by selecting those with less than a 5% false discovery rate using the Benjamini-Hochberg correction.

What causes credible intervals to be poorly calibrated? Generally it comes from when the prior is poorly estimated, and particularly when *when the variance of the prior is underestimated*. When the model underestimates the amount of variance, it tends to think the credible intervals are narrower than they are. The variance of the estimated prior can be represented by $\alpha_0+\beta_0$: the higher that sum, the smaller the variance of the beta distribution.

```{r sizecoveragealphabeta, echo = FALSE, fig.cap = "The relationship between the estimated value of $\\alpha_0+\\beta_0$ in a particular replication and the resulting coverage probabilities. 95\\% shown as a horizontal dashed line, and the true value of $\\alpha_0+\\beta_0$ is shown as a vertical dotted line. Best fit lines are shown in blue."}
varying_size_coverages %>%
  inner_join(varying_size_params, by = c("size", "replication")) %>%
  ggplot(aes(alpha + beta, coverage)) +
  geom_point() +
  facet_wrap(~ size, scales = "free") +
  geom_vline(xintercept = alpha0 + beta0, color = "red", lty = 3) +
  geom_hline(yintercept = .95, color = "red", lty = 2) +
  geom_smooth(method = "lm") +
  labs(x = "Estimated alpha + beta (true value shown as dashed line)",
       y = "Coverage of credible intervals interval")
```

Figure \@ref(fig:sizecoveragealphabeta) compares the estimate of $\alpha_0+\beta_0$ in each replication to the resulting credible interval coverage. When $\alpha_0+\beta_0$ is underestimated relative to the true value (the vertical dotted line), the credible intervals tend to be too conservative and include the true value more than 95% of the time, and when $\alpha_0+\beta_0$ is underestimated the intervals are too narrow.

We notice in particular the one replication of $n=30$ where $\alpha_0+\beta_0$ was dramatically overestimated, which resulted in only 1/3 of credible intervals containing the true value. This is the risk of empirical Bayes estimation for low sample sizes: we might estimate a very poor prior, and then treat it as though we're certain of its value. 

This trend holds true across all sample sizes (as can be seen by the best-fit lines in blue), but it is trivial in sample sizes like 10,000, where coverage probabilities range from 94.5% to 95.5%. This is because the initial estimates of $\alpha_0$ and $\beta_0$ were generally accurate, as we'd also seen in Figure \@ref(fig:varyingsizeparamsplot). Still, this shows how our accuracy in estimating the prior can affect the performance of the rest of our methods.

This chapter shows just a few examples of simulations we could perform. What if each player had half as many at-bats? What if we varied the algorithm used to estimate our hyperparameters, using the (much faster) [method of moments](http://stats.stackexchange.com/questions/12232/calculating-the-parameters-of-a-beta-distribution-using-the-mean-and-variance) to compute the beta prior, rather than maximum likelihod? I encourage you to explore other simulations you might be interested in.

## Conclusion: "I have only proved it correct, not simulated it"

Computer scientist Donald Knuth has a [famous quote](https://staff.fnwi.uva.nl/p.vanemdeboas/knuthnote.pdf): **"Beware of bugs in the above code; I have only proved it correct, not tried it."** I feel the same way about statistical methods.

When I look at mathematical papers about statistical methods, the text tends to look something like:

> Smith et al (2008) proved several asymptotic properties of empirical Bayes estimators of the exponential family under regularity assumptions i, ii, and iii. We extend this to prove the estimator of Jones et al (2001) is inadmissable, in the case that $\hat{\theta}(x)$ is an unbiased estimator and $g(y)$ is convex...

This kind of paper is an important part of the field, but it does almost nothing for me. I'm not particularly good at manipulating equations, and I get rustier every year out of grad school. If I'm considering applying a statistical method to a dataset, papers and proofs like this won't help me judge whether it will work. ("Oh- I should have known that my data didn't follow regularity assumption ii!") What does help me is the approach we've used here, where I can see for myself just how accurate the method tends to be.

For example, I recently found myself working on a problem of logistic regression that I suspected had mislabeled outcomes (some zeroes turned to ones, and vice versa), and read up on [some robust logistic regression methods](https://www.jstor.org/stable/2345763?seq=1#page_scan_tab_contents), implemented in the [robust package](https://cran.r-project.org/web/packages/robust/robust.pdf). But I wasn't sure they would be effective on my data, so I [did some random simulation](http://rpubs.com/dgrtwo/235656) of mislabeled outcomes and applied the method. The method didn't work as well as I needed it to, which saved me from applying it to my data and thinking I'd solved the problem.

For this reason, no matter how much math and proofs there are that show a method is reliable, I really only feel comfortable with a method once I've worked with it simulated data. It's also a great way to teach myself about the statistical method. I hope in these simulation chapters, and indeed throughout this book, that you have found my approaches to concrete examples and simulation as useful as I have.