hierarchical-bayes.Rmd

# Empirical Bayesian hierarchical modeling {#hierarchical-modeling}

```{r, echo = FALSE}
library(knitr)
opts_chunk$set(cache = TRUE, warning = FALSE, message = FALSE, tidy = FALSE, fig.height = 5, fig.width = 6.67, out.height = "3in", out.width = "4in")
options(digits = 3)
```

```{r cache = FALSE, echo = FALSE}
library(ggplot2)
theme_set(theme_bw())
```

Suppose you were a scout hiring a new baseball player, and were choosing between two that have had 100 at-bats each:

* A left-handed batter who has hit **30 hits / 100 at-bats**
* A right-handed batter who has hit **30 hits / 100 at-bats**

Who would you guess was the better batter?

This seems like a silly question: they both have the same exact batting record. But what if I told you that historically, left-handed batters are slightly better hitters than right-handed? How could you incorporate that evidence?

In Chapter \@ref(regression), we used the method of beta-binomial regression to incorporate information (specifically the number of at-bats a player had) into a per-player prior distribution. We did this to correct a bias of the algorithm, but we could do a lot more with this method: in particular, we can include other factors that might change our prior expectations of a player.

These are each particular applications of [Bayesian hierarchical modeling](https://en.wikipedia.org/wiki/Bayesian_hierarchical_modeling), where the priors for each player are not fixed, but rather depend on other latent variables. In our empirical Bayesian approach to hierarchical modeling, we'll estimate this prior using beta binomial regression, and then apply it to each batter.[^otherapplications] This change in the model can influence our credible intervals, A/B testing, and the other methods we've explored.

[^otherapplications]: This strategy is useful in many applications beyond baseball- for example, if we were analyzing ad clickthrough rates on a website, we might notice that different countries have different clickthrough rates, and therefore fit different priors for each.

## Setup

As usual, we start with code that sets up the variables analyzed in this chapter.

```{r lahman}
library(gamlss)
library(dplyr)
library(tidyr)
library(Lahman)
library(ggplot2)
theme_set(theme_bw())

# Grab career batting average of non-pitchers
# (allow players that have pitched <= 3 games, like Ty Cobb)
pitchers <- Pitching %>%
  group_by(playerID) %>%
  summarize(gamesPitched = sum(G)) %>%
  filter(gamesPitched > 3)

# we're keeping some extra information for later in the post:
# a "bats" column and a "year" column
career <- Batting %>%
  filter(AB > 0) %>%
  anti_join(pitchers, by = "playerID") %>%
  group_by(playerID) %>%
  summarize(H = sum(H), AB = sum(AB), year = mean(yearID)) %>%
  mutate(average = H / AB)

# Add player names
career <- Master %>%
  tbl_df() %>%
  dplyr::select(playerID, nameFirst, nameLast, bats) %>%
  unite(name, nameFirst, nameLast, sep = " ") %>%
  inner_join(career, by = "playerID")
```

Based on our model in Chapter \@ref(regression), we perform beta binomial regression using the gamlss package. This fits a model that allows the mean batting average $\mu$ to depend on the number of at-bats a player has had.

```{r fit1, results = 'hide'}
library(gamlss)

fit <- gamlss(cbind(H, AB - H) ~ log(AB),
              data = dplyr::select(career, -bats),
              family = BB(mu.link = "identity"))
```

The prior $\alpha_0$ and $\beta_0$ can then be computed for each player based on $\mu$ and a dispersion parameter $\sigma$.

```{r career_eb, dependson = "fit1"}
career_eb <- career %>%
  mutate(mu = fitted(fit, "mu"),
         sigma = fitted(fit, "sigma"),
         alpha0 = mu / sigma,
         beta0 = (1 - mu) / sigma,
         alpha1 = alpha0 + H,
         beta1 = beta0 + AB - H,
         estimate = alpha1 / (alpha1 + beta1))
```

Now we've corrected for one confounding factor, $\mbox{AB}$. One important aspect of this prediction is that it won't be useful when we've just hired a "rookie" player, and we're wondering what his batting average will be. This observed variable $\mbox{AB}$ is based on a player's *entire career*, such that a low number is evidence that a player didn't have much of a chance to bat.[^rookiepredict]

[^rookiepredict]: If we wanted to make a prediction for a rookie, we'd have to consider the distribution of possible $\mbox{AB}$'s the player could end up with and integrate over that, which is beyond the scope of this book.

But there's some information we *can* use even at the start of a player's career. Part of the philosophy of the Bayesian approach is to bring our *prior expectations* in mathematically. Let's try doing that with some factors that influence batting success.

## Right- and left- handed batters

It's well known in sabermetrics that left-handed batters [tend to bat slightly better](http://www.hardballtimes.com/the-advantage-of-batting-left-handed/).[^righthandedpitchers] The Lahman dataset provides that information in the `bats` column.

[^righthandedpitchers]: In fact, the general belief is that left-handed batters have an advantage *against right-handed pitchers*, but since most pitchers historically have been right-handed this evens out to an advantage.

```{r}
career %>%
  count(bats)
```

These letters represent "Both" ([switch hitters](https://en.wikipedia.org/wiki/Switch_hitter)), "Left", and "Right", respectively. One interesting feature is that while the ratio of righties to lefties is about 9-to-1 in the general population, in professional baseball it is only 2-to-1. Managers like to hire left-handed batters- in itself, this is some evidence of a left-handed advantage! We also see that there are a number of batters (mostly from earlier in the game's history) that we don't have handedness information for. We'll filter them out of this analysis.

Incorporating this as a predictor is as simple as adding `bats` to the formula in the `gamlss` call (our beta-binomial regression).

```{r fit2, dependson = "lahman", results = "hide"}
# relevel to set right-handed batters as the baseline
career2 <- career %>%
  filter(!is.na(bats)) %>%
  mutate(bats = relevel(bats, "R"))

fit2 <- gamlss(cbind(H, AB - H) ~ log(AB) + bats,
               data = career2,
               family = BB(mu.link = "identity"))
```

We can then look at the coefficients using the `tidy()` function from broom.

```{r dependson = "fit2"}
library(broom)
tidy(fit2)
```

According to our beta-binomial regression, there is indeed a statistically significant advantage to being left-handed, with lefties getting a hit about 1% more often. This may seem like a small effect, but over the course of multiple games it could certainly make a difference. In contrast, there's apparently no detectable advantage to being able to bat with both hands.

For our empirical Bayes estimation, this means every combination of handedness and AB now has its own prior, as shown in Figure \@ref(fig:leftrightprior). Left-handed batters, and those who had more hits in their career, would have generally higher expectations.

```{r leftrightprior, dependson = "fit2", echo = FALSE, fig.cap = "The prior distribution according to the hierarchical model for players with particular combinations of AB and handedness."}
sigma <- fitted(fit2, "sigma")[1]

crossing(bats = c("L", "R"),
         AB = c(1, 10, 100, 1000, 10000)) %>%
  augment(fit2, newdata = .) %>%
  rename(mu = .fitted) %>%
  crossing(x = seq(.1, .36, .0005)) %>%
  mutate(alpha = mu / sigma,
         beta = (1 - mu) / sigma,
         density = dbeta(x, alpha, beta)) %>%
  ggplot(aes(x, density, color = factor(AB), lty = bats)) +
  geom_line() +
  labs(x = "Batting average",
       y = "Prior density",
       color = "AB",
       lty = "Batting hand")
```

Note that this prior can still easily be overcome by enough evidence. For example, consider our hypothetical pair of battters from the introduction, where each has a 30% success rate, but where one is left-handed and one right-handed. If the batters had few at-bats (such as 10 or 100), we'd guess that the left-handed batter was better, but the posterior for the two will converge as AB increases to 1000 or 10,000 (Figure \@ref(fig:credintervals30)).

```{r credintervals30, dependson = "plot_left_right", echo = FALSE, fig.cap = "Empirical Bayes estimates and 95\\% credible intervals for two hypothetical batters with a 30\\% success rate, one left-handed and one right-handed."}
crossing(bats = c("L", "R"),
         AB = c(10, 100, 1000, 10000)) %>%
  augment(fit2, newdata = .) %>%
  mutate(H = .3 * AB,
         alpha0 = .fitted / sigma,
         beta0 = (1 - .fitted) / sigma,
         alpha1 = alpha0 + H,
         beta1 = beta0 + AB - H,
         estimate = alpha1 / (alpha1 + beta1),
         conf.low = qbeta(.025, alpha1, beta1),
         conf.high = qbeta(.975, alpha1, beta1),
         record = paste(H, AB, sep = " / ")) %>%
  ggplot(aes(estimate, record, color = bats)) +
  geom_point() +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high)) +
  labs(x = "Estimate w/ 95% credible interval",
       y = "Batting record",
       color = "Batting hand")
```

## Over time

One of the most dramatic pieces of information we've "swept under the rug" in our analysis is the time period when each player was active. It's absurd to expect that players in the 1880s would have the same ranges of batting averages as players today, and we should take that into account in our estimates. Each player in the dataset therefore includes a `year` column representing the average of the years that player was an active batter.

```{r dependson = "career2", echo = FALSE}
career2 %>%
  head(6) %>%
  select(-playerID) %>%
  knitr::kable(booktabs = TRUE)
```

Could we simply fit a linear model with respect to year, such as `~ log10(AB) + bats + year`? Well, before we fit a model we should always look at a graph. A boxplot comparing batting averages across decades is a good start (Figure \@ref(fig:decadeboxplot)).

```{r decadeboxplot, echo = FALSE, fig.cap = "Boxplot of batting averages within each decade. To reduce the effect of noise, only players with more than 500 at-bats are included."}
career2 %>%
  mutate(decade = factor(round(year - 5, -1))) %>%
  filter(AB >= 500) %>%
  ggplot(aes(decade, average)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ylab("Batting average")
```

Well, there's certainly a trend over time, but there's nothing linear about it: batting averages have both risen and fallen across time. If you're interested in baseball history and not just Bayesian statistics, you may notice that this graph marks the "power struggle" between offense and defense.

* The rise in the 1920s and 1930s marks [the end of the dead-ball era](https://en.wikipedia.org/wiki/Dead-ball_era#The_end_of_the_dead-ball_era), where hitting, especially home runs, became a more important part of the game
* The batting average "cools off" as pitchers adjust their technique, especially when [the range of the strike zone was increased in 1961](http://www.thisgreatgame.com/1968-baseball-history.html)
* Batting average rose again in the 1970s thanks to the [designated hitter rule](https://en.wikipedia.org/wiki/Designated_hitter), where pitchers (in one of the two main leagues) were no longer required to bat
* It looks like batting averages may [again be drifting downward](http://www.latimes.com/sports/la-sp-mlb-declining-offense-20150405-story.html)

In any case, we certainly can't fit a simple linear trend here. To capture the nonlinear trend, we could instead fit a [natural cubic spline](Spline_interpolation) using the [ns function](http://www.inside-r.org/r-doc/splines/ns):[^df]

[^df]: Why 5 degrees of freedom? Not very scientific- I just tried a few and picked one that roughly captured the shapes we saw in the boxplots. If you have too few degrees of freedom you can't capture the complex trend we're seeing here, but if you have too many you'll overfit to noise in your data. If you're not familiar with splines, what's important is that even in a linear model, we can include nonlinear trends.

```{r career3, dependson = "career2", results = "hide"}
library(splines)

fit3 <- gamlss(cbind(H, AB - H) ~ 0 + ns(year, df = 5)
                                    + bats + log(AB),
               data = career2,
               family = BB(mu.link = "identity"))
```

We now have a prior for each year, handedness, and number of at-bats. For example, Figure \@ref(fig:gamlssprior) the prior distribution for each year and handedness for a hypothetical player with AB = 1000.

```{r plot_gamlss_fit, dependson = "career3", echo = FALSE}
plot_gamlss_fit <- function(f) {
  career2 %>%
    dplyr::select(year, bats) %>%
    distinct() %>%
    filter(bats != "B") %>%
    mutate(AB = 1000) %>%
    augment(f, newdata = .) %>%
    rename(mu = .fitted) %>%
    mutate(sigma = fitted(fit3, "sigma")[1],
           alpha0 = mu / sigma,
           beta0 = (1 - mu) / sigma,
           conf_low = qbeta(.025, alpha0, beta0),
           conf_high = qbeta(.975, alpha0, beta0)) %>%
    ggplot(aes(year, mu, color = bats, group = bats)) +
    geom_line() +
    geom_ribbon(aes(ymin = conf_low, ymax = conf_high), linetype = 2, alpha = .1) +
    labs(x = "Year",
         y = "Prior distribution (median + 95% quantiles)",
         color = "Batting hand")
}
```

```{r gamlssprior, dependson = c("fit3", "plot_gamlss_fit"), echo = FALSE, fig.cap = "The prior distribution that would be used for a left- or right-handed player at a particular point in time. Shown are the mean and the 95\\% intervals for each prior."}
plot_gamlss_fit(fit3)
```

Note that those intervals don't represent uncertainty about our trend, but rather 95% range in prior batting averages. Each combination of year and left/right handedness is a beta distribution, of which we're seeing just one cross-section.

### Interaction terms

One of the implicit assumptions of the above model is that the effect of left-handedness hasn't changed over time. But this may not be true! We can change the formula to allow an [interaction term](https://en.wikipedia.org/wiki/Interaction_(statistics)) `ns(year, 5) * bats`, which lets the effect of handedness change over time.

```{r fit4, dependson = "career3", results = "hide"}
fit4 <- gamlss(cbind(H, AB - H) ~ 0 + ns(year, 5) * bats
                                    + log(AB),
               data = career2,
               family = BB(mu.link = "identity"))
```

```{r gamlssinteraction, dependson = c("plot_gamlss_fit", "fit4"), echo = FALSE, fig.cap = "The prior distribution that would be used for a left- or right-handed player at a particular point in time, allowing for an interaction term between time and handedness."}
plot_gamlss_fit(fit4)
```

These new priors are shown in Figure \@ref(fig:gamlssinteraction). Interesting- we can now see that *the gap between left-handed and right-handed batters has been closing since the start of the game,* such that today the gap has basically completely disappeared. This suggests that managers and coaches may have learned how to deal with left-handed batters.[^percentpitcher]

[^percentpitcher]: Inspired by this, we might wonder if the percentage of games started by left-handed pitchers have been going up over time, and indeed they have. This is one thing I like about fitting hierarchical models like these: they don't just improve your estimation, they can also give you insights into your data.

```{r pitcherhandtime, echo = FALSE, fig.cap = "The percentage of games in each year that were played with a left-handed pitcher.", eval = FALSE}
Pitching %>%
  dplyr::select(playerID, yearID, GS) %>%
  distinct() %>%
  inner_join(dplyr::select(Master, playerID, throws)) %>%
  count(yearID, throws, wt = GS) %>%
  filter(!is.na(throws)) %>%
  mutate(percent = n / sum(n)) %>%
  filter(throws == "L") %>%
  ggplot(aes(yearID, percent)) +
  geom_line() +
  geom_smooth() +
  scale_y_continuous(labels = scales::percent_format()) +
  xlab("Year") +
  ylab("% of games with left-handed pitcher")
```

### Posterior distributions

Let's go back to those two batters with a record of 30 hits out of 100 at-bats. We've now seen that this would be a different question in different years. Let's consider what it would look like in three different years, each 50 years apart.

```{r players_posterior, dependson = "fit4"}
players <- crossing(year = c(1915, 1965, 2015),
                    bats = c("L", "R"),
                    H = 30,
                    AB = 100)

players_posterior <- players %>%
  mutate(mu = predict(fit4, what = "mu", newdata = players),
         sigma = predict(fit4, what = "sigma",
                         newdata = players, type = "response"),
         alpha0 = mu / sigma,
         beta0 = (1 - mu) / sigma,
         alpha1 = alpha0 + H,
         beta1 = beta0 + AB - H)

players_posterior
```

We can see in Figure \@ref(fig:posteriortime) how these posterior distributions (the `alpha1` and `beta1` we chose) differ for each of the three. If this comparison had happened in 1915, you may have wanted to pick the left-handed batter. We wouldn't have been sure he was better (we'd need a method from Chapter \@ref(ab-testing) to determine that), but it was more likely than not. But today, there'd be basically no reason to: left- versus right- handedness has almost no extra information.

```{r posteriortime, echo = FALSE, fig.cap = "Posterior distributions for batters with a record of 30 / 100, whether left- or right- handed and at different points in time."}
players_posterior %>%
  crossing(x = seq(.15, .3, .001)) %>%
  mutate(density = dbeta(x, alpha1, beta1)) %>%
  ggplot(aes(x, density, color = bats)) +
  geom_line() +
  facet_wrap(~ year) +
  xlab("Batting average") +
  ylab("Posterior density")
```

## Uncertainty in hyperparameters

We've followed the philosophy of empirical Bayes so far: we fit hyperparameters ($\alpha_0$, $\beta_0$, or our coefficients for time and handedness) for our model using maximum likelihood (e.g. beta-binomial regression), and then use that as the prior for each of our observations.

There's a problem I've been ignoring so far with the empirical Bayesian approach, which is that there's uncertainty in these hyperparameters as well. When we come up with an alpha and beta, or come up with particular coefficients over time, we are treating those as fixed knowledge, as if these are the priors we "entered" the experiment with. But in fact each of these parameters were chosen from this same data, and in fact each comes with a confidence interval that we're entirely ignoring. This is sometimes called the "double-dipping" problem among critics of empirical Bayes.

This wasn't a big deal when we were estimating just $\alpha_0$ and $\beta_0$ for the overall dataset: we had so much data, and were estimating so few parameters, that we could feel good about the approach. But now that we're fitting this many parameters, we're *pushing it*. Actually quantifying this, and choosing methods robust to the charge of "double-dipping", involves Bayesian methods outside the scope of this book. But I wanted to note that this chapter reaches approximately the edge of what empirical Bayes can be used for reliably.