empirical-bayes.Rmd

# Empirical Bayes estimation {#empirical-bayes}

```{r, echo = FALSE}
library(knitr)
opts_chunk$set(cache = TRUE, warning = FALSE, message = FALSE, echo = TRUE, tidy = FALSE, height = 3, width = 5)
options(digits = 3, tibble.print_min = 6)
```

```{r cache = FALSE, echo = FALSE}
library(ggplot2)
theme_set(theme_bw())
```

Which of these two proportions is higher: **4 out of 10**, or **300 out of 1000**? This sounds like a silly question. Obviously $4/10=.4$, which is greater than $300/1000=.3$.

But suppose you were a baseball recruiter, trying to decide which of two potential players is a better batter based on how many hits they get. One has achieved 4 hits in 10 chances, the other 300 hits in 1000 chances. While the first player has a higher proportion of hits, it's not a lot of evidence: a typical player tends to achieve a hit around 27% of the time, and this player's $4/10$ could be due to luck. The second player, on the other hand, has a lot of evidence that he's an above-average batter.

This chapter introduces a very useful statistical method for estimating a large number of proportions, called **empirical Bayes estimation**. It's to help you with data that looks like this:

```{r successtotal, echo = FALSE}
d <- data.frame(Success = c(11, 82, 2, 0, 1203, 5),
                Total = c(104, 1351, 26, 40, 7592, 166))

kable(d, booktabs = TRUE)
```

A lot of data takes the form of these success/total counts, where you want to estimate a "proportion of success" for each instance. Each row might represent:

* **An ad you're running**: Which of your ads have higher clickthrough rates, and which have lower? Note that I'm not talking about running an A/B test comparing two options (which will be discussed in Chapter \@ref(ab-testing)), but rather about ranking and analyzing a large list of choices.
* **A user on your site**: In my work at Stack Overflow, I might look at what fraction of a user's visits are to Javascript questions, to [guess whether they are a web developer](http://kevinmontrose.com/2015/01/27/providence-machine-learning-at-stack-exchange/). In another application, you might consider how often a user decides to read an article they browse over, or to purchase a product they've clicked on.

When you work with pairs of successes/totals like this, you tend to get tripped up by the uncertainty in low counts. $1/2$ does not mean the same thing as $50/100$; nor does $0/1$ mean the same thing as $0/1000$.[^filteringdata]

[^filteringdata]: One common approach is to filter out all cases that don't meet some minimum. But this isn't always an option, since you could be throwing away useful information, and since it's not generally clear where to set a threshold.

In the last chapter, we learned how using the beta distribution to represent your prior expectations, and updating based on the new evidence, can help make your estimate more accurate and practical. In this chapter, we'll explore the method we'll be using for the rest of this book: **empirical Bayes estimation**, where a beta distribution fit on all observations is then used to improve each individually. What's great about this method is that as long as you have a lot of examples, *you don't need to bring in prior expectations*.

## Setup: the Lahman baseball dataset

In Chapter \@ref(beta-distribution), we made some vague guesses about the distribution of batting averages across history. Here we'll instead work with real data, and come up with a quantitative solution, using the `Batting` dataset from the excellent [Lahman package](https://cran.r-project.org/web/packages/Lahman/index.html). We'll prepare and clean the data a little first, using dplyr and tidyr.[^removepitchers]

[^removepitchers]: Pitchers tend to be the unusually weak batters and should be analyzed separately, so we remove them using dplyr's `anti_join`. We'll revisit pitchers in depth in Chapter \@ref(mixture-models).

```{r lahman_03}
library(dplyr)
library(tidyr)
library(Lahman)

# Filter out pitchers
career <- Batting %>%
  filter(AB > 0) %>%
  anti_join(Pitching, by = "playerID") %>%
  group_by(playerID) %>%
  summarize(H = sum(H), AB = sum(AB)) %>%
  mutate(average = H / AB)

# Include names along with the player IDs
career <- Master %>%
  tbl_df() %>%
  dplyr::select(playerID, nameFirst, nameLast) %>%
  unite(name, nameFirst, nameLast, sep = " ") %>%
  inner_join(career, by = "playerID") %>%
  dplyr::select(-playerID)
```

We've prepared a dataset of players' career averages, including their number of hits, at-bats, and batting averages.

```{r dependson = "lahman_03"}
career
```

I wonder who the best batters in history were. Well, here are the ones with the highest batting average:

```{r dependson = "lahman_03", echo = FALSE, results = "asis"}
career %>%
  arrange(desc(average)) %>%
  head(5) %>%
  knitr::kable(booktabs = TRUE)
```

Err, that's not really what I was looking for. These aren't the best batters, they're just the batters who went up once or twice and got lucky. How about the worst batters?

```{r dependson = "lahman_03", echo = FALSE, results = "asis"}
career %>%
  arrange(average) %>%
  head(5) %>%
  knitr::kable(booktabs = TRUE)
```

Also not what I was looking for. That "average" is a really crummy estimate. **Let's make a better one!**

## Step 1: Estimate a prior from all your data {#mle-prior}

Consider the distribution of batting averages across players. Any time we want to examine a distribution, we'll generally use a histogram (Figure \@ref(fig:histogrambattingaverages)).[^filtering]

[^filtering]: For the sake of visualizing and fitting the prior distribution (though not for updating individual players), I've filtered for players with more than 500 at-bats, to reduce the amount of noise. We'll see how this model can be improved in Chapter \@ref(regression).

```{r histogrambattingaverages, dependson = "lahman_03", echo = FALSE, fig.cap = "A histogram of the batting averages of all players with more than 500 at-bats."}
career %>%
  filter(AB >= 500) %>%
  ggplot(aes(average)) +
  geom_histogram(binwidth = .005)
```

The first step of empirical Bayes estimation is to estimate a beta prior using this data. Estimating priors from the data you're currently analyzing is not the typical Bayesian approach- usually you decide on your priors ahead of time. There's a lot of debate and discussion about when and where it's appropriate to use empirical Bayesian methods, but it basically comes down to how many observations we have: if we have a lot, we can get a good estimate that doesn't depend much on any one individual. Empirical Bayes is an **approximation** to more exact Bayesian methods- and with the amount of data we have, it's a very good approximation.

So far, a beta distribution looks like a pretty appropriate choice based on the above histogram.[^whatbadchoice] So we know we want to fit the following model:

[^whatbadchoice]: What would have made the beta a bad choice? Well, suppose the histogram had two peaks, or three, instead of one. Then we might have needed a mixture of betas, as will be introduced in Chapter \@ref(mixture-models), or an even more complicated model.

$$X\sim\mbox{Beta}(\alpha_0,\beta_0)$$

We just need to pick $\alpha_0$ and $\beta_0$, which we call "hyper-parameters" of our model. We can fit these using maximum likelihood: to see what parameters would maximize the probability of generating the distribution we see.[^mle]

[^mle]: There are many functions in R for fitting a probability distribution to data (`optim`, `mle`, `bbmle`, etc). You don't even have to use maximum likelihood: you could [use the mean and variance](http://stats.stackexchange.com/questions/12232), called the "method of moments". We'll choose to use the `mle` function, and to use `dbetabinom.ab`.

```{r mle, dependson = "lahman_03"}
library(stats4)

career_filtered <- career %>%
  filter(AB > 500)

# log-likelihood function
ll <- function(alpha, beta) {
  x <- career_filtered$H
  total <- career_filtered$AB
  -sum(VGAM::dbetabinom.ab(x, total, alpha, beta, log = TRUE))
}

# maximum likelihood estimation
m <- mle(ll, start = list(alpha = 1, beta = 10), method = "L-BFGS-B",
                 lower = c(0.0001, .1))
ab <- coef(m)

alpha0 <- ab[1]
beta0 <- ab[2]
```

This comes up with $\alpha_0=`r alpha0`$ and $\beta_0=`r beta0`$. We can see from the red curve in Figure \@ref(fig:histogrambeta) that it fits pretty well!

```{r histogrambeta, dependson = "mle", echo = FALSE, fig.cap = "A histogram of the batting averages, showing the density of the beta distribution (fit with maximum likelihood) in red."}
career_filtered %>%
  filter(AB > 500) %>%
  ggplot() +
  geom_histogram(aes(average, y = ..density..), binwidth = .005) +
  stat_function(fun = function(x) dbeta(x, alpha0, beta0), color = "red",
                size = 1) +
  xlab("Batting average")
```

## Step 2: Use that distribution as a prior for each individual estimate

Now when we look at any individual to estimate their batting average, we'll start with our overall prior, and [update](https://en.wikipedia.org/wiki/Bayesian_inference) based on the individual evidence. I went over this process in detail in Chapter 2: it's as simple as adding $\alpha_0$ to the number of hits, and $\alpha_0 + \beta_0$ to the total number of at-bats.

For example, consider our hypothetical batter from the introduction that went up 1000 times, and got 300 hits. We would estimate his batting average as:

$$\frac{300+\alpha_0}{1000+\alpha_0+\beta_0}=\frac{300+`r round(alpha0, 1)`}{1000+`r round(alpha0, 1)`+`r round(beta0, 1)`}=`r (300 + alpha0) / (1000 + alpha0 + beta0)`$$

How about the batter who went up only 10 times, and got 4 hits. We would estimate his batting average as:

$$\frac{4+\alpha_0}{10+\alpha_0+\beta_0}=\frac{4+`r round(alpha0, 1)`}{10+`r round(alpha0, 1)`+`r round(beta0, 1)`}=`r (4 + alpha0) / (10 + alpha0 + beta0)`$$

Thus, even though $\frac{4}{10}>\frac{300}{1000}$, we would guess that the $\frac{300}{1000}$ batter is better than the $\frac{4}{10}$ batter!

Performing this calculation for all the batters is simple enough. We'll call this an "empirical Bayes estimate", saving it in the `eb_estimate` column.

```{r career_eb, dependson = "mle"}
career_eb <- career %>%
  mutate(eb_estimate = (H + alpha0) / (AB + alpha0 + beta0))
```

## Results

Now that we have a better estimate for each player's batting average, we can ask who the best batters are.

```{r dependson = "career_eb", echo = FALSE}
options(digits = 3)
career_eb %>%
  arrange(desc(eb_estimate)) %>%
  head(5) %>%
  kable(booktabs = TRUE)
options(digits = 1)
```

Who are the *worst* batters?

```{r dependson = "career_eb", echo = FALSE}
options(digits = 3)
career_eb %>%
  arrange(eb_estimate) %>%
  head(5) %>%
  kable(booktabs = TRUE)
options(digits = 1)
```

Notice that in each of these cases, empirical Bayes didn't simply pick the players who had 1 or 2 at-bats and hit on 100% of them. It found players who generally batted well, or poorly, across a long career. What a load off of our minds: we can start using these empirical Bayes estimates in downstream analyses and algorithms, and not worry that we're accidentally letting $0/1$ or $1/1$ cases ruin everything.

Overall, let's see how empirical Bayes changed all of the batting average estimates using a scatter plot (Figure \@ref(fig:ebestimatescatter)).

```{r ebestimatescatter, dependson = "career_eb", echo = FALSE, fig.cap = "A comparison between the raw batting average ($\\frac{H}{AB}$) and the empirical Bayes estimate ($\\frac{H+\\alpha_0}{AB+\\alpha_0+\\beta_0}$) for all batters.", fig.width = 6, fig.height = 5}
ggplot(career_eb, aes(average, eb_estimate, color = AB)) +
  geom_hline(yintercept = alpha0 / (alpha0 + beta0), color = "red", lty = 2) +
  geom_point() +
  geom_abline(color = "red") +
  scale_colour_gradient(trans = "log", breaks = 10 ^ (1:5)) +
  xlab("Batting average") +
  ylab("Empirical Bayes batting average")
```

The horizontal dashed red line marks $y=\frac{\alpha_0}{\alpha_0 + \beta_0}=`r sprintf("%.3f", alpha0 / (alpha0 + beta0))`$. That's what we would guess someone's batting average was if we had *no* evidence at all (if both $H$ and $AB$ were 0). Notice that points above that line tend to move down towards it, while points below it move up.

The diagonal red line marks $x=y$. Points that lie close to it are the ones that didn't get shrunk at all by empirical Bayes. Notice that they're the ones with the highest number of at-bats (the brightest blue): they have enough evidence that we're willing to believe the naive batting average estimate.

This process is often (including in the rest of this book) called **shrinkage**: the process of moving all our estimates towards the average. How much it moves these estimates depends on how much evidence we have: if we have very little evidence (4 hits out of 10) we move it a lot, if we have a lot of evidence (300 hits out of 1000) we move it only a little. That's shrinkage in a nutshell: *Extraordinary outliers require extraordinary evidence*.

## So easy it feels like cheating

There are two steps in empirical Bayes estimation:

1. Estimate the overall distribution of your data.
2. Use that distribution as your prior for estimating each average.

Step 1 can be done once, "offline"- analyze all your data and come up with some estimates of your overall distribution. Step 2 is done for each new observation you're considering. You might be estimating the success of a post or an ad, or classifying the behavior of a user in terms of how often they make a particular choice.

And because we're using the beta and the binomial, consider how *easy* that second step is. All we did was add one number to the successes, and add another number to the total. If you work in a company that puts models into production, you could build that into your system with a single line of code that takes nanoseconds to run.

    // We hired a Data Scientist to analyze our Big Data
    // and all we got was this lousy line of code.
    float estimate = (successes + `r round(alpha0, 1)`) / (total + `r round(alpha0 + beta0, 1)`);

That really is so simple that it feels like cheating- like the kind of "fudge factor" you might throw into your code, with the intention of coming back to it later to do some real Machine Learning.

I bring this up to disprove the notion that statistical sophistication necessarily means dealing with complicated, burdensome algorithms. This Bayesian approach is based on sound principles, but it's still easy to implement. Conversely, next time you think "I only have time to implement a dumb hack," remember that you can use methods like these: it's a way to choose your fudge factor. Some dumb hacks are better than others!

But when anyone asks what you did, remember to call it "empirical Bayesian shrinkage towards a Beta prior." We statisticians have to keep up appearances.