bayesian-ab.Rmd

# Bayesian A/B testing {#ab-testing}

```{r, echo = FALSE}
library(knitr)
opts_chunk$set(cache = TRUE, warning = FALSE, message = FALSE, tidy = FALSE, fig.height = 5, fig.width = 6.67, out.height = "3in", out.width = "4in")
options(digits = 3)
```

```{r cache = FALSE, echo = FALSE}
library(ggplot2)
theme_set(theme_bw())
```

Who is a better batter: [Mike Piazza](https://en.wikipedia.org/wiki/Mike_Piazza) or [Hank Aaron](https://en.wikipedia.org/wiki/Hank_Aaron)?

Well, Mike Piazza has a slightly higher career batting average (2127 hits / 6911 at-bats = 0.308) than Hank Aaron (3771 hits / 12364 at-bats = 0.305). But can we say with confidence that his skill is *actually* higher, or is it possible he just got lucky a bit more often?

In this series of posts about an empirical Bayesian approach to batting statistics, we've been estimating batting averages by modeling them as a binomial distribution with a beta prior. But we've been looking at a single batter at a time. What if we want to compare *two* batters, find a probability that one is better than the other, and estimate *by how much*?

This is a topic rather relevant to my own work and to the data science field, because understanding the difference between two proportions is important in **A/B testing**. One of the most common examples of A/B testing is comparing clickthrough rates ("out of X impressions, there have been Y clicks")- which on the surface is similar to our batting average estimation problem ("out of X at-bats, there have been Y hits"").[^bayesianAB]

[^bayesianAB]: The differences between frequentist and Bayesian A/B testing is a topic I've [blogged about in greater depth](http://varianceexplained.org/r/bayesian-ab-testing/), particularly about the problem of early stopping

Here, we're going to look at an empirical Bayesian approach to comparing two batters. We'll define the problem in terms of the difference between each batter's posterior distribution, and look at four mathematical and computational strategies we can use to resolve this question. While we're focusing on baseball here, remember that similar strategies apply to A/B testing, and indeed to many Bayesian models.

## Setup

As usual, we start with code that sets up the variables analyzed in this chapter.

```{r lahman_06}
library(dplyr)
library(tidyr)
library(Lahman)

# Grab career batting average of non-pitchers
# (allow players that have pitched <= 3 games, like Ty Cobb)
pitchers <- Pitching %>%
  group_by(playerID) %>%
  summarize(gamesPitched = sum(G)) %>%
  filter(gamesPitched > 3)

career <- Batting %>%
  filter(AB > 0) %>%
  anti_join(pitchers, by = "playerID") %>%
  group_by(playerID) %>%
  summarize(H = sum(H), AB = sum(AB)) %>%
  mutate(average = H / AB)

# Add player names
career <- Master %>%
  tbl_df() %>%
  dplyr::select(playerID, nameFirst, nameLast) %>%
  unite(name, nameFirst, nameLast, sep = " ") %>%
  inner_join(career, by = "playerID")

# values estimated by maximum likelihood in Chapter 3
alpha0 <- 101.4
beta0 <- 287.3

# For each player, update the beta prior based on the evidence
# to get posterior parameters alpha1 and beta1
career_eb <- career %>%
  mutate(eb_estimate = (H + alpha0) / (AB + alpha0 + beta0)) %>%
  mutate(alpha1 = H + alpha0,
         beta1 = AB - H + beta0) %>%
  arrange(desc(eb_estimate))
```

## Comparing posterior distributions

So let's take a look at the two batters in question, Hank Aaron and Mike Piazza.

```{r two_players, dependson = "lahman_06"}
# while we're at it, save them as separate objects too for later:
aaron <- career_eb %>% filter(name == "Hank Aaron")
piazza <- career_eb %>% filter(name == "Mike Piazza")
two_players <- bind_rows(aaron, piazza)

two_players
```

We see that Piazza has a slightly higher average ($H / AB$), *and* a higher shrunken empirical bayes estimate ($(H + \alpha_0) / (AB + \alpha_0 + \beta_0)$, where $\alpha_0$ and $\beta_0$ are our priors).

But is Piazza's *true* probability of getting a hit higher? Or is the difference due to chance? To answer, let's consider, just as we did in Chapter \@ref(credible-intervals) to find credible intervals, the actual posterior distributions. The posteriors give the range of plausible values for their "true" batting averages after we've taken the evidence (their batting record) into account. Recall that these posterior distributions are modeled as beta distributions with the parameters $\mbox{Beta}(\alpha_0 + H, \alpha_0 + \beta_0 + H + AB)$.

```{r aaronpiazza, dependson = "two_players", echo = FALSE, fig.cap = "Posterior distributions for the batting average of Hank Aaron and Mike Piazza."}
library(ggplot2)
theme_set(theme_bw())

two_players %>%
  crossing(x = seq(.28, .33, .00025)) %>%
  mutate(density = dbeta(x, alpha1, beta1)) %>%
  ggplot(aes(x, density, color = name)) +
  geom_line() +
  labs(x = "Batting average", color = "")
```

These posterior distributions (Figure \@ref(fig:aaronpiazza)) are therefore a probabilistic representation of our *uncertainty* in each estimate. When asking the probability Piazza is better, we're asking "if I picked a random draw from Piazza's distribution and a random draw from Aaron's, what's the probability Piazza is higher"?

Well, notice that those two distributions overlap *a lot*! There's enough uncertainty in each of those estimates that Aaron could easily be better than Piazza.

This changes if we throw another player in, retired Yankee Hideki Matsui (Figure \@ref(fig:aaronpiazzamatsui)). Hideki Matsui is a fine batter (above average for major league baseball), but not up to the level of Aaron and Piazza: notice that his posterior distribution of batting averages barely overlaps theirs. If we took a random draw from Matsui's distribution and from Piazza's, it's very unlikely that Matsui's would be higher.

```{r aaronpiazzamatsui, dependson = "two_players", echo = FALSE, fig.cap = "Posterior distributions for the batting average of Hank Aaron, Mike Piazza, and Hideki Matsui."}
career_eb %>%
  filter(name %in% c("Hank Aaron", "Mike Piazza", "Hideki Matsui")) %>%
  crossing(x = seq(.26, .33, .00025)) %>%
  mutate(density = dbeta(x, alpha1, beta1)) %>%
  ggplot(aes(x, density, color = name)) +
  geom_line() +
  labs(x = "Batting average", color = "")
```

We may be interested in the probability that Piazza is better than Aaron within our model. We can already tell from the graph that it's greater than 50%, but probably not much greater. How could we quantify it?

We'd need to know the *probability one beta distribution is greater than another*. This question is not trivial to answer, and I'm going to illustrate four routes that are common lines of attack in a Bayesian problem:

* Simulation of posterior draws
* Numerical integration
* Closed-form solution
* Closed-form approximation

Which of these approaches you choose depends on your particular problem, as well as your computational constraints. In many cases an exact closed-form solution may not be known or even exist. In some cases (such as running machine learning in production) you may be heavily constrained for time, while in others (such as drawing conclusions for a scientific paper) you care more about precision.

### Simulation of posterior draws

If we don't want to do any math today (I'm sympathetic!), we could simply try simulation. We could use each player's $\alpha_1$ and $\beta_1$ parameters, draw a million items from each of them using `rbeta`, and compare the results.

```{r dependson = "two_players"}
piazza_simulation <- rbeta(1e6, piazza$alpha1, piazza$beta1)
aaron_simulation <- rbeta(1e6, aaron$alpha1, aaron$beta1)

sim <- mean(piazza_simulation > aaron_simulation)
sim
```

This gives about a `r scales::percent(sim)` probability Piazza is better than Aaron! An answer like this is often good enough, depending on your need for precision and the computational efficiency. You could turn up or down the number of draws depending on how much you value speed vs precision. 

Notice we didn't have to do any mathematical derivation or proofs. Even if we had a much more complicated model, the process for simulating from it would still have been pretty straightforward. This is one of the reasons Bayesian simulation approaches like MCMC have become popular: computational power has gotten very cheap, while doing math is as expensive as ever.

### Integration

These two posteriors each have their own (independent) distribution, and together they form a *joint distribution*- that is, a density over particular pairs of $x$ and $y$. That joint distribution could be imagined as a density cloud (Figure \@ref(fig:densitycloud)).

```{r densitycloud, dependson = "two_players", echo = FALSE, fig.cap = "The joint probability density of Piazza's and Aaron's possible batting averages. Darker red means higher probability, and the black line represents the $x=y$ line.", dev = "png"}
library(tidyr)

x <- seq(.29, .318, .0002)
crossing(piazza_x = x, aaron_x = x) %>%
  mutate(piazza_density = dbeta(piazza_x, piazza$alpha1, piazza$beta1),
         aaron_density = dbeta(aaron_x, aaron$alpha1, aaron$beta1),
         joint = piazza_density * aaron_density) %>%
  ggplot(aes(piazza_x, aaron_x, fill = joint)) +
  geom_tile() +
  geom_abline() +
  scale_fill_gradient2(low = "white", high = "red") +
  labs(x = "Piazza batting average",
       y = "Aaron batting average",
       fill = "Joint density") +
  theme(legend.position = "none")
```

Here, we're asking what fraction of the joint probability density lies below that black line, where Piazza's average is greater than Aaron's. Notice that a bit more of the cloud's mass lies below than above: that's confirming the posterior probability that Piazza is better is about 60%.

The way to calculate this quantitatively is numerical integration, which is how [Chris Stucchio approaches the problem in this post](https://web.archive.org/web/20150419163005/http://www.bayesianwitch.com/blog/2014/bayesian_ab_test.html) and [this Python script](https://gist.github.com/stucchio/9090456). Here's a simple approach in R.

```{r integration, dependson = "two_players"}
d <- .00002
limits <- seq(.29, .33, d)
sum(outer(limits, limits, function(x, y) {
  (x > y) *
    dbeta(x, piazza$alpha1, piazza$beta1) *
    dbeta(y, aaron$alpha1, aaron$beta1) *
    d ^ 2
}))
```

Like simulation, this is a bit on the "brute force" side. (And unlike simulation, the approach becomes intractable in problems that have many dimensions, as opposed to the two dimensions here).

### Closed-form solution

You don't need to be great at calculus to be a data scientist. But it's useful to know how to find people that *are* great at calculus. When it comes to A/B testing, the person to find is often [Evan Miller](http://www.evanmiller.org/).

[This post](http://www.evanmiller.org/bayesian-ab-testing.html#binary_ab_derivation) lays out a closed-form solution Miller derived for the probability a draw from one beta distribution is greater than a draw from another:

$$p_A \sim \mbox{Beta}(\alpha_A, \beta_A)$$

$$p_B \sim \mbox{Beta}(\alpha_B, \beta_B)$$

$${\rm Pr}(p_B > p_A) = \sum_{i=0}^{\alpha_B-1}\frac{B(\alpha_A+i,\beta_A+\beta_B)}{(\beta_B+i) 
B(1+i, \beta_B)
B(\alpha_A, \beta_A)
}$$

(Where $B$ is the [beta function](https://en.wikipedia.org/wiki/Beta_function)). If you'd like an intuition behind this formula... well, you're on your own. But it's pretty straightforward to implement in R.[^hfunction]

[^hfunction]: I'm borrowing notation from [the Chris Stucchio post](https://www.chrisstucchio.com/blog/2014/bayesian_ab_decision_rule.html) and calling this function $h$.

```{r dependson = "two_players"}
h <- function(alpha_a, beta_a,
              alpha_b, beta_b) {
  j <- seq.int(0, round(alpha_b) - 1)
  log_vals <- (lbeta(alpha_a + j, beta_a + beta_b) - log(beta_b + j) -
               lbeta(1 + j, beta_b) - lbeta(alpha_a, beta_a))
  1 - sum(exp(log_vals))
}

h(piazza$alpha1, piazza$beta1,
  aaron$alpha1, aaron$beta1)
```

Having an exact solution is pretty handy![^exactc] So why did we even look at simulation/integration approaches? Well, the downsides are:

* *Not every problem has a solution like this.* And even if it does, we may not know it. That's why it's worth knowing how to run a simulation. (If nothing else, they let us check our math!)
* *This solution is slow for large $\alpha_B$, and not straightforward to vectorize*: notice that term that iterates from 0 to $\alpha_B-1$. If we run A/B tests with thousands of clicks, this step is going to constrain us (though it's still usually faster than simulation or integration).

[^exactc]: Note that this solution is exact only for integer values of $\alpha_b$: we're rounding it here, which is a trivial difference in most of our examples but may matter in others.

### Closed-form approximation

As [this report points out](http://www.johndcook.com/fast_beta_inequality.pdf), there's a much faster approximation we can use. Notice that when $\alpha$ and $\beta$ are both fairly large, the beta starts looking a lot like a normal distribution, so much so that it can be closely approximated. In fact, if you draw the normal approximation to the two players we've been considering, they are *visually indistinguishable* (Figure \@ref(fig:betanormal)).

```{r betanormal, dependson = "two_players", echo = FALSE, fig.cap = "The posterior beta distribution of batting averages for two players, shown alongside the normal approximation to each as a dashed line."}
two_players %>%
  mutate(mu = alpha1 / (alpha1 + beta1),
         var = alpha1 * beta1 / ((alpha1 + beta1) ^ 2 * (alpha1 + beta1 + 1))) %>%
  crossing(x = seq(.28, .33, .00025)) %>%
  mutate(density = dbeta(x, alpha1, beta1),
         normal = dnorm(x, mu, sqrt(var))) %>%
  ggplot(aes(x, density, group = name)) +
  geom_line(aes(color = name)) +
  geom_line(lty = 2)
```

The probability one normal variable is greater than another is *very easy to calculate*- much easier than the beta!

```{r h_approx, dependson = "two_players"}
h_approx <- function(alpha_a, beta_a, alpha_b, beta_b) {
  u1 <- alpha_a / (alpha_a + beta_a)
  u2 <- alpha_b / (alpha_b + beta_b)
  var1 <- (alpha_a * beta_a) /
    ((alpha_a + beta_a) ^ 2 * (alpha_a + beta_a + 1))
  var2 <- (alpha_b * beta_b) /
    ((alpha_b + beta_b) ^ 2 * (alpha_b + beta_b + 1))
  pnorm(0, u2 - u1, sqrt(var1 + var2))
}

h_approx(piazza$alpha1, piazza$beta1, aaron$alpha1, aaron$beta1)
```

This calculation is very fast, and (in R terms) it's *vectorizable*.

The disadvantage is that for low $\alpha$ or low $\beta$, the normal approximation to the beta is going to fit rather poorly. While the simulation and integration approaches were inexact, this one will be *systematically biased*: in some cases it will always give too high an answer, and in some cases too low. But when we have priors $\alpha_0=`r alpha0`$ and $\beta_0=`r beta0`$, as we do here, our parameters are never going to be low, so we're safe using it. 

## Confidence and credible intervals

In classical (frequentist) statistics, you may have seen this kind of "compare two proportions" problem before, perhaps laid out as a "contingency table".

```{r dependson = "two_players", echo = FALSE}
two_players %>%
  transmute(Player = name, Hits = H, Misses = AB - H) %>%
  knitr::kable(booktabs = TRUE)
```

One of the most common ways to approach these contingency table problems is with Pearson's chi-squared test, implemented in R as `prop.test`.

```{r dependson = "two_players"}
prop.test(two_players$H, two_players$AB)
```

We see a non-significant p-value of .70, indicating the test couldn't find a difference.[^pvalue] Something else useful that `prop.test` gives you is a confidence interval for the difference between the two players. We learned in Chapter \@ref(credible-intervals) about using **credible intervals** to represent the uncertainty in each player's average. Now we'll use empirical Bayes to compute the credible interval about the *difference* in these two players.

[^pvalue]: We won't talk about p-values here (we talked a little about ways to translate between p-values and posterior probabilities in Chapter \@ref(hypothesis-testing)), but we can agree it would have been strange if the p-value were significant, given that the posterior distributions overlapped so much.

We could do this with simulation or integration, but we'll use our normal approximation approach since it's the most efficient (we'll also compute our posterior probability while we're at it).

```{r, credible_interval_approx, dependson = "lahman"}
credible_interval_approx <- function(a, b, c, d) {
  u1 <- a / (a + b)
  u2 <- c / (c + d)
  var1 <- a * b / ((a + b) ^ 2 * (a + b + 1))
  var2 <- c * d / ((c + d) ^ 2 * (c + d + 1))
  
  mu_diff <- u2 - u1
  sd_diff <- sqrt(var1 + var2)
  
  data_frame(posterior = pnorm(0, mu_diff, sd_diff),
             estimate = mu_diff,
             conf.low = qnorm(.025, mu_diff, sd_diff),
             conf.high = qnorm(.975, mu_diff, sd_diff))
}

credible_interval_approx(piazza$alpha1, piazza$beta1,
                         aaron$alpha1, aaron$beta1)
```

It's not particularly exciting for this Piazza/Aaron comparison (notice it's very close to the confidence interval we calculated with `prop.test`). So let's select 20 random players, and compare each of them to Mike Piazza: how many players can we say are better than Piazza? We'll also calculate the confidence interval using `prop.test`, and compare them (Figure \@ref(fig:intervalcompare)).

```{r intervals, dependson = "credible_interval_approx", echo = FALSE}
set.seed(2016)

intervals <- career_eb %>%
  filter(AB > 10) %>%
  sample_n(20) %>%
  group_by(name, H, AB) %>%
  do(credible_interval_approx(piazza$alpha1, piazza$beta1, .$alpha1, .$beta1)) %>%
  ungroup() %>%
  mutate(name = reorder(paste0(name, " (", H, " / ", AB, ")"), -estimate))
```

```{r intervalcompare, dependson = "intervals", echo = FALSE, fig.cap = "Confidence and credible intervals for comparing the batting of 20 randomly selected players to Mike Piazza. Players are sorted in increasing order of the number of at-bats."}
f <- function(H, AB) broom::tidy(prop.test(c(H, piazza$H), c(AB, piazza$AB)))
prop_tests <- purrr::map2_df(intervals$H, intervals$AB, f) %>%
  mutate(estimate = estimate1 - estimate2,
         name = intervals$name)

all_intervals <- bind_rows(
  mutate(intervals, type = "Credible"),
  mutate(prop_tests, type = "Confidence")
)

all_intervals %>%
  mutate(name = reorder(name, -AB, na.rm = TRUE)) %>%
  ggplot(aes(x = estimate, y = name, color = type)) +
  geom_point() +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high)) +
  xlab("Piazza average - player average") +
  ylab("Player")
```

Notice the same pattern we saw in Chapter \@ref(confidence-credible). When we don't have a lot of information about a player, their credible interval ends up smaller than their confidence interval, because we're able to use the prior to adjust our expectations (Harry Little's batting average may be lower than Mike Piazza's, but we're confident it's not .3 lower). When we do have a lot of information, the credible intervals and confidence intervals converge almost perfectly.[^derivation]

[^derivation]: This can be derived mathematically, based on the fact that `prop.test`'s confidence interval is in fact very similar to our normal approximation along with an uninformative prior and a small continuity correction, but it's left as an exercise for the reader.

Thus, we can think of empirical Bayes A/B credible intervals as being a way to "shrink" frequentist confidence intervals, by sharing power across players.