forked from poldrack/psych10-book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
08-Resampling.Rmd
242 lines (165 loc) · 14.6 KB
/
08-Resampling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
---
output:
bookdown::gitbook:
lib_dir: "book_assets"
includes:
in_header: google_analytics.html
pdf_document: default
html_document: default
---
# Resampling and simulation
```{r echo=FALSE,warning=FALSE,message=FALSE}
library(tidyverse)
library(ggplot2)
library(cowplot)
library(knitr)
set.seed(123456) # set random seed to exactly replicate results
# load the NHANES data library
library(NHANES)
# drop duplicated IDs within the NHANES dataset
NHANES <- NHANES %>%
dplyr::distinct(ID,.keep_all=TRUE)
NHANES_adult <- NHANES %>%
drop_na(Height) %>%
subset(Age>=18)
```
The use of computer simulations has become an essential aspect of modern statistics. For example, one of the most important books in practical computer science, called *Numerical Recipes*, says the following:
> "Offered the choice between mastery of a five-foot shelf of analytical statistics books and middling ability at performing statistical Monte Carlo simulations, we would surely choose to have the latter skill."
In this chapter we will introduce the concept of a Monte Carlo simulation and discuss how it can be used to perform statistical analyses.
## Monte Carlo simulation
The concept of Monte Carlo simulation was devised by the mathematicians Stan Ulam and Nicholas Metropolis, who were working to develop an atomic weapon for the US as part of the Manhattan Project. They needed to compute the average distance that a neutron would travel in a substance before it collided with an atomic nucleus, but they could not compute this using standard mathematics.
Ulam realized that these computations could be simulated using random numbers, just like a casino game. In a casino game such as a roulette wheel, numbers are generated at random; to estimate the probability of a specific outcome, you could play the game hundreds of times. Ulam's uncle had gambled at the Monte Carlo casino in Monaco, which is apparently where the name came from for this new technique.
There are four steps to performing a Monte Carlo simulation:
1. Define a domain of possible values
2. Generate random numbers within that domain from a probability distribution
3. Perform a computation using the random numbers
4. Combine the results across many repetitions
As an example, let's say that I want to figure out how much time to allow for an in-class quiz. Say that we know that the distribution of quiz completion times is normal, with mean of 5 minutes and standard deviation of 1 minute. Given this, how long does the test period need to be so that we expect all students to finish the exam 99% of the time? There are two ways to solve this problem. The first is to calculate the answer using a mathematical theory known as the statistics of extreme values. However, this involves complicated mathematics. Alternatively, we could use Monte Carlo simulation. To do this, we need to generate random samples from a normal distribution.
## Randomness in statistics
The term "random" is often used colloquially to refer to things that are bizarre or unexpected, but in statistics the term has a very specific meaning: A process is *random* if it is unpredictable. For example, if I flip a fair coin 10 times, the value of the outcome on one flip does not provide me with any information that lets me predict the outcome on the next flip. It's important to note that the fact that something is unpredictable doesn't necessarily mean that it is not deterministic. For example, when we flip a coin, the outcome of the flip is determined by the laws of physics; if we knew all of the conditions in enough detail, we should be able to predict the outcome of the flip. However, many factors combine to make the outcome of the coin flip unpredictable in practice.
Psychologists have shown that humans actually have a fairly bad sense of randomness. First, we tend to see patterns when they don't exist. In the extreme, this leads to the phenomenon of *pareidolia*, in which people will perceive familiar objects within random patterns (such as perceiving a cloud as a human face or seeing the Virgin Mary in a piece of toast). Second, humans tend to think of random processes as self-correcting, which leads us to expect that we are "due for a win" after losing many rounds in a game of chance, a phenomenon known as the "gambler's fallacy".
## Generating random numbers {#generating-random-numbers}
Running a Monte Carlo simulation requires that we generate random numbers. Generating truly random numbers (i.e. numbers that are completely unpredictable) is only possible through physical processes, such as the decay of atoms or the rolling of dice, which are difficult to obtain and/or too slow to be useful for computer simulation (though they can be obtained from the [NIST Randomness Beacon](https://www.nist.gov/programs-projects/nist-randomness-beacon])).
In general, instead of truly random numbers we use *pseudo-random* numbers generated using a computer algorithm; these numbers will seem random in the sense that they are difficult to predict, but the series of numbers will actually repeat at some point. For example, the random number generator used in R will repeat after $2^{19937} - 1$ numbers. That's far more than the number of seconds in the history of the universe, and we generally think that this is fine for most purposes in statistical analysis.
In R, there is a function to generate random numbers for each of the major probability distributions, such as:
* `runif()` - uniform distribution (all values between 0 and 1 equally)
* `rnorm()` - normal distribution
* `rbinom()` - binomial distribution (e.g. rolling the dice, coin flips)
Figure \@ref(fig:rngExamples) shows examples of numbers generated using the `runif()` and `rnorm()` functions, generated using the following code:
```{r rngExamples,echo=FALSE, fig.cap="Examples of random numbers generated from a uniform (left) or normal (right) distribution.",fig.width=8,fig.height=4,out.height='50%'}
p1 <-
tibble(
x = runif(10000)
) %>%
ggplot((aes(x))) +
geom_histogram(bins = 100) +
labs(title = "Uniform")
p2 <-
tibble(
x = rnorm(10000)
) %>%
ggplot(aes(x)) +
geom_histogram(bins = 100) +
labs(title = "Normal")
plot_grid(p1, p2, ncol = 3)
```
You can also generate random numbers for any distribution if you have a *quantile* function for the distribution. This is the inverse of the cumulative distribution function; instead of identifying the cumulative probabilities for a set of values, the quantile function identifies the values for a set of cumulative probabilities. Using the quantile function, we can generate random numbers from a uniform distribution, and then map those into the distribution of interest via its quantile function.
By default, R will generate a different set of random numbers every time you run one of the random number generator functions described above. However, it is also possible to generate exactly the same set of random numbers, by setting what is called the *random seed* to a specific value. We will do this in many of the examples in this book, in order to make sure that the examples are reproducible.
If we run the ```rnorm()``` function twice, it will give us different sets of pseudorandom numbers each time:
```{r}
print(rnorm(n = 5))
print(rnorm(n = 5))
```
However, if we set the random seed to the same value each time using the ```set.seed()``` function, then it will give us the same series of pseudorandom numbers each time:
```{r}
set.seed(12345)
print(rnorm(n = 5))
set.seed(12345)
print(rnorm(n = 5))
```
## Using Monte Carlo simulation
Let's go back to our example of exam finishing times. Let's say that I administer three quizzes and record the finishing times for each student for each exam, which might look like the distributions presented in Figure \@ref(fig:finishingTimes).
```{r finishingTimes, echo=FALSE,fig.cap="Simulated finishing time distributions.",fig.width=8,fig.height=4,out.height='50%'}
finishTimeDf <- tibble(finishTime=rnorm(3*150,mean=5,sd=1),
quiz=kronecker(c(1:3),rep(1,150)))
ggplot(finishTimeDf,aes(finishTime)) +
geom_histogram(bins=25) +
facet_grid(. ~ quiz) +
xlim(0,10)
```
However, what we really want to know is not what the distribution of finishing times looks like, but rather what the distribution of the *longest* finishing time for each quiz looks like. To do this, we can simulate the finishing time for a quiz, using the assumption that the finishing times are distributed normally, as stated above; for each of these simulated quizzes, we then record the longest finishing time. We repeat this simulation a large number of times (5000 should be enough) and record the distribution of finishing times, which is shown in Figure \@ref(fig:finishTimeSim).
```{r finishTimeSim,echo=FALSE,fig.cap="Distribution of maximum finishing times across simulations.",fig.width=4,fig.height=4,out.height='50%'}
# sample maximum value 5000 times and compute 99th percentile
nRuns <- 5000
sampSize <- 150
sampleMax <- function(sampSize = 150) {
samp <- rnorm(sampSize, mean = 5, sd = 1)
return(max(samp))
}
maxTime <- replicate(nRuns, sampleMax())
cutoff <- quantile(maxTime, 0.99)
tibble(maxTime) %>%
ggplot(aes(maxTime)) +
geom_histogram(bins = 100) +
geom_vline(xintercept = cutoff, color = "red")
```
This shows that the 99th percentile of the finishing time distribution falls at `r I(cutoff)`, meaning that if we were to give that much time for the quiz, then everyone should finish 99% of the time. It's always important to remember that our assumptions matter -- if they are wrong, then the results of the simulation are useless. In this case, we assumed that the finishing time distribution was normally distributed with a particular mean and standard deviation; if these assumptions are incorrect (and they almost certainly are), then the true answer could be very different.
## Using simulation for statistics: The bootstrap
So far we have used simulation to demonstrate statistical principles, but we can also use simulation to answer real statistical questions. In this section we will introduce a concept known as the *bootstrap* that lets us use simulation to quantify our uncertainty about statistical estimates. Later in the course, we will see other examples of how simulation can often be used to answer statistical questions, especially when theoretical statistical methods are not available or when their assumptions are too difficult to meet.
### Computing the bootstrap
In the section above, we used our knowledge of the sampling distribution of the mean to compute the standard error of the mean and confidence intervals. But what if we can't assume that the estimates are normally distributed, or we don't know their distribution? The idea of the bootstrap is to use the data themselves to estimate an answer. The name comes from the idea of pulling one's self up by one's own bootstraps, expressing the idea that we don't have any external source of leverage so we have to rely upon the data themselves. The bootstrap method was conceived by Bradley Efron of the Stanford Department of Statistics, who is one of the world's most influential statisticians.
The idea behind the bootstrap is that we repeatedly sample from the actual dataset; importantly, we sample *with replacement*, such that the same data point will often end up being represented multiple times within one of the samples. We then compute our statistic of interest on each of the bootstrap samples, and use the distribution of those estimates.
Let's start by using the bootstrap to estimate the sampling distribution of the mean, so that we can compare the result to the standard error of the mean (SEM) that we discussed earlier.
```{r echo=FALSE}
# perform the bootstrap to compute SEM and compare to parametric method
nRuns <- 2500
sampleSize <- 32
heightSample <-
NHANES_adult %>%
sample_n(sampleSize)
bootMeanHeight <- function(df) {
bootSample <- sample_n(df, dim(df)[1], replace = TRUE)
return(mean(bootSample$Height))
}
bootMeans <- replicate(nRuns, bootMeanHeight(heightSample))
SEM_standard <- sd(heightSample$Height) / sqrt(sampleSize)
SEM_bootstrap <- sd(bootMeans)
```
```{r bootstrapSEM,echo=FALSE,fig.cap="An example of bootstrapping to compute the standard error of the mean. The histogram shows the distribution of means across bootstrap samples, while the red line shows the normal distribution based on the sample mean and standard deviation.",fig.width=4,fig.height=4,out.height='50%'}
options(pillar.sigfig = 3)
tibble(bootMeans=bootMeans) %>%
ggplot(aes(bootMeans)) +
geom_histogram(aes(y=..density..),bins=50) +
stat_function(fun = dnorm, n = 100,
args = list(mean = mean(heightSample$Height),
sd = SEM_standard),
size=1.5,color='red'
)
```
Figure \@ref(fig:bootstrapSEM) shows that the distribution of means across bootstrap samples is fairly close to the theoretical estimate based on the assumption of normality. We can also use the bootstrap samples to compute a confidence interval for the mean, simply by computing the quantiles of interest from the distribution of bootstrap samples.
```{r echo=FALSE}
# compute bootstrap confidence interval
bootCI <- quantile(bootMeans, c(0.025, 0.975))
#bootCI$type <- c('Bootstrap')
# now let's compute the confidence intervals using the sample mean and SD
sampleMean <- mean(heightSample$Height)
normalCI <-
tibble(
"2.5%" = sampleMean - 1.96 * SEM_standard,
"97.5%" = sampleMean + 1.96 * SEM_standard
)
allCI <- rbind(normalCI, bootCI)
allCI$type <- c('Normal', 'Bootstrap')
allCI <- allCI %>% dplyr::select(type, `2.5%`, `97.5%`)
kable(allCI, digits=2,
caption="Confidence limits for normal distribution and bootstrap methods")
```
We would not usually employ the bootstrap to compute confidence intervals for the mean (since we can generally assume that the normal distribution is appropriate for the sampling distribution of the mean, as long as our sample is large enough), but this example shows how the method gives us roughly the same result as the standard method based on the normal distribution. The bootstrap would more often be used to generate standard errors for estimates of other statistics where we know or suspect that the normal distribution is not appropriate.
## Learning objectives
After reading this chapter, you should be able to:
* Describe the concept of a Monte Carlo simulation.
* Describe the meaning of randomness in statistics
* Obtain random numbers from the uniform and normal distributions
* Describe the concept of the bootstrap
## Suggested readings
- *Computer Age Statistical Inference: Algorithms, Evidence and Data Science*, by Bradley Efron and Trevor Hastie