forked from poldrack/psych10-book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path07b-SamplingInR.Rmd
227 lines (170 loc) · 6.22 KB
/
07b-SamplingInR.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
---
output:
pdf_document: default
bookdown::gitbook:
lib_dir: "book_assets"
includes:
in_header: google_analytics.html
html_document: default
---
# Sampling in R
First we load the necessary libraries and set up the NHANES adult dataset
```{r warning=FALSE,message=FALSE}
library(tidyverse)
library(ggplot2)
library(knitr)
library(cowplot)
set.seed(123456)
opts_chunk$set(tidy.opts=list(width.cutoff=80))
options(tibble.width = 60)
# load the NHANES data library
library(NHANES)
# create a NHANES dataset without duplicated IDs
NHANES <-
NHANES %>%
distinct(ID, .keep_all = TRUE)
#create a dataset of only adults
NHANES_adult <-
NHANES %>%
filter(
Age >= 18
) %>%
drop_na(Height)
```
## Sampling error (Section \@ref(samplingerror))
Here we will repeatedly sample from the NHANES Height variable in order to obtain the sampling distribution of the mean.
```{r}
sampSize <- 50 # size of sample
nsamps <- 5000 # number of samples we will take
# set up variable to store all of the results
sampMeans <- tibble(meanHeight=rep(NA,nsamps))
# Loop through and repeatedly sample and compute the mean
for (i in 1:nsamps) {
sampMeans$meanHeight[i] <- NHANES_adult %>%
sample_n(sampSize) %>%
summarize(meanHeight=mean(Height)) %>%
pull(meanHeight)
}
```
Now let's plot the sampling distribution. We will also overlay the sampling distribution of the mean predicted on the basis of the population mean and standard deviation, to show that it properly describes the actual sampling distribution.
```{r fig.width=8,fig.height=4,out.height='50%'}
# pipe the sampMeans data frame into ggplot
sampMeans %>%
ggplot(aes(meanHeight)) +
# create histogram using density rather than count
geom_histogram(
aes(y = ..density..),
bins = 50,
col = "gray",
fill = "gray"
) +
# add a vertical line for the population mean
geom_vline(xintercept = mean(NHANES_adult$Height),
size=1.5) +
# add a label for the line
annotate(
"text",
x = 169.6,
y = .4,
label = "Population mean",
size=6
) +
# label the x axis
labs(x = "Height (inches)") +
# add normal based on population mean/sd
stat_function(
fun = dnorm, n = sampSize,
args = list(
mean = mean(NHANES_adult$Height),
sd = sd(NHANES_adult$Height)/sqrt(sampSize)
),
size = 1.5,
color = "black",
linetype='dotted'
)
```
## Central limit theorem
The central limit theorem tells us that the sampling distribution of the mean becomes normal as the sample size grows. Let's test this by sampling a clearly non-normal variable and look at the normality of the results using a Q-Q plot. We saw in Figure \@ref(fig:alcDist50) that the variable `AlcoholYear` is distributed in a very non-normal way. Let's first look at the Q-Q plot for these data, to see what it looks like. We will use the `stat_qq()` function from `ggplot2` to create the plot for us.
```{r fig.width=4, fig.height=4}
# prepare the dta
NHANES_cleanAlc <- NHANES %>%
drop_na(AlcoholYear)
ggplot(NHANES_cleanAlc, aes(sample=AlcoholYear)) +
stat_qq() +
# add the line for x=y
stat_qq_line()
```
We can see from this figure that the distribution is highly non-normal, as the Q-Q plot diverges substantially from the unit line.
Now let's repeatedly sample and compute the mean, and look at the resulting Q-Q plot. We will take samples of various sizes to see the effect of sample size. We will use a function from the `dplyr` package called `do()`, which can run a large number of analyses at once.
```{r}
set.seed(12345)
sampSizes <- c(16, 32, 64, 128) # size of sample
nsamps <- 1000 # number of samples we will take
# create the data frame that specifies the analyses
input_df <- tibble(sampSize=rep(sampSizes,nsamps),
id=seq(nsamps*length(sampSizes)))
# create a function that samples and returns the mean
# so that we can loop over it using replicate()
get_sample_mean <- function(sampSize){
meanAlcYear <-
NHANES_cleanAlc %>%
sample_n(sampSize) %>%
summarize(meanAlcoholYear = mean(AlcoholYear)) %>%
pull(meanAlcoholYear)
return(tibble(meanAlcYear = meanAlcYear, sampSize=sampSize))
}
# loop through sample sizes
# we group by id so that each id will be run separately by do()
all_results = input_df %>%
group_by(id) %>%
# "." refers to the data frame being passed in by do()
do(get_sample_mean(.$sampSize))
```
Now let's create separate Q-Q plots for the different sample sizes.
```{r fig.width=8, fig.height=8}
# create empty list to store plots
qqplots = list()
for (N in sampSizes){
sample_results <-
all_results %>%
filter(sampSize==N)
qqplots[[toString(N)]] <- ggplot(sample_results,
aes(sample=meanAlcYear)) +
stat_qq() +
# add the line for x=y
stat_qq_line(fullrange = TRUE) +
ggtitle(sprintf('N = %d', N)) +
xlim(-4, 4)
}
plot_grid(plotlist = qqplots)
```
This shows that the results become more normally distributed (i.e. following the straight line) as the samples get larger.
## Confidence intervals (Section \@ref(confidence-intervals))
Remember that confidence intervals are intervals that will contain the population parameter on a certain proportion of times. In this example we will walk through the simulation that was presented in Section \@ref(confidence-intervals) to show that this actually works properly. Here we will use a function called `do()` that lets us
```{r echo=FALSE}
# compute how often the confidence interval contains the true population mean
nsamples <- 2500
sampSize <- 100
ci_data <- tibble(run=seq(nsamples), sampSize=sampSize) %>%
group_by(run)
# create a function that takes a sample and returns the confidence interval
get_ci <- function(sampSize){
result <- NHANES_adult %>%
sample_n(sampSize) %>%
summarize(
mean = mean(Height),
sem = sd(Height) / sqrt(sampSize)
) %>%
mutate(
CI_upper = mean + 1.96 * sem,
CI_lower = mean - 1.96 * sem
)
}
ci_results <- do(ci_data, get_ci(sampSize))
pop_mean <- mean(NHANES_adult$Height)
ci_results <-
ci_results %>%
mutate(pop_mean_within_ci = CI_upper > pop_mean & CI_lower < pop_mean) %>%
summarize(p_popmean_within_ci=mean(pop_mean_within_ci)) %>%
pull()
```