-
Notifications
You must be signed in to change notification settings - Fork 58
/
Copy pathWeek6_MissingData.Rmd
750 lines (595 loc) · 30.9 KB
/
Week6_MissingData.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
---
title: "Missing Data"
author: "Joshua F. Wiley / Michelle Byrne"
date: "`r Sys.Date()`"
output:
tufte::tufte_html:
toc: true
number_sections: true
---
Download the raw `R` markdown code: https://github.com/michellebyrne1/MonashHonoursStatistics
```{r setup}
options(digits = 2)
## There are two new packages: (1) mice (2) VIM
## both are used for missing data.
## some people have reported also needing the zip pkg
# install.packages("zip") before the other packages install correctly
library(data.table)
library(JWileymisc)
library(mice)
library(VIM)
library(ggplot2)
```
# Missing Data: Multiple Imputation
Missing data are very common, but also problematic
- Results from the non-missing data may be biased
- Missing data cause a loss of efficiency
The easiest approach to "address" missing data is to only analyse
complete cases (i.e., list-wise deletion). This often leads to
inefficiency (e.g., discarding data on X and Z because observation on
variable Y are missing). Also, list wise deletion may bias results
unless the data are missing completely at random (MCAR).
Missing data often are classified:
- Missing completely at random (MCAR) when the missingness mechanism
is completely independent of the estimate of our parameter(s) of
interest.
- Missing at random (MAR) when the missingness mechanism is
*conditionally* independent of the estimate of our parameter(s) of
interest
- Missing not at random (MNAR) when the missingness mechanism is
associated with the estimate of our parameter(s) of interest
**When data are missing completely at random (MCAR)**, listwise
deletion will yield unbiased estimates of the true parameter(s) if the
data had not been missing.
**When data are missing at random (MAR)**,
it is possible to recover unbiased
estimates if the right other variables are present. Our lecture
example will be an example of MAR (see below).
When people had high stress, we set
their positive affect values to missing. So the missing data
mechanism was stress. Now, because stress and positive affect were
correlated, low positive affect values were more likely to be
missing. Thus, the complete cases only provide a biased
sample. However, we DID have stress measured, and this was
actually the cause of the missingness. So, if we condition on
stress, we can obtain unbiased results for positive affect. This
is MAR. The data are biased if using complete cases, but there
exists some other or combination of other variables that, if
conditioned correctly on, can recover unbiased estimates from the
data despite missingness.
**When data are missing not at random (MNAR)**,
you cannot recover unbiased estimates.
A simple example of this would be if less happy people
were more likely to be missing. So rather than stress driving
missingness, its actually the positive affect scores driving
missingness on positive affect. The data we would need to recover
unbiased estimates (positive affect) is then itself missing and so
we have no ability to adjust for it. We often talk about this
based on whether the parameter(s) we are interested in estimating
are also causes (or not) of the missingness. If mean positive
affect is a cause of missingness, then mean positive affect will
be MNAR. If mean stress is a cause of missingness on mean positive
affect and we have stress, then mean positive affect will be
MAR. If age is a cause of missingness on positive affect BUT mean
age is unrelated to mean positive affect, mean positive affect is
MCAR because even though data are systematically missing, for the
parameter we care about (mean positive affect) it is completely
random relative to the missingness.
Our parameters may be anything, such as a mean, a standard deviation,
a regression coefficient, etc. We will explore **multiple imputation
(MI)** as one robust way to address missing data.
## Multiple Imputation (MI) Theory
Multiple imputation is one robust way to address missing data. It
involves generate multiple, different datasets where in each one,
different, plausible values are imputed for the missing data,
resulting in each imputed dataset have "complete" data since missing
data are filled in by predictions.
As the problem with missing data is very generic and impacts every
statistic, even the most basic decriptive statistics, we need a
general language for talking about it.
- Let $Q$ be some population value (e.g., a mean, a regression
coefficient).
- Let $\hat{Q}$ be an estimate of $Q$ with some estimate of
uncertainty due to sampling variation, calculated typically in each
imputed dataset.
- Let $\bar{Q}$ be the average of a set of estimates, $\hat{Q}$ across
different imputed datasets, with some estimate of uncertainty both
due to sampling variation impacting $\hat{Q}$ and missing data
uncertainty (causing variation in $\hat{Q}$ from one imputed dataset
to the next.
We will not discuss how to estimate $\hat{Q}$ as this is specific to
the analysis being performed (e.g., mean, regression coefficient, mean
difference from t-test, etc.). It does not really matter for the
purpose of MI. All we need is an estimate and (valid) standard error.
The general steps in MI are as follows:
1. Start with the incomplete data (the raw dataset with missing
data).
2. Generate $m$ datasets with no missingness, by filling in *different*
plausible values for any missing data. We will discuss this more
later.
3. Perform the analysis of interest on **each** imputed dataset. That
is, the analysis of interest is repeated $m$ times. This generates
$m$ different $\hat{Q}$ estimates and associated standard errors.
4. Pool the results from the analyses run on each imputed dataset to
generate an overall estimate, $\bar{Q}$.
Previously in statistics, we have grown accustomed to reporting
$\hat{Q}$, whether that be a mean, correlation, regression
coefficient, or any other statistic of interest. With missing data
addressed using MI, we instead report $\bar{Q}$, which is defined as:
$$ \bar{Q} = \frac{1}{m}\sum_{i = 1}^{m}\hat{Q}_{i} $$
This is simply the average of the $\hat{Q}$ estimates from each
individual imputed dataset. What is more unique and forms an important
difference between multiple imputation and other simpler methods
(e.g., single mean imputation) is the variance estimate.
Let $\hat{V}_{i}$ be the variance estimate (e.g., squared standard
error) for the $ith$ imputed dataset for $\hat{Q}_{i}$. Then:
$$ \bar{V} = \frac{1}{m}\sum_{i = 1}^{m}\hat{V}_{i} $$
$\bar{V}$ is the average uncertainty estimate of $\hat{Q}$ across the
multiply imputed datasets.
However, there is another source of uncertainty in our estimate of
$\bar{Q}$. The between imputed dataset variation, which is:
$$ B = \frac{1}{m - 1}\sum_{i = 1}^{m}(\hat{Q}_{i} - \bar{Q})^2 $$
$B$ captures the variance in the estimates, $\hat{Q}$. Because the
observed, non-missing data never change from one imputed dataset to
another, the only reason that $\hat{Q}$ will change is when the
plausible values imputed for the missing data change. Thus $B$ is an
estimate of the uncertainty in $\bar{Q}$ due to missing data.
Finally, for practical reasons we do not generate an infinite number
of imputed datasets. We could simply continue multiply imputing more
datasets, but this takes additional time. Because of this we do not
have the *population* of imputed datasets, instead we have a (random)
sample of all possible multiply imputed datasets. This results in some
simulation error which is added to the uncertainty in our $\bar{Q}$
estimate. Specifically, this uncertainty is:
$$\frac{B}{m}$$
Now we finally have all the pieces to determine what the total
uncertainty in our average estimate, $\bar{Q}$ is:
$$T = \bar{V} + B + \frac{B}{m}$$
Practically, this can be thought of as: the uncertainty in the
estimate of our statistic of interest from multiply imputed data,
$\bar{Q}$ is due to: (1) sampling variation $\bar{V}$, (2) uncertainty
in how we imputed the missing data $B$, and (3) uncertainty because we
did not generate an infinite number of imputed datasets,
$\frac{B}{m}$. One implication is that we can reduce the uncertainty
somewhat, simply by generating more imputed datasets. However, this
yields diminishing returns and slows analysis (remember, the analysis
of interest must be run separately **on each imputed dataset**. 100
imputed datasets would mean running an analysis 100 times, which takes
more than running it for, say, 20 imputed datasets. In most cases,
with modern computers, people recommend 25 - 100 imputed datasets now.
## Multiple Imputation Steps
The general process to generate multiply imputed using a robust
approach: fully conditional specification (FCS) or multiple imputation
through chained equations (mice) is:
1. Start with a raw dataset with missing data
2. Fill in the missing data values with **initial** estimates.
*The first "fill in" estimates are generally fast and easy
to generate. For example, fill in missing using the median for each
variable or even random numbers within the observed data range.*
3. For each variable that has missing data, build a prediction model
from other variables.
*Typically, all other variables in the dataset
being imputed are used in the prediction model, but it is possible to
use a subset. The most common case is to use Generalized Linear Models
(GLMs) as the prediction model, so linear regression for continuous
outcomes, logistic regression for binary outcomes, etc. However, any
prediction model can be used, including more complex methods such as
machine learning.*
4. Use the model to predict the missing data.
*At this step, the average / best prediction from the
model is used without capturing uncertainty. That is because this is
the dataset used to build the "final" model for actually imputing the
missing data. While we want uncertainty in our imputations, we do not
want the uncertainty in the data used to build the imputation model.*
5. Repeat steps 2 and 3 until the predicted complete dataset does not
vary within some tolerance from one iteration to the next, indicating
convergence.
These steps may be easier to understand with a concrete example. Here
we setup a small dataset with age and inflammation.
```{r}
set.seed(12345)
x <- data.frame(Age = sort(round(runif(10, 30, 65))))
x$Inflammation <- rnorm(10, x$Age/10, sd = .2)
x$Missing <- as.logical(rep(FALSE:TRUE, c(9, 1))) # first indicate which (one) data point will be missing by creating a var called "missing"
scatterp <- ggplot(x, aes(Age, Inflammation)) +
stat_smooth(method = "lm", formula = y ~ x, se=FALSE) +
geom_point(aes(colour = Missing, size = Missing)) +
theme_classic() + theme(legend.position = c(.8, .2))
print(scatterp)
## create a missing dataset (i.e., where we said we wanted the missing data to be, replace the value of inflammation with NA)
y <- x
y$Inflammation[y$Missing] <- NA
```
Here is a for loop that iterates through filling in missing values,
first using the median and then based on linear regression.
The graphs following show the scatter plots with the filled in missing
value at the first iteration and then later iterations.
Watch how the imputed missing value changes from one iteration to the
next and how with later iterations, the amount of change gets
smaller and smaller.
The goal behind iterating is to build a prediction
model to impute the missing data. However, building that model in
the first place is difficult when the data on which the model
would be built are (partially) missing. To deal with this, we
iterate between predicting the missing data and (re)building a
model on the new dataset, now that missingness has been imputed.
Multiple iterations are required because every time we generate new
predictions to fill in the missing data, the dataset on which the
prediction model was built changes, so we must rebuild the
prediction model using the latest iteration of the dataset. This
process is repeated until the change from one iteration to the
next is very small. At that point, we say the results have
**converged**.
```{r, fig.show = "animate", animation.hook="gifski", warning = FALSE, message = FALSE}
## create an imputed dataset
yimp <- y
p <- vector("list", length = 10)
imps <- vector("numeric", length = 10)
# the first loop (i==1) uses the median, then subsequent loops build a linear model from the last loop (and extrapolate the next loop's missing value from that model)
for (i in 1:10) {
if (i == 1) {
imps[i] <- median(y$Inflammation, na.rm = TRUE)
} else if (i > 1) {
m <- lm(Inflammation ~ Age, data = yimp)
imps[i] <- fitted(m)[yimp$Missing]
}
yimp$Inflammation[yimp$Missing == TRUE] <- imps[i]
p[[i]] <- scatterp %+% yimp + ggtitle(sprintf("Iteration = %d", i))
if (i > 4) {
p[[i]] <- p[[i]] + coord_cartesian(xlim = c(55, 66), ylim = c(5.4, 7), expand=FALSE)
}
print(p[[i]]) # what we're printing here is a scatterplot of each loop
}
```
We can print out the imputed values too and see how those change. This
highlights how over iterations, there are big changes at first, and
with more iterations the changes are smaller and smaller. This is what
is meant by convergence where you converge to a final, stable value.
We could keep going by at some point we are close enough, we can stop.
We might decide,say, that once it does not change beyond the 2nd
decimal place, we stop.
```{r}
print(imps, digits = 5)
```
This gets us through Steps 2 to 5. The final step is to use those
final models to impute missing data,
generating the desired number of imputed datasets (e.g., 100).
This can be a rather complex step. Generating the
average prediction from, say, a linear regression is straight
forward. However, that average prediction does not take into account
that the linear regression model is uncertain in its prediction. To
take uncertainty into account, from parametric models like linear
regression, we draw random values from the predicted distribution. In
the case of linear regression that is: a normal distribution with mean
equal to the average prediction and standard deviation equal to the
residual standard deviation. This both takes the model into account
but also by using the residual standard deviation and drawing random
numbers, we ensure that when generating multiple imputed datasets, we
don't pretend we can fill in missing data values perfectly. We fill
them in with some error / uncertainty, which is important for the MI
to generate accurate standard errors, confidence intervals and
p-values. For non-parametric models, such as random forest machine
learning models, we do not have an assumed distribution so we cannot
draw random values from a distribution with some mean and standard
deviation estimate. Instead, uncertainty in predictions must be
captured in other ways, such as via bootstrapping. However, the
details of this are beyond the scope of this lecture. Finally, when
doing Bayesian modeling, the predicted values for missing data are
very easily drawn from the posterior distribution. A fact that means
its fairly common for MI to be implemented within a Bayesian
framework, even if behind the scenes.
Often, the prediction models are generalized linear models (GLMs),
such as linear regression. GLMs are familiar and fast, but there are
drawbacks.
- GLMs assume linear relationships on the scale of the link
function. This can be problematic for MI as if you are imputing, say,
50 variables it is very time consuming to manually check whether the
assumption is met.
- GLMs only include interactions between variables when specified by
the analyst. Often, analysts do not know which variables should
interact. If there are, say, 50 variables, the possible number of
interactions is overwhelming.
- In small datasets (e.g., 100 people) it is easily possible there may
be more variables then people. GLMs require that the sample size be
larger then the number of predictors, making them a poor choice for MI
in these cases.
We do not discuss it here, but many of these limitations can be
addressed using alternate prediction models, especially in machine
learning.
Once multiple imputed datasets are generated, the previously discussed
steps are performed. The analysis of interest is run on each dataset
to generate estimates for the parameters of interest, $\hat{Q}$ and
standard errors and these are combined (pooled) to generate an overall
estimate and an uncertainty estimate that accounts for sampling
variation, missing data uncertainty, and simulation error from finite
number of imputed datasets. These are then reported as usual
(estimate, standard error, confidence interval, p-value, etc.).
## MI in `R`
Before we can try MI in `R` we need to make a dataset with missing
values. It is possible to do multilevel imputation, but it is more
complicated and outside the scope of this lecture.
To begin with, we will make a single level dataset. For the sake of
example, we will take only complete data and then add missing data
ourselves. This allows us to compare different approaches with the
"truth" from the non-missing data (_i.e., sensitivity analysis - MLB_)
The code below is not important to understand. It just creates three
datasets:
1. The raw daily diary dataset we've worked with
2. A complete case, between person dataset with no missing values.
3. A between person dataset with missing values
```{r}
## read in the dataset
data(aces_daily)
d <- as.data.table(aces_daily)
## between person data, no missing - using the na.omit function
davg <- na.omit(d[, .(
Female = factor(na.omit(Female)[1], levels = 0:1),
Age = na.omit(Age)[1],
STRESS = mean(STRESS, na.rm = TRUE),
PosAff = mean(PosAff, na.rm = TRUE),
NegAff = mean(NegAff, na.rm = TRUE)),
by = UserID])
## create missing data
davgmiss <- copy(davg)
davgmiss[STRESS < 1, NegAff := NA]
davgmiss[STRESS > 4, PosAff := NA]
## random missingness on age and Female
set.seed(1234)
davgmiss[rbinom(.N, size = 1, prob = .1) == 1, Age := NA_real_]
davgmiss[rbinom(.N, size = 1, prob = .05) == 1, Female := NA]
## drop unneeded variables to make analysis easier
davgmiss[, UserID := NULL]
```
Before doing any imputation, it is a good idea to examine data. The
`VIM` package in `R` has helpful functions for exploring missing
data. We start by looking at the amount missing on each variable and
the patterns of missing data using the `aggr()` function. This shows
us that just over half of cases have complete data on all
variables.
The left hand side of this missing data plot shows the
proportion of missing data on each individual variable. Variables
where the bars have no height are variables with no (or very nearly
no) missing data. Variables with higher bars have a greater proportion
of missing observations on that variable.
The right hand side shows different patterns of missing data. Blue
indicates present data. Red indicates missing data. About 52% of
people have all variables present. The next most common pattern is to
have all variables present but be missing just negative affect, with
about 17% of people. The next most common pattern is having all
variable but to be missing positive affect, with about 13% of people.
Identifying the patterns of missing data can be useful for identifying
whether there is enough overlap of relevant variables to predict
missingness and to identify if there are any unexpected patterns that
might indicate problems with the data.
```{r}
aggr(davgmiss, prop = TRUE,
numbers = TRUE)
```
A margin plot is an augmented scatter plot that in the margins shows
the values on one variable when missing on the other along with a
boxplot. This shows us that when negative affect is missing, stress
scores are all low between 0 and 1. There are no missing stress scores
so there is only one boxplot for negative affect.
The margin plot starts with a scatter plot. Blue dots
are observed data. Red dots are imputed data. The values of these can
help to see if imputations fall outside the range of observed data,
fit with the rest of the trend from the observed data or identify any
other potential issues. Outside the scatter plot are up to two
boxplots on each axis. Examining the x axis, here STRESS, the blue
boxplot shows the distribution of stress when negative affect (the y
axis) is not missing. The red boxplot shows the distribution of stress
when negative affect (the y axis) **is** missing. These boxplots make
it clear that the distribution of stress is very different when
negative affect is missing or not. On the y axis, there is only a blue
boxplot, because no stress values are missing. Thus there can only be
the observed distribution of negative affect when stress is present.
```{r}
marginplot(davgmiss[,.(STRESS, NegAff)])
```
We can test whether levels of one variable differ by missing on
another variable. This is often done with simple analyses, like
t-tests or chi-square tests.
These tests tell us whether values on one variable
differ systematically when another variable is missing or present. For
example, do mean stress levels differ between people who are missing
negative affect versus people who are not missing negative affect?
These results can help to identify factors that may be relevant to
*why* people have missing data. Any variable that is related is worth
considering as a candidate variable to be included in any missing data
model (e.g., multiple imputation) even if you do not care about it for
your research question. These tests do not necessarily indicate
whether the data are MCAR, MAR, or MNAR.
```{r}
## does stress differ by missing on negative affect? (Spoiler: we literally told it to earlier!)
t.test(STRESS ~ is.na(NegAff), data = davgmiss)
## does stress differ by missing on positive affect?
t.test(STRESS ~ is.na(PosAff), data = davgmiss)
## does age differ by missing on negative affect?
t.test(Age ~ is.na(NegAff), data = davgmiss)
## does age differ by missing on positive affect?
t.test(Age ~ is.na(PosAff), data = davgmiss)
chisq.test(davgmiss$Female, is.na(davgmiss$PosAff))
chisq.test(davgmiss$Female, is.na(davgmiss$NegAff))
```
Actually conducting MI in `R` is relatively straight forward. A very
flexible, robust approach is available using the `mice` package and
the `mice()` function. You can control almost every aspect of the MI
in `mice()` but you also can use defaults.
`mice()` takes a dataset, the number of imputed datasets to generate,
the number of iterations, the random seed (make it reproducible) and
default imputation methods. The defaults are fine, but you can change
default imputation methods. Below we ask for all GLMs:
1. "norm" which is linear regression for numeric data
2. "logreg" which is logistic regression for factor data with 2 levels
3. "polyreg" which is polytomous or multinomial regression for
unordered factor data with 3+ levels
4. "polr" ordered logistic regression for factor data with 3+ ordered
levels
`R` will do the steps discussed to then generate imputed datasets.
We plot them to see if there is evidence of convergence. From 3
iterations, it may be too few to see evidence of convergence.
```{r}
mi.1 <- mice(
davgmiss,
m = 5, maxit = 3,
defaultMethod = c("norm", "logreg", "polyreg", "polr"),
seed = 1234, printFlag = FALSE)
## plot convergence diagnostics
plot(mi.1, PosAff + NegAff + Female ~ .it | .ms)
```
If we are not convinced the model converged, re-run with more
iterations. Convergence is when there is no clear systematic change
with additional iterations.
The convergence diagnostics show estimated values of
descriptive statistics on some of the variables that have missing
data (note, if no missing data are present on a variable, results will
be same for each iteration so the plot is not needed and will fail).
What we are looking for is no systematic change (e.g., increasing,
decreasing) with additional iterations. Also, we would hope that the
random process leads to more or less the same estimates across the
different imputed datasets. If that is the case, all imputations
should converge around the same average value. If there is a "split"
(e.g., if some imputations have high estimates and others, low
estimates) that would indicate non-convergence. These plots look
pretty good with the 3 + 20 = 23 iterations.
```{r}
## run an additional iterations
mi.1 <- mice.mids(
mi.1, maxit = 20,
printFlag = FALSE)
## plot convergence diagnostics
plot(mi.1, PosAff + NegAff + Female ~ .it | .ms)
```
Once we are satisfied with convergence, we can plot the imputed data
versus observed data. Imputed data are in red.
These diagnostic plots do not necessarily tell you
whether the imputation is "right" or "wrong". They show in blue the
observed density plots and in blue the observed data scatter
plots. The red lines and points represent the distributions from the
imputed data. If observed and imputed distributions and values are
similar, it may suggest that the overall distribution does not change
much when imputed. This may be indicative of the data being MCAR or
that the relevant variables are missing from the imputation model.
Impossible imputed values may warrant changing the imputation model
(e.g., if ages could only range from 0 to 6 years old, and imputed
ages are 15 or -4 that may be a sign that the imputation model [e.g.,
linear] is a poor fit in this instance and a non-linear or other more
complex model is needed to appropriately capture the data).
```{r}
densityplot(mi.1, ~ PosAff + NegAff + Age)
xyplot(mi.1, NegAff + PosAff ~ STRESS)
```
Once you have created the imputed datasets, you can analyse
them. Recall that analyses must be repeated on each individual
dataset. For models that are fitted with a formula interface, you can
use the special `with()` function which works with multiply imputed
datasets. It means that the analysis is automatically run on each
imputed dataset separately. If we view the results, we end up with a
bunch of separate linear regressions.
To pool the results from the individual regressions, we use the
`pool()` function.
The pooled output has several pieces:
- **estimate**: the pooled, average regression coefficients;
- **ubar**: the average variance (based on the squared standard errors)
associated with sampling variation, we called it $\bar{V}$;
- **b**: the between imputation variance;
- **t**: the total variance incorporating sampling variation, between
imputation variance, and simulation error;
- **dfcom**: the degrees of freedom for the complete data analysis (i.e.,
if there had not been missing data);
- **df**: the estimated residual degrees of freedom for the model, taking
into account missingness;
- **riv**: the relative increase variance due to missing data;
- **lambda** or $\lambda$: the proportion of total variance due to
missing data, that is: $\lambda = \frac{B + B/m}{T}$;
- **fmi**: the fraction of missing information
```{r}
mi.reg <- with(mi.1, lm(NegAff ~ PosAff + Age))
mi.reg
pool(mi.reg)
```
We can also summarise the pooled results with confidence intervals
using the `summary()` function on the pooled results.
Finally, we can get a pooled $R^2$ value using the
`pool.r.squared()` function on the analyses.
The summary output includes standard regression
output in this case.
- **estimate**: the pooled, average regression coefficients;
- **std.error**: the standard error of the estimates incorporating all
three sources of MI error, specifically: $se = \sqrt{T}$;
- **statistic**: $\frac{estimate}{std.error}$;
- **df**: the estimated degrees of freedom incorporating uncertainty due
to missing data, combined with the statistic to calculate p-values;
- **p.value**: the p-value, the probability of obtaining the estimate or
larger in the sample given that the true population value was zero.
- **2.5%**: the lower limit of the 95% confidence interval;
- **97.5%**: the upper limit of the 95% confidence interval.
```{r}
m.mireg <- summary(pool(mi.reg), conf.int=TRUE)
print(m.mireg)
pool.r.squared(mi.reg)
```
In general, you can use any analysis you want with multiple
imputation, replacing `lm()` with another function, including `glm()`
and `lmer()`.
Here is an example looking at a logistic regression (albeit a rather
silly one) predicting Female from negative affect and age.
We can select the coefficients and confidence intervals and
exponentiate them to get the odds ratios.
```{r}
mi.reg2 <- with(mi.1, glm(Female ~ NegAff + Age, family = binomial()))
m.mireg2 <- summary(pool(mi.reg2), conf.int=TRUE)
print(m.mireg2)
## odds ratios
exp(m.mireg2[, c("estimate", "2.5 %", "97.5 %")])
```
## Multiple Imputation Comparison (Sensitivity Analysis)
One thing that often is helpful is to compare results under different
models or assumptions. The code that follows is a bit complicated
(don't worry, you don't need to use it yourself), and is used to fit
the same model in three ways:
1. pooled from the multiple imputations that was already done,
labelled "MI Reg".
2. based on the true, non-missing data, since we made the missing data
ourselves and have the true data, labelled, "Truth".
3. Complete case analysis using listwise deletion, labelled "CC".
The graph that follows shows that compared to the true data, the
multiply imputed results have larger confidence intervals slightly. In
this case, the multiply imputed results are closer to the truth in the
absolute estimates (dots in the graph) than are the estimates from the
complete case analysis.
**In general we do not have the true estimates**. In this
example, we have the true estimates because we created the missing
data. In real world settings, all we could do is compare the complete
case analysis (i.e., using listwise deletion, labelled "CC") with the multiple
imputation results (labelled "MI Reg").
```{r}
m.true <- lm(NegAff ~ PosAff + Age, data = davg)
m.cc <- lm(NegAff ~ PosAff + Age, data = davgmiss)
res.true <- as.data.table(cbind(coef(m.true), confint(m.true)))
res.cc <- as.data.table(cbind(coef(m.cc), confint(m.cc)))
res.mireg <- as.data.table(m.mireg[, c("estimate", "2.5 %", "97.5 %")])
setnames(res.true, c("B", "LL", "UL"))
setnames(res.cc, c("B", "LL", "UL"))
setnames(res.mireg, c("B", "LL", "UL"))
res.compare <- rbind(
cbind(Type = "Truth", Param = names(coef(m.true)), res.true),
cbind(Type = "CC", Param = names(coef(m.true)), res.cc),
cbind(Type = "MI Reg", Param = names(coef(m.true)), res.mireg))
ggplot(res.compare, aes(factor(""),
y = B, ymin = LL, ymax = UL, colour = Type)) +
geom_pointrange(position = position_dodge(.4)) +
facet_wrap(~Param, scales = "free")
```
# Summary Table
Here is a little summary of some of the functions used in this
topic.
| Function | What it does |
|----------------|----------------------------------------------|
| `aggr()` | Create a plot of missingness by variable and patterns of mising data for an entire dataset |
| `marginplot()` | Create a scatter plot of the non missing data between two continuous variables with the margins showing where missing data are |
| `mice()` | Main worker function for multiple imputation through chained equations in `R`. Takes a dataset, the number of imputations and number of iterations as a minimum. |
| `mice.mids()` | A way to add more iterations to an existing multiply imputed object from `mice()` |
| `with()` | Function used to run models with a multiply imputed dataset. |
| `pool()` | Pool / aggregate results run on multiply imputed data. |
| `pool.r.squared()` | Calculate pooled $R^2$ value (only for `lm()` models) in multiply imputed analyses.|