-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcfa.qmd
198 lines (132 loc) · 7.68 KB
/
cfa.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
# Confirmatory factor analysis
This walkthrough will take a look at a famous dataset in SEM, namely the @Holzinger_Swineford_1939 study of the cognitive ability of children in two schools. We will perform a Confirmatory Factor Analysis (CFA) on these data in R using lavaan. In developing these materials, I relied on two excellent sources: the CFA tutorial [on the lavaan website](https://lavaan.ugent.be/tutorial/cfa.html) as well as this video by [Sasha Epskamp](https://youtu.be/4Nlq4JubO6g).
Let's set up the environment and have a look.
```{r}
#| label: setup
#| message: false
library("lavaan")
library("tidyverse") # for some minimal data wrangling
```
```{r}
#| eval: false
?HolzingerSwineford1939
```
```
HolzingerSwineford1939 package:lavaan R Documentation
Holzinger and Swineford Dataset (9 Variables)
Description:
The classic Holzinger and Swineford (1939) dataset consists of
mental ability test scores of seventh- and eighth-grade children
from two different schools (Pasteur and Grant-White). In the
original dataset (available in the ‘MBESS’ package), there are
scores for 26 tests. However, a smaller subset with 9 variables is
more widely used in the literature (for example in Joreskog's 1969
paper, which also uses the 145 subjects from the Grant-White
school only).
Usage:
data(HolzingerSwineford1939)
Format:
A data frame with 301 observations of 15 variables.
‘id’ Identifier
‘sex’ Gender
‘ageyr’ Age, year part
‘agemo’ Age, month part
‘school’ School (Pasteur or Grant-White)
‘grade’ Grade
‘x1’ Visual perception
‘x2’ Cubes
‘x3’ Lozenges
‘x4’ Paragraph comprehension
‘x5’ Sentence completion
‘x6’ Word meaning
‘x7’ Speeded addition
‘x8’ Speeded counting of dots
‘x9’ Speeded discrimination straight and curved capitals
Source:
This dataset was originally retrieved from
http://web.missouri.edu/~kolenikovs/stata/hs-cfa.dta (link no
longer active) and converted to an R dataset.
References:
Holzinger, K., and Swineford, F. (1939). A study in factor
analysis: The stability of a bifactor solution. Supplementary
Educational Monograph, no. 48. Chicago: University of Chicago
Press.
Joreskog, K. G. (1969). A general approach to confirmatory maximum
likelihood factor analysis. _Psychometrika_, 34, 183-202.
```
The basic idea behind the dataset is that you have three indicators for each of three latent factors representing different cognitive abilities, as shown in @fig-hs below.
{#fig-hs}
The data from the study is 'built-in' to the lavaan package, which means that after you've loaded lavaan using `library("lavaan")`, the dataset will be accessible as the named variable `HolzingerSwineford1939`.
```{r}
HolzingerSwineford1939 |>
as_tibble() # convert from data.frame to tibble for printing
```
When performing CFA in lavaan, we can either write everything out in full lavaan syntax and fit the model using `lavaan()`, or we can use the convenience function `cfa()` which has appropriate defaults. We will do the latter.
The first step is to define the syntax of the model. To perform CFA, we need to use a new syntactic operator in lavaan, `=~`, which is used to define your measurement model. When you see `=~` you should read this as "... is measured by ...". This operator will appear in a formula such as `lv =~ i1 + i2 + ...` which means "latent variable `lv` is measured by indicators `i1`, `i2`, etc.
So, the model syntax to reproduce the diagram in @fig-hs would be as follows (with the syntax enclosed between single quotes, as is usual in lavaan).
```{r}
mod_hs <- '
visual =~ x1 + x2 + x3
textual =~ x4 + x5 + x6
speed =~ x7 + x8 + x9
'
```
That is really it, if we are using the function `cfa()`. Everything else in the model—the variances of the model indicators, the covariances among the latent factors—will be included in the model by default.
What's left is just to run `cfa()`.
```{r}
fit_hs <- cfa(mod_hs, data = HolzingerSwineford1939)
fit_hs ## just printing
```
Here we can see that we estimated 21 parameters and thus has 24 degrees of freedom. It also reports back that the default estimation algorithm of Maximum Likelihood (ML) was used. The model has a $\chi^2(24)=85.306$, with $p<.001$, so the null hypothesis of perfect fit is rejected (as it often is with moderately large datasets).
Let's get more information using `summary()`, including measures of fit.
```{r}
summary(fit_hs, fit.measures = TRUE)
```
The fit indices don't look spectacular, so perhaps we can look into ways to improve the model. The `modificationIndices()` function shows the $\chi^2$ value associated with various possible modifications to the original fit.
```{r}
modificationIndices(fit_hs) |>
arrange(desc(mi))
```
These are some ways you could improve model fit, but here you want to be careful. Note that some cross-loadings are being suggested (e.g., `visual =~ x9`) but you might want to avoid these unless you have some theoretical reason for including them.
Maybe what we would want to do instead would be to improve the model by allowing a few covariances between indicator variables. So let's restrict the set of possibilities to these.
```{r}
modificationIndices(fit_hs) |>
filter(op == "~~") |> # the syntactic symbol for covariances
arrange(desc(mi)) |> # order them in terms of the modification index
slice(1:15)
```
These are the top candidates for improving our model. First of these is to allow `x7` to covary with `x8`. Let's do that and check the improvement to the fit.
```{r}
mod_hs2 <- '
visual =~ x1 + x2 + x3
textual =~ x4 + x5 + x6
speed =~ x7 + x8 + x9
## included to improve model fit
x7 ~~ x8
'
fit_hs2 <- cfa(mod_hs2, data = HolzingerSwineford1939)
summary(fit_hs2, fit.measures=TRUE)
```
You will probably want to compare models to see if the inclusion is justified.
```{r}
anova(fit_hs2, fit_hs)
```
You could (in principal) continue this process until you get a satisfactory model fit. But let's put this aside and have a look at another useful thing you can do.
One thing you might want to know is: what are the correlations between these various latent factors? The output gives use covariances, which are hard to interpret, because the variances of the latent factors are not standardized. Rather than using unit loading identification (ULI), we can switch to unit variance identification (UVI); in other words, rather than having the first loading scaled to one, we allow the loading to vary but fix the variance of our latent factors to 1. We can do this by adding `std.lv=TRUE` option to our call to `cfa()`.
```{r}
fit_cfa2 <- cfa(mod_hs2, data = HolzingerSwineford1939,
std.lv = TRUE)
```
```{r}
#| echo: false
.txt <- capture.output(summary(fit_cfa2))
.t0 <- grep("^Covariances:", .txt)[1]
.t1 <- (grep("^Variances:", .txt)[1] - 1)
.cr <- parameterEstimates(fit_cfa2) |>
filter(op == "~~",
!grepl("^x[1-9]$", lhs),
lhs != rhs) |>
select(lhs, rhs, est)
cat(.txt[.t0:.t1], sep = "\n")
```
Now that we have this information, we can interpret the covariance between latent factors as correlations. We obtain small to medium positive correlations between all three factors: so, a correlation of `{r} round(.cr[["est"]][1], 3)` between `{r} .cr[["lhs"]][1]` and `{r} .cr[["rhs"]][1]`; of `{r} round(.cr[["est"]][2], 3)` between `{r} .cr[["lhs"]][2]` and `{r} .cr[["rhs"]][2]`; and of `{r} round(.cr[["est"]][3], 3)` between `{r} .cr[["lhs"]][3]` and `{r} .cr[["rhs"]][3]`.