-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpath.qmd
247 lines (173 loc) · 9.82 KB
/
path.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
# Path analysis
## SEM software: lavaan
For estimating structural equation models in R we will use the [`lavaan`](https://www.lavaan.ugent.be/) add-on package [@Rosseel_2012] for R. You can install it by typing `install.packages("lavaan")` in your R console window.
The main workhorse function for lavaan is called, simply, `lavaan()`. This is a powerful function that lets you specify a wide array of models. You can see all the options by looking at the help page (`?lavaan::lavaan`):
```
lavaan package:lavaan R Documentation
Fit a Latent Variable Model
Description:
Fit a latent variable model.
Usage:
lavaan(model = NULL, data = NULL, ordered = NULL,
sampling.weights = NULL,
sample.cov = NULL, sample.mean = NULL, sample.th = NULL,
sample.nobs = NULL,
group = NULL, cluster = NULL, constraints = "",
WLS.V = NULL, NACOV = NULL, ov.order = "model",
slotOptions = NULL, slotParTable = NULL, slotSampleStats = NULL,
slotData = NULL, slotModel = NULL, slotCache = NULL,
sloth1 = NULL,
...)
```
The `...` in the function syntax suggests there are more options than are listed, and the help page says "See `lavOptions` for a complete list." And if you type `lavOptions()` you will see there are 113 additional options! This can be quite overwhelming for the novice user.
Fortunately, lavaan provides two 'convenience' functions that hide away some of this complexity and that automatically choose sensible defaults. These functions are:
| function | description |
|----------|------------------------------------|
| `sem()` | Fit a structural equation model |
| `cfa()` | Fit a confirmatory factor analysis |
In what follows, we'll be using `sem()` to fit a path model. The function documentation indicates what options of lavaan are set to by default (see `?lavOptions` for further information).
```
The ‘sem’ function is a wrapper for the more general lavaan
function, but setting the following default options: ‘int.ov.free
= TRUE’, ‘int.lv.free = FALSE’, ‘auto.fix.first = TRUE’ (unless
‘std.lv = TRUE’), ‘auto.fix.single = TRUE’, ‘auto.var = TRUE’,
‘auto.cov.lv.x = TRUE’, ‘auto.efa = TRUE’, ‘auto.th = TRUE’,
‘auto.delta = TRUE’, and ‘auto.cov.y = TRUE’.
```
Many of these options are only relevant for models including latent variables, which our path models will not include. Thus, when using path analyses, these are the defaults you should keep in mind.
| option setting | effect |
|--------------------|--------------------------------------------------------------|
| `int.ov.free=TRUE` | intercepts of observed variables are not fixed to zero |
| `auto.var=TRUE` | residual variances for observed variables are set free |
| `auto.cov.y=TRUE` | covariances of endogenous variables are modeled and set free |
## Example: SEM vs regression
We will begin our practical introduction to path analysis with an example that represents a simple extension from ordinary regression modeling—namely, a situation where you have multiple response variables.
Let's assume we're interested in the social determinants of [narcisissm](https://en.wikipedia.org/wiki/Narcissistic_personality_disorder) and we have given a (subclinical) population a narcissism questionnaire (scale goes from 1-100).
```{r}
#| label: narc-setup
#| include: false
options(tidyverse.quiet = TRUE)
library("tidyverse")
set.seed(1451)
.nobs <- 200L
## params for one DV model
.b_HP <- .1
.b_EP <- .3
.b_SB <- -.1
.err_sd <- 7
.errs <- rnorm(.nobs, sd = .err_sd)
.cvmx <- rbind(c(8 * 8, 8 * 4 * .1),
c(4 * 8 * .1, 4 * 4))
.preds0 <- MASS::mvrnorm(.nobs, mu = c(HP = 72, EP = 34),
Sigma = .cvmx)
.preds0[, 1] <- dplyr::case_when(.preds0[, 1] < 1 ~ 1,
.preds0[, 1] > 100 ~ 100,
.default = .preds0[, 1])
.preds0[, 2] <- dplyr::case_when(.preds0[, 2] < 1 ~ 1,
.preds0[, 2] > 100 ~ 100,
.default = .preds0[, 2])
## add number of siblings (0:5)
.preds <- cbind(.preds0,
sample(0:5, .nobs, TRUE, prob=c(.35,.30,.20,.10,.03,.02)))
## params for two DV model
.b_HP_VN <- .1
.b_HP_GN <- .1
.b_EP_VN <- -.1
.b_EP_GN <- .1
.b_SB_VN <- -.05
.b_SB_GN <- -.05
.corr <- 0
.resid_cvmx <- rbind(c(3^2, 3 * 4 * .corr),
c(3 * 4 * .corr, 4^2))
.resids <- MASS::mvrnorm(nrow(.preds), mu = c(VN = 0, GN = 0),
Sigma = .resid_cvmx)
# use the same predictor values for both models
narc_1dv <- tibble(HP = .preds[, 1],
EP = .preds[, 2],
SB = .preds[, 3],
Narc_r = 50 + .b_HP * HP + .b_EP * EP +
.b_SB * SB + .errs) |>
mutate(Narc = case_when(Narc_r < 1 ~ 1,
Narc_r > 100 ~ 100,
.default = Narc_r)) |>
select(-Narc_r)
write_csv(narc_1dv |> select(-SB), file.path("data", "narc_1.csv"))
write_csv(narc_1dv, file.path("data", "narc_1sib.csv"))
narc_2dv <- tibble(HP = .preds[, 1],
EP = .preds[, 2],
SB = .preds[, 3],
VN_r = 40 + .b_HP_VN * HP + .b_EP_VN * EP +
.b_SB_VN * SB + .resids[, 1],
GN_r = 60 + .b_HP_GN * HP + .b_EP_GN * EP +
.b_SB_GN * SB + .resids[, 2],
VN = case_when(VN_r < 1 ~ 1,
VN_r > 100 ~ 100,
.default = VN_r),
GN = case_when(GN_r < 1 ~ 1,
GN_r > 100 ~ 100,
.default = GN_r)) |>
select(-VN_r, -GN_r)
write_csv(narc_2dv |> select(-SB), file.path("data", "narc_2.csv"))
write_csv(narc_2dv, file.path("data", "narc_2sib.csv"))
```
To follow along, download the following files. NB: These files contain simulated (i.e., not real) data, so don't draw any conclusions about narcissism from the analyses.
| file | description |
|---------------------------------|--------------------------------------|
| [`narc_1.csv`](data/narc_1.csv) | narcisissm data, 1 outcome variable |
| [`narc_2.csv`](data/narc_2.csv) | narcisissm data, 2 outcome variables |
### single outcome data
We might have the following theory of narcisissm, where parenting style (specifically, the degree of "helicopter" parenting where parents smother their child with attention) and economic privilege jointly determine the level of adult narcissism.
{fig-align="center"}
Here, the SEM framework wouldn't buy us anything because we can fit this model using basic regression.
```{r}
library("tidyverse") # for read_csv
library("lavaan") # for SEM; we'll need these functions later
narc_1 <- read_csv("data/narc_1.csv", col_types = "ddd")
mod_reg <- lm(Narc ~ HP + EP, data = narc_1)
summary(mod_reg)
```
```{r}
sem_formula <- 'Narc ~ HP + EP'
mod_sem <- sem(sem_formula, data = narc_1)
summary(mod_sem, rsquare = TRUE)
```
```{r}
#| include: false
.dv_rv <- parameterEstimates(mod_sem) |>
filter(op == "~~", lhs == "Narc") |>
pull(est)
.dv_rv_sq <- sqrt(.dv_rv)
```
You can see that the regression coefficients closely match the ones from the regression. We also get an estimate of the residual variance for the DV of `{r} round(.dv_rv, 3)`, which is close to the "Residual standard error" in the regression model, because sqrt(`{r} round(.dv_rv, 3)`) = `{r} round(.dv_rv_sq, 3)`, which is close to `{r} round(sigma(mod_reg), 3)`.
The summary suggests that we have estimated 3 parameters. What are they? Let's use lavaan's `parTable()` function to see.
```{r}
parTable(mod_sem)
```
The 'free' column counts the free parameters. You'll note that `Narc ~ HP`, `Narc ~ EP` and `Narc ~~ Narc` are the (population) direct effects and variance that are being estimated. All the variances/covariances for the exogenous variables are just directly calculated from the sample, and so they have no standard errors (= 0).
We can also get confidence intervals using the `parameterEstimates()` function, which also puts the estimates in a table, making them easier to access for reporting.
```{r}
parameterEstimates(mod_sem)
```
We can also get standardized estimates. The `std.all` column is the one you want to look at.
```{r}
parameterEstimates(mod_sem, standardized = TRUE)
```
{fig-align="center"}
### Adding a second DV
Up to this point using SEM doesn't really buy us anything over and above ordinary regression. But the situation changes if we have more than one DV. Let's say that instead of a single measure we had a measure of two types of narcissism: vulnerable narcissism (VN) and grandiose narcissism (GN), which, for the sake of example, let's say are not mutually exclusive (i.e., a person can have a little bit of each).
We could perform two separate regressions, but this has the problem that each regression is performed independently of the other, which only makes sense if we are sure that the two DVs are totally uncorrelated. If they are correlated, then the estimates from this approach will be biased. Here, a SEM approach is useful because it can account for this covariance.

```{r}
narc_2 <- read_csv("data/narc_2.csv",
col_types = "dddd")
sem_formula2 <- '
VN ~ HP + EP
GN ~ HP + EP'
mod_sem2 <- sem(sem_formula2, data = narc_2)
summary(mod_sem2, rsquare = TRUE)
```
Let's have a look at the model estimates, including standardized coefficients, and add the (raw) coefficients to our path diagram.
```{r}
parameterEstimates(mod_sem2, standardized = TRUE)
```
