cfa.Rmd

# Confirmatory Factor Analysis


This mostly follows Bollen (1989) for maximum likelihood estimation of a confirmatory factor analysis. In the following example we will examine a situation where there are two underlying (correlated) latent variables for 8 observed responses.  The code as is will only  work with this toy data set.  Setup uses the <span class="pack" style = "">psych</span> and <span class="pack" style = "">mvtnorm</span> packages, and results are checked against the <span class="pack" style = "">lavaan</span> package.


## Data Setup

For the data we will simulate observed variables with specific loadings on two latent constructs (factors).

```{r cfa-setup}
library(tidyverse)

set.seed(123)

# loading matrix
lambda = matrix(
  c(1.0, 0.5, 0.8, 0.6, 0.0, 0.0, 0.0, 0.0,
    0.0, 0.0, 0.0, 0.0, 1.0, 0.7, 0.6, 0.8), 
  nrow = 2, 
  byrow = TRUE
)

# correlation of factors
phi = matrix(c(1, .25, .25, 1), nrow = 2, byrow = TRUE)  

# factors and some noise
factors = mvtnorm::rmvnorm(1000, mean = rep(0, 2), sigma = phi, "chol")
e = mvtnorm::rmvnorm(1000, sigma = diag(8))

# observed responses
y = 0 + factors%*%lambda + e

# Examine
#dim(y)
psych::describe(y)
# round(cor(y), 2)

# see the factor structure
psych::cor.plot(cor(y))

# example exploratory fa
#psych::fa(y, nfactors=2, rotate="oblimin") 
```


## Functions

We will have two separate estimation functions, one for the covariance matrix,
and another for the correlation matrix.

```{r cfa-cov-func}
# measurement model, covariance approach

# trace function, strangely absent from base R
tr <- function(mat) {
  sum(diag(mat), na.rm = TRUE)
}

cfa_cov <- function (parms, data) {
  # Arguments- 
  # parms: initial values (named)
  # data: raw data
  
  # Extract parameters by name
  
  l1 = c(1, parms[grep('l1', names(parms))])      # loadings for factor 1
  l2 = c(1, parms[grep('l2', names(parms))])      # loadings for factor 2
  
  cov0 = parms[grep('cov', names(parms))]         # factor covariance, variances
  
  # Covariance matrix
  S = cov(data)*((nrow(data)-1)/nrow(data))       # ML covariance div by N rather than N-1, the multiplier adjusts
  
  # loading estimates
  lambda = cbind(
    c(l1, rep(0,length(l2))),
    c(rep(0,length(l1)), l2)
  )
  
  # disturbances
  dist_init = parms[grep('dist', names(parms))]    
  disturbs  = diag(dist_init)
  
  # factor correlation
  phi_init = matrix(c(cov0[1], cov0[2], cov0[2], cov0[3]), 2, 2)  #factor cov/correlation matrix
  
  # other calculations and log likelihood
  sigtheta = lambda%*%phi_init%*%t(lambda) + disturbs
  
  # in Bollen p + q (but for the purposes of this just p) = tr(data)
  pq = dim(data)[2] 
  
  # a reduced version; Bollen 1989 p.107
  # ll = -(log(det(sigtheta)) + tr(S%*%solve(sigtheta)) - log(det(S)) - pq) 
  
  # this should be the same as Mplus H0 log likelihood
  ll = ( (-nrow(data)*pq/2) * log(2*pi) ) - 
    (nrow(data)/2) * ( log(det(sigtheta)) + tr(S%*%solve(sigtheta)) )
  
  -ll
}
```

We can use the correlation matrix for standardized results. Lines correspond to those in `cfa_cov`.

```{r cfa-cor-func}
cfa_cor <- function (parms, data) {
  
  l1 = parms[grep('l1', names(parms))]      # loadings for factor 1
  l2 = parms[grep('l2', names(parms))]      # loadings for factor 2
  cor0 = parms[grep('cor', names(parms))]   # factor correlation
  
  S = cor(data)
  
  lambda = cbind(
    c(l1, rep(0,length(l2))),
    c(rep(0,length(l1)), l2)
  )
  
  dist_init = parms[grep('dist', names(parms))]
  disturbs  = diag(dist_init)
  
  phi_init = matrix(c(1, cor0, cor0, 1), ncol=2)
  
  sigtheta = lambda%*%phi_init%*%t(lambda) + disturbs
  pq = dim(data)[2]
  
  #ll = ( log(det(sigtheta)) + tr(S%*%solve(sigtheta)) - log(det(S)) - pq )
  
  ll = ( (-nrow(data)*pq/2) * log(2*pi) ) - 
    (nrow(data)/2) * ( log(det(sigtheta)) + tr(S%*%solve(sigtheta)) )
  
  -ll
}
```


## Estimation

Corresponding to the functions, we will get results based on the covariance and
correlation matrix respectively.

### Raw

Set initial values.

```{r cfa-cov-init}
par_init_cov = c(rep(1, 6), rep(.05, 8), rep(.5, 3)) 
names(par_init_cov) = rep(c('l1','l2', 'dist', 'cov'), c(3, 3, 8, 3))
```


Estimate and extract.

```{r cfa-cov-est}
fit_cov = optim(
  par  = par_init_cov,
  fn   = cfa_cov,
  data = y,
  method  = "L-BFGS-B",
  lower   = 0
) 

loadings_cov = data.frame(
  f1 = c(1, fit_cov$par[1:3], rep(0, 4)),
  f2 = c(rep(0, 4), 1, fit_cov$par[4:6])
)

disturbances_cov = fit_cov$par[7:14]
```


### Standardized

```{r cfa-cor-init}
par_init_cor = c(rep(1, 8), rep(.05, 8), 0) #for cor
names(par_init_cor) = rep(c('l1', 'l2', 'dist', 'cor'), c(4, 4, 8, 1))
```

```{r cfa-cor-est}
fit_cor = optim(
  par  = par_init_cor,
  fn   = cfa_cor,
  data = y,
  method = "L-BFGS-B",
  lower  = 0,
  upper  = 1
)

loadings_cor = matrix(
  c(fit_cor$par[1:4], rep(0, 4), rep(0, 4), fit_cor$par[5:8]), 
  ncol = 2
)

disturbances_cor = fit_cor$par[9:16]
```


## Comparison

Gather results for summarizing.

```{r cfa-all}
results = list(
  raw = list(
    loadings = round(data.frame(loadings_cov, Variances = disturbances_cov), 3),
    cov.fact = round(matrix(c(fit_cov$par[c(15, 16, 16, 17)]), ncol = 2) , 3)
  ),
  
  standardized = list(
    loadings = round(
      data.frame(
        loadings_cor,
        Variances = disturbances_cor,
        Rsq = (1 - disturbances_cor)
      ), 3),
    cor.fact = round(matrix(c(1, fit_cor$par[c(17, 17)], 1), ncol = 2), 3)
  ),
  
  # note inclusion of intercepts for total number of par
  fit_lav = data.frame(
    ll  = fit_cov$value,
    AIC = 2*fit_cov$value + 2 * (length(par_init_cov) + ncol(y)),
    BIC = 2*fit_cov$value + log(nrow(y)) * (length(par_init_cov) + ncol(y))
  )  
)

results
```


Compare with <span class="pack" style = "">lavaan</span>.

```{r cfa-lavaan}
library(lavaan)

y = data.frame(y)

model = '
  F1  =~ X1 + X2 + X3 + X4
  F2  =~ X5 + X6 + X7 + X8
'

fit_lav = cfa(
  model,
  data  = y,
  mimic = 'Mplus',
  estimator = 'ML'
)

fit_lav_std = cfa(
  model,
  data  = y,
  mimic = 'Mplus',
  estimator = 'ML',
  std.lv = TRUE,
  std.ov = TRUE
) 

# note that lavaan does not count the intercepts among the free params for
# AIC/BIC by default, (can get its result via -2 * as.numeric(lls) + k *
# attr(lls, "df")), but the mimic='Mplus' should have them correspond to optim's
# results

summary(fit_lav,
        fit.measures = TRUE,
        standardized = TRUE)
```


### Mplus

If you have access to Mplus you can use Mplus Automation to prepare the data. The following code is in Mplus syntax and will produce the above model.

```{r mplus, eval=FALSE}
library(MplusAutomation)

prepareMplusData(data.frame(y), "factsim.dat")
```

```
MODEL:
 F1 BY X1-X4;
 F2 BY X5-X8;

results:
 STDYX;
```


## Source

Original code available at
https://github.com/m-clark/Miscellaneous-R-Code/blob/master/ModelFitting/cfa.R