Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow.empty.cell option produces error with empty cells #407

Open
homfunc opened this issue Dec 24, 2024 · 2 comments
Open

allow.empty.cell option produces error with empty cells #407

homfunc opened this issue Dec 24, 2024 · 2 comments

Comments

@homfunc
Copy link

homfunc commented Dec 24, 2024

I have some very skew ordinal data which has some gaps in some of the frequency tables for each of two groups and I received an error when using the allow.empty.cell = TRUE option which is required due to the following error:

Error : lavaan->lav_samplestats_step1():  
   some categories of variable `x3' are empty in group 1; frequencies are 
   [53, 13, 12, 4, 7, 6, 2, 1, 2, 0]
In addition: There were 25 warnings (use warnings() to see them)

Here is a reprex

library(tidyverse)
library(lavaan)

# simulate some data
sim_test_model <- "
  f1 =~ x1 + x2 + x3 + x4
  f2 =~ x5 + x6 + x7 + x8
"

variable_stats <- tribble(
  ~varname, ~mean, ~variance, ~skewness, ~kurtosis,
  "x1", 4.354913, 10.21996, 0.5958355, -0.9449583,
  "x2", 1.821965, 3.787711, 2.819906, 7.535725,
  "x3", 1.618497, 2.287157, 3.110957, 10.47878,
  "x4", 2.872832, 6.212976, 1.435709, 1.169378,
  "x5", 1.304046, 1.105366, 4.837008, 27.79657,
  "x6", 1.914451, 3.013506, 2.436427, 6.188786,
  "x7", 3.600000, 7.596759, 0.9209243, -0.2308455,
  "x8", 2.182659, 5.052245, 2.141474, 3.877923
)

simulated_data <-
  simulateData(
    model = sim_test_model,
    model.type = "cfa",
    sample.nobs = c(100, 40),
    ov.var = variable_stats$variance,
    skewness = variable_stats$skewness,
    kurtosis = variable_stats$kurtosis,
    return.type = "data.frame",
    seed = 99242
  ) %>%
  tibble()

# now adjust the means
simulated_data <- simulated_data %>%
  mutate(across(starts_with("x"),
    .fn = ~ (.x + (variable_stats %>% filter(varname == cur_column()) %>%
                     pull(mean)) - mean(.x))
  ))

# and integerfy the data using thresholds of c(-Inf, 0.5, 1.5, ..., 9.5, Inf)
simulated_data <- simulated_data %>%
  mutate(across(starts_with("x"),
    .fn = ~ cut(.x, breaks = c(-Inf, seq(0.5, 9.5), Inf), labels = FALSE)
  ))

fit_test_wlsmv_all_with_groups_allow_empty_cell_is_true <-
  cfa(
    model = sim_test_model,
    data = simulated_data,
    group = "group",
    estimator = "WLSMV",
    verbose = TRUE,
    ordered = TRUE,
    allow.empty.cell = TRUE,
  )

yields error

Error in A21[pstar.idx, th.idx_i] <- lav_matrix_crossprod(SC.COR[, pstar.idx],  : 
  number of items to replace is not a multiple of replacement length

That particular error is easy to fix...

index 7a2ef81..d6a97c6 100644
--- a/R/lav_muthen1984.R
+++ b/R/lav_muthen1984.R
@@ -331,12 +331,14 @@ muthen1984 <- function(Data = NULL,
         }
 
         # TH
-        A21[pstar.idx, th.idx_i] <-
+        th.idx_i_local <- th.idx_i[seq_len(dim(SC.COR.UNI$dx.th.y1)[2])]
+        th.idx_j_local <- th.idx_j[seq_len(dim(SC.COR.UNI$dx.th.y2)[2])]
+        A21[pstar.idx, th.idx_i_local] <-
           lav_matrix_crossprod(
             SC.COR[, pstar.idx],
             SC.COR.UNI$dx.th.y1
           )
-        A21[pstar.idx, th.idx_j] <-
+        A21[pstar.idx, th.idx_j_local] <-
           lav_matrix_crossprod(
             SC.COR[, pstar.idx],
             SC.COR.UNI$dx.th.y2

but that is not the end of the story as we now get NaN gradients when running

fit_test_wlsmv_group_1_allow_empty_cell_is_true <-
  cfa(
    model = sim_test_model,
    data = simulated_data %>% filter(group == 1),
    estimator = "WLSMV",
    verbose = TRUE,
    ordered = TRUE,
    allow.empty.cell = TRUE,
  )
@ecmerkle
Copy link
Contributor

I recently added this argument for use with blavaan (see #377), because the priors in a Bayesian model can help to handle empty cells (using bcfa() in place of cfa()).

I think an NaN gradient is the correct behavior for lavaan, because some of the threshold parameters cannot be informed by the data. But I can add something to the docs along with a more informative message about the use of allow.empty.cell.

@homfunc
Copy link
Author

homfunc commented Dec 28, 2024

Thanks for the information about blavaan and the reference to the PR.

I have run the example using blavaan and it does survive so that is positive. However, I would have thought (somewhat naiively I guess) that the allow.empty.cell would have simply filled in the empty cells with 0.5 (or whatever is specified in zero.add[2]) and returned something.

In particular, it seems strange to me that I can get a fit back when allow.empty.cell = FALSE for group 1:

fit_test_wlsmv_group_1_allow_empty_cell_is_false <-
  cfa(
    model = sim_test_model,
    data = simulated_data %>% filter(group == 1),
    estimator = "WLSMV",
    verbose = TRUE,
    ordered = TRUE,
    allow.empty.cell = FALSE,
  )

I would like to see some attempt by lavaan to use, in effect, a simple prior to 'fill in the blanks' when requested.

What are your thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants