Skip to content

Expand check_heterogeneity_bias()'s output #812

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 37 commits into from
May 9, 2025
Merged

Conversation

strengejacke
Copy link
Member

Fixes #810

@mattansb
Copy link
Member

mattansb commented May 7, 2025

Okay, I like this a lot:

library(performance)

mlmRev::egsingle |> 
  check_group_variation(select = c("lowinc", "female", "math"),
                        by = c("schoolid", "childid"))
#> Check group variation
#> 
#> group    | variable |    type
#> -----------------------------
#> schoolid |   lowinc | between
#> schoolid |   female |  within
#> schoolid |     math |    both
#> childid  |   lowinc | between
#> childid  |   female | between
#> childid  |     math |    both


mlmRev::egsingle |> 
  check_group_variation(select = c("lowinc", "female", "math"),
                        by = c("schoolid", "childid"), 
                        
                        include_by = TRUE)
#> Check group variation
#> 
#> group    | variable |    type
#> -----------------------------
#> schoolid |  childid |  nested
#> schoolid |   lowinc | between
#> schoolid |   female |  within
#> schoolid |     math |    both
#> childid  | schoolid | between
#> childid  |   lowinc | between
#> childid  |   female | between
#> childid  |     math |    both



dat <- data.frame(
  id = rep(letters, each = 2),
  
  constant = "a",
  
  between_num = rep(rnorm(26), each = 2),
  within_num = rep(rnorm(2), times = 26),
  both_num = rnorm(52),
  
  between_fac = rep(LETTERS, each = 2),
  within_fac = rep(LETTERS[1:2], times = 26),
  both_fac = sample(LETTERS[1:5], size = 52, replace = TRUE)
)


dat |> 
  check_group_variation(by = "id")
#> Check id variation
#> 
#> variable    |    type
#> ---------------------
#> constant    |        
#> between_num | between
#> within_num  |  within
#> both_num    |    both
#> between_fac | between
#> within_fac  |  within
#> both_fac    |    both

Created on 2025-05-07 with reprex v2.1.1

I re-wrote a lot of the docs, to explain what is going on.

  • @strengejacke How do I make the printing method split by group?
  • Can this function replace (supersede) check_heterogeneity_bias()? If so, some of the docs from there can make their way into here.

@strengejacke
Copy link
Member Author

Great! Printing should be fixed. You need to provide a list of data frames, so I just added a code to split(). by only works for HTML format, where the {gt} package has a group_by argument (or like that). I wonder why we haven't added this feature for textual output yet? Maybe we should/could add it to export_table() to make by also work for text format.

I don't think this function will supersede check_heterogeneity_bias(), we probably keep both. If you look at the code and references (in demean()), you'll see that calculating group averages (or checking heterogeneity bias) is not straightforward for crossed/nested designs with multiple levels.

@mattansb
Copy link
Member

mattansb commented May 8, 2025

Yes, getting group means is harder for complex nested designs, but isn't check_heterogeneity_bias() just checking for any within-group variance of the lowest grouping variable?

library(performance)
library(dplyr)

egsingle <- mlmRev::egsingle |> 
  group_by(childid) |> 
  filter(n() == 6L)

egsingle |> 
  check_group_variation(select = c("lowinc", "female", "year", "math"),
                        by = c("schoolid", "childid"))
#> Check group variation
#> 
#> group    | variable |    type
#> -----------------------------
#> schoolid |   lowinc | between
#> schoolid |   female |    both
#> schoolid |     year |  within
#> schoolid |     math |    both
#> childid  |   lowinc | between
#> childid  |   female | between
#> childid  |     year |  within
#> childid  |     math |    both

egsingle |> 
  check_heterogeneity_bias(select = c("lowinc", "female", "year", "math"),
                           by = c("schoolid", "childid"), nested = TRUE)
#> Possible heterogeneity bias due to following predictors: year, math

Created on 2025-05-08 with reprex v2.1.1

@strengejacke
Copy link
Member Author

strengejacke commented May 8, 2025

just checking for any within-group variance of the lowest grouping variable

No, only for cross-classified, but not for nested, see https://github.com/easystats/datawizard/blob/4d78084f0676e1df60dc9eaf20298a2d810cb645/R/demean.R#L445-L477

See docs and references in https://easystats.github.io/datawizard/reference/demean.html

image

@mattansb
Copy link
Member

mattansb commented May 8, 2025

check_heterogeneity_bias() only looks at the _within component, which for a nested design is the same as only looking at the lowest level variable:

# Nested
demean1 <- mlmRev::egsingle |> 
  datawizard::demean("math", by = c("schoolid/childid"), append = FALSE)
head(demean1)
#>   math_schoolid_between math_childid_between math_within
#> 1             0.1984639             1.328203  -0.3806667
#> 2             0.1984639             1.328203  -0.3926667
#> 3             0.1984639             1.328203   0.7733333
#> 4             0.1984639             1.340136  -2.8416000
#> 5             0.1984639             1.340136  -1.0996000
#> 6             0.1984639             1.340136   0.8914000

# Only by the lower elvel grouping variable
demean2 <- mlmRev::egsingle |> 
  datawizard::demean("math", by = c("childid"), append = FALSE)
head(demean2)
#>   math_between math_within
#> 1     1.526667  -0.3806667
#> 2     1.526667  -0.3926667
#> 3     1.526667   0.7733333
#> 4     1.538600  -2.8416000
#> 5     1.538600  -1.0996000
#> 6     1.538600   0.8914000

# These are the same
all(demean1$math_within == demean2$math_within)
#> [1] TRUE

Created on 2025-05-08 with reprex v2.1.1

In other words, check_heterogeneity_bias() will return any variable that is labelled by check_group_variation() as within/both.

@mattansb
Copy link
Member

mattansb commented May 8, 2025

I propose soft deprecating check_heterogeneity_bias() in favor of this new function that is more flexible and provides more specific labels for variables (:

@mattansb
Copy link
Member

mattansb commented May 8, 2025

@strengejacke I'm reading Bell and Jones (2015) and Lee (2011) re: heterogeneity bias, and they define heterogeneity bias as cases where a within group variable also varies between groups (or: x is correlated with the random intercepts of the groups).

But check_heterogeneity_bias() doesn't check that "x" varies both between and within "groups" - only if it varies within groups. This means that a variable can be marked as by the function, while only varying within groups (and not between) which would not lead to heterogeneity bias...

(The opposite is not true - any variable that might lead to heterogeneity bias will be flagged. So what I'm saying is that check_heterogeneity_bias() can have false positives, but not false negatives.)

@strengejacke
Copy link
Member Author

P.S.: which publication is Lee 2011?

@mattansb
Copy link
Member

mattansb commented May 8, 2025

Sorry, Li 2011, here>>.

So it would make sense to replace check_heterogeneity_bias() by check_group_variation() in the long run?

Yeah. Should be easy to replace all existing uses of it in easystats.

@strengejacke
Copy link
Member Author

Should we flag those variables with "both", that may indicate heterogeneity bias?
I added a short sentence to the details, so users are aware of how to detect heterogeneity bias, in case they do not fully understand the paper.

@strengejacke
Copy link
Member Author

@mattansb wdyt about the "new" print / type-column? (indicating an extra nested, because "nested" can be between or both)

egsingle <- data.frame(
  schoolid = factor(rep(c("2020", "2820"), times = c(18, 6))),
  lowinc = rep(c(TRUE, FALSE), times = c(18, 6)),
  childid = factor(rep(
    c("288643371", "292020281", "292020361", "295341521"),
    each = 6
  )),
  female = rep(c(TRUE, FALSE), each = 12),
  year = rep(1:6, times = 4),
  math = c(
    -3.068, -1.13, -0.921, 0.463, 0.021, 2.035,
    -2.732, -2.097, -0.988, 0.227, 0.403, 1.623,
    -2.732, -1.898, -0.921, 0.587, 1.578, 2.3,
    -2.288, -2.162, -1.631, -1.555, -0.725, 0.097
  )
)

performance::check_group_variation(egsingle, by = c("schoolid", "childid"))
#> Check schoolid variation
#> 
#> variable |             type
#> ---------------------------
#> lowinc   | between (nested)
#> female   |             both
#> year     |           within
#> math     |             both
#> 
#> Check childid variation
#> 
#> variable |    type
#> ------------------
#> lowinc   | between
#> female   | between
#> year     |  within
#> math     |    both

Created on 2025-05-08 with reprex v2.1.1

@mattansb
Copy link
Member

mattansb commented May 8, 2025

Not sure what you mean here - nested is defined differently than between (between is fixed).

@mattansb
Copy link
Member

mattansb commented May 8, 2025

Oh, I see. This is a matter of perspective - nested variables also vary within each group (they are not fixed) but they also vary between groups (levels are not crossed), so maybe it is something between "between" and "both", which was why I chose to give it a separate label.

@strengejacke
Copy link
Member Author

But we can have "nested both" and "nested between", that's why I though this information is useful. See your example:

data.frame(group, variable1, variable2, variable3) |> 
  performance::check_group_variation(by = "group")
#> Check group variation
#> 
#> variable  |    type
#> -------------------
#> variable1 | between
#> variable2 |  within
#> variable3 |    both

c(
  variable1 = lme4::isNested(variable1, group),
  variable2 = lme4::isNested(variable2, group),
  variable3 = lme4::isNested(variable3, group)
)
#> variable1 variable2 variable3 
#>      TRUE     FALSE      TRUE

@strengejacke
Copy link
Member Author

Do you have examples for the use of numeric_as_factors, or tolerance_factor, so we can add those as tests?

@strengejacke strengejacke merged commit c6dfdcc into main May 9, 2025
19 of 24 checks passed
@strengejacke strengejacke deleted the strengejacke/issue810 branch May 9, 2025 06:12
@mattansb mattansb restored the strengejacke/issue810 branch May 9, 2025 06:55
@mattansb
Copy link
Member

mattansb commented May 9, 2025

@strengejacke I think this might still need some work.

You can make two types of decisions:

  • Is variable X nested within groups
  • Is variable X crossed with groups
  • (Or other)

But also:

  • Is variable X constant within groups (varies only between)
  • Does variable X vary within groups
  • (Or not constant within and also varies between groups)

So we have the following combinations (with "--" marking impossible situations):

nested crossed other
varies only between "between" -- "between"
varies only within -- "within" --
both "nested" -- "both"

@mattansb
Copy link
Member

mattansb commented May 9, 2025

Currently in the code, you check if the variable is crossed (and possible balanced) and if it also nested - but this is an impossible situation.

I'm reverting you change, sorry.

mattansb added a commit that referenced this pull request May 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expand check_heterogeneity_bias()'s output
2 participants