Expand `check_heterogeneity_bias()`'s output #812

strengejacke · 2025-04-30T10:53:38Z

Fixes #810

R/check_group_variation.R

mattansb · 2025-05-07T18:53:33Z

Okay, I like this a lot:

library(performance)

mlmRev::egsingle |> 
  check_group_variation(select = c("lowinc", "female", "math"),
                        by = c("schoolid", "childid"))
#> Check group variation
#> 
#> group    | variable |    type
#> -----------------------------
#> schoolid |   lowinc | between
#> schoolid |   female |  within
#> schoolid |     math |    both
#> childid  |   lowinc | between
#> childid  |   female | between
#> childid  |     math |    both


mlmRev::egsingle |> 
  check_group_variation(select = c("lowinc", "female", "math"),
                        by = c("schoolid", "childid"), 
                        
                        include_by = TRUE)
#> Check group variation
#> 
#> group    | variable |    type
#> -----------------------------
#> schoolid |  childid |  nested
#> schoolid |   lowinc | between
#> schoolid |   female |  within
#> schoolid |     math |    both
#> childid  | schoolid | between
#> childid  |   lowinc | between
#> childid  |   female | between
#> childid  |     math |    both



dat <- data.frame(
  id = rep(letters, each = 2),
  
  constant = "a",
  
  between_num = rep(rnorm(26), each = 2),
  within_num = rep(rnorm(2), times = 26),
  both_num = rnorm(52),
  
  between_fac = rep(LETTERS, each = 2),
  within_fac = rep(LETTERS[1:2], times = 26),
  both_fac = sample(LETTERS[1:5], size = 52, replace = TRUE)
)


dat |> 
  check_group_variation(by = "id")
#> Check id variation
#> 
#> variable    |    type
#> ---------------------
#> constant    |        
#> between_num | between
#> within_num  |  within
#> both_num    |    both
#> between_fac | between
#> within_fac  |  within
#> both_fac    |    both

^{Created on 2025-05-07 with reprex v2.1.1}

I re-wrote a lot of the docs, to explain what is going on.

@strengejacke How do I make the printing method split by group?
Can this function replace (supersede) check_heterogeneity_bias()? If so, some of the docs from there can make their way into here.

strengejacke · 2025-05-08T06:51:59Z

Great! Printing should be fixed. You need to provide a list of data frames, so I just added a code to split(). by only works for HTML format, where the {gt} package has a group_by argument (or like that). I wonder why we haven't added this feature for textual output yet? Maybe we should/could add it to export_table() to make by also work for text format.

I don't think this function will supersede check_heterogeneity_bias(), we probably keep both. If you look at the code and references (in demean()), you'll see that calculating group averages (or checking heterogeneity bias) is not straightforward for crossed/nested designs with multiple levels.

mattansb · 2025-05-08T07:11:41Z

Yes, getting group means is harder for complex nested designs, but isn't check_heterogeneity_bias() just checking for any within-group variance of the lowest grouping variable?

library(performance)
library(dplyr)

egsingle <- mlmRev::egsingle |> 
  group_by(childid) |> 
  filter(n() == 6L)

egsingle |> 
  check_group_variation(select = c("lowinc", "female", "year", "math"),
                        by = c("schoolid", "childid"))
#> Check group variation
#> 
#> group    | variable |    type
#> -----------------------------
#> schoolid |   lowinc | between
#> schoolid |   female |    both
#> schoolid |     year |  within
#> schoolid |     math |    both
#> childid  |   lowinc | between
#> childid  |   female | between
#> childid  |     year |  within
#> childid  |     math |    both

egsingle |> 
  check_heterogeneity_bias(select = c("lowinc", "female", "year", "math"),
                           by = c("schoolid", "childid"), nested = TRUE)
#> Possible heterogeneity bias due to following predictors: year, math

^{Created on 2025-05-08 with reprex v2.1.1}

strengejacke · 2025-05-08T08:33:13Z

just checking for any within-group variance of the lowest grouping variable

No, only for cross-classified, but not for nested, see https://github.com/easystats/datawizard/blob/4d78084f0676e1df60dc9eaf20298a2d810cb645/R/demean.R#L445-L477

See docs and references in https://easystats.github.io/datawizard/reference/demean.html

mattansb · 2025-05-08T09:38:26Z

check_heterogeneity_bias() only looks at the _within component, which for a nested design is the same as only looking at the lowest level variable:

# Nested
demean1 <- mlmRev::egsingle |> 
  datawizard::demean("math", by = c("schoolid/childid"), append = FALSE)
head(demean1)
#>   math_schoolid_between math_childid_between math_within
#> 1             0.1984639             1.328203  -0.3806667
#> 2             0.1984639             1.328203  -0.3926667
#> 3             0.1984639             1.328203   0.7733333
#> 4             0.1984639             1.340136  -2.8416000
#> 5             0.1984639             1.340136  -1.0996000
#> 6             0.1984639             1.340136   0.8914000

# Only by the lower elvel grouping variable
demean2 <- mlmRev::egsingle |> 
  datawizard::demean("math", by = c("childid"), append = FALSE)
head(demean2)
#>   math_between math_within
#> 1     1.526667  -0.3806667
#> 2     1.526667  -0.3926667
#> 3     1.526667   0.7733333
#> 4     1.538600  -2.8416000
#> 5     1.538600  -1.0996000
#> 6     1.538600   0.8914000

# These are the same
all(demean1$math_within == demean2$math_within)
#> [1] TRUE

^{Created on 2025-05-08 with reprex v2.1.1}

In other words, check_heterogeneity_bias() will return any variable that is labelled by check_group_variation() as within/both.

mattansb · 2025-05-08T11:11:03Z

I propose soft deprecating check_heterogeneity_bias() in favor of this new function that is more flexible and provides more specific labels for variables (:

mattansb · 2025-05-08T11:25:43Z

@strengejacke I'm reading Bell and Jones (2015) and Lee (2011) re: heterogeneity bias, and they define heterogeneity bias as cases where a within group variable also varies between groups (or: x is correlated with the random intercepts of the groups).

But check_heterogeneity_bias() doesn't check that "x" varies both between and within "groups" - only if it varies within groups. This means that a variable can be marked as by the function, while only varying within groups (and not between) which would not lead to heterogeneity bias...

(The opposite is not true - any variable that might lead to heterogeneity bias will be flagged. So what I'm saying is that check_heterogeneity_bias() can have false positives, but not false negatives.)

strengejacke · 2025-05-08T11:33:59Z

P.S.: which publication is Lee 2011?

mattansb · 2025-05-08T12:11:03Z

Sorry, Li 2011, here>>.

So it would make sense to replace check_heterogeneity_bias() by check_group_variation() in the long run?

Yeah. Should be easy to replace all existing uses of it in easystats.

…performance into strengejacke/issue810

strengejacke · 2025-05-08T12:49:44Z

Should we flag those variables with "both", that may indicate heterogeneity bias?
I added a short sentence to the details, so users are aware of how to detect heterogeneity bias, in case they do not fully understand the paper.

strengejacke · 2025-05-08T12:56:09Z

@mattansb wdyt about the "new" print / type-column? (indicating an extra nested, because "nested" can be between or both)

egsingle <- data.frame(
  schoolid = factor(rep(c("2020", "2820"), times = c(18, 6))),
  lowinc = rep(c(TRUE, FALSE), times = c(18, 6)),
  childid = factor(rep(
    c("288643371", "292020281", "292020361", "295341521"),
    each = 6
  )),
  female = rep(c(TRUE, FALSE), each = 12),
  year = rep(1:6, times = 4),
  math = c(
    -3.068, -1.13, -0.921, 0.463, 0.021, 2.035,
    -2.732, -2.097, -0.988, 0.227, 0.403, 1.623,
    -2.732, -1.898, -0.921, 0.587, 1.578, 2.3,
    -2.288, -2.162, -1.631, -1.555, -0.725, 0.097
  )
)

performance::check_group_variation(egsingle, by = c("schoolid", "childid"))
#> Check schoolid variation
#> 
#> variable |             type
#> ---------------------------
#> lowinc   | between (nested)
#> female   |             both
#> year     |           within
#> math     |             both
#> 
#> Check childid variation
#> 
#> variable |    type
#> ------------------
#> lowinc   | between
#> female   | between
#> year     |  within
#> math     |    both

^{Created on 2025-05-08 with reprex v2.1.1}

mattansb · 2025-05-08T13:00:08Z

Not sure what you mean here - nested is defined differently than between (between is fixed).

mattansb · 2025-05-08T13:04:07Z

Oh, I see. This is a matter of perspective - nested variables also vary within each group (they are not fixed) but they also vary between groups (levels are not crossed), so maybe it is something between "between" and "both", which was why I chose to give it a separate label.

strengejacke · 2025-05-08T13:07:45Z

But we can have "nested both" and "nested between", that's why I though this information is useful. See your example:

data.frame(group, variable1, variable2, variable3) |> 
  performance::check_group_variation(by = "group")
#> Check group variation
#> 
#> variable  |    type
#> -------------------
#> variable1 | between
#> variable2 |  within
#> variable3 |    both

c(
  variable1 = lme4::isNested(variable1, group),
  variable2 = lme4::isNested(variable2, group),
  variable3 = lme4::isNested(variable3, group)
)
#> variable1 variable2 variable3 
#>      TRUE     FALSE      TRUE

strengejacke · 2025-05-08T13:08:19Z

Do you have examples for the use of numeric_as_factors, or tolerance_factor, so we can add those as tests?

mattansb · 2025-05-09T07:06:49Z

@strengejacke I think this might still need some work.

You can make two types of decisions:

Is variable X nested within groups
Is variable X crossed with groups
(Or other)

But also:

Is variable X constant within groups (varies only between)
Does variable X vary within groups
(Or not constant within and also varies between groups)

So we have the following combinations (with "--" marking impossible situations):

	nested	crossed	other
varies only between	"between"	--	"between"
varies only within	--	"within"	--
both	"nested"	--	"both"

mattansb · 2025-05-09T07:17:53Z

Currently in the code, you check if the variable is crossed (and possible balanced) and if it also nested - but this is an impossible situation.

I'm reverting you change, sorry.

strengejacke and others added 12 commits April 30, 2025 12:58

Expand check_heterogeneity_bias()'s output

9974498

Fixes #810

add function

0c73645

Merge branch 'main' into strengejacke/issue810

422b4d3

bump

42d61ea

docs, news

1d5b1fd

add details

1febdcb

fix example

b944b11

fixes

6ff019e

add checks

47f9f1c

optimize

61b25b7

Update check_group_variation.R

18a6ca2

Merge branch 'main' into strengejacke/issue810

5b12ee8

strengejacke commented May 6, 2025

View reviewed changes

R/check_group_variation.R Outdated Show resolved Hide resolved

strengejacke commented May 6, 2025

View reviewed changes

R/check_group_variation.R Outdated Show resolved Hide resolved

strengejacke and others added 4 commits May 6, 2025 13:24

stylo

f3d49c0

docs

b7c9626

Merge branch 'main' into strengejacke/issue810

4961dc2

re write doc and factor function

bbc551b

mattansb and others added 2 commits May 7, 2025 22:04

fix checks

770c71d

print groups, minor styling

fb5cf71

Accept mixed models as well

7089a09

mattansb added 2 commits May 8, 2025 14:04

more docs and examples

e243c64

Update DESCRIPTION

15bf944

strengejacke added 2 commits May 8, 2025 13:45

styler

b0ed2e9

"nested" as additional informatrion

642f879

mattansb and others added 7 commits May 8, 2025 15:17

fix conditional example

e460d58

drop unused function

733c8a2

add tests

850f8c0

soft deprecate check_heterogeneity_bias

4d87968

Merge branch 'strengejacke/issue810' of https://github.com/easystats/…

0c24bf3

…performance into strengejacke/issue810

docs-style, use quietly

fae733c

use expicit roxygen tags

4fe5573

strengejacke added 2 commits May 8, 2025 14:52

docs

229fa19

news

c60c263

strengejacke added 5 commits May 8, 2025 15:09

examples, rename

fbbdfb8

remove

f4e22f3

fix test

3fd7220

styler

4b493a0

add test

4095cec

strengejacke merged commit c6dfdcc into main May 9, 2025
19 of 24 checks passed

strengejacke deleted the strengejacke/issue810 branch May 9, 2025 06:12

mattansb restored the strengejacke/issue810 branch May 9, 2025 06:55

mattansb added a commit that referenced this pull request May 9, 2025

see #812

fbb1b15

Uh oh!

Expand check_heterogeneity_bias()'s output #812

Expand check_heterogeneity_bias()'s output #812

Uh oh!

Conversation

strengejacke commented Apr 30, 2025

Uh oh!

Uh oh!

Uh oh!

mattansb commented May 7, 2025 • edited by strengejacke Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

strengejacke commented May 8, 2025

Uh oh!

mattansb commented May 8, 2025

Uh oh!

strengejacke commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattansb commented May 8, 2025

Uh oh!

mattansb commented May 8, 2025

Uh oh!

mattansb commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

strengejacke commented May 8, 2025

Uh oh!

mattansb commented May 8, 2025

Uh oh!

strengejacke commented May 8, 2025

Uh oh!

strengejacke commented May 8, 2025

Uh oh!

mattansb commented May 8, 2025

Uh oh!

mattansb commented May 8, 2025

Uh oh!

strengejacke commented May 8, 2025

Uh oh!

strengejacke commented May 8, 2025

Uh oh!

Uh oh!

mattansb commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattansb commented May 9, 2025

Uh oh!

Uh oh!

Expand `check_heterogeneity_bias()`'s output #812

Expand `check_heterogeneity_bias()`'s output #812

mattansb commented May 7, 2025 •

edited by strengejacke

Loading

strengejacke commented May 8, 2025 •

edited

Loading

mattansb commented May 8, 2025 •

edited

Loading

mattansb commented May 9, 2025 •

edited

Loading