Improving check_model() for GLMs #376

bwiernik · 2021-11-08T23:19:50Z

The current selection of plots returned by check_model() for GLMs aren't ideal in a few ways.

~~1. They are missing a linearity check (fitted vs residuals). For binomial models, this should be a called to binned_residuals(). For other families, the standard check is fine.~~
~~2. For binomial models, the constant variance plot should be omitted.~~
3. For binomial models, the residual QQ plot is hard to interpret.
~~4. For non-bernoulli models, we should include a plot for checking overdispersion.~~

For the latter few points, the DHARMa package provides an easy-to-interpret approach for checking distributional assumptions from qq plots and problems with fitted vs residual plots using quantile residuals. We might consider soft-importing DHARMa or re-implementing those approaches.
https://cran.r-project.org/web/packages/DHARMa/vignettes/DHARMa.html
florianhartig/DHARMa#33

The text was updated successfully, but these errors were encountered:

jebyrnes · 2021-12-09T06:09:28Z

Would it be easier to pull in https://github.com/florianhartig/DHARMa as its very flexible in terms of model type?

bbolker · 2021-12-11T23:30:15Z

https://gist.github.com/bbolker/945e684dc148239364ab0f99c9a1beed

I realized that the residuals from binomials are not binomial (so constructing binomial CIs isn't useful); however I think that a binned plot using mean_cl_boot() to generate CIs is probably fine. See examples and decide what you think is useful or not ...

jebyrnes · 2022-02-25T14:34:00Z

What about using DHARMa and it’s quartile residuals approach? The qqunif plot that results is very intuitive if you are used to qqplots.

strengejacke · 2022-02-25T15:21:35Z

What about using DHARMa and it’s quartile residuals approach? The qqunif plot that results is very intuitive if you are used to qqplots.

Do you mean for binomial models?

jebyrnes · 2022-02-25T15:46:40Z

Yes? Or any glm. Also meant quantile not quartile. Stupid autocorrect. Ha!

mattansb · 2022-03-09T08:22:40Z

I also like the first one (:

I don't think that (most people) can diagnose dispersion by looking a a qq-plot, but maybe that's just me (:

strengejacke · 2022-03-15T10:41:39Z

@bwiernik How can we create the first plot variance versus expected)? We may include this in the forthcoming update.

bwiernik · 2022-03-15T22:01:47Z

tl;dr I think we should go with the second plot below. What do you think?

Okay, so I looked into what SAS was doing with that first plot. Super disappointing. It's actually just plotting the model-predicted mean agains the model-predicted variance (so for a Poisson model, always a straight line). So there's no actual observed data there, only predictions. I wouldn't even call it a diagnostic plot -- you need to have already fit a model with over dispersion before it will confirm that, yes, there was over dispersion that needed to handled.

So I played around and came up with two ideas

In the top plot, we show the standardized residuals. These should follow a normal distribution, so most points should be within ±2 and almost no points should be outside ±4. There is a pretty big number larger that +2.

The second plot takes the squared residuals and fits a geom_smooth() to them. This gives us the observed average squared residual (i.e., variance) for each Predicted value. It then shows the comparison against the model-predicted variance. In this plot, we see that the observed variance is much higher than expected at a bunch of ranges.

I think the second plot is probably better. It doesn't have individual points, but it is still based on the actual observed residuals, and I think it more clearly indicates the violation of the assumption.

What do you think?

Here is an example of the above data with a model better handling the dispersion (negative binomial, zero-inflated, with a variable dispersion model):

bwiernik · 2022-03-15T22:03:56Z

@bbolker Either of the above require computation of Pearson residuals and/or model-expected conditional variance. Do you think that a function could be added for the conditional variance in glmmTMB for the various count families, including with zero-inflation and variable dispersion (e.g., for nbinom2, V = Predicted * (1 + Predicted / Disp) * (1 - Prob) * (1 + Predicted * (1 + Predicted / Disp) * Prob))?

strengejacke · 2022-03-15T22:53:54Z

2nd plot looks great, and is easy to understand even for non-experts. Can you share some code, so I can implement it in performance (and see), or do you want to do that?

bwiernik · 2022-03-16T00:33:52Z

I'll through some code up

bwiernik · 2022-03-16T14:20:49Z

docvisit.txt

docvisit <- readr::read_table(file.path("docvisit.txt"))

docvisit

library(glmmTMB)

mp <- glmmTMB(
  doctorco ~ sex + illness + income + hscore, 
  data = docvisit,
  family = poisson()
)

mnb <- glmmTMB(
  doctorco ~ sex + illness + income + hscore, 
  data = docvisit,
  family = nbinom2()
)

mzip <- glmmTMB(
  doctorco ~ sex + illness + income + hscore, 
  ziformula = ~ age,
  data = docvisit,
  family = poisson()
)

mzinb <- glmmTMB(
  doctorco ~ sex + illness + income + hscore,
  ziformula = ~ age,
  data = docvisit,
  family = nbinom2()
)

mzinbd <- glmmTMB(
  doctorco ~ sex + illness + income + hscore + age,
  ziformula = ~ sex + illness + income + hscore + age,
  dispformula = ~ sex + illness + income + hscore + age,
  data = docvisit,
  family = nbinom2()
)

ep <- modelbased::estimate_expectation(mp) |> 
  dplyr::mutate(
    Residuals = -1 * Residuals,
    Res2 = Residuals^2,
    Disp = 0,
    Prob = 0,
    V = Predicted,
    PredV = predict(lm(Res2 ~ poly(Predicted, 2))),
    StdRes = residuals(mp, type = "pearson")
  )

enb <- modelbased::estimate_expectation(mnb) |> 
  dplyr::mutate(
    Residuals = -1 * Residuals,
    Res2 = Residuals^2,
    Disp = predict(mnb, type = "disp"),
    Prob = 0,
    V = Predicted * (1 + Predicted / Disp),
    PredV = predict(lm(Res2 ~ poly(Predicted, 2))),
    StdRes = residuals(mnb, type = "pearson")
  )

ezip <- modelbased::estimate_expectation(mzip) |> 
  dplyr::mutate(
    Residuals = -1 * Residuals,
    Res2 = Residuals^2,
    Disp = 0
    Prob = predict(mzip, type = "zprob"),
    V = Predicted * (1 - Prob) * (1 + Predicted * Prob),
    PredV = predict(lm(Res2 ~ poly(Predicted, 2))),
    StdRes = Residuals / sqrt(V)
  )

ezinb <- modelbased::estimate_expectation(mzinb) |> 
  dplyr::mutate(
    Residuals = -1 * Residuals,
    Res2 = Residuals^2,
    Disp = predict(mzinb, type = "disp"),
    Prob = predict(mzinb, type = "zprob"),
    V = Predicted * (1 + Predicted / Disp) * (1 - Prob) * (1 + Predicted * (1 + Predicted / Disp) * Prob),
    PredV = predict(lm(Res2 ~ poly(Predicted, 2))),
    StdRes = Residuals / sqrt(V)
  )

ezinbd <- modelbased::estimate_expectation(mzinbd) |> 
  dplyr::mutate(
    Residuals = -1 * Residuals,
    Res2 = Residuals^2,
    Disp = predict(mzinbd, type = "disp"),
    Prob = predict(mzinbd, type = "zprob"),
    V = Predicted * (1 + Predicted / Disp) * (1 - Prob) * (1 + Predicted * (1 + Predicted / Disp) * Prob),
    PredV = predict(lm(Res2 ~ poly(Predicted, 2))),
    StdRes = Residuals / sqrt(V)
  )

p1 <- ggplot(ezinbd) + aes(x = Predicted) + 
  geom_point(aes(y = StdRes)) + 
  geom_hline(
    yintercept = c(-2, 2, -4, 4), 
    linetype = c("solid", "solid", "dashed", "dashed"),
    color = c(rep(see::social_colors("green"), 2), rep(see::social_colors("blue"), 2))
  ) +
  labs(
    title = "Overdispersion and zero-inflation", 
    subtitle = "Most points should be within ±2 (green), few points outside ±4 (blue)",
    x = "Predicted mean",
    y = "Standardized resiuduals"
  ) +
  see::theme_modern(plot.title.space = 2)

p2 <- ggplot(ezinbd) + aes(x = Predicted) + 
  geom_smooth(aes(y = V), color = see::social_colors("blue"), se = FALSE) + 
  geom_smooth(aes(y = Res2), color = see::social_colors("green")) + 
  labs(
    title = "Overdispersion and zero-inflation", 
    subtitle = "Observed residual variance (green) should follow predicted residual variance (blue)",
    x = "Predicted mean",
    y = "Residual variance"
  ) +
  see::theme_modern(plot.title.space = 2)

p1 / p2

bwiernik · 2022-03-16T21:38:48Z

Awesome. Can you link to where this is implemented?

strengejacke · 2022-03-16T21:54:23Z

Awesome. Can you link to where this is implemented?

This works for check_model(), not yet for plot() (because it requires the forthcoming update from see on CRAN). I'll commit the latest changes to working PR branches, so you can reproduce the plots above.

strengejacke · 2022-03-16T22:04:06Z

Ok, you need the latest insight from GitHub/r-universe, performance from this PR (#404), and latest see from GitHub / r-universe. Then, the above examples work. If you install performance from master instead of PR, "only" the check_model() including overdispersion plots work.

DominiqueMakowski · 2022-03-18T00:07:14Z

This design is somewhat a bit at odds with our traditional opinionated API, rather than having different plot_types, I'd just pick one version which we think it's the best and stick with it (until the next improvement)... or, if they show different stuff, have the two overdispersion plots side by side in the corner (subplots)?

strengejacke · 2022-03-18T06:33:01Z

Yeah, I think for now we stick to plot 1, the default. I just implemented that option, which is undocumented for now, bnut we could make it in a similar way as with normality plots, where we have 3 different types:
https://easystats.github.io/see/articles/performance.html#check-for-normal-distributed-residuals

mattansb · 2022-03-18T11:17:26Z

This design is somewhat a bit at odds with our traditional opinionated API, rather than having different plot_types, I'd just pick one version which we think it's the best and stick with it

Agreed.

strengejacke · 2023-06-08T11:48:01Z

@mccarthy-m-g suggestion looks rather easy to implement.

mccarthy-m-g · 2023-06-08T16:25:03Z

@strengejacke What would a PR for this involve? I could get a draft started if this is the solution you want to go for.

bwiernik · 2023-06-10T17:44:13Z

Let's call the function check_residuals()

@mccarthy-m-g You can add the function to a new check_residuals.R file here in the performance package. Take a look at one of the other check_ functions like check_normality.R for an example of the documentation syntax and structure.

Then open a PR here and we can merge it in. After that, then we can move over to the see package repo and add the plotting function there.

mccarthy-m-g · 2023-06-17T23:15:51Z

Hi all, I just opened a new issue to discuss the implementation for check_residuals() (#595). There are a few things that should be resolved before getting a PR started.

strengejacke · 2024-03-17T12:23:37Z

This is the current development stage. We see a mismatch between the tests based on simulated residuals and generated plots for following families/models:

mnb / nbinom2() (plot suggests underdispersion)
mzinb / nbinom2() with ZI (plot suggests underdispersion)

library(performance)
library(glmmTMB)
library(readr)
docvisit <- read_table2("C:/Users/Daniel/Downloads/docvisit.txt")

mp <- glmmTMB(
  doctorco ~ sex + illness + income + hscore, 
  data = docvisit,
  family = poisson()
)
out <- check_overdispersion(mp)
out
#> # Overdispersion test
#> 
#>        dispersion ratio =    1.808
#>   Pearson's Chi-Squared = 9375.539
#>                 p-value =  < 0.001
#> Overdispersion detected.
plot(out)
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

mnb <- glmmTMB(
  doctorco ~ sex + illness + income + hscore, 
  data = docvisit,
  family = nbinom2()
)
out <- check_overdispersion(mnb)
out
#> # Overdispersion test
#> 
#>  dispersion ratio = 1.005
#>           p-value = 0.816
#> No overdispersion detected.
plot(out)
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

mzip <- glmmTMB(
  doctorco ~ sex + illness + income + hscore, 
  ziformula = ~ age,
  data = docvisit,
  family = poisson()
)
out <- check_overdispersion(mzip)
out
#> # Overdispersion test
#> 
#>  dispersion ratio =   1.417
#>           p-value = < 0.001
#> Overdispersion detected.
plot(out)
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

mzinb <- glmmTMB(
  doctorco ~ sex + illness + income + hscore,
  ziformula = ~ age,
  data = docvisit,
  family = nbinom2()
)
out <- check_overdispersion(mzinb)
out
#> # Overdispersion test
#> 
#>  dispersion ratio = 1.031
#>           p-value =  0.64
#> No overdispersion detected.
plot(out)
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

mzinbd <- glmmTMB(
  doctorco ~ sex + illness + income + hscore + age,
  ziformula = ~ sex + illness + income + hscore + age,
  dispformula = ~ sex + illness + income + hscore + age,
  data = docvisit,
  family = nbinom2()
)
out <- check_overdispersion(mzinbd)
out
#> # Overdispersion test
#> 
#>  dispersion ratio = 1.133
#>           p-value = 0.104
#> No overdispersion detected.
plot(out)
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

^{Created on 2024-03-17 with reprex v2.1.0}

Looks like "nbinom2()" is currently inaccurate, the code we use is here:

performance/R/check_model_diagnostics.R

Line 370 in 35b5e19

.diag_overdispersion <- function(model, ...) {

(also pinging @bbolker and cross referencing to #654)

bwiernik added the Enhancement 💥 Implemented features can be improved or revised label Nov 8, 2021