-
-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving check_model() for GLMs #376
Comments
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Would it be easier to pull in https://github.com/florianhartig/DHARMa as its very flexible in terms of model type? |
https://gist.github.com/bbolker/945e684dc148239364ab0f99c9a1beed I realized that the residuals from binomials are not binomial (so constructing binomial CIs isn't useful); however I think that a binned plot using |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
What about using DHARMa and it’s quartile residuals approach? The qqunif plot that results is very intuitive if you are used to qqplots. |
Do you mean for binomial models? |
Yes? Or any glm. Also meant quantile not quartile. Stupid autocorrect. Ha! |
I also like the first one (: I don't think that (most people) can diagnose dispersion by looking a a qq-plot, but maybe that's just me (: |
@bwiernik How can we create the first plot variance versus expected)? We may include this in the forthcoming update. |
tl;dr I think we should go with the second plot below. What do you think? Okay, so I looked into what SAS was doing with that first plot. Super disappointing. It's actually just plotting the model-predicted mean agains the model-predicted variance (so for a Poisson model, always a straight line). So there's no actual observed data there, only predictions. I wouldn't even call it a diagnostic plot -- you need to have already fit a model with over dispersion before it will confirm that, yes, there was over dispersion that needed to handled. So I played around and came up with two ideas In the top plot, we show the standardized residuals. These should follow a normal distribution, so most points should be within ±2 and almost no points should be outside ±4. There is a pretty big number larger that +2. The second plot takes the squared residuals and fits a I think the second plot is probably better. It doesn't have individual points, but it is still based on the actual observed residuals, and I think it more clearly indicates the violation of the assumption. What do you think? Here is an example of the above data with a model better handling the dispersion (negative binomial, zero-inflated, with a variable dispersion model): |
@bbolker Either of the above require computation of Pearson residuals and/or model-expected conditional variance. Do you think that a function could be added for the conditional variance in glmmTMB for the various count families, including with zero-inflation and variable dispersion (e.g., for nbinom2, |
2nd plot looks great, and is easy to understand even for non-experts. Can you share some code, so I can implement it in performance (and see), or do you want to do that? |
I'll through some code up |
docvisit <- readr::read_table(file.path("docvisit.txt"))
docvisit
library(glmmTMB)
mp <- glmmTMB(
doctorco ~ sex + illness + income + hscore,
data = docvisit,
family = poisson()
)
mnb <- glmmTMB(
doctorco ~ sex + illness + income + hscore,
data = docvisit,
family = nbinom2()
)
mzip <- glmmTMB(
doctorco ~ sex + illness + income + hscore,
ziformula = ~ age,
data = docvisit,
family = poisson()
)
mzinb <- glmmTMB(
doctorco ~ sex + illness + income + hscore,
ziformula = ~ age,
data = docvisit,
family = nbinom2()
)
mzinbd <- glmmTMB(
doctorco ~ sex + illness + income + hscore + age,
ziformula = ~ sex + illness + income + hscore + age,
dispformula = ~ sex + illness + income + hscore + age,
data = docvisit,
family = nbinom2()
)
ep <- modelbased::estimate_expectation(mp) |>
dplyr::mutate(
Residuals = -1 * Residuals,
Res2 = Residuals^2,
Disp = 0,
Prob = 0,
V = Predicted,
PredV = predict(lm(Res2 ~ poly(Predicted, 2))),
StdRes = residuals(mp, type = "pearson")
)
enb <- modelbased::estimate_expectation(mnb) |>
dplyr::mutate(
Residuals = -1 * Residuals,
Res2 = Residuals^2,
Disp = predict(mnb, type = "disp"),
Prob = 0,
V = Predicted * (1 + Predicted / Disp),
PredV = predict(lm(Res2 ~ poly(Predicted, 2))),
StdRes = residuals(mnb, type = "pearson")
)
ezip <- modelbased::estimate_expectation(mzip) |>
dplyr::mutate(
Residuals = -1 * Residuals,
Res2 = Residuals^2,
Disp = 0
Prob = predict(mzip, type = "zprob"),
V = Predicted * (1 - Prob) * (1 + Predicted * Prob),
PredV = predict(lm(Res2 ~ poly(Predicted, 2))),
StdRes = Residuals / sqrt(V)
)
ezinb <- modelbased::estimate_expectation(mzinb) |>
dplyr::mutate(
Residuals = -1 * Residuals,
Res2 = Residuals^2,
Disp = predict(mzinb, type = "disp"),
Prob = predict(mzinb, type = "zprob"),
V = Predicted * (1 + Predicted / Disp) * (1 - Prob) * (1 + Predicted * (1 + Predicted / Disp) * Prob),
PredV = predict(lm(Res2 ~ poly(Predicted, 2))),
StdRes = Residuals / sqrt(V)
)
ezinbd <- modelbased::estimate_expectation(mzinbd) |>
dplyr::mutate(
Residuals = -1 * Residuals,
Res2 = Residuals^2,
Disp = predict(mzinbd, type = "disp"),
Prob = predict(mzinbd, type = "zprob"),
V = Predicted * (1 + Predicted / Disp) * (1 - Prob) * (1 + Predicted * (1 + Predicted / Disp) * Prob),
PredV = predict(lm(Res2 ~ poly(Predicted, 2))),
StdRes = Residuals / sqrt(V)
)
p1 <- ggplot(ezinbd) + aes(x = Predicted) +
geom_point(aes(y = StdRes)) +
geom_hline(
yintercept = c(-2, 2, -4, 4),
linetype = c("solid", "solid", "dashed", "dashed"),
color = c(rep(see::social_colors("green"), 2), rep(see::social_colors("blue"), 2))
) +
labs(
title = "Overdispersion and zero-inflation",
subtitle = "Most points should be within ±2 (green), few points outside ±4 (blue)",
x = "Predicted mean",
y = "Standardized resiuduals"
) +
see::theme_modern(plot.title.space = 2)
p2 <- ggplot(ezinbd) + aes(x = Predicted) +
geom_smooth(aes(y = V), color = see::social_colors("blue"), se = FALSE) +
geom_smooth(aes(y = Res2), color = see::social_colors("green")) +
labs(
title = "Overdispersion and zero-inflation",
subtitle = "Observed residual variance (green) should follow predicted residual variance (blue)",
x = "Predicted mean",
y = "Residual variance"
) +
see::theme_modern(plot.title.space = 2)
p1 / p2 |
This comment was marked as outdated.
This comment was marked as outdated.
Awesome. Can you link to where this is implemented? |
This comment was marked as outdated.
This comment was marked as outdated.
This works for |
Ok, you need the latest insight from GitHub/r-universe, performance from this PR (#404), and latest see from GitHub / r-universe. Then, the above examples work. If you install performance from master instead of PR, "only" the |
This comment was marked as outdated.
This comment was marked as outdated.
This design is somewhat a bit at odds with our traditional opinionated API, rather than having different |
Yeah, I think for now we stick to plot 1, the default. I just implemented that option, which is undocumented for now, bnut we could make it in a similar way as with normality plots, where we have 3 different types: |
Agreed. |
This comment was marked as outdated.
This comment was marked as outdated.
@mccarthy-m-g suggestion looks rather easy to implement. |
@strengejacke What would a PR for this involve? I could get a draft started if this is the solution you want to go for. |
Let's call the function @mccarthy-m-g You can add the function to a new check_residuals.R file here in the performance package. Take a look at one of the other check_ functions like check_normality.R for an example of the documentation syntax and structure. Then open a PR here and we can merge it in. After that, then we can move over to the see package repo and add the plotting function there. |
Hi all, I just opened a new issue to discuss the implementation for |
This is the current development stage. We see a mismatch between the tests based on simulated residuals and generated plots for following families/models:
library(performance)
library(glmmTMB)
library(readr)
docvisit <- read_table2("C:/Users/Daniel/Downloads/docvisit.txt")
mp <- glmmTMB(
doctorco ~ sex + illness + income + hscore,
data = docvisit,
family = poisson()
)
out <- check_overdispersion(mp)
out
#> # Overdispersion test
#>
#> dispersion ratio = 1.808
#> Pearson's Chi-Squared = 9375.539
#> p-value = < 0.001
#> Overdispersion detected.
plot(out)
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")' mnb <- glmmTMB(
doctorco ~ sex + illness + income + hscore,
data = docvisit,
family = nbinom2()
)
out <- check_overdispersion(mnb)
out
#> # Overdispersion test
#>
#> dispersion ratio = 1.005
#> p-value = 0.816
#> No overdispersion detected.
plot(out)
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")' mzip <- glmmTMB(
doctorco ~ sex + illness + income + hscore,
ziformula = ~ age,
data = docvisit,
family = poisson()
)
out <- check_overdispersion(mzip)
out
#> # Overdispersion test
#>
#> dispersion ratio = 1.417
#> p-value = < 0.001
#> Overdispersion detected.
plot(out)
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")' mzinb <- glmmTMB(
doctorco ~ sex + illness + income + hscore,
ziformula = ~ age,
data = docvisit,
family = nbinom2()
)
out <- check_overdispersion(mzinb)
out
#> # Overdispersion test
#>
#> dispersion ratio = 1.031
#> p-value = 0.64
#> No overdispersion detected.
plot(out)
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")' mzinbd <- glmmTMB(
doctorco ~ sex + illness + income + hscore + age,
ziformula = ~ sex + illness + income + hscore + age,
dispformula = ~ sex + illness + income + hscore + age,
data = docvisit,
family = nbinom2()
)
out <- check_overdispersion(mzinbd)
out
#> # Overdispersion test
#>
#> dispersion ratio = 1.133
#> p-value = 0.104
#> No overdispersion detected.
plot(out)
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")' Created on 2024-03-17 with reprex v2.1.0 Looks like "nbinom2()" is currently inaccurate, the code we use is here: performance/R/check_model_diagnostics.R Line 370 in 35b5e19
|
The current selection of plots returned by
check_model()
for GLMs aren't ideal in a few ways.1. They are missing a linearity check (fitted vs residuals). For binomial models, this should be a called tobinned_residuals()
. For other families, the standard check is fine.2. For binomial models, the constant variance plot should be omitted.3. For binomial models, the residual QQ plot is hard to interpret.
4. For non-bernoulli models, we should include a plot for checking overdispersion.For the latter few points, the DHARMa package provides an easy-to-interpret approach for checking distributional assumptions from qq plots and problems with fitted vs residual plots using quantile residuals. We might consider soft-importing DHARMa or re-implementing those approaches.
https://cran.r-project.org/web/packages/DHARMa/vignettes/DHARMa.html
florianhartig/DHARMa#33
The text was updated successfully, but these errors were encountered: