-
Notifications
You must be signed in to change notification settings - Fork 16
Description
Hi,
I have trained a LMM of the form y ~ 1 + (1|a) + (1|a:b) and now want to predict y for different inputs. All the inputs are stored in a single dataframe, which I pass to predictInterval. In my query dataframe all a are present in the training dataset, while a:b are only sometimes present.
I noticed that multiple calls to predictInterval return significantly different (more than simple sampling noise) intervals in some cases. Surprisingly, the intervals predicted calling predictInterval on the entire dataframe and on each row individually are consistently different.
- If all
a:bin the query are present in the training dataset, I don't see issues. - If all
a:bin the query are not present in the training dataset, I don't see issues. - If some
a:bin the query are present in the training dataset and some are not, I see the issue.
I tried to debug the code of the function, and I believe that the problem lies here: https://github.com/jknowles/merTools/blob/master/R/merPredict.R#L246
tmp$var <- names(tmp[keep])[max.col(tmp[keep])]
With my input, tmp[keep] looks like this:
a1:b1 a2:b1
2 1 0
3 0 0
4 0 1
Queries 2 and 4 have a corresponding levels in the training set, while query 3 doesn't. I would now expect tmp$var to take the values a1:b1, NA, a2:b1, but instead it takes either a1:b1, a1:b1, a2:b1 or a1:b1, a2:b1, a2:b1 at random. There are two problems here:
max.colwill return a value (and thus select a column/level) even if the entire row in zero (instead of returning NA).- Since a zero row is a tie,
max.colby default selects a column at random.
I'm relatively new to R, and even more to LMMs, but my understanding is that when a level is missing, predictInterval will compute the random effect for another level selected at random from the (observed) levels of other queries. A solution that seems to work for me is to change the guilty line to:
tmp2 <- tmp[keep]
tmp2[!as.logical(rowSums(tmp2 != 0)),] <- NA
tmp$var <- names(tmp[keep])[max.col(tmp2)]
Quite dirty, but the idea is to fill the all-zeros rows with NAs, so that max.col will return NA for them.
Unfortunately I cannot share my actual model/dataset at the moment, but if the issue is unclear I'm happy to try to build a small example to reproduce the problem.