Skip to content

Nonreproducible predictInteval results mixing observed and unobserved levels. #124

@dotPiano

Description

@dotPiano

Hi,

I have trained a LMM of the form y ~ 1 + (1|a) + (1|a:b) and now want to predict y for different inputs. All the inputs are stored in a single dataframe, which I pass to predictInterval. In my query dataframe all a are present in the training dataset, while a:b are only sometimes present.

I noticed that multiple calls to predictInterval return significantly different (more than simple sampling noise) intervals in some cases. Surprisingly, the intervals predicted calling predictInterval on the entire dataframe and on each row individually are consistently different.

  • If all a:b in the query are present in the training dataset, I don't see issues.
  • If all a:b in the query are not present in the training dataset, I don't see issues.
  • If some a:b in the query are present in the training dataset and some are not, I see the issue.

I tried to debug the code of the function, and I believe that the problem lies here: https://github.com/jknowles/merTools/blob/master/R/merPredict.R#L246

 tmp$var <- names(tmp[keep])[max.col(tmp[keep])] 

With my input, tmp[keep] looks like this:

                       a1:b1                     a2:b1
2                          1                         0
3                          0                         0
4                          0                         1

Queries 2 and 4 have a corresponding levels in the training set, while query 3 doesn't. I would now expect tmp$var to take the values a1:b1, NA, a2:b1, but instead it takes either a1:b1, a1:b1, a2:b1 or a1:b1, a2:b1, a2:b1 at random. There are two problems here:

  1. max.col will return a value (and thus select a column/level) even if the entire row in zero (instead of returning NA).
  2. Since a zero row is a tie, max.col by default selects a column at random.

I'm relatively new to R, and even more to LMMs, but my understanding is that when a level is missing, predictInterval will compute the random effect for another level selected at random from the (observed) levels of other queries. A solution that seems to work for me is to change the guilty line to:

tmp2 <- tmp[keep]
tmp2[!as.logical(rowSums(tmp2 != 0)),] <- NA
tmp$var <- names(tmp[keep])[max.col(tmp2)]

Quite dirty, but the idea is to fill the all-zeros rows with NAs, so that max.col will return NA for them.

Unfortunately I cannot share my actual model/dataset at the moment, but if the issue is unclear I'm happy to try to build a small example to reproduce the problem.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions