Nonreproducible predictInteval results mixing observed and unobserved levels.

Hi,

I have trained a LMM of the form `y ~ 1 + (1|a) + (1|a:b) ` and now want to predict `y` for different inputs. All the inputs are stored in a single dataframe, which I pass to `predictInterval`. In my query dataframe all `a` are present in the training dataset, while `a:b` are only sometimes present.

I noticed that multiple calls to `predictInterval` return significantly different (more than simple sampling noise) intervals in some cases. Surprisingly, the intervals predicted calling `predictInterval` on the entire dataframe and on each row individually are consistently different.
- If all `a:b` in the query are present in the training dataset, I don't see issues.
- If all `a:b` in the query are not present in the training dataset, I don't see issues.
- If some `a:b` in the query are present in the training dataset and some are not, I see the issue.

I tried to debug the code of the function, and I believe that the problem lies here: https://github.com/jknowles/merTools/blob/master/R/merPredict.R#L246 
```
 tmp$var <- names(tmp[keep])[max.col(tmp[keep])] 
```

With my input, `tmp[keep]` looks like this:
```
                       a1:b1                     a2:b1
2                          1                         0
3                          0                         0
4                          0                         1
```
Queries 2 and 4 have a corresponding levels in the training set, while query 3 doesn't. I would now expect `tmp$var` to take the values `a1:b1, NA, a2:b1`, but instead it takes either `a1:b1, a1:b1, a2:b1` or `a1:b1, a2:b1, a2:b1` at random. There are two problems here:

1. `max.col` will return a value (and thus select a column/level) even if the entire row in zero (instead of returning NA).
2. Since a zero row is a tie, `max.col` by default selects a column at random.

I'm relatively new to R, and even more to LMMs, but my understanding is that when a level is missing, `predictInterval` will compute the random effect for another level selected at random from the (observed) levels of other queries. A solution that seems to work for me is to change the guilty line to:
```
tmp2 <- tmp[keep]
tmp2[!as.logical(rowSums(tmp2 != 0)),] <- NA
tmp$var <- names(tmp[keep])[max.col(tmp2)]
```
Quite dirty, but the idea is to fill the all-zeros rows with NAs, so that `max.col` will return NA for them.

Unfortunately I cannot share my actual model/dataset at the moment, but if the issue is unclear I'm happy to try to build a small example to reproduce the problem. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nonreproducible predictInteval results mixing observed and unobserved levels. #124

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Nonreproducible predictInteval results mixing observed and unobserved levels. #124

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions