Error when using custom covariates for prediction model

**Describe the bug**
We are conducting a study on hospital sepsis prediction at discharge.
Following the guidelines, we selected as covariates the variation of several laboratory tests, considering both baseline and 48/72-hour measurements.
Since these variables are not included among the package’s default covariates, we defined custom covariates.
However, when running the model, we encounter an error.
Our hypothesis is that the issue may be related to the data structure: our records represent visits rather than individual patients. As a result, the model may not be able to correctly associate multiple values from the same patient with the appropriate visit.

**Set up (please run in R "sessionInfo()" and copy the output here):**

> R version 4.4.1 (2024-06-14 ucrt)
> Platform: x86_64-w64-mingw32/x64
> Running under: Windows 10 x64 (build 19044)
> 
> Matrix products: default
> 
> 
> locale:
> [1] LC_COLLATE=Italian_Italy.utf8  LC_CTYPE=Italian_Italy.utf8    LC_MONETARY=Italian_Italy.utf8
> [4] LC_NUMERIC=C                   LC_TIME=Italian_Italy.utf8    
> 
> time zone: Europe/Rome
> tzcode source: internal
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base     
> 
> other attached packages:
>  [1] reticulate_1.45.0            SqlRender_1.19.4             PatientLevelPrediction_6.5.1 FeatureExtraction_3.12.0    
>  [5] Andromeda_1.2.0              dplyr_1.2.0                  CohortGenerator_1.1.0        R6_2.6.1                    
>  [9] DatabaseConnector_7.1.0      stringi_1.8.7                here_1.0.2                  
> 
> loaded via a namespace (and not attached):
>  [1] generics_0.1.4       RSQLite_2.4.6        lattice_0.22-6       magrittr_2.0.4       grid_4.4.1          
>  [6] fastmap_1.2.0        blob_1.3.0           rprojroot_2.1.1      jsonlite_2.0.0       Matrix_1.7-0        
> [11] backports_1.5.0      DBI_1.3.0            survival_3.8-6       urltools_1.7.3.1     purrr_1.2.1         
> [16] duckdb_1.4.4         cli_3.6.5            rlang_1.1.7          dbplyr_2.5.2         triebeard_0.4.1     
> [21] splines_4.4.1        bit64_4.6.0-1        withr_3.0.2          cachem_1.1.0         otel_0.2.0          
> [26] tools_4.4.1          memuse_4.2-3         ParallelLogger_3.5.1 memoise_2.0.1        checkmate_2.3.4     
> [31] vctrs_0.7.1          png_0.1-8            lifecycle_1.0.5      bit_4.6.0            pkgconfig_2.0.3     
> [36] rJava_1.0-14         pillar_1.11.1        glue_1.8.0           Rcpp_1.1.1           tibble_3.3.1        
> [41] tidyselect_1.2.1

**To Reproduce**
```
library(FeatureExtraction)
library(PatientLevelPrediction)
library(SqlRender)

# Define output folder ----
outputFolder <- here::here("path/to/folder")   
# Create output folder if it doesn't exist
if (!file.exists(outputFolder)){
  dir.create(outputFolder, recursive = TRUE)}

# Create a custom function to calculate the changes in laboratory test results 
# for each hospital admission
getMeasurementDeltaCovariateData <- function(connection,
                                             tempEmulationSchema = NULL,
                                             cdmDatabaseSchema,
                                             cohortTable = "#cohort_person",
                                             cohortIds = -1,
                                             cdmVersion = "5",
                                             rowIdField = "visit_occurrence_id",
                                             covariateSettings,
                                             aggregated = FALSE,
                                             minCharacterizationMean = 0.001) {
  
  writeLines("Building custom delta measurement covariates...")
  
  if (aggregated) {
    stop("Aggregated not supported")
  }
  
  renderTranslateExecuteSql(connection,
                            sql = readSql(here("sql", "custom-covariates-SETUP.sql")),
                            vocabulary_database_schema = cdmDatabaseSchema,
                            cohort_table = cohortTable,
                            cohort_id_ricoveri = 2,
                            cdm_schema = cdmDatabaseSchema
  )
  
  covariates <- renderTranslateQuerySql(
    connection,
    sql = readSql(here("sql", "custom-covariates-SELECT.sql")),
    snakeCaseToCamelCase = TRUE,
    cdm_schema = cdmDatabaseSchema
  )
  
  covariateNames <- unique(
    covariates[, c("covariateId", "conceptName", "measurementConceptId")]
  )
  
  # Remove conceptName 
  covariates <- covariates[, c("rowId", "covariateId", "covariateValue")]
  
  # Create covariateRef
  covariateRef <- data.frame(
    covariateId   = covariateNames$covariateId,
    covariateName = paste0(
      "Delta within 48h ±12h from baseline: ",
      covariateNames$conceptName
    ),
    analysisId    = 10000,
    conceptId     = covariateNames$measurementConceptId
  )
  analysisRef <- data.frame(
    analysisId = 10000,
    analysisName = "Measurement Delta 48h ±12h",
    domainId = "Measurement",
    startDay = -3,
    endDay = 0,
    isBinary = "N",
    missingMeansZero = "N"
  )
  
  # Create Andromeda object
  result <- Andromeda::andromeda(
    covariates = covariates,
    covariateRef = covariateRef,
    analysisRef = analysisRef
  )
  
  attr(result, "metaData") <- list(populationSize = length(unique(covariates$rowId)))
  class(result) <- "CovariateData"
  
  return(result)
}

# Create settings
createMeasurementDeltaSettings <- function() {
  covariateSettings <- list(useMeasurementDelta = TRUE)
  attr(covariateSettings, "fun") <- "getMeasurementDeltaCovariateData"
  class(covariateSettings) <- "covariateSettings"
  return(covariateSettings)
}

# ------------------------------------------------------------------------------
# Data and Feature Extraction
# ------------------------------------------------------------------------------

#Delta lab variables
MeasurementDelta <- createMeasurementDeltaSettings()

#test only for measurement_delta
covariateSettingsList <- list(MeasurementDelta)
                      
covariateData <- getDbCovariateData(
  connectionDetails = connectionDetails,
  cdmDatabaseSchema = cdmDatabaseSchema,
  cohortDatabaseSchema = cohortDatabaseSchema,
  cohortTable = cohortTable,
  cohortIds = 2, 
  rowIdField = "person_id",
  covariateSettings = covariateSettingsList
)

summary(covariateData)

covariates_overview <- merge(covariateData$covariates, covariateData$covariateRef, by = "covariateId")

databaseDetails <- createDatabaseDetails(
  connectionDetails = connectionDetails,
  cdmDatabaseSchema = cdmDatabaseSchema,
  cdmDatabaseName = cdmDatabaseName,
  cohortDatabaseSchema = resultsDatabaseSchema,
  cohortTable = cohortTable,
  targetId = 2,
  outcomeDatabaseSchema = resultsDatabaseSchema,
  outcomeTable = cohortTable,
  outcomeIds = 1,
  cdmVersion = 5
)

restrictPlpDataSettings <- createRestrictPlpDataSettings()

plpData <- getPlpData(
  databaseDetails = databaseDetails,
  covariateSettings = covariateSettingsList,
  restrictPlpDataSettings = restrictPlpDataSettings
)

# ------------------------------------------------------------------------------
# Model training (person split or time split)
# ------------------------------------------------------------------------------
populationSettings <- createStudyPopulationSettings(
  removeSubjectsWithPriorOutcome = F)

splitSettings <- createDefaultSplitSetting(
  trainFraction = 0.75,
  testFraction = 0.25,
  type = 'stratified',
  nfold = 5,
  splitSeed = 34568
)

sampleSettings <- createSampleSettings()

featureEngineeringSettings <- createSimpleImputer(method = "median", missingThreshold = 0.8)

preprocessSettings <- createPreprocessSettings(
  minFraction = 0,
  normalize = F,
  removeRedundancy = F
)

# Set ML models 
lrModel  <- setLassoLogisticRegression()

lrResults <- runPlp(
  plpData = plpData,
  outcomeId = 1,
  analysisId = 'singleDemo',
  analysisName = 'Demonstration of runPlp for training single PLP models',
  populationSettings = populationSettings,
  splitSettings = splitSettings,
  sampleSettings = sampleSettings,
  featureEngineeringSettings = featureEngineeringSettings,
  preprocessSettings = preprocessSettings,
  modelSettings = lrModel,
  logSettings = createLogSettings(),
  executeSettings = createExecuteSettings(
    runSplitData = T,
    runSampleData = T,
    runFeatureEngineering = T,
    runPreprocessData = T,
    runModelDevelopment = T,
    runCovariateSummary = F
  ),
  saveDirectory = file.path(getwd(), 'singlePlp'))
```

**PLP Log File**
[plpLog.txt](https://github.com/user-attachments/files/28588589/plpLog.txt)

**Additional context**
Cohorts JSON files:

- [2.json](https://github.com/user-attachments/files/28588648/2.json) (Inpatient visit cohort)
- [1.json](https://github.com/user-attachments/files/28588649/1.json) (Sepsis cohort)

Custom covariate files:

- [custom-covariates-SETUP.sql](https://github.com/user-attachments/files/28588751/custom-covariates-SETUP.sql)
- [custom-covariates-SELECT.sql](https://github.com/user-attachments/files/28588750/custom-covariates-SELECT.sql)

plpData[["cohorts"]]
| rowId | subjectId | targetId | cohortStartDate | daysFromObsStart | daysToCohortEnd | daysToObsEnd | ageYear | gender |
|-------|-----------|----------|-----------------|------------------|-----------------|--------------|---------|--------|
| 1  | 1  | 2 | 2023-01-01 | 9   | 0 | 473  | 46  | 8532 |
| 2  | 2  | 2 | 2022-07-16 | 324 | 0 | 5    | 102 | 8507 |
| 3  | 3  | 2 | 2021-09-15 | 28  | 0 | 1435 | 77  | 8532 |
| 4  | 3  | 2 | 2022-01-06 | 141 | 0 | 1322 | 77  | 8532 |
| 5  | 3  | 2 | 2022-02-05 | 171 | 0 | 1292 | 77  | 8532 |
| 6  | 3  | 2 | 2022-04-10 | 235 | 0 | 1228 | 78  | 8532 |
| 7  | 4  | 2 | 2020-08-07 | 125 | 0 | 1319 | 88  | 8507 |
| 8  | 5  | 2 | 2020-10-04 | 0   | 0 | 77   | 96  | 8507 |
| 9  | 6  | 2 | 2021-08-18 | 110 | 0 | 1342 | 84  | 8532 |
| 10 | 7  | 2 | 2021-08-27 | 0   | 0 | 2    | 39  | 8532 |
| 11 | 8  | 2 | 2021-01-10 | 27  | 0 | 1817 | 62  | 8507 |
| 12 | 9  | 2 | 2020-10-17 | 0   | 0 | 32   | 103 | 8532 |
| 13 | 10 | 2 | 2020-06-08 | 43  | 0 | 1855 | 55  | 8507 |
| 14 | 10 | 2 | 2020-10-03 | 160 | 0 | 1738 | 55  | 8507 |
| 15 | 10 | 2 | 2020-10-24 | 181 | 0 | 1717 | 55  | 8507 |
| 16 | 10 | 2 | 2021-04-27 | 366 | 0 | 1532 | 56  | 8507 |
| 17 | 10 | 2 | 2022-05-01 | 735 | 0 | 1163 | 57  | 8507 |

plpData[["outcome"]]
| rowId | outcomeId | daysToEvent |
|-------|-----------|-------------|
| 2     | 1         | 5           |



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when using custom covariates for prediction model #658

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

rowId	subjectId	targetId	cohortStartDate	daysFromObsStart	daysToObsEnd	ageYear	gender
1	1	2	2023-01-01	9	473	46	8532
2	2	2	2022-07-16	324	5	102	8507
3	3	2	2021-09-15	28	1435	77	8532
4	3	2	2022-01-06	141	1322	77	8532
5	3	2	2022-02-05	171	1292	77	8532
6	3	2	2022-04-10	235	1228	78	8532
7	4	2	2020-08-07	125	1319	88	8507
8	5	2	2020-10-04	0	77	96	8507
9	6	2	2021-08-18	110	1342	84	8532
10	7	2	2021-08-27	0	2	39	8532
11	8	2	2021-01-10	27	1817	62	8507
12	9	2	2020-10-17	0	32	103	8532
13	10	2	2020-06-08	43	1855	55	8507
14	10	2	2020-10-03	160	1738	55	8507
15	10	2	2020-10-24	181	1717	55	8507
16	10	2	2021-04-27	366	1532	56	8507
17	10	2	2022-05-01	735	1163	57	8507

Error when using custom covariates for prediction model #658

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions