Hi @msuchard ,
I believe currently the intercept is initialized as zero (at least for logistic regression, only model I checked). Given we are often dealing with rare outcomes I think it would be better to use the logit of the outcome rate ( log(p / (1 - p)) ). In this case the model converges faster (as measured in fever iterations, see below) and seems to give the same results. I do think this could end up giving ~ 20-30% speed up for outcome rates of 1-3% which are common in our data. The example below for Eunomia gives a speedup of over 2x with outcome rate of 2%, but it's rather small data (n=1800, p=372). There's also a pure Cyclops reprex at the bottom giving about 30% fewer iterations but I had to change the simulateCyclopsData to give 1-3% outcome rates.
Of course I could supply startingCoefficients myself from the PLP to add this. But if there are no downsides maybe this should be the default in Cyclops ? What do you think ?
reprex:
library(PatientLevelPrediction)
library(Cyclops)
connectionDetails <- Eunomia::getEunomiaConnectionDetails()
#> attempting to download GiBleed
#> attempting to extract and load: /tmp/RtmpplDETg/GiBleed_5.3.zip to: /tmp/RtmpplDETg/GiBleed_5.3.sqlite
Eunomia::createCohorts(connectionDetails)
#> Cohorts created in table main.cohort
#> cohortId name
#> 1 1 Celecoxib
#> 2 2 Diclofenac
#> 3 3 GiBleed
#> 4 4 NSAIDs
#> description
#> 1 A simplified cohort definition for new users of celecoxib, designed specifically for Eunomia.
#> 2 A simplified cohort definition for new users ofdiclofenac, designed specifically for Eunomia.
#> 3 A simplified cohort definition for gastrointestinal bleeding, designed specifically for Eunomia.
#> 4 A simplified cohort definition for new users of NSAIDs, designed specifically for Eunomia.
#> count
#> 1 1844
#> 2 850
#> 3 479
#> 4 2694
outcomeId <- 3 # GIbleed
seed <- 42
databaseDetails <- createDatabaseDetails(
connectionDetails = connectionDetails,
cdmDatabaseSchema = "main",
cdmDatabaseName = "main",
cohortDatabaseSchema = "main",
cohortTable = "cohort",
outcomeDatabaseSchema = "main",
outcomeTable = "cohort",
targetId = 1,
outcomeIds = outcomeId,
cdmVersion = 5)
#> No cdm database id entered so using cdmDatabaseSchema - if cdmDatabaseSchema is the same for multiple different databases, please use cdmDatabaseId to specify a unique identifier for the database and version
covariateSettings <- FeatureExtraction::createDefaultCovariateSettings()
plpData <- getPlpData(databaseDetails = databaseDetails,
covariateSettings = covariateSettings,
restrictPlpDataSettings = createRestrictPlpDataSettings())
#> Connecting using SQLite driver
#>
#> Constructing the at risk cohort
#> | | | 0% | |======================= | 33% | |=============================================== | 67% | |======================================================================| 100%
#> Executing SQL took 0.026 secs
#> Fetching cohorts from server
#> Loading cohorts took 0.031 secs
#> Constructing features on server
#> | | | 0% | |= | 1% | |= | 2% | |== | 2% | |== | 3% | |=== | 4% | |=== | 5% | |==== | 5% | |==== | 6% | |===== | 7% | |===== | 8% | |====== | 8% | |====== | 9% | |======= | 9% | |======= | 10% | |======= | 11% | |======== | 11% | |======== | 12% | |========= | 12% | |========= | 13% | |========= | 14% | |========== | 14% | |========== | 15% | |=========== | 15% | |=========== | 16% | |============ | 17% | |============ | 18% | |============= | 18% | |============= | 19% | |============== | 20% | |============== | 21% | |=============== | 21% | |=============== | 22% | |================ | 22% | |================ | 23% | |================= | 24% | |================= | 25% | |================== | 25% | |================== | 26% | |=================== | 27% | |=================== | 28% | |==================== | 28% | |==================== | 29% | |===================== | 29% | |===================== | 30% | |====================== | 31% | |====================== | 32% | |======================= | 32% | |======================= | 33% | |======================== | 34% | |======================== | 35% | |========================= | 35% | |========================= | 36% | |========================== | 36% | |========================== | 37% | |========================== | 38% | |=========================== | 38% | |=========================== | 39% | |============================ | 39% | |============================ | 40% | |============================ | 41% | |============================= | 41% | |============================= | 42% | |============================== | 42% | |============================== | 43% | |=============================== | 44% | |=============================== | 45% | |================================ | 45% | |================================ | 46% | |================================= | 47% | |================================= | 48% | |================================== | 48% | |================================== | 49% | |=================================== | 50% | |==================================== | 51% | |==================================== | 52% | |===================================== | 52% | |===================================== | 53% | |====================================== | 54% | |====================================== | 55% | |======================================= | 55% | |======================================= | 56% | |======================================== | 57% | |======================================== | 58% | |========================================= | 58% | |========================================= | 59% | |========================================== | 59% | |========================================== | 60% | |========================================== | 61% | |=========================================== | 61% | |=========================================== | 62% | |============================================ | 62% | |============================================ | 63% | |============================================ | 64% | |============================================= | 64% | |============================================= | 65% | |============================================== | 65% | |============================================== | 66% | |=============================================== | 67% | |=============================================== | 68% | |================================================ | 68% | |================================================ | 69% | |================================================= | 70% | |================================================= | 71% | |================================================== | 71% | |================================================== | 72% | |=================================================== | 72% | |=================================================== | 73% | |==================================================== | 74% | |==================================================== | 75% | |===================================================== | 75% | |===================================================== | 76% | |====================================================== | 77% | |====================================================== | 78% | |======================================================= | 78% | |======================================================= | 79% | |======================================================== | 79% | |======================================================== | 80% | |========================================================= | 81% | |========================================================= | 82% | |========================================================== | 82% | |========================================================== | 83% | |=========================================================== | 84% | |=========================================================== | 85% | |============================================================ | 85% | |============================================================ | 86% | |============================================================= | 86% | |============================================================= | 87% | |============================================================= | 88% | |============================================================== | 88% | |============================================================== | 89% | |=============================================================== | 89% | |=============================================================== | 90% | |=============================================================== | 91% | |================================================================ | 91% | |================================================================ | 92% | |================================================================= | 92% | |================================================================= | 93% | |================================================================== | 94% | |================================================================== | 95% | |=================================================================== | 95% | |=================================================================== | 96% | |==================================================================== | 97% | |==================================================================== | 98% | |===================================================================== | 98% | |===================================================================== | 99% | |======================================================================| 100%
#> Executing SQL took 3.26 secs
#> Fetching data from server
#> Fetching data took 0.312 secs
#> Fetching outcomes from server
#> Loading outcomes took 0.0241 secs
population <- createStudyPopulation(
plpData,
outcomeId = outcomeId,
populationSettings = createStudyPopulationSettings(
minTimeAtRisk = 1,
riskWindowEnd = 15) # short TAR end for lower outcome rate
)
#> outcomeId: 3
#> binary: TRUE
#> includeAllOutcomes: TRUE
#> firstExposureOnly: FALSE
#> washoutPeriod: 0
#> removeSubjectsWithPriorOutcome: TRUE
#> priorOutcomeLookback: 99999
#> requireTimeAtRisk: TRUE
#> minTimeAtRisk: 1
#> restrictTarToCohortEnd: FALSE
#> riskWindowStart: 1
#> startAnchor: cohort start
#> riskWindowEnd: 15
#> endAnchor: cohort start
#> restrictTarToCohortEnd: FALSE
#> Removing subjects with prior outcomes (if any)
#> Removing non outcome subjects with insufficient time at risk (if any)
#> Outcome is 0 or 1
#> Population created with: 1800 observations, 1800 unique subjects and 38 outcomes
#> Population created in 0.044 secs
population$y <- population$outcomeCount
plpData$covariateData$outcomes <- population
cyclopsData <- convertToCyclopsData(plpData$covariateData$outcomes,
plpData$covariateData$covariates,
modelType = "lr",
addIntercept = TRUE)
prior <- createPrior("laplace", exclude = c(0), useCrossValidation = FALSE)
control <- createControl(noiseLevel = "quiet",
cvRepetitions = 1,
tolerance = 1e-9,
seed = seed)
out <- capture.output(fit <- fitCyclopsModel(cyclopsData, prior, control))
# Now fit the same model with starting coefficients
cyclopsData <- convertToCyclopsData(plpData$covariateData$outcomes,
plpData$covariateData$covariates,
modelType = "lr",
addIntercept = TRUE)
prior <- createPrior("laplace", exclude = c(0), useCrossValidation = FALSE)
control <- createControl(noiseLevel = "quiet",
cvRepetitions = 1,
seed = seed,
tolerance = 1e-9)
ncovars <- length(summary(cyclopsData)$covariateId)
startingCoefficients <- rep(0, ncovars)
outcomeRate <- mean(population$y)
startingCoefficients[1] <- log(outcomeRate / (1 - outcomeRate))
out2 <- capture.output(fit2 <- fitCyclopsModel(cyclopsData,
prior,
control,
startingCoefficients = startingCoefficients))
message("Original Cyclops fit complete in ", round(fit$timeFit, 4), " seconds. And ",
fit$iterations, " iterations. With logLikelihood: ", fit$log_likelihood)
#> Original Cyclops fit complete in 0.018 seconds. And 125 iterations. With logLikelihood: -166.622177178957
message("Modified Cyclops fit complete in ", round(fit2$timeFit, 4), " seconds. And ",
fit2$iterations, " iterations. With logLikelihood: ", fit2$log_likelihood)
#> Modified Cyclops fit complete in 0.0077 seconds. And 67 iterations. With logLikelihood: -166.622177181422
Created on 2025-01-29 with reprex v2.1.1
Pure Cyclops reprex:
library(Cyclops)
# modify https://github.com/OHDSI/Cyclops/blob/develop/R/Simulation.R#L80C1-L80C63
# strataBackgroundProb <- runif(nstrata,min=0.01,max=0.03)
seed <- 666
set.seed(seed)
nrows <- 10000
ncovars <- 500
data <- simulateCyclopsData(nstrata = 1,
nrows = nrows,
ncovars = ncovars,
model = "logistic")
#> Sparseness = 98.99538 %
cyclopsData <- convertToCyclopsData(data$outcomes,
data$covariates,
modelType = "lr",
addIntercept = TRUE)
#> Sorting covariates by covariateId and rowId
prior <- createPrior("laplace", exclude = c(0), useCrossValidation = FALSE)
control <- createControl(noiseLevel = "quiet",
cvType = "auto",
cvRepetitions = 1,
seed = seed)
out <- capture.output(fit <- fitCyclopsModel(cyclopsData, prior, control))
# Now fit the same model with starting coefficients
cyclopsData <- convertToCyclopsData(data$outcomes,
data$covariates,
modelType = "lr",
addIntercept = TRUE)
#> Sorting covariates by covariateId and rowId
prior <- createPrior("laplace", exclude = c(0), useCrossValidation = FALSE)
control <- createControl(noiseLevel = "quiet",
cvRepetitions = 1,
seed = seed)
startingCoefficients <- rep(0, ncovars + 1)
outcomeRate <- mean(data$outcomes$y)
startingCoefficients[1] <- log(outcomeRate / (1 - outcomeRate))
out2 <- capture.output(fit2 <- fitCyclopsModel(cyclopsData,
prior,
control,
startingCoefficients = startingCoefficients))
message("Original Cyclops fit complete in ", round(fit$timeFit, 4), " seconds. And ",
fit$iterations, " iterations. With logLikelihood: ", fit$log_likelihood)
#> Original Cyclops fit complete in 0.0686 seconds. And 17 iterations. With logLikelihood: -756.202262114616
message("Modified Cyclops fit complete in ", round(fit2$timeFit, 4), " seconds. And ",
fit2$iterations, " iterations. With logLikelihood: ", fit2$log_likelihood)
#> Modified Cyclops fit complete in 0.0472 seconds. And 12 iterations. With logLikelihood: -756.202248337887
Created on 2025-01-29 with reprex v2.1.1
Hi @msuchard ,
I believe currently the intercept is initialized as zero (at least for logistic regression, only model I checked). Given we are often dealing with rare outcomes I think it would be better to use the
logitof the outcome rate (log(p / (1 - p))). In this case the model converges faster (as measured in fever iterations, see below) and seems to give the same results. I do think this could end up giving ~ 20-30% speed up for outcome rates of 1-3% which are common in our data. The example below for Eunomia gives a speedup of over 2x with outcome rate of 2%, but it's rather small data (n=1800, p=372). There's also a pureCyclopsreprex at the bottom giving about 30% fewer iterations but I had to change thesimulateCyclopsDatato give 1-3% outcome rates.Of course I could supply
startingCoefficientsmyself from the PLP to add this. But if there are no downsides maybe this should be the default inCyclops? What do you think ?reprex:
Created on 2025-01-29 with reprex v2.1.1
Pure
Cyclopsreprex:Created on 2025-01-29 with reprex v2.1.1