Skip to content

Latest commit

 

History

History
95 lines (65 loc) · 4.36 KB

File metadata and controls

95 lines (65 loc) · 4.36 KB

PreTrained-Subtyping

A one-shot summary statistics-based knowledge transfer framework.

Outline

  1. Description
  2. Multi-system inflammatory syndrome in children (MIS-C) analysis with synthetic data

Description

Subtyping is a useful approach to divide a broad and complex population into several distinct subgroups for a more comprehensive understanding of the population under study. However, subtyping tasks often require a large sample size to ensure reliable estimates. When the target study has a limited sample size, leveraging information from similar source studies to aid in subtyping in the target study can lead to more generalizable and robust research results, as well as refine existing knowledge. Therefore, we proposed a one-shot summary statistics-based framework, PreTrained-Subtyping, that transfers the knowledge of shared subtypes pre-trained in a large-scale source study to a target study to achieve joint learning. This page provides the R code for illustration of this framework by using a classical subtyping approach, the latent class analysis (LCA), and synthezied datasets of multi-system inflammatory syndrome in children (MIS-C) to explore MIS-C subphenotypes.

MIS-C analysis with synthetic data

MIS-C is a severe and rare post-acute hyperinflammatory disorder associated with SARS-CoV-2 infection. MIS-C patients commonly have highly heterogeneous signs and symptoms, and a further elucidation of MIS-C disease patterns by identifying its subphenotypes is helpful to improve its recognition and management. We generated patient-level observations of the involvement of eight organ systems (i.e., eight binary variables) for patients from one large-scale source study, with 3,000 patients, and a small size target study, with only 200 patients. To apply our proposed PreTrained-Subtyping approach, we can follow the following steps.

step 1: load required R packages.

library(poLCA)
library(BayesLCA)
library(ROCR)

set.seed(200) # to reproduce the results

The R code has been tested in R version 4.4.1 (2024-06-14).

Step 2: extract pre-trained knowledge from the source study.

In practice, source study data is not available, and only the estimates of the LCA parameters and the related standard deviations are needed from the source study.

load("source_study_data.rda")
data.ex <- as.data.frame(ex.data)
for(i in 1:8) data.ex[,i] <- factor(data.ex[,i])

fmla <- as.formula(paste('cbind(',paste(colnames(data.ex),collapse = ','),')~1'))
C <- 3 # three latent classes
q <- 8 # eight binary variables
ex.fit <- poLCA(fmla, data.ex, nclass = C, maxiter = 1000, tol = 1e-10, probs.start = NULL, nrep = 10, verbose = FALSE)

# parameter estimates
ex.pi.mat <- matrix(unlist(ex.fit$probs), ncol=C, byrow=TRUE)[-seq(1, 2*q, by = 2),]
# the related standard deviations
ex.pi.sd <- matrix(unlist(ex.fit$probs.se), ncol=C, byrow=TRUE)[-seq(1, 2*q, by = 2),]

The pre-trained subphenotype information obtained from the source study, i.e., ex.pi.mat, is displayed in the following heatmap.

Step 3: incorporated the pre-rtained knowledge from source study into the latent class analysis in the target study.

load("target_study_data.rda")

blca.y <- data.blca(in.data)

# get parameter estimates from the external study
pi.mean <- t(ex.pi.mat)
# get parameter estimate standard deviations from the external study
pi.sd <- t(ex.pi.sd)

# prepare prior distributions using pre-trained knowledge
pi.upper <- pi.mean + 1.96*pi.sd
pi.upper[pi.upper > 1] <- 1
pi.lower <- pi.mean - 1.96*pi.sd
pi.lower[pi.lower < 0] <- 0
b <- (pi.upper - pi.lower)/4
a <- pi.mean
b[b < 1e-10] <- 1e-10
a[a < 1e-10] <- 1e-10
prior.alpha <- a^2*(1 - a)/b^2 - a
prior.beta <- prior.alpha*(1 - a)/a

# update pre-trained knowledge using data from the internal study
PT.Sub.fit <- blca.em(blca.y, C, alpha = prior.alpha, beta = prior.beta, restarts = 10,
                      verbose = FALSE)

# the updated knowledge
PT.Sub.pi <- t(PT.Sub.fit$itemprob)

# the class mixing proportions in the internal study
PT.Sub.lam <- PT.Sub.fit$classprob

The updated subphenotype information, i.e., PT.Sub.pi, is shown in the following heatmap.

The estimated subphenotype mixing proportions at the target study, i.e., PT.Sub.lam, are 0.44, 0.39, and 0.17, for class 1, class 2, and class 3, respectively.