CNVRock — AMR gene copy-number variation in Klebsiella pneumoniae using variational autoencoders

Antimicrobial resistance (AMR) is one of the leading global health threats. In Klebsiella pneumoniae species complex (KpSC), resistance is driven not just by gene presence but by how many copies of a resistance gene a bacterium carries — chromosomal amplification (e.g. blaSHV) and plasmid copy-number variation (PCN) of clinically critical genes (blaKPC, blaNDM, blaCTX-M, blaOXA-48-like). Both effects matter clinically; conventional presence-callers see neither.

CNVRock detects both from short-read sequencing using:

a 1D convolutional VAE that learns a 10-dim latent embedding of genome-wide 1 kb-binned read depth,
a 6-state Gaussian HMM that segments the per-sample reconstructions into copy-number states,
separate per-gene CNV callers — CRR-based for chromosomal genes, PCN-based for plasmid-borne AMR genes (aggregated by gene family so multi-mapped reads between near-identical allele variants are correctly counted).

Scaling study (manuscript headline)

We report performance across five training-set sizes holding architecture, HMM, and CNV-caller parameters constant — only n varies:

Tier	n samples	Manifest
Phase 1 baseline	545	(legacy)
exp 32 / 5K	5,000	`assets/kpsc_expansion_subset_5k.tsv`
exp 33 / 10K	10,000	`assets/kpsc_expansion_subset_10k.tsv`
exp 34 / 20K	20,000	`assets/kpsc_expansion_subset_20k.tsv`
exp 36 / 40K	40,000	`assets/kpsc_expansion_subset_40k.tsv`

Each manifest is a strict superset of the previous one (see data/setup/select_expansion_subset.py) — any observed change in metrics is attributable to more training data, not to a different sample mix.

Results populate as experiments complete. See docs/08_scaling_study.md for the live table.

Pipeline at a glance

ENA (~78,000 KpSC FASTQ pairs)
    │
    │  IBM Aspera (ascp via bioconda aspera-cli, concur=10, MQ=0 GATK output)
    ▼
data/raw/fastq_subset/     ── BWA-MEM ── data/raw/bam_subset/  (kept)
                                                │
                                                ▼ GATK CollectReadCounts
                              data/raw/readcounts_subset_mq0/  (1 kb bins, chrom + plasmids)
                                                │
                          ┌─────────────────────┴─────────────────────┐
                          ▼                                            ▼
       data/inputs/KpSC-expansion-{N}-mq0-1000bp-npy/    KpSC-expansion-{N}-mq0-plasmid-1000bp-npy/
       (NC_016845.1 only, 5,334 bins)                    (7 gene families: blaKPC, blaCTX-M,
                          │                                            blaNDM, blaOXA-48-like,
                          │                                            blaTEM, qnrB, aac6-Ib-cr)
                          ▼                                            ▼
                ┌──────────────────────────────────────────────────────────┐
                │   VAE → Gaussian HMM → CNV callers → evaluation.txt      │
                │                  (models/train.py)                       │
                └──────────────────────────────────────────────────────────┘

Quickstart

# One-time setup (HPC):
conda env create -f environment.yml && conda activate cnvrock
conda install -y -c bioconda aspera-cli
ascli config ascp install
curl -sL https://www.ebi.ac.uk/bioimage-archive/help-download/ \
    | sed -n '/BEGIN OPENSSH PRIVATE KEY/,/END OPENSSH PRIVATE KEY/p' \
    > ~/.aspera/sdk/ebi_aspera_key.openssh
chmod 600 ~/.aspera/sdk/ebi_aspera_key.openssh

# Generate subset manifests (deterministic, seed=42):
python3 data/setup/select_expansion_subset.py

# Build 1 kb interval list across chrom + plasmids:
gatk CreateSequenceDictionary -R assets/HS11286_extended.fasta
gatk PreprocessIntervals \
    --reference assets/HS11286_extended.fasta \
    --bin-length 1000 --padding 0 \
    --interval-merging-rule OVERLAPPING_ONLY \
    --output assets/HS11286_extended_1kb.interval_list

# Download + align + count one tier (idempotent; samples already done are skipped):
sbatch --export=ALL,MANIFEST=assets/kpsc_expansion_subset_20k.tsv \
       --array=1-4999%10 hpc/aspera_subset_pipeline.sh
# (Plus three more chunks of size ≤5000 — SLURM MaxArraySize limit.)

# Build NPY stores (per tier):
python3 data/setup/readcounts_to_npy_kpsc.py \
    --counts-dir data/raw/readcounts_subset_mq0 \
    --manifest   assets/kpsc_expansion_subset_5k.tsv \
    --out-dir    data/inputs/KpSC-expansion-5k-mq0-1000bp-npy \
    --keep-contigs NC_016845.1

python3 data/setup/plasmid_genes_to_npy_kpsc.py \
    --counts-dir  data/raw/readcounts_subset_mq0 \
    --manifest    assets/kpsc_expansion_subset_5k.tsv \
    --gene-coords assets/plasmid_refs/plasmid_gene_coords.tsv \
    --families    assets/plasmid_refs/plasmid_gene_families.tsv \
    --out-dir     data/inputs/KpSC-expansion-5k-mq0-plasmid-1000bp-npy

# Train one experiment:
sbatch hpc/train_gpu.sh experiments/32/config.yaml

Full per-step explanation lives in the ReadTheDocs site.

Repository layout

assets/                    sample manifests, ground truth, reference + intervals
  HS11286_extended.fasta              chromosome + 11 plasmid contigs
  HS11286_extended_1kb.interval_list  7,365 1 kb bins for GATK CollectReadCounts
  kpsc_expansion_subset_{5k,10k,20k,40k,80k}.tsv  nested manifests
  kpsc_expansion_kleborate_gt_runlevel.tsv         Kleborate v3 ground truth
  amrfinder_gt_expansion.tsv                       AMRFinder+ ground truth
  plasmid_refs/
    plasmid_gene_coords.tsv     per-gene CDS coordinates on the extended ref
    plasmid_gene_families.tsv   gene-family groupings (aggregation key)

data/
  raw/
    fastq_subset/                  persistent FASTQs (Aspera downloads)
    bam_subset/                    persistent BWA BAMs
    readcounts_subset_mq0/         GATK 1 kb count TSVs
  inputs/
    KpSC-expansion-{N}-mq0-1000bp-npy/                chromosome NPY store
    KpSC-expansion-{N}-mq0-plasmid-1000bp-npy/        plasmid-family NPY store
  results/
    {32,33,34,36}_kpsc_expansion_{5k,10k,20k,40k}/   per-tier CNVRock outputs

hpc/
  aspera_subset_pipeline.sh        ascp + BWA + GATK per SLURM array task
  train_gpu.sh                     submits one CNVRock training run on an A40

data/setup/
  select_expansion_subset.py       stratified nested 5K/10K/20K/40K/80K
  readcounts_to_npy_kpsc.py        TSV → chromosome NPY store
  plasmid_genes_to_npy_kpsc.py     TSV → plasmid-family NPY store
  reannotate_plasmid_genes_blast.py     BLAST-based coord verification
  reannotate_plasmid_genes_genbank.py   GenBank-based coord verification

models/
  train.py                         single entry point: VAE + HMM + CNV + eval
  architecture/06_conv_vae.py      1D Conv-VAE (10-dim latent)
  hmm/02_gaussian_hmm.py           6-state Gaussian HMM segmenter
  cnv/06_gene_cnv_caller.py        chromosomal CRR-based caller
  cnv/07_plasmid_cnv_caller.py     plasmid PCN-based caller
  evaluation/04_kpsc_evaluation.py per-gene MCC / FNR / PPV
  experiments/32-36/               scaling-study experiment configs

docs/                              Sphinx + MyST documentation (publishes to RTD)
archive/experiments_pf/            historical Phase 1 experiments (exp 01–21)

Key parameter choices

These are documented at length in docs/09_methods.md; the highlights:

MQ filter = 0 for GATK CollectReadCounts. Multi-mapping between nearly-identical AMR allele variants (NDM-1 / NDM-5; OXA-48 / OXA-181; CTX-M-15 / -14 / -65 / -27) is the signal, not noise — filtering it out makes those genes invisible. We aggregate by gene family downstream.
Aspera concurrency = 10 per array job (EBI's per-source-IP cap).
Stratification: species × Bla_Carb × ST cap of 150 per stratum, with 1.5× carbapenemase-carrier oversample weighting.
Nested manifest construction: one weighted shuffle of the eligible pool, then prefixes of length 5K / 10K / 20K / 40K. Strictly nested by construction.

Documentation

The full pipeline narrative — overview, data acquisition, reference + intervals, subset selection, NPY stores, training, evaluation, scaling-study results, methods, reproducibility — is at cnvrock.readthedocs.io and builds automatically on every push to main from the docs/ folder.

Interactive diagnostics

A live Streamlit app showing the monitor / sample-viewer / latent-coverage pages on a 200-sample subsample of exp 32 is hosted at https://cnvrock.streamlit.app/. No install required for reviewers — just click and explore. The full pipeline (training + all 5 scaling tiers) reproduces locally via the Quickstart above; the app then auto-detects the full data/results/ and switches out of demo mode.

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 210 Commits
.devcontainer		.devcontainer
analysis		analysis
archive/experiments_pf		archive/experiments_pf
assets		assets
data		data
diagnostics		diagnostics
docs		docs
hpc		hpc
models		models
tests		tests
tools		tools
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
write_report.py		write_report.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CNVRock — AMR gene copy-number variation in Klebsiella pneumoniae using variational autoencoders

Scaling study (manuscript headline)

Pipeline at a glance

Quickstart

Repository layout

Key parameter choices

Documentation

Interactive diagnostics

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CNVRock — AMR gene copy-number variation in Klebsiella pneumoniae using variational autoencoders

Scaling study (manuscript headline)

Pipeline at a glance

Quickstart

Repository layout

Key parameter choices

Documentation

Interactive diagnostics

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages