Skip to content

Divya1205/Hydra_Watch_AIxBio2026

Repository files navigation

HydraWatch

Embedding-based wastewater pathogen surveillance for federated hospital networks.

AIxBio Hackathon · Track 2 · Apart Research · April 2026

Local embeddings · Global signal · Reference-free early warning


What this is

HydraWatch is a reference-free, privacy-preserving wastewater pathogen surveillance pipeline designed for federated hospital networks. Each hospital sequences its own sewershed, embeds reads with DNABERT-2, and trains a local Transformer-encoder Variational Autoencoder (TE-VAE) on the classified read pool to define a site-normal baseline. A hybrid anomaly score flags reads in the unclassified pool — the blind spot where novel pathogens hide because reference-based tools cannot see them. Anomalies are clustered with HDBSCAN and tracked across timepoints to surface emerging signals.

Cross-site detection happens by query, not data: hospitals exchange ~3 KB cluster centroids, never raw reads or read-level embeddings.

Full methodology and results: submission/HydraWatch_report.pdf Slide deck: submission/HydraWatch_Track2_slides.pdf


System architecture

HydraWatch federated surveillance concept

Five hospitals each run a local DNABERT-2 + TE-VAE pipeline on their own wastewater data. When a site detects an emerging anomaly cluster, it sends a single 768-dim centroid query (~3 KB) to peer sites — raw reads and read-level embeddings stay on-prem. The regional layer matches centroids across nearby hospitals; the national layer triggers public-health alerts when a cluster signature appears in more than one region. Hospital D is the pilot site analysed in this report.


Headline result

On a three-timepoint NY hospital sewershed pilot (CASPER PRJNA1247874, September–November 2025), joint HDBSCAN clustering surfaces a dominant emerging cluster:

Cluster T1 T2 T3 Growth Pattern
6 284 122 3,506 ×12.3 Emerging — dominant signal
3 0 0 31 ×32 Emerging — low mass

The hybrid TE-VAE score cleanly separates classified from unclassified reads (0.33% vs 55.6% flagged at μ + 3σ).

A multi-view (DNA + protein) proof of concept on a separate CASPER sample (SRR37006656) shows that 40 of the top 50 anomalous reads are flagged by both DNABERT-2 and ESM-2 — the views are complementary, not redundant.

BLAST validation of the emerging cluster is queued and will be reported in a follow-up.


Pipeline overview

Wastewater FASTQ
   │
   ├── Trimmomatic (QC + trimming)
   ├── Kraken2 (reference classification)
   │     └── split into classified / unclassified pools
   │
   ├── Strip human reads + subsample (50K classified, 250K unclassified, R1 only)
   │
   ├── DNABERT-2 embedding (768-dim, frozen, mean-pooled, on Kaggle P100 GPU)
   │
   ├── TE-VAE training on classified pool
   │     └── hybrid anomaly score: robust z(recon) + robust z(log(latent Mahalanobis))
   │
   ├── HDBSCAN clustering on top 1% anomalies (pooled across timepoints)
   │     └── trajectory analysis: emerging / transient / declining
   │
   └── BLAST validation of representative reads (queued)

See submission/HydraWatch_report.pdf §3 for full methodology and Figure S1 for the architectural diagram.


Repository structure

Hydra_Watch_AIxBio2026/
├── README.md                              ← you are here
├── requirements.txt                       ← Python dependencies
├── LICENSE                                ← MIT
│
├── preprocessing/                         ← Stages 1–5 of the pipeline
│   ├── README.md                          ← preprocessing walkthrough
│   ├── prep_for_embedding.py              ← strip human reads, subsample classified to 50K
│   ├── subsample_unclassified.py          ← reservoir-sample unclassified to 250K
│   └── generate-dnabert2-embeddings.ipynb ← DNABERT-2 inference on Kaggle GPU
│
├── anomaly_detection/                     ← Stages 6–8: TE-VAE + clustering + BLAST prep
│   ├── README.md                          ← anomaly detection walkthrough
│   ├── tevae_anomaly_detection_hybrid.py  ← TE-VAE training + hybrid scoring
│   ├── tevae_plots.py                     ← deck/report figures (3 + UMAP)
│   ├── replot_tevae_components.py         ← regenerate distribution figure
│   └── extract_clusters_for_blast.py      ← BLAST FASTA extraction
│
├── multi_view/                            ← §6.3: ESM-2 proof of concept
│   ├── README.md                          ← multi-view extension notes
│   └── pandemic_plug_and_play.ipynb       ← DNA + protein dual-signal pipeline (SRR37006656)
│
├── results/                               ← outputs (samples; full data on request)
│   ├── tevae_cluster_trajectories.tsv
│   ├── tevae_threshold.txt
│   └── figures/
│
├── data/                                  ← reference files + SRA metadata (raw FASTQs not committed)
│   ├── README.md                          ← how to download from SRA + Kaggle
│   ├── SraRunTable.csv                    ← BioSample metadata for the four CASPER accessions
│   ├── pathogens_CASPER.txt               ← CASPER pathogen reference list (BLAST validation)
│   └── pathogen_protein_domains_conservation.txt  ← protein domain conservation reference
│
└── submission/                            ← final hackathon deliverables
    ├── HydraWatch_report.pdf
    └── HydraWatch_Track2_slides.pdf

Each subfolder has its own README explaining what the scripts do and how to run them in order.


Quick start

1. Set up the environment

git clone https://github.com/Divya1205/Hydra_Watch_AIxBio2026.git
cd Hydra_Watch_AIxBio2026
pip install -r requirements.txt

2. Get the data — three options

Option A — full reproducibility, raw SRA reads:

prefetch SRR37006657 SRR37006671 SRR37006667
fasterq-dump SRR37006657 SRR37006671 SRR37006667 --split-files

Then follow preprocessing/README.md from Step 1.

Option B — skip preprocessing, use the published Kaggle dataset:

🔗 https://www.kaggle.com/datasets/divyasitani/dataset-v3

Six preprocessed FASTAs ready for DNABERT-2 embedding. Attach to a Kaggle notebook and skip to preprocessing/README.md Step 5.

Option C — skip preprocessing AND embedding:

If a public embeddings dataset is available, download the 12 .npy files and place them in casper_data/ny_hospital_d/embeddings/25k/embeddings_combined/, then run the anomaly detection scripts directly.

3. Run the pipeline

# Preprocessing (see preprocessing/README.md for full details)
python preprocessing/prep_for_embedding.py SRR37006657   # repeat for SRR37006671, SRR37006667
python preprocessing/subsample_unclassified.py \
    SRR37006657_unclassified_for_embedding.fasta \
    SRR37006657_unclassified_250k.fasta \
    250000 --seed 42
# Then run preprocessing/generate-dnabert2-embeddings.ipynb on Kaggle GPU

# Anomaly detection (see anomaly_detection/README.md for full details)
python anomaly_detection/tevae_anomaly_detection_hybrid.py
python anomaly_detection/tevae_plots.py
python anomaly_detection/extract_clusters_for_blast.py \
    --fasta-dir <path/to/unclassified/fastas>

Outputs land in results_tevae/ (anomaly scores, cluster trajectories, figures, BLAST-ready FASTAs).


Method summary

Stage Tool Notes
QC + trimming Trimmomatic Standard parameters (LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50)
Reference classification Kraken2 PlusPF database
Strip human + subsample classified prep_for_embedding.py Drops taxon 9606, subsamples to 50K, seed=42
Subsample unclassified subsample_unclassified.py Vitter's Algorithm R, 250K reads, R1 only, seed=42
Embedding DNABERT-2 (117M, frozen) 768-dim, mean-pooled, max 512 tokens
Anomaly model TE-VAE 32-dim latent, β = 0.1, 50 epochs
Anomaly score Hybrid Robust z(recon) + robust z(log(latent Mahalanobis))
Threshold μ + 3σ on classified scores ~0.3% expected flag rate under Gaussianity
Clustering HDBSCAN min_cluster_size = 30, on 50 PCA components
Trajectory analysis Per-cluster T1/T2/T3 counts Emerging / transient / declining
Validation NCBI web blastn Queued for cluster 6 representative reads

Citation

If you use this work, please cite it as:

Sitani D, ElSayed M, Arrey F, Schutz H, Held S. (2026). HydraWatch: Embedding-based wastewater pathogen surveillance for federated hospital networks. AIxBio Hackathon Track 2, Apart Research. https://github.com/Divya1205/Hydra_Watch_AIxBio2026

GitHub also generates BibTeX and APA citations from CITATION.cff — click "Cite this repository" near the top of the repo page.

Underlying dataset:

Justen LJ et al. (2026). Deep untargeted wastewater metagenomic sequencing from sewersheds across the United States. medRxiv 2026-03 (CASPER consortium). BioProject PRJNA1247874.


Authors

Author Affiliation
Divya Sitani Independent Researcher
Mohammed ElSayed Helmut Schmidt Universität Hamburg
Frida Arrey Independent Researcher
Hanna Schutz Oxford Nanopore Technologies
Sascha Held Swissbit AG

With Apart Research.


Limitations

HydraWatch is a hackathon-scale pilot. Key caveats:

  • BLAST validation is queued, not yet completed for the TE-VAE clusters. The trajectory pattern (×12.3 emergence) is the embedding-space signal; sequence-level anchoring follows.
  • Single-site pilot. The federated multi-site architecture is described and motivated, but only a single-site three-timepoint pilot has been run.
  • TE-VAE trained on classified embeddings. The model's notion of "normal" inherits any biases of Kraken2's reference database.
  • Multi-view (ESM-2) is proof of concept only, on a single sample separate from the main pilot. Full integration into the TE-VAE pipeline is future work.

See report §6.3 for full limitations and future work.


License

MIT — see LICENSE.


Acknowledgements

Built on top of the SecureBio CASPER initiative dataset (PRJNA1247874). HydraWatch is complementary to, not a replacement for, reference-based wastewater surveillance. The two layers cover different failure modes.

About

Reference-free, privacy-preserving wastewater pathogen surveillance for federated hospital networks. DNABERT-2 embeddings + Transformer-encoder VAE detect emerging anomaly clusters in unclassified reads. AIxBio Hackathon Track 2, Apart Research.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors