A comprehensive documentation hub for genetics and brain foundation models and their multimodal integration.
📖 Read the Docs | 🚀 Quick Start | 💡 Use Cases
A documentation-first knowledge base for researchers working with:
- 🧬 Genetic foundation models (Caduceus, DNABERT-2, Evo2, GENERator)
- 🧠 Brain imaging models (BrainLM, Brain-JEPA, BrainMT, Brain Harmony, SwiFT)
- 🏥 Multimodal/Clinical models (BAGEL, MoT, M3FM, Me-LLaMA, TITAN, FMS-Medical)
- 🔗 Integration strategies for gene-brain-behavior-language analysis
Scope: Documentation, metadata cards, and integration patterns — not model implementation code.
# 1. Clone and setup
git clone https://github.com/allison-eunse/neuro-omics-kb.git
cd neuro-omics-kb
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# 2. View documentation locally
mkdocs serve
# Visit http://localhost:8000
# 3. Validate metadata cards
python scripts/manage_kb.py validate modelsNew to foundation models? Start with:
- 📖 KB Overview - Understand the structure
- 🧬 Genetics Models Overview - DNA sequence models
- 🧠 Brain Models Overview - Neuroimaging models
- 🔗 Integration Strategy - How to combine modalities
This KB supports research across multiple modalities:
- 🧬 Genetics Research - Extract gene embeddings, analyze variant effects, predict phenotypes from DNA sequences
- 🧠 Brain Imaging - Process fMRI/sMRI, extract neuroimaging features, harmonize multi-site data
- 🔗 Multimodal Integration - Fuse gene + brain embeddings, gene-brain-behavior prediction, cross-modal alignment
- 🏥 Multimodal & Clinical Models - Use unified architectures (BAGEL, MoT, M3FM), process medical imaging with clinical text (TITAN, M3FM), leverage medical LLMs (Me-LLaMA)
- 🧪 Reproducible Research - Use validated pipelines, experiment configs, and quality gates for your cohorts
Example workflows:
- Gene-brain association discovery using WES + sMRI with CCA
- fMRI embedding extraction with BrainLM for MDD prediction
- Leave-one-gene-out (LOGO) attribution for gene importance
- Multimodal fusion for clinical decision support
📚 Documentation (docs/)
- Code Walkthroughs - Step-by-step guides for 17 foundation models with consistent formatting
- 🧬 Genetics (4): Caduceus, DNABERT-2, GENERator, Evo 2
- 🧠 Brain (5): BrainLM, Brain-JEPA, Brain Harmony, BrainMT, SwiFT
- 🏥 Multimodal/Clinical (6): BAGEL, MoT, M3FM, Me-LLaMA, TITAN, FMS-Medical catalog
- Integration Playbooks - Multimodal fusion strategies (late fusion → contrastive → TAPE)
- Data Schemas - UK Biobank, HCP, developmental cohorts
- Decision Logs - Architectural choices and research rationale
- Curated Papers - PDFs + Markdown summaries in
docs/generated/kb_curated/
🏷️ Metadata Cards (kb/)
- Model Cards (
model_cards/*.yaml) - 20 model cards (17 FMs + 3 ARPA-H/reference cards) with architecture specs, embedding recipes, integration hooks - Dataset Cards (
datasets/*.yaml) - 19 dataset specifications (6 UKB planning cards + 13 available/reference datasets) - Paper Cards (
paper_cards/*.yaml) - 30 research papers with structured takeaways - Integration Cards (
integration_cards/*.yaml) - Embedding strategies, harmonization methods, preprocessing pipelines
🔧 Tools & Scripts
scripts/manage_kb.py- Validate YAML cards, query embedding strategiesscripts/codex_gate.py- Quality gate for automated workflowsscripts/fetch_external_repos.sh- Sync upstream model repositories
⚙️ Experiment Configs
Ready-to-run YAML templates in configs/experiments/:
01_cca_gene_smri.yaml- CCA + permutation baseline02_prediction_baselines.yaml- Gene vs Brain vs Fusion03_logo_gene_attribution.yaml- Gene attribution protocol
| Model | Best for | Context | Documentation |
|---|---|---|---|
| 🧬 Caduceus | RC-equivariant gene embeddings | DNA sequences | Walkthrough |
| 🧬 DNABERT-2 | Cross-species transfer | BPE tokenization | Walkthrough |
| 🧬 Evo 2 | Ultra-long regulatory regions | 1M context | Walkthrough |
| 🧬 GENERator | Generative modeling | 6-mer LM | Walkthrough |
| 🧬 HyenaDNA | Long-range DNA modeling | Efficient convolutions | Walkthrough |
| Model | Modality | Best for | Documentation |
|---|---|---|---|
| 🧠 BrainLM | fMRI | Site-robust embeddings | Walkthrough |
| 🧠 Brain-JEPA | fMRI | Lower-latency option | Walkthrough |
| 🧠 Brain Harmony | sMRI + fMRI | Multi-modal fusion | Walkthrough |
| 🧠 BrainMT | sMRI/fMRI | Mamba efficiency | Walkthrough |
| 🧠 SwiFT | fMRI | Hierarchical spatiotemporal | Walkthrough |
| Model | Modalities | Best for | Documentation |
|---|---|---|---|
| 🏥 BAGEL | Vision + Text + Video | Unified multimodal FM with MoT | Walkthrough |
| 🏥 MoT | Text + Images + Speech | Sparse mixture-of-transformers | Walkthrough |
| 🏥 M3FM | CXR + Text | Multilingual medical reports | Walkthrough |
| 🏥 Me-LLaMA | Medical Text | LLM for clinical reasoning | Walkthrough |
| 🏥 TITAN | Histopathology | Whole-slide image analysis | Walkthrough |
| 🏥 FMS-Medical | Clinical Multi-modal | Medical foundation models catalog | Walkthrough |
30 structured paper cards documenting:
- 🧬 Genetics FMs (5): Caduceus, DNABERT-2, Evo2, GENERator, HyenaDNA
- 🧠 Brain FMs (5): BrainLM, Brain-JEPA, Brain Harmony, BrainMT, SwiFT
- 🏥 Multimodal/Clinical FMs (6): BAGEL, MoT, M3FM, Me-LLaMA, TITAN, Flamingo
- 🔗 Integration & Methods (11): Ensemble integration, Multimodal FMs survey, MM-LLM imaging, Oncology review, Yoon BioKDD, RC-equivariant networks, RC consistency for DNA LMs, Systems & algorithms for multi-hybrid LMs, Brain MRI bias unlearning, Brain multisite harmonization (MURD), Site unlearning (Dinsdale)
- 🧬 Genomics & Population (2): GWAS diverse populations, PRS guide
- 📊 Tabular Baseline (1): TabPFN (2023)
- 📚 General (2): Representation learning, Foundation models overview
View all paper cards → | Browse summaries →
3 comprehensive integration cards synthesizing multimodal patterns:
- 🎯 Ensemble Integration - Model stacking, averaging, and meta-learning for late fusion
- 🏥 Oncology Multimodal Review - Early/intermediate/late fusion taxonomy from cancer research
- 🎨 Multimodal FM Patterns - Architectural patterns from BAGEL, MoT, M3FM for Brain-Omics Models
View integration design patterns → | Multimodal architectures →
# Show the full sMRI baseline recipe
python scripts/manage_kb.py ops strategy smri_free_surfer_pca512_v1
# Inspect harmonization metadata (e.g., MURD)
python scripts/manage_kb.py ops harmonization murd_t1_t2python scripts/manage_kb.py validate models
python scripts/manage_kb.py validate datasets# Cycle 1 – quick sanity before giving Codex control
python scripts/codex_gate.py --mode fast --label cycle1 --since origin/main
# Cycle 2 – full sweep before handing work back
python scripts/codex_gate.py --mode full --label cycle2 --since HEAD~1In addition to standalone neurogenomics analyses, neuro-omics-kb serves as the documentation layer for multimodal brain–omics–LLM foundation model efforts, including ARPA-H–style Brain-Omics Model (BOM) initiatives.
This KB provides the foundation for escalating from late fusion → contrastive learning → unified multimodal architectures:
-
Phase 1 (Current): Late fusion baselines with genetics + brain FMs
- Tools: CCA+permutation, prediction baselines, partial correlations
- Models: Caduceus/DNABERT-2 (genetics) + BrainLM/SwiFT (brain)
-
Phase 2 (Near-term): Two-tower contrastive alignment
- Patterns: InfoNCE, frozen encoders, small projectors
- Reference: M3FM, oncology multimodal review
-
Phase 3 (Long-term): Unified Brain-Omics Models
- Architectures: MoT-style sparse transformers, BAGEL-style unified decoders
- Integration: Gene-brain-behavior-language tokens with LLM as semantic hub
Read Integration Plan → | View Design Patterns →
- Model Implementations: See links in individual model cards
Maintainer: Allison Eun Se You
Purpose: Knowledge base for neuro-omics foundation model research
Scope: Documentation, metadata, integration strategies, and references to upstream implementations
Note: This is a documentation-first KB. Implementation code lives in the upstream repositories referenced throughout external_repos/ (a mix of tracked snapshots and fetch-on-demand placeholders).