Skip to content

allison-eunse/neuro-omics-kb

Repository files navigation

🧬🧠 Neuro-Omics Knowledge Base

Documentation Models Paper Cards Integration Cards

A comprehensive documentation hub for genetics and brain foundation models and their multimodal integration.

📖 Read the Docs | 🚀 Quick Start | 💡 Use Cases


What is this?

A documentation-first knowledge base for researchers working with:

  • 🧬 Genetic foundation models (Caduceus, DNABERT-2, Evo2, GENERator)
  • 🧠 Brain imaging models (BrainLM, Brain-JEPA, BrainMT, Brain Harmony, SwiFT)
  • 🏥 Multimodal/Clinical models (BAGEL, MoT, M3FM, Me-LLaMA, TITAN, FMS-Medical)
  • 🔗 Integration strategies for gene-brain-behavior-language analysis

Scope: Documentation, metadata cards, and integration patterns — not model implementation code.


🚀 Quick Start

# 1. Clone and setup
git clone https://github.com/allison-eunse/neuro-omics-kb.git
cd neuro-omics-kb
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# 2. View documentation locally
mkdocs serve
# Visit http://localhost:8000

# 3. Validate metadata cards
python scripts/manage_kb.py validate models

New to foundation models? Start with:

  1. 📖 KB Overview - Understand the structure
  2. 🧬 Genetics Models Overview - DNA sequence models
  3. 🧠 Brain Models Overview - Neuroimaging models
  4. 🔗 Integration Strategy - How to combine modalities

💡 Use Cases

This KB supports research across multiple modalities:

  • 🧬 Genetics Research - Extract gene embeddings, analyze variant effects, predict phenotypes from DNA sequences
  • 🧠 Brain Imaging - Process fMRI/sMRI, extract neuroimaging features, harmonize multi-site data
  • 🔗 Multimodal Integration - Fuse gene + brain embeddings, gene-brain-behavior prediction, cross-modal alignment
  • 🏥 Multimodal & Clinical Models - Use unified architectures (BAGEL, MoT, M3FM), process medical imaging with clinical text (TITAN, M3FM), leverage medical LLMs (Me-LLaMA)
  • 🧪 Reproducible Research - Use validated pipelines, experiment configs, and quality gates for your cohorts

Example workflows:

  • Gene-brain association discovery using WES + sMRI with CCA
  • fMRI embedding extraction with BrainLM for MDD prediction
  • Leave-one-gene-out (LOGO) attribution for gene importance
  • Multimodal fusion for clinical decision support

📦 What's Inside

📚 Documentation (docs/)
  • Code Walkthroughs - Step-by-step guides for 17 foundation models with consistent formatting
    • 🧬 Genetics (4): Caduceus, DNABERT-2, GENERator, Evo 2
    • 🧠 Brain (5): BrainLM, Brain-JEPA, Brain Harmony, BrainMT, SwiFT
    • 🏥 Multimodal/Clinical (6): BAGEL, MoT, M3FM, Me-LLaMA, TITAN, FMS-Medical catalog
  • Integration Playbooks - Multimodal fusion strategies (late fusion → contrastive → TAPE)
  • Data Schemas - UK Biobank, HCP, developmental cohorts
  • Decision Logs - Architectural choices and research rationale
  • Curated Papers - PDFs + Markdown summaries in docs/generated/kb_curated/
🏷️ Metadata Cards (kb/)
  • Model Cards (model_cards/*.yaml) - 20 model cards (17 FMs + 3 ARPA-H/reference cards) with architecture specs, embedding recipes, integration hooks
  • Dataset Cards (datasets/*.yaml) - 19 dataset specifications (6 UKB planning cards + 13 available/reference datasets)
  • Paper Cards (paper_cards/*.yaml) - 30 research papers with structured takeaways
  • Integration Cards (integration_cards/*.yaml) - Embedding strategies, harmonization methods, preprocessing pipelines

Browse all cards →

🔧 Tools & Scripts
  • scripts/manage_kb.py - Validate YAML cards, query embedding strategies
  • scripts/codex_gate.py - Quality gate for automated workflows
  • scripts/fetch_external_repos.sh - Sync upstream model repositories
⚙️ Experiment Configs

Ready-to-run YAML templates in configs/experiments/:

  • 01_cca_gene_smri.yaml - CCA + permutation baseline
  • 02_prediction_baselines.yaml - Gene vs Brain vs Fusion
  • 03_logo_gene_attribution.yaml - Gene attribution protocol

🎯 Foundation Models

Genetics Models

Model Best for Context Documentation
🧬 Caduceus RC-equivariant gene embeddings DNA sequences Walkthrough
🧬 DNABERT-2 Cross-species transfer BPE tokenization Walkthrough
🧬 Evo 2 Ultra-long regulatory regions 1M context Walkthrough
🧬 GENERator Generative modeling 6-mer LM Walkthrough
🧬 HyenaDNA Long-range DNA modeling Efficient convolutions Walkthrough

Brain Models

Model Modality Best for Documentation
🧠 BrainLM fMRI Site-robust embeddings Walkthrough
🧠 Brain-JEPA fMRI Lower-latency option Walkthrough
🧠 Brain Harmony sMRI + fMRI Multi-modal fusion Walkthrough
🧠 BrainMT sMRI/fMRI Mamba efficiency Walkthrough
🧠 SwiFT fMRI Hierarchical spatiotemporal Walkthrough

Multimodal & Clinical Models

Model Modalities Best for Documentation
🏥 BAGEL Vision + Text + Video Unified multimodal FM with MoT Walkthrough
🏥 MoT Text + Images + Speech Sparse mixture-of-transformers Walkthrough
🏥 M3FM CXR + Text Multilingual medical reports Walkthrough
🏥 Me-LLaMA Medical Text LLM for clinical reasoning Walkthrough
🏥 TITAN Histopathology Whole-slide image analysis Walkthrough
🏥 FMS-Medical Clinical Multi-modal Medical foundation models catalog Walkthrough

📋 Research Papers

30 structured paper cards documenting:

  • 🧬 Genetics FMs (5): Caduceus, DNABERT-2, Evo2, GENERator, HyenaDNA
  • 🧠 Brain FMs (5): BrainLM, Brain-JEPA, Brain Harmony, BrainMT, SwiFT
  • 🏥 Multimodal/Clinical FMs (6): BAGEL, MoT, M3FM, Me-LLaMA, TITAN, Flamingo
  • 🔗 Integration & Methods (11): Ensemble integration, Multimodal FMs survey, MM-LLM imaging, Oncology review, Yoon BioKDD, RC-equivariant networks, RC consistency for DNA LMs, Systems & algorithms for multi-hybrid LMs, Brain MRI bias unlearning, Brain multisite harmonization (MURD), Site unlearning (Dinsdale)
  • 🧬 Genomics & Population (2): GWAS diverse populations, PRS guide
  • 📊 Tabular Baseline (1): TabPFN (2023)
  • 📚 General (2): Representation learning, Foundation models overview

View all paper cards → | Browse summaries →


🔗 Integration Strategies

3 comprehensive integration cards synthesizing multimodal patterns:

View integration design patterns → | Multimodal architectures →


🔗 Reference Commands

Trace Embedding Strategies

# Show the full sMRI baseline recipe
python scripts/manage_kb.py ops strategy smri_free_surfer_pca512_v1

# Inspect harmonization metadata (e.g., MURD)
python scripts/manage_kb.py ops harmonization murd_t1_t2

Validate YAML Cards

python scripts/manage_kb.py validate models
python scripts/manage_kb.py validate datasets

Codex Quality Gate

# Cycle 1 – quick sanity before giving Codex control
python scripts/codex_gate.py --mode fast --label cycle1 --since origin/main

# Cycle 2 – full sweep before handing work back
python scripts/codex_gate.py --mode full --label cycle2 --since HEAD~1

🌍 Role in Larger Ecosystems

In addition to standalone neurogenomics analyses, neuro-omics-kb serves as the documentation layer for multimodal brain–omics–LLM foundation model efforts, including ARPA-H–style Brain-Omics Model (BOM) initiatives.

ARPA-H Brain-Omics Model (BOM) Alignment

This KB provides the foundation for escalating from late fusion → contrastive learning → unified multimodal architectures:

  • Phase 1 (Current): Late fusion baselines with genetics + brain FMs

    • Tools: CCA+permutation, prediction baselines, partial correlations
    • Models: Caduceus/DNABERT-2 (genetics) + BrainLM/SwiFT (brain)
  • Phase 2 (Near-term): Two-tower contrastive alignment

    • Patterns: InfoNCE, frozen encoders, small projectors
    • Reference: M3FM, oncology multimodal review
  • Phase 3 (Long-term): Unified Brain-Omics Models

    • Architectures: MoT-style sparse transformers, BAGEL-style unified decoders
    • Integration: Gene-brain-behavior-language tokens with LLM as semantic hub

Read Integration Plan → | View Design Patterns →


🔗 Related Repositories

  • Model Implementations: See links in individual model cards

📧 Contact

Maintainer: Allison Eun Se You
Purpose: Knowledge base for neuro-omics foundation model research
Scope: Documentation, metadata, integration strategies, and references to upstream implementations


Note: This is a documentation-first KB. Implementation code lives in the upstream repositories referenced throughout external_repos/ (a mix of tracked snapshots and fetch-on-demand placeholders).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published