Skip to content

pinellolab/DNA-Diffusion

Repository files navigation

DNA Diffusion

Generative modeling of regulatory DNA sequences with diffusion probabilistic models.

build codecov PyPI version

All Contributors

Model on HF


Documentation: https://pinellolab.github.io/DNA-Diffusion

Source Code: https://github.com/pinellolab/DNA-Diffusion


Contents

Introduction

DNA-Diffusion is diffusion-based model for generation of 200bp cell type-specific synthetic regulatory elements.

Installation

Our preferred package / project manager is uv. Please follow their recommended instructions for installation.

To clone the repository and install the necessary packages, run:

git clone https://github.com/pinellolab/DNA-Diffusion.git
cd DNA-Diffusion
uv sync

This will create a virtual environment in .venv and install all dependencies listed in the uv.lock file. This is compatible with both CPU and GPU, but preferred operating system is Linux with a recent GPU (e.g. A100 GPU). For detailed versions of the dependencies, please refer to the uv.lock file.

Usage

Data

We provide a small subset of the DHS Index dataset for training that is located at data/K562_hESCT0_HepG2_GM12878_12k_sequences_per_group.txt.

If you would like to recreate the dataset, you can call:

uv run master_dataset_and_filter.py

which will download all the necessary data and create a file data/master_dataset.ftr containing the full ~3.59 million dataset and a file data/filtered_dataset.txt containing the same subset of sequences as above.

To run data curation process as a notebook a marimo notebook file can be found at notebooks/marimo_master_dataset_and_filter.py with a rendered version of the notebook provided at notebooks/marimo_master_dataset_and_filter.ipynb.

This notebook can be opened/run with the following command:

uvx marimo edit notebooks/marimo_master_dataset_and_filter.py

All of the data processing files make use of uv to manage dependencies and so all libraries are installed when you run the above commands. See uv documentation for more information on how to run uv scripts.

Training

To train the DNA-Diffusion model, we provide a basic config file for training the diffusion model on the same subset of chromatin accessible regions described in the data section above.

To train the model call:

uv run train.py

We also provide a base config for debugging that will use a single sequence for training. You can override the default training script to use this debugging config by calling:

uv run train.py -cn train_debug

Model Checkpoint

We have uploaded the model checkpoint to HuggingFace. We provide both a .pt file and a .safetensors file for the model. The .safetensors file is recommended as it is more efficient and secure. Prior to generating sequences, download the model checkpoint and update the corresponding path in configs/sampling/default_hf.yaml or configs/sampling/default.yaml to point to the downloaded model checkpoint.

Sequence Generation

We provide a basic config file for generating sequences using the diffusion model resulting in 1000 sequences made per cell type. Base generation utilizes a guidance scale 1.0, however this can be tuned within the sample.py with the cond_weight_to_metric parameter. To generate sequences call:

uv run sample.py

The default setup for sampling will generate 1000 sequences per cell type. You can override the default sampling script to generate one sequence per cell type with the following cli flags:

uv run sample.py sampling.number_of_samples=1 sampling.sample_batch_size=1

To generate sequences using the trained model hosted on Hugging Face call:

uv run sample_hf.py

Examples

We provide an example notebook for training and sampling with the diffusion model. This notebook runs the previous commands for training and sampling. See notebooks/train_sample.ipynb for more details.

We also provide a jupyter notebook for generating sequences with the diffusion model using the trained model hosted on Hugging Face. This notebook runs the previous commands for sampling and shows some example outputs. See notebooks/sample.ipynb for more details.

Both examples were run on Google Colab using a T4 GPU.

Contributors ✨

Thanks goes to these wonderful people (emoji key):

Lucas Ferreira da Silva
Lucas Ferreira da Silva

πŸ€” πŸ’»
Luca Pinello
Luca Pinello

πŸ€”
Simon
Simon

πŸ€” πŸ’»

This project follows the all-contributors specification. Contributions of any kind welcome!