Generative modeling of regulatory DNA sequences with diffusion probabilistic models.
Documentation: https://pinellolab.github.io/DNA-Diffusion
Source Code: https://github.com/pinellolab/DNA-Diffusion
DNA-Diffusion is diffusion-based model for generation of 200bp cell type-specific synthetic regulatory elements.
Our preferred package / project manager is uv. Please follow their recommended instructions for installation.
To clone the repository and install the necessary packages, run:
git clone https://github.com/pinellolab/DNA-Diffusion.git
cd DNA-Diffusion
uv sync
This will create a virtual environment in .venv
and install all dependencies listed in the uv.lock
file. This is compatible with both CPU and GPU, but preferred operating system is Linux with a recent GPU (e.g. A100 GPU). For detailed versions of the dependencies, please refer to the uv.lock
file.
We provide a small subset of the DHS Index dataset for training that is located at data/K562_hESCT0_HepG2_GM12878_12k_sequences_per_group.txt
.
If you would like to recreate the dataset, you can call:
uv run master_dataset_and_filter.py
which will download all the necessary data and create a file data/master_dataset.ftr
containing the full ~3.59 million dataset and a file data/filtered_dataset.txt
containing the same subset of sequences as above.
To run data curation process as a notebook a marimo notebook file can be found at notebooks/marimo_master_dataset_and_filter.py
with a rendered version of the notebook provided at notebooks/marimo_master_dataset_and_filter.ipynb
.
This notebook can be opened/run with the following command:
uvx marimo edit notebooks/marimo_master_dataset_and_filter.py
All of the data processing files make use of uv to manage dependencies and so all libraries are installed when you run the above commands. See uv documentation for more information on how to run uv scripts.
To train the DNA-Diffusion model, we provide a basic config file for training the diffusion model on the same subset of chromatin accessible regions described in the data section above.
To train the model call:
uv run train.py
We also provide a base config for debugging that will use a single sequence for training. You can override the default training script to use this debugging config by calling:
uv run train.py -cn train_debug
We have uploaded the model checkpoint to HuggingFace. We provide both a .pt file and a .safetensors file for the model. The .safetensors file is recommended as it is more efficient and secure. Prior to generating sequences, download the model checkpoint and update the corresponding path in configs/sampling/default_hf.yaml
or configs/sampling/default.yaml
to point to the downloaded model checkpoint.
We provide a basic config file for generating sequences using the diffusion model resulting in 1000 sequences made per cell type. Base generation utilizes a guidance scale 1.0, however this can be tuned within the sample.py with the cond_weight_to_metric
parameter. To generate sequences call:
uv run sample.py
The default setup for sampling will generate 1000 sequences per cell type. You can override the default sampling script to generate one sequence per cell type with the following cli flags:
uv run sample.py sampling.number_of_samples=1 sampling.sample_batch_size=1
To generate sequences using the trained model hosted on Hugging Face call:
uv run sample_hf.py
We provide an example notebook for training and sampling with the diffusion model. This notebook runs the previous commands for training and sampling.
See notebooks/train_sample.ipynb
for more details.
We also provide a jupyter notebook for generating sequences with the diffusion model using the trained model hosted on Hugging Face. This notebook runs the previous commands for sampling and shows some example outputs.
See notebooks/sample.ipynb
for more details.
Both examples were run on Google Colab using a T4 GPU.
Thanks goes to these wonderful people (emoji key):
Lucas Ferreira da Silva π€ π» |
Luca Pinello π€ |
Simon π€ π» |
This project follows the all-contributors specification. Contributions of any kind welcome!