Skip to content

Commit

Permalink
optimize
Browse files Browse the repository at this point in the history
  • Loading branch information
SeonghwanSeo committed Oct 11, 2024
1 parent a1dc0fa commit df41275
Show file tree
Hide file tree
Showing 136 changed files with 8,052 additions and 8,584 deletions.
3 changes: 0 additions & 3 deletions .bandit

This file was deleted.

1 change: 0 additions & 1 deletion .git-blame-ignore-revs

This file was deleted.

16 changes: 12 additions & 4 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,11 +1,19 @@
# Model cache
src/gflownet/models/cache/
data/building_blocks/Enamine*
envs/
data/building_blocks/
data/envs/
data/experiments/
logs/
debug/
.github/
wandb_run/
wandb_debug/
experiments/analysis
experiments/release-ckpt
storage*
job*.sh
slurm-*
wandb*
nogit*/
unidock_2024*

# Byte-compiled / optimized / DLL files
__pycache__/
Expand Down
71 changes: 0 additions & 71 deletions .pre-commit-config.yaml

This file was deleted.

4 changes: 0 additions & 4 deletions .yapfignore

This file was deleted.

1 change: 0 additions & 1 deletion CODEOWNERS

This file was deleted.

6 changes: 0 additions & 6 deletions Dockerfile.test

This file was deleted.

2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
MIT License

Copyright (c) 2020 Recursion Pharmaceuticals
Copyright (c) 2024 Seonghwan Seo

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
1 change: 0 additions & 1 deletion MANIFEST.in

This file was deleted.

89 changes: 33 additions & 56 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,75 +1,52 @@


[![Build-and-Test](https://github.com/recursionpharma/gflownet/actions/workflows/build-and-test.yaml/badge.svg)](https://github.com/recursionpharma/gflownet/actions/workflows/build-and-test.yaml)
[![Code Quality](https://github.com/recursionpharma/gflownet/actions/workflows/code-quality.yaml/badge.svg)](https://github.com/recursionpharma/gflownet/actions/workflows/code-quality.yaml)
[![Python versions](https://img.shields.io/badge/Python-3.9%2B-blue)](https://www.python.org/downloads/)
[![Python versions](https://img.shields.io/badge/Python-3.10%2B-blue)](https://www.python.org/downloads/)
[![license: MIT](https://img.shields.io/badge/License-MIT-purple.svg)](LICENSE)

# gflownet

GFlowNet-related training and environment code on graphs.

**Primer**

[GFlowNet](https://yoshuabengio.org/2022/03/05/generative-flow-networks/), short for Generative Flow Network, is a novel generative modeling framework, particularly suited for discrete, combinatorial objects. Here in particular it is implemented for graph generation.

The idea behind GFN is to estimate flows in a (graph-theoretic) directed acyclic network*. The network represents all possible ways of constructing an object, and so knowing the flow gives us a policy which we can follow to sequentially construct objects. Such a sequence of partially constructed objects is a _trajectory_. *Perhaps confusingly, the _network_ in GFN refers to the state space, not a neural network architecture.

Here the objects we construct are themselves graphs (e.g. graphs of atoms), which are constructed node by node. To make policy predictions, we use a graph neural network. This GNN outputs per-node logits (e.g. add an atom to this atom, or add a bond between these two atoms), as well as per-graph logits (e.g. stop/"done constructing this object").

The GNN model can be trained on a mix of existing data (offline) and self-generated data (online), the latter being obtained by querying the model sequentially to obtain trajectories. For offline data, we can easily generate trajectories since we know the end state.
# RxnFlow: Generative Flows on Synthetic Pathways for Drug Design

## Repo overview
Official implementation of ***Generative Flows on Synthetic Pathways for Drug Design*** by Seonghwan Seo, Minsu Kim, Tony Shen, Martin Ester, Jinkyu Park, Sungsoo Ahn, and Woo Youn Kim.

- [algo](src/gflownet/algo), contains GFlowNet algorithms implementations ([Trajectory Balance](https://arxiv.org/abs/2201.13259), [SubTB](https://arxiv.org/abs/2209.12782), [Flow Matching](https://arxiv.org/abs/2106.04399)), as well as some baselines. These implement how to sample trajectories from a model and compute the loss from trajectories.
- [data](src/gflownet/data), contains dataset definitions, data loading and data sampling utilities.
- [envs](src/gflownet/envs), contains environment classes; a graph-building environment base, and a molecular graph context class. The base environment is agnostic to what kind of graph is being made, and the context class specifies mappings from graphs to objects (e.g. molecules) and torch geometric Data.
- [examples](docs/examples), contains simple example implementations of GFlowNet.
- [models](src/gflownet/models), contains model definitions.
- [tasks](src/gflownet/tasks), contains training code.
- [qm9](src/gflownet/tasks/qm9/qm9.py), temperature-conditional molecule sampler based on QM9's HOMO-LUMO gap data as a reward.
- [seh_frag](src/gflownet/tasks/seh_frag.py), reproducing Bengio et al. 2021, fragment-based molecule design targeting the sEH protein
- [seh_frag_moo](src/gflownet/tasks/seh_frag_moo.py), same as the above, but with multi-objective optimization (incl. QED, SA, and molecule weight objectives).
- [utils](src/gflownet/utils), contains utilities (multiprocessing, metrics, conditioning).
- [`trainer.py`](src/gflownet/trainer.py), defines a general harness for training GFlowNet models.
- [`online_trainer.py`](src/gflownet/online_trainer.py), defines a typical online-GFN training loop.
[paper]

See [implementation notes](docs/implementation_notes.md) for more.
RxnFlow are a synthesis-oriented generative framework that aims to discover diverse drug candidates through GFlowNet objective and a large action space.

## Getting started
- RxnFlow can operate on large synthetic action spaces comprising 1.2M building blocks and 71 reaction templates without memory overhead.
- RxnFlow can explore broader chemical space within less reaction steps, resulting in higher diversity, higher potency, and lower synthetic complexity of generated molecules.
- RxnFlow can generate molecules with expanded or modified building block libaries without retraining.

A good place to get started is with the [sEH fragment-based MOO task](src/gflownet/tasks/seh_frag_moo.py). The file `seh_frag_moo.py` is runnable as-is (although you may want to change the default configuration in `main()`).
The implementation of this project builds upon the [recursionpharma/gflownet](https://github.com/recursionpharma/gflownet) with MIT license.

## Installation
## Setup

### PIP

This package is installable as a PIP package, but since it depends on some torch-geometric package wheels, the `--find-links` arguments must be specified as well:
### Install

```bash
pip install -e . --find-links https://data.pyg.org/whl/torch-2.1.2+cu121.html
# python: 3.10
conda install openbabel
pip install -e . --find-links https://data.pyg.org/whl/torch-2.3.1+cu121.html
```
Or for CPU use:

```bash
pip install -e . --find-links https://data.pyg.org/whl/torch-2.1.2+cpu.html
```
### Data

To install or [depend on](https://matiascodesal.com/blog/how-use-git-repository-pip-dependency/) a specific tag, for example here `v0.0.10`, use the following scheme:
```bash
pip install git+https://github.com/recursionpharma/[email protected] --find-links ...
```
To construct the synthetic action space, RxnFlow requires the reaction teamplate set and the building block library.

If package dependencies seem not to work, you may need to install the exact frozen versions listed `requirements/`, i.e. `pip install -r requirements/main-3.10.txt`.
The reaction template used in this paper contains 13 uni-molecular reactions and 58 bi-molecular reactions, which is constructed by [Cretu et al](https://github.com/mirunacrt/synflownet). The template set is available under [data/template/hb_edited.txt](data/template/hb_edited.txt).

## Developing & Contributing
The Enamine building block library is available upon request at [https://enamine.net/building-blocks/building-blocks-catalog](https://enamine.net/building-blocks/building-blocks-catalog). We used the "Comprehensive Catalog" released at 2024.06.10.

External contributions are welcome.
- Use Comprehensive Catalog

To install the developers dependencies
```
pip install -e '.[dev]' --find-links https://data.pyg.org/whl/torch-2.1.2+cu121.html
```
```bash
cd data
# case1: single-step
python scripts/a_sdf_to_env.py -b <CATALOG_SDF> -d envs/enamine_all --cpu <CPU>

# case2: two-step
python scripts/b1_sdf_to_smi.py -b <CATALOG_SDF> -o building_blocks/blocks.smi --cpu <CPU>
python scripts/b2_smi_to_env.py -b building_blocks/blocks.smi -d envs/enamine_all --cpu <CPU> --skip_sanitize
```

- Use custom SMILES file (`.smi`)

We use `tox` to run tests and linting, and `pre-commit` to run checks before committing.
To ensure that these checks pass, simply run `tox -e style` and `tox run` to run linters and tests, respectively.
```bash
python scripts/b2_smi_to_env.py -b <SMILES-FILE> -d ./envs/<ENV> --cpu <CPU>
```
2 changes: 0 additions & 2 deletions VERSION

This file was deleted.

5 changes: 5 additions & 0 deletions data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Enamine Building Block Environment Construction

The enamine building block library is not freely-available. You can obtain the "Comprehensive Catalog" from https://enamine.net/building-blocks/building-blocks-catalog. You need to register and request access of library.

In our manuscript, we used June-10th version Catalog. For reproducibility, we report the Enamine ID list used in our paper at `experiments/paper_enamine_id.txt.gz`.
32 changes: 32 additions & 0 deletions data/experiments/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Protein used in experiments

## LIT-PCBA optimization

From https://drugdesign.unistra.fr/LIT-PCBA/

| Target | PDB ID | Center |
| ---------- | ------ | --------------------- |
| ADRB2 | 4ldo | -1.96, -12.27, -48.98 |
| ALDH1 | 5l2m | 34.43, -16.88, 13.77 |
| ESR_ago | 2p15 | -35.22, 4.64, 20.78 |
| ESR_antago | 2iok | 17.85, 35.51, 52.49 |
| FEN1 | 5fv7 | -16.81, -4.80, 0.62 |
| GBA | 2v3d | 32.44, 33.88, -19.56 |
| IDH1 | 4umx | 12.11, 28.09, 80.47 |
| KAT2A | 5h86 | -0.11, 5.73, 10.14 |
| MAPK1 | 4zzn | -15.69, 14.49, 42.72 |
| MTORC1 | 4dri | 35.38, 49.65, 36.21 |
| OPRK1 | 6b73 | 58.61, -24.16, -4.32 |
| PKM2 | 4jpg | 8.64, 2.94, 10.76 |
| PPARG | 5y2t | 8.30, -1.02, 46.32 |
| TP53 | 3zme | 89.32, 91.82, -44.87 |
| VDR | 3a2i | 11.38, -3.12, -31.57 |

### Note

There is some valence issue, we remove 2623'th Atom (O) in PPARG.

## SBDD optimization (zero-shot sampling with pocket-conditioning)

From https://github.com/gnina/models/tree/master/data/CrossDocked2020
15,201 training pockets + 100 test pockets.
Loading

0 comments on commit df41275

Please sign in to comment.