GitHub

CHORISO (CHemical Organic ReactIon Smiles Omnibus) is a benchmarking suite for reaction prediction machine learning models.

We release:

A highly curated dataset of academic chemical reactions (download ChORISO and splits)
A suite of standardized evaluation metrics
A compilation of models for reaction prediction (choriso-models)

It is derived from the CJHIF dataset. This repo provides all the code used for dataset curation, splitting and analysis reported in the paper, as well as the metrics for evaluation of models.

🚀 Installation

First clone this repo:

git clone https://github.com/schwallergroup/choriso.git
cd choriso

Set up and activate the environment:

conda env create -f environment.yml
conda activate choriso
pip install rxnmapper --no-deps

🔥 Quick start

To download the preprocessed dataset and split it to obtain the corresponding train, validation and test sets, run the following command:

choriso --download_processed \
	--run split

After executing some command from choriso-models, run the analysis of your model's results using:

analyse --results_folders='path/to/results/folder'

Results will be stored in the same directory as benchmarking-results.

Advanced usage

🧠 Advanced usage

Using this repo lets you reproduce the results in the paper using different flags and modes.

📥 Download preprocessed dataset:

choriso --download_processed \
	--out-dir data/processed/

⚙️ Preprocessing

Get the raw datasets (CJHIF, USPTO) and preprocess. The --upsto command runs the same processing pipeline for the raw USPTO data:

NOTE: To run the clean step you need to have NameRXN (v3.4.0) installed.

choriso --download_raw \
	--uspto \
    	--data-dir=data/raw/ \
	--out-dir data/processed/ \
	--run clean \
	--run atom_map

🔍 Stereo check

For this step you need to have either downloaded the preprocessed dataset, or running the preprocessing pipeline. The step checks reactions where there are stereochemistry issues and corrects the dataset.

choriso --run analysis

➗ Splitting

In the paper, we describe a splitting scheme to obtain test splits by product, product molecular weight and random. When doing the splitting, all the testing reactions go to a single test set file, with the split column indicating to which split they belong. To run the splitting:

choriso --run split

By default, reactions with products below 150 a.m.u go to the low MW set and reactions with products above 700 a.m.u go to the high MW set. These values can be modified and adapted to your preferences. For example, to create a split to test on low MW with a threshold of 100 a.m.u., and another split on high MW with threshold of 750 a.m.u. run:

choriso --run split \
	--low_mw=150
	--high_mw=700

You can optionally augment the SMILES to double the size of the training set:

choriso --run split \
	--augment

By default, the splitting will be done on the choriso dataset, which is called choriso.tsv. If you want to split a different dataset, you can specify the path to the dataset using the --split_file_name option. For example, to split the USPTO dataset, run:

choriso --run split \
    --split_file_name=uspto.tsv

📊 Logging

By default the execution of any step will store all results locally.

Optionally, you can log all results from the preprocessing to W&B using the wandb_log flag at any step.

As an example

choriso --run clean \
	--wandb_log

will execute the analysis step and upload all results (plots, metrics) to W&B.

📈 Metrics

You can also use the implemented metrics from the paper to evaluate your own results. We have adapted the evaluation pipeline to the files from the benchmarking repo. As an example:

analyse --results_folders='OpenNMT_Transformer'

This will launch the analysis on all the files of the OpenNMT_Transformer folder. The output files should have the same structure as the one included on the benchmarking repo as an example. The program computes the chemistry metrics by default, which require the presence of a template with radius=0 and a template with radius=1 (these columns should be present on the test set file).

Flagging individual reactions

You can use the metrics functions to check if a specific reaction is regio or stereoselective. As an example:

from choriso.metrics.selectivity import flag_regio_problem, flag_stereo_problem

regio_rxn = 'BrCc1ccccc1.C1CCOC1.C=CC(O)CO.[H-].[Na+]>>C=CC(O)COCc1ccccc1'
stereo_rxn = 'C=C(NC(C)=O)c1ccc(OC)cc1.ClCCl.[H][H].[Rh+]>>COc1ccc([C@@H](C)NC(C)=O)cc1'

print(flag_regio_problem(regio_rxn))
print(flag_stereo_problem(stereo_rxn))

The output will display the flagging labels

True
True

👐 Contributing

Contributions, whether filing an issue, making a pull request, or forking, are appreciated. See CONTRIBUTING.md for more information on getting involved.

👋 Attribution

⚖️ License

The code in this package is licensed under the MIT License.

🍪 Cookiecutter

This package was created with @audreyfeldroy's cookiecutter package using @cthoyt's cookiecutter-snekpack template.

🛠️ For Developers

See developer instructions

The final section of the README is for if you want to get involved by making a code contribution.

Development Installation

To install in development mode, use the following:

$ git clone git+https://github.com/schwallergroup/choriso.git
$ cd choriso
$ pip install -e .

🥼 Testing

After cloning the repository and installing tox with pip install tox, the unit tests in the tests/ folder can be run reproducibly with:

$ tox

Additionally, these tests are automatically re-run with each commit in a GitHub Action.

📖 Building the Documentation

The documentation can be built locally using the following:

$ git clone git+https://github.com/schwallergroup/choriso.git
$ cd choriso
$ tox -e docs
$ open docs/build/html/index.html

The documentation automatically installs the package as well as the docs extra specified in the setup.cfg. sphinx plugins like texext can be added there. Additionally, they need to be added to the extensions list in docs/source/conf.py.

📦 Making a Release

After installing the package in development mode and installing tox with pip install tox, the commands for making a new release are contained within the finish environment in tox.ini. Run the following from the shell:

$ tox -e finish

This script does the following:

Uses Bump2Version to switch the version number in the setup.cfg, src/choriso/version.py, and docs/source/conf.py to not have the -dev suffix
Packages the code in both a tar archive and a wheel using build
Uploads to PyPI using twine. Be sure to have a .pypirc file configured to avoid the need for manual input at this step
Push to GitHub. You'll need to make a release going with the commit where the version was bumped.
Bump the version to the next patch. If you made big changes and want to bump the version by minor, you can use tox -e bumpversion -- minor after.

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
.github		.github
assets		assets
data		data
docs/source		docs/source
notebooks		notebooks
src/choriso		src/choriso
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 Installation

🔥 Quick start

🧠 Advanced usage

📥 Download preprocessed dataset:

⚙️ Preprocessing

🔍 Stereo check

➗ Splitting

📊 Logging

📈 Metrics

Flagging individual reactions

👐 Contributing

👋 Attribution

⚖️ License

🍪 Cookiecutter

🛠️ For Developers

Development Installation

🥼 Testing

📖 Building the Documentation

📦 Making a Release

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

schwallergroup/choriso

Folders and files

Latest commit

History

Repository files navigation

🚀 Installation

🔥 Quick start

🧠 Advanced usage

📥 Download preprocessed dataset:

⚙️ Preprocessing

🔍 Stereo check

➗ Splitting

📊 Logging

📈 Metrics

Flagging individual reactions

👐 Contributing

👋 Attribution

⚖️ License

🍪 Cookiecutter

🛠️ For Developers

Development Installation

🥼 Testing

📖 Building the Documentation

📦 Making a Release

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages