CHORISO (CHemical Organic ReactIon Smiles Omnibus) is a benchmarking suite for reaction prediction machine learning models.
We release:
- A highly curated dataset of academic chemical reactions (download ChORISO and splits)
- A suite of standardized evaluation metrics
- A compilation of models for reaction prediction (choriso-models)
It is derived from the CJHIF dataset. This repo provides all the code used for dataset curation, splitting and analysis reported in the paper, as well as the metrics for evaluation of models.
First clone this repo:
git clone https://github.com/schwallergroup/choriso.git
cd chorisoSet up and activate the environment:
conda env create -f environment.yml
conda activate choriso
pip install rxnmapper --no-depsTo download the preprocessed dataset and split it to obtain the corresponding train, validation and test sets, run the following command:
choriso --download_processed \
--run splitAfter executing some command from choriso-models, run the analysis of your model's results using:
analyse --results_folders='path/to/results/folder' Results will be stored in the same directory as benchmarking-results.
Advanced usage
Using this repo lets you reproduce the results in the paper using different flags and modes.
choriso --download_processed \
--out-dir data/processed/Get the raw datasets (CJHIF, USPTO) and preprocess. The --upsto command runs the same processing pipeline for the raw USPTO data:
NOTE: To run the clean step you need to have NameRXN (v3.4.0) installed.
choriso --download_raw \
--uspto \
--data-dir=data/raw/ \
--out-dir data/processed/ \
--run clean \
--run atom_mapFor this step you need to have either downloaded the preprocessed dataset, or running the preprocessing pipeline. The step checks reactions where there are stereochemistry issues and corrects the dataset.
choriso --run analysis
In the paper, we describe a splitting scheme to obtain test splits by product, product molecular weight and random. When doing the splitting, all the testing reactions go to a single test set file, with the split column indicating to which split they belong. To run the splitting:
choriso --run split By default, reactions with products below 150 a.m.u go to the low MW set and reactions with products above 700 a.m.u go to the high MW set. These values can be modified and adapted to your preferences. For example, to create a split to test on low MW with a threshold of 100 a.m.u., and another split on high MW with threshold of 750 a.m.u. run:
choriso --run split \
--low_mw=150
--high_mw=700You can optionally augment the SMILES to double the size of the training set:
choriso --run split \
--augmentBy default, the splitting will be done on the choriso dataset, which is called choriso.tsv. If you want to split a different dataset, you can specify the path to the dataset using the --split_file_name option. For example, to split the USPTO dataset, run:
choriso --run split \
--split_file_name=uspto.tsvBy default the execution of any step will store all results locally.
Optionally, you can log all results from the preprocessing to W&B using the wandb_log flag at any step.
As an example
choriso --run clean \
--wandb_logwill execute the analysis step and upload all results (plots, metrics) to W&B.
You can also use the implemented metrics from the paper to evaluate your own results. We have adapted the evaluation pipeline to the files from the benchmarking repo. As an example:
analyse --results_folders='OpenNMT_Transformer'
This will launch the analysis on all the files of the OpenNMT_Transformer folder. The output files should have the same structure as the one included on the benchmarking repo as an example. The program computes the chemistry metrics by default, which require the presence of a template with radius=0 and a template with radius=1 (these columns should be present on the test set file).
You can use the metrics functions to check if a specific reaction is regio or stereoselective. As an example:
from choriso.metrics.selectivity import flag_regio_problem, flag_stereo_problem
regio_rxn = 'BrCc1ccccc1.C1CCOC1.C=CC(O)CO.[H-].[Na+]>>C=CC(O)COCc1ccccc1'
stereo_rxn = 'C=C(NC(C)=O)c1ccc(OC)cc1.ClCCl.[H][H].[Rh+]>>COc1ccc([C@@H](C)NC(C)=O)cc1'
print(flag_regio_problem(regio_rxn))
print(flag_stereo_problem(stereo_rxn))The output will display the flagging labels
True
TrueContributions, whether filing an issue, making a pull request, or forking, are appreciated. See CONTRIBUTING.md for more information on getting involved.
The code in this package is licensed under the MIT License.
This package was created with @audreyfeldroy's cookiecutter package using @cthoyt's cookiecutter-snekpack template.
See developer instructions
The final section of the README is for if you want to get involved by making a code contribution.
To install in development mode, use the following:
$ git clone git+https://github.com/schwallergroup/choriso.git
$ cd choriso
$ pip install -e .After cloning the repository and installing tox with pip install tox, the unit tests in the tests/ folder can be
run reproducibly with:
$ toxAdditionally, these tests are automatically re-run with each commit in a GitHub Action.
The documentation can be built locally using the following:
$ git clone git+https://github.com/schwallergroup/choriso.git
$ cd choriso
$ tox -e docs
$ open docs/build/html/index.htmlThe documentation automatically installs the package as well as the docs
extra specified in the setup.cfg. sphinx plugins
like texext can be added there. Additionally, they need to be added to the
extensions list in docs/source/conf.py.
After installing the package in development mode and installing
tox with pip install tox, the commands for making a new release are contained within the finish environment
in tox.ini. Run the following from the shell:
$ tox -e finishThis script does the following:
- Uses Bump2Version to switch the version number in the
setup.cfg,src/choriso/version.py, anddocs/source/conf.pyto not have the-devsuffix - Packages the code in both a tar archive and a wheel using
build - Uploads to PyPI using
twine. Be sure to have a.pypircfile configured to avoid the need for manual input at this step - Push to GitHub. You'll need to make a release going with the commit where the version was bumped.
- Bump the version to the next patch. If you made big changes and want to bump the version by minor, you can
use
tox -e bumpversion -- minorafter.