MultiModal explores how to generate molecular structures in SMILES format from spectroscopic information such as NMR, IR and MS. The architecture combines convolutional encoders for the spectra and a Transformer decoder.
The dataset and several design choices are based on the paper at https://arxiv.org/pdf/2407.17492. For more details see the preprint.
models/– encoder/decoder modules and the SMILES tokenizertraining/– training scripts and vocabulary fileinference/– inference utilities and decoding strategiesdata/– scripts for downloading and tokenizing the datasetconfigs/– YAML configuration files for experimentsutils/– logging helpers and custom optimizers- Gradio demos (
gradio_app*.py) for interactive testing
- Install dependencies
pip install -r requirements.txt
- Download raw data
python3 download_data.py # optionally, for parallel download python3 download_data_parallel.py - Tokenize the data (optional if you download the prepared version)
pip install rxn-chem-utils python3 create_tokenized_dataset_faster.py \ --analytical_data data_extraction/multimodal_spectroscopic_dataset \ --out_path tokenized_baseline \ --h_nmr --c_nmr --ir --formula
Alternatively Fetch pre-tokenized data and build the vocabulary
python3 data/download_tokenized_dataset.py
python3 data/build_vocab.pyRun the autoregressive training script with a configuration file:
torchrun --nproc_per_node=1 train_autoregressive.py --config configs/test_config.yamlConfiguration files in configs/ control model size, dataset paths and optimization parameters.
After training you can test different decoding strategies using:
python test_inference.py --config your_config_path --checkpoint your_checkpoint_pathinference/INFERENCE_README.md describes the available strategies: greedy decoding, beam search, sampling, nucleus sampling and Entropix.
The approach and background are discussed in detail in the preprint.