MultiModal

MultiModal explores how to generate molecular structures in SMILES format from spectroscopic information such as NMR, IR and MS. The architecture combines convolutional encoders for the spectra and a Transformer decoder.

The dataset and several design choices are based on the paper at https://arxiv.org/pdf/2407.17492. For more details see the preprint.

Repository layout

models/ – encoder/decoder modules and the SMILES tokenizer
training/ – training scripts and vocabulary file
inference/ – inference utilities and decoding strategies
data/ – scripts for downloading and tokenizing the dataset
configs/ – YAML configuration files for experiments
utils/ – logging helpers and custom optimizers
Gradio demos (gradio_app*.py) for interactive testing

Setup

Install dependencies
```
pip install -r requirements.txt
```

Download raw data

python3 download_data.py
# optionally, for parallel download
python3 download_data_parallel.py

Tokenize the data (optional if you download the prepared version)

pip install rxn-chem-utils
python3 create_tokenized_dataset_faster.py \
    --analytical_data data_extraction/multimodal_spectroscopic_dataset \
    --out_path tokenized_baseline \
    --h_nmr --c_nmr --ir --formula

Alternatively Fetch pre-tokenized data and build the vocabulary

python3 data/download_tokenized_dataset.py
python3 data/build_vocab.py

Training

Run the autoregressive training script with a configuration file:

torchrun --nproc_per_node=1 train_autoregressive.py --config configs/test_config.yaml

Configuration files in configs/ control model size, dataset paths and optimization parameters.

Inference

After training you can test different decoding strategies using:

python test_inference.py --config your_config_path --checkpoint your_checkpoint_path

inference/INFERENCE_README.md describes the available strategies: greedy decoding, beam search, sampling, nucleus sampling and Entropix.

Reference

The approach and background are discussed in detail in the preprint.

Name		Name	Last commit message	Last commit date
Latest commit History 217 Commits
calibration_results		calibration_results
configs		configs
data		data
inference		inference
inference_results_test_looping		inference_results_test_looping
models		models
representation_analysis		representation_analysis
scripts		scripts
test_plots		test_plots
training		training
utils		utils
.gitignore		.gitignore
README.md		README.md
ait.ipynb		ait.ipynb
analyze_loop_representations.py		analyze_loop_representations.py
animated_inference.py		animated_inference.py
calibrate_entropix.py		calibrate_entropix.py
gradio_app.py		gradio_app.py
gradio_app_beam_entropix.py		gradio_app_beam_entropix.py
gradio_app_tree.py		gradio_app_tree.py
preprint.pdf		preprint.pdf
requirements.txt		requirements.txt
resume_train.sh		resume_train.sh
run.sh		run.sh
run_detached.sh		run_detached.sh
run_entropix_inference_loop_steps.sh		run_entropix_inference_loop_steps.sh
run_inference.sh		run_inference.sh
run_lr_sweep.sh		run_lr_sweep.sh
run_test_looping.sh		run_test_looping.sh
test_automatic_exit.py		test_automatic_exit.py
test_inference.py		test_inference.py
test_looping.py		test_looping.py
test_model.py		test_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MultiModal

Repository layout

Setup

Training

Inference

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MultiModal

Repository layout

Setup

Training

Inference

Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages