Skip to content

SuninKingdom/CLIP-OCSR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

82 Commits
 
 
 
 
 
 
 
 

Repository files navigation

CLIP-OCSR: Bridging the Markush Gap in Optical Chemical Structure Recognition

Hugging Face Space License Python RDKit

Official implementation of the paper: "Bridging the Markush gap in optical chemical structure recognition via a CLIP-derived visual backbone and synthetic data generation".

CLIP-OCSR is a specialized encoder-decoder model for Optical Chemical Structure Recognition (OCSR). It focuses on the high-fidelity translation of complex chemical images into SMILES strings, with a particular emphasis on Markush structures ubiquitous in pharmaceutical patents.


🎮 Live Demo

We provide a ready-to-use web interface hosted on Hugging Face Spaces. You can upload chemical images (including those with complex Markush variations) and experience the model's recognition capabilities firsthand without any local setup.

👉 Try CLIP-OCSR on Hugging Face Spaces


✨ Model Highlights

  • CLIP-Derived Visual Backbone: Utilizes a CLIP-RN50 encoder pretrained on chemical image-caption pairs for robust feature extraction.
  • Markush Structure Mastery: Specifically optimized to handle complex structural variations (substituent, frequency, and position variations).
  • Deterministic Post-processing: Employs an enumeration strategy to derive specific isomer sets from symbolic Pseudo-SMILES predictions.
  • MarkushGen Powered: Developed using the MarkushGen toolkit to overcome the scarcity of annotated Markush images.

Note: This repository currently provides the Evaluation Suite used to assess model accuracy on SMILES and Pseudo-SMILES (Markush) predictions. Model weights and training datasets are not included in the current release.


📊 Evaluation Suite

This toolkit provides the core logic for benchmarking OCSR models, especially those capable of generating Markush notations.

Key Capabilities:

  • Pseudo-SMILES Validation: Logic to verify if a predicted Markush string is chemically consistent with the ground truth.
  • Canonicalization: Leveraging RDKit for robust molecular identity comparison.
  • Accuracy Metrics: Scripts to compute exact match accuracy for both standard SMILES and Markush-specific Pseudo-SMILES

🚀 Getting Started

1. Installation

# Clone the repository
git clone https://github.com/SuninKingdom/CLIP-OCSR.git
cd CLIP-OCSR

# Create and activate the environment
conda create -n clip-ocsr python=3.8.18
conda activate clip-ocsr

# Install dependencies
pip install rdkit==2022.09.1

2. Evaluation examples

cd benchmark

# 1. Standard OCSR Evaluation (non-Markush structures)
python eval.py

# 2. Markush Evaluation (Substituent & Frequency variations)
python eval_subfrevar.py

# 3. Markush Evaluation (Position variations)
python eval_posvar.py

License

About

Official implementation of "Bridging the Markush gap in optical chemical structure recognition via a CLIP-derived visual backbone and synthetic data generation". A powerful OCSR tool for general and Markush structures.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages