Skip to content

Data and supplementary information (modeling files etc) for the chromatography prediction models

License

Notifications You must be signed in to change notification settings

icredd-cheminfo/chromatography-modeling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

chromatography-modeling

Data and supplementary information (modeling files etc) for the chromatography predictive modeling paper (https://doi.org/10.26434/chemrxiv-2025-s2qhv).

Three Jupyter notebooks are provided witht he underlying data:

  1. Initial chromatograms and extraction of data from them.

  2. Modeling benchmark with cross-validation on the whole set (64 molecules) and external set predictions.

  3. Cross-validation and external prediction on an in-set separation (50 molecules as training and 14 as a test set, taken from the original set).

Chython and DOPtools are required for most visulizations. HPLCAnalysis required for the chromatogram analysis.

Folder structure

Analysis_modeling folder contains all the initial chromatogram data files and the scripts to analyze and process the data to reproduce the results presented in the manuscript.

chromatogram_data folder is where the raw chromatogram data in text format is located. Every molecule has its dedicated file which is produces by the SFC chromatography equipment after the HTE synthesis of the compound. Jupyter notebook 1 - Data extraction from chromatograms contains the code for the conversion of raw files to the values of retention time and eventually the Excel file with the full dataset.

CV_results folder contains the results of the cross-validation on the full dataset for both raw retention time (RT) and retention factor (LnRF). For both properties, the descriptor files are located in folders with descriptors in them, each containing subfloders for each descriptor type. The subfolders have two types of files in them -- .svm which is the descriptor matrix in a sparse format, and .pkl which contain the pickled Python object of the descriptor calculator as defined by DOPtools.

The cross-validation itself has been done on local servers using DOPtools Command Line Interface (CLI) and only the results of those are present. The conmmands follow the standard preparer and optimizer CLI commands to launch the run:

foo@bar:~$ launch_preparer -i Training_set_64.csv -o $output_folder --property_col Exp_RT """or Exp_LnRF""" --structure_column SMILES """descriptor parameters depending on the descriptor type"""

foo@bar:~$ launch_optimizer -d $descriptor_folder -o $output_folder --ntrials 500 --cv_splits 5 --cv_repeats 5 -m SVR """or RFR"""

Full description of the commands is available in the original DOPtools paper (https://doi.org/10.1039/D4DD00399C) or the repository.

The descriptors were generated from the Training_set_64.csv file. The CLI produced optimization result folders for each descriptor type and method. The folders contain the full results of the optimization (all setups with scores) in the file trials.all, sorted 50 best trials in trials.best and the folder for the best trial (predictions, statistics, parameters). For the sake of saving space, folders for other parameter setups are not given. The code to reproduce the plots and external set predicitons from these optimizations is provided in the Jupyter notebook 2 -Cross-validation and external test set.

The inset_results folder follow the same format, where two models (for ChyLine and CircuS) are given. Training and test sets were generated by random separation from the initial training set file, and their compositions are provided in the corresponding files in this folder. Optimization has been performed on the training set of 50 compounds for these two descriptor types only, as they showed the best performance in the previous experiments, similarly in DOPtools CLI. The code for reproducing the results of the CV and the external validation is given in the Jupyter notebook 3 - In-set performance analysis.

About

Data and supplementary information (modeling files etc) for the chromatography prediction models

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors