Skip to content

ComPlat/PubChem_datasourcing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Solubility data Sourcing from Pubchem (https://pubchem.ncbi.nlm.nih.gov/)

Overview

This repository contains the code, data, and models used in a sourcing the solubility dataset. The study focuses on taking advantage of open source paubchem data base to source the solubility data and compare with literature and benchmark datasets, comparing the performance of different combination of the dataset.

Workflow from data gathering, preprocesiing, model training and evaluation of the source data

Workflow

Workflow

From PubChem's vast database of 50 million molecules to actionable insights: Curated 53,789 solubility records, split into datasets X (29,652) and Y (18,548), validated against literature data (X: 357 matches, Y: 498 matches). With X data demonstrating lower error, predictive capacity was tested using the Huuskonen dataset, achieving an MAE of 0.91 and R² of 0.63. A robust workflow integrating data curation, validation, and prediction for reliable solubility analysis.

Data files

  1. Clean_0k_50000k.csv (53,789 samples)
  2. more_clean_0k_50000k_29652.csv (29,652 samples)
  3. Pubchem_2nd filter_18548.csv (18,548 samples)
  4. Unique_train4_new24_new.csv (17,937 samples)
  5. Unique_test_new24.csv (1282 samples)

Evaluation Notebooks

  • notebooks/: Jupyter notebooks containing the analysis, preprocessing, and model training code.
    • Pubchem_with_salt.ipynb
    • Pubchem_without salt.ipynb
    • Pubchem_litrerature_Reaxys(With_salt).ipynb
    • Pubchem_litrerature_Reaxys(Without_salt).ipynb

Workflow Workflow

  • scripts/: Function to create the discriptors used in the study.
    • utilities.py

Installation step

1 Clone the Repository
Run the following command to clone the repository:

2 Create Conda Environment with Python 3.8.

  • conda create --name env python=3.8.11
  • conda activate env
  • conda install -c conda-forge rdkit=2023.9.5
  • pip install -r requirements.txt
  • pip install ipykernel
  • python -m ipykernel install --user --name env --display-name "Python 3.8 (env)"

3 Select the kernel Python 3.8 (env))

  • Run the jupyter notebbok cell by cell to reproduce the results

License

This project is licensed under the MIT License.

Contact

For any questions or further information, you can contact:

Conclusion

This work highlights the value of predictability of completed dataset and the value in gathering data from as diverse a variety of sources as PubChem, Reaxys, and the literature. Further refinement and qualification of the dataset can expand its usefulness for predictive purposes beyond this exercise. The dataset for both conditions (with salt, without salt) performed adequately, with some variation in accuracy observed: with Salt: MAE 0.91, R² 0.63 and without Salt: MAE 0.92, R² 0.63.Comparisons with Reaxys and literature data showed high correlation R² 0.91, confirming the reliability of the dataset. PubChem data also showed strong potential for predictive modeling, with performance metrics in close agreement with established benchmarks MAE 0.91, R² 0.63

For any questions or further information, please feel free to open an issue or contact me directly over mail id mushtaq.ali@kit.edu.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published