Solubility data Sourcing from Pubchem (https://pubchem.ncbi.nlm.nih.gov/)
This repository contains the code, data, and models used in a sourcing the solubility dataset. The study focuses on taking advantage of open source paubchem data base to source the solubility data and compare with literature and benchmark datasets, comparing the performance of different combination of the dataset.
From PubChem's vast database of 50 million molecules to actionable insights: Curated 53,789 solubility records, split into datasets X (29,652) and Y (18,548), validated against literature data (X: 357 matches, Y: 498 matches). With X data demonstrating lower error, predictive capacity was tested using the Huuskonen dataset, achieving an MAE of 0.91 and R² of 0.63. A robust workflow integrating data curation, validation, and prediction for reliable solubility analysis.
Data files
- Clean_0k_50000k.csv (53,789 samples)
- more_clean_0k_50000k_29652.csv (29,652 samples)
- Pubchem_2nd filter_18548.csv (18,548 samples)
- Unique_train4_new24_new.csv (17,937 samples)
- Unique_test_new24.csv (1282 samples)
- notebooks/: Jupyter notebooks containing the analysis, preprocessing, and model training code.
Pubchem_with_salt.ipynbPubchem_without salt.ipynbPubchem_litrerature_Reaxys(With_salt).ipynbPubchem_litrerature_Reaxys(Without_salt).ipynb
- scripts/: Function to create the discriptors used in the study.
utilities.py
1 Clone the Repository
Run the following command to clone the repository:
-
git clone https://github.com/ComPlat/water-solubility-prediction.git
-
cd water-solubility-prediction
2 Create Conda Environment with Python 3.8.
- conda create --name env python=3.8.11
- conda activate env
- conda install -c conda-forge rdkit=2023.9.5
- pip install -r requirements.txt
- pip install ipykernel
- python -m ipykernel install --user --name env --display-name "Python 3.8 (env)"
3 Select the kernel Python 3.8 (env))
- Run the jupyter notebbok cell by cell to reproduce the results
This project is licensed under the MIT License.
For any questions or further information, you can contact:
-
Mushtaq Ali - mushtaq.ali@kit.edu
-
Nicole Jung - nicole.jung@kit.edu
-
Institution: - https://www.ibcs.kit.edu
This work highlights the value of predictability of completed dataset and the value in gathering data from as diverse a variety of sources as PubChem, Reaxys, and the literature. Further refinement and qualification of the dataset can expand its usefulness for predictive purposes beyond this exercise. The dataset for both conditions (with salt, without salt) performed adequately, with some variation in accuracy observed: with Salt: MAE 0.91, R² 0.63 and without Salt: MAE 0.92, R² 0.63.Comparisons with Reaxys and literature data showed high correlation R² 0.91, confirming the reliability of the dataset. PubChem data also showed strong potential for predictive modeling, with performance metrics in close agreement with established benchmarks MAE 0.91, R² 0.63
For any questions or further information, please feel free to open an issue or contact me directly over mail id mushtaq.ali@kit.edu.


