Solubility data Sourcing from Pubchem (https://pubchem.ncbi.nlm.nih.gov/)

Overview

This repository contains the code, data, and models used in a sourcing the solubility dataset. The study focuses on taking advantage of open source paubchem data base to source the solubility data and compare with literature and benchmark datasets, comparing the performance of different combination of the dataset.

Workflow from data gathering, preprocesiing, model training and evaluation of the source data

Workflow

From PubChem's vast database of 50 million molecules to actionable insights: Curated 53,789 solubility records, split into datasets X (29,652) and Y (18,548), validated against literature data (X: 357 matches, Y: 498 matches). With X data demonstrating lower error, predictive capacity was tested using the Huuskonen dataset, achieving an MAE of 0.91 and R² of 0.63. A robust workflow integrating data curation, validation, and prediction for reliable solubility analysis.

Data files

Clean_0k_50000k.csv (53,789 samples)
more_clean_0k_50000k_29652.csv (29,652 samples)
Pubchem_2nd filter_18548.csv (18,548 samples)
Unique_train4_new24_new.csv (17,937 samples)
Unique_test_new24.csv (1282 samples)

Evaluation Notebooks

notebooks/: Jupyter notebooks containing the analysis, preprocessing, and model training code.
- Pubchem_with_salt.ipynb
- Pubchem_without salt.ipynb
- Pubchem_litrerature_Reaxys(With_salt).ipynb
- Pubchem_litrerature_Reaxys(Without_salt).ipynb

scripts/: Function to create the discriptors used in the study.
- utilities.py

Installation step

1 Clone the Repository
Run the following command to clone the repository:

git clone https://github.com/ComPlat/water-solubility-prediction.git
cd water-solubility-prediction

2 Create Conda Environment with Python 3.8.

conda create --name env python=3.8.11
conda activate env
conda install -c conda-forge rdkit=2023.9.5
pip install -r requirements.txt
pip install ipykernel
python -m ipykernel install --user --name env --display-name "Python 3.8 (env)"

3 Select the kernel Python 3.8 (env))

Run the jupyter notebbok cell by cell to reproduce the results

License

This project is licensed under the MIT License.

Contact

For any questions or further information, you can contact:

Mushtaq Ali - mushtaq.ali@kit.edu
Nicole Jung - nicole.jung@kit.edu
Institution: - https://www.ibcs.kit.edu

Conclusion

This work highlights the value of predictability of completed dataset and the value in gathering data from as diverse a variety of sources as PubChem, Reaxys, and the literature. Further refinement and qualification of the dataset can expand its usefulness for predictive purposes beyond this exercise. The dataset for both conditions (with salt, without salt) performed adequately, with some variation in accuracy observed: with Salt: MAE 0.91, R² 0.63 and without Salt: MAE 0.92, R² 0.63.Comparisons with Reaxys and literature data showed high correlation R² 0.91, confirming the reliability of the dataset. PubChem data also showed strong potential for predictive modeling, with performance metrics in close agreement with established benchmarks MAE 0.91, R² 0.63

For any questions or further information, please feel free to open an issue or contact me directly over mail id mushtaq.ali@kit.edu.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Data		Data
Notebook		Notebook
.DS_Store		.DS_Store
.gitattributes		.gitattributes
README.md		README.md
error1.png		error1.png
error2.png		error2.png
wfpc.png		wfpc.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Solubility data Sourcing from Pubchem (https://pubchem.ncbi.nlm.nih.gov/)

Overview

Workflow from data gathering, preprocesiing, model training and evaluation of the source data

Workflow

Evaluation Notebooks

Installation step

License

Contact

Conclusion

About

Uh oh!

Releases

Packages

Languages

ComPlat/PubChem_datasourcing

Folders and files

Latest commit

History

Repository files navigation

Solubility data Sourcing from Pubchem (https://pubchem.ncbi.nlm.nih.gov/)

Overview

Workflow from data gathering, preprocesiing, model training and evaluation of the source data

Workflow

Evaluation Notebooks

Installation step

License

Contact

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages