pka prediction using ML model. by joanimato · Pull Request #51 · forlilab/molscrub

joanimato · 2026-02-26T23:18:20Z

This PR introduces a couple of major changes:

An ML model based on the sklearn ExtraTreesRegressor is used to compute the pKa of protonation sites. This model is trained using the pre-existing rules and rdkit Descriptors as input features. Because the model uses the rules which are site specific, it can therefore compute the pka for each protonatable site in the molecule. Seems to work pretty well so far, though we are limited by the training set. We can always retrain with a larger set or expand our rules.
Protonations are done combinatorially, with each possible combination of protonated and deprotonated sites considered. This increases the cost somewhat, with the total number of pKa computations being N*2^N, where N is the number of protonatable sites in the molecule. Testing with cysteine (N=3) resulted in molscrub being about 3-4 times slower. So not too bad. There might be some ways to speed this up (loading the model seems to take quite a bit of time), but that's a future problem.

For now this is optional, set with the --pka_model etr1 cli option, or the pka_model="etr1" when initializing the Scrub object. The default will still the rules-based method .

The plan for now is to update this iteratively, with the model being improved as new edge cases are discovered and fed into the model.

The code used to train the model is also included in the training folder. It's pretty or generalizable, but for now it should be ok. we can always improve this as more models are considered.

…raft)

…here

diogomart · 2026-03-11T22:59:55Z

Found an error

$ scrub.py "C1(C(C(C(C(C1OP(=O)(O)O)OP(=O)(O)O)OP(=O)(O)O)OP(=O)(O)O)OP(=O)(O)O)OP(=O)(O)O" -o ip6.sdf --pka_model etr1 --cpu 1 --debug

pKa ML model found:  /home/diogom/.cache/molscrub/ETR_latest_compressed.joblib
download skipped

Traceback (most recent call last):
  File "/home/diogom/micromamba/envs/dev/bin/scrub.py", line 440, in <module>
    isomer_list, log = scrub.scrub_and_catch_errors(input_mol)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/disk/diogom/code/molscrub/molscrub/core.py", line 191, in scrub_and_catch_errors
    raise e
  File "/disk/diogom/code/molscrub/molscrub/core.py", line 186, in scrub_and_catch_errors
    isomer_list_if_ok_else_input = self(input_mol)
                                   ^^^^^^^^^^^^^^^
  File "/disk/diogom/code/molscrub/molscrub/core.py", line 122, in __call__
    for mol_out in self.acid_base_conjugator(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/disk/diogom/code/molscrub/molscrub/protonate.py", line 83, in __call__
    return self.protonate_with_model(input_mol, ph_range_low, ph_range_high, model)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/disk/diogom/code/molscrub/molscrub/protonate.py", line 117, in protonate_with_model
    ml_pka = self.calculate_pka(rxn["original"], rxn, model=model)[0]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/disk/diogom/code/molscrub/molscrub/protonate.py", line 323, in calculate_pka
    x = self._prepare_data_for_model(mol, rxn_info)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/disk/diogom/code/molscrub/molscrub/protonate.py", line 352, in _prepare_data_for_model
    x["charge_diff"] = self._charge_diff(mol, rxn_info["protonated_atom"])
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/disk/diogom/code/molscrub/molscrub/protonate.py", line 368, in _charge_diff
    if atom_idx < 0 or atom_idx >= mol.GetNumAtoms():
       ^^^^^^^^^^^^
TypeError: '<' not supported between instances of 'NoneType' and 'int'

joanimato · 2026-03-12T18:28:41Z

The issue results due to RunReactants applying the reaction to multiple sites instead of one at a time, which I think is the expected behavior. I'm not sure if this is a bug in rdkit or if I'm missing something about the functionality. Or if maybe it's caused by the high symmetry, or something wrong with the smarts pattern.

I implemented a fix rejecting the reactions with multiple sites, it seems to work for now, not sure if it will have any unintended consequences.

joanimato added 3 commits February 23, 2026 13:55

feat: ML pka prediction and protonation

2a8c7b0

minor fixes

943cb39

feat: combinatoric determination of protonated sites

e100862

joanimato requested a review from diogomart February 26, 2026 23:18

joanimato added 5 commits February 27, 2026 14:33

feat: added code used to train the model for reproducibility (rough d…

34f5daa

…raft)

fix: account for new descriptors in newer versions of rdkit

c707880

feat: output predicted pka in the sdf file

8e6e74e

feat: download model from different repository instead of storing it …

d78deca

…here

removed pkl files from this repository

05bfa77

fix: remove multiples produced by RunReactants

360a027

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pka prediction using ML model. #51

pka prediction using ML model. #51
joanimato wants to merge 9 commits intodevelopfrom
pka_prediction

joanimato commented Feb 26, 2026 •

edited

Loading

Uh oh!

diogomart commented Mar 11, 2026

Uh oh!

joanimato commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

joanimato commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

diogomart commented Mar 11, 2026

Uh oh!

joanimato commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

joanimato commented Feb 26, 2026 •

edited

Loading