Skip to content

pka prediction using ML model. #51

Open
joanimato wants to merge 9 commits intodevelopfrom
pka_prediction
Open

pka prediction using ML model. #51
joanimato wants to merge 9 commits intodevelopfrom
pka_prediction

Conversation

@joanimato
Copy link
Copy Markdown
Contributor

@joanimato joanimato commented Feb 26, 2026

This PR introduces a couple of major changes:

  1. An ML model based on the sklearn ExtraTreesRegressor is used to compute the pKa of protonation sites. This model is trained using the pre-existing rules and rdkit Descriptors as input features. Because the model uses the rules which are site specific, it can therefore compute the pka for each protonatable site in the molecule. Seems to work pretty well so far, though we are limited by the training set. We can always retrain with a larger set or expand our rules.

  2. Protonations are done combinatorially, with each possible combination of protonated and deprotonated sites considered. This increases the cost somewhat, with the total number of pKa computations being N*2^N, where N is the number of protonatable sites in the molecule. Testing with cysteine (N=3) resulted in molscrub being about 3-4 times slower. So not too bad. There might be some ways to speed this up (loading the model seems to take quite a bit of time), but that's a future problem.

For now this is optional, set with the --pka_model etr1 cli option, or the pka_model="etr1" when initializing the Scrub object. The default will still the rules-based method .

The plan for now is to update this iteratively, with the model being improved as new edge cases are discovered and fed into the model.

The code used to train the model is also included in the training folder. It's pretty or generalizable, but for now it should be ok. we can always improve this as more models are considered.

@joanimato joanimato requested a review from diogomart February 26, 2026 23:18
@diogomart
Copy link
Copy Markdown
Contributor

Found an error

$ scrub.py "C1(C(C(C(C(C1OP(=O)(O)O)OP(=O)(O)O)OP(=O)(O)O)OP(=O)(O)O)OP(=O)(O)O)OP(=O)(O)O" -o ip6.sdf --pka_model etr1 --cpu 1 --debug

pKa ML model found:  /home/diogom/.cache/molscrub/ETR_latest_compressed.joblib
download skipped

Traceback (most recent call last):
  File "/home/diogom/micromamba/envs/dev/bin/scrub.py", line 440, in <module>
    isomer_list, log = scrub.scrub_and_catch_errors(input_mol)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/disk/diogom/code/molscrub/molscrub/core.py", line 191, in scrub_and_catch_errors
    raise e
  File "/disk/diogom/code/molscrub/molscrub/core.py", line 186, in scrub_and_catch_errors
    isomer_list_if_ok_else_input = self(input_mol)
                                   ^^^^^^^^^^^^^^^
  File "/disk/diogom/code/molscrub/molscrub/core.py", line 122, in __call__
    for mol_out in self.acid_base_conjugator(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/disk/diogom/code/molscrub/molscrub/protonate.py", line 83, in __call__
    return self.protonate_with_model(input_mol, ph_range_low, ph_range_high, model)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/disk/diogom/code/molscrub/molscrub/protonate.py", line 117, in protonate_with_model
    ml_pka = self.calculate_pka(rxn["original"], rxn, model=model)[0]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/disk/diogom/code/molscrub/molscrub/protonate.py", line 323, in calculate_pka
    x = self._prepare_data_for_model(mol, rxn_info)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/disk/diogom/code/molscrub/molscrub/protonate.py", line 352, in _prepare_data_for_model
    x["charge_diff"] = self._charge_diff(mol, rxn_info["protonated_atom"])
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/disk/diogom/code/molscrub/molscrub/protonate.py", line 368, in _charge_diff
    if atom_idx < 0 or atom_idx >= mol.GetNumAtoms():
       ^^^^^^^^^^^^
TypeError: '<' not supported between instances of 'NoneType' and 'int'

@joanimato
Copy link
Copy Markdown
Contributor Author

The issue results due to RunReactants applying the reaction to multiple sites instead of one at a time, which I think is the expected behavior. I'm not sure if this is a bug in rdkit or if I'm missing something about the functionality. Or if maybe it's caused by the high symmetry, or something wrong with the smarts pattern.

I implemented a fix rejecting the reactions with multiple sites, it seems to work for now, not sure if it will have any unintended consequences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants