(Fix #352) Allow multiple matches from raw mol, when raw mol is symmetrical#353
(Fix #352) Allow multiple matches from raw mol, when raw mol is symmetrical#353rwxayheee wants to merge 3 commits intoforlilab:developfrom
Conversation
Explanation of #352This is not a bonding issue, I was thinking about something else that is not relevant. This special case was failing because in the past, only one (arbitrary) match is considered from function For unprotonated Serine, at the point of mapping evaluation against the template: Therefore, two matches of the MCS in SolutionThe vanishing of symmetry is evaluated by an empirical symmetry score, not sure if it works in all cases: And because the return type of |
|
Wow, this bug was really deep. I did not expect that. Thank you so much. |
|
The symmetry_score depends on whether the molecule has hydrogens or not: from rdkit import Chem
from collections import Counter
def symmetry_score(mol):
"""Returns a rough score of symmetry: higher = more symmetric."""
ranks = Chem.CanonicalRankAtoms(mol, breakTies=False)
rank_counts = Counter(ranks)
score = sum(count for count in rank_counts.values() if count > 1)
return score
p = Chem.SmilesParserParams()
p.removeHs=False
asp_template = Chem.MolFromSmiles("C(=O)C([H])(N[H])C([H])([H])C(=O)[O-]", p)
asp_template_noH = Chem.MolFromSmiles("C(=O)C([H])(N[H])C([H])([H])C(=O)[O-]")
asp_raw = Chem.MolFromSmiles("C(O)C(N)CC(O)O")
ser_template = Chem.MolFromSmiles("C(=O)C([H])(N[H])C([H])([H])O[H]", p)
ser_template_noH = Chem.MolFromSmiles("C(=O)C([H])(N[H])C([H])([H])O[H]")
ser_raw = Chem.MolFromSmiles("C(O)C(N)CO")
print(f"{symmetry_score(ser_template)=}")
print(f"{symmetry_score(ser_template_noH)=}")
print(f"{symmetry_score(ser_raw)=}")
print()
print(f"{symmetry_score(asp_template)=}")
print(f"{symmetry_score(asp_template_noH)=}")
print(f"{symmetry_score(asp_raw)=}")Here's the result: Thus, I think we need an alternate solution, as the raw mols may or may not have hydrogens. What about keeping multiple mapping only when the indices of link atoms are affected? |
We can exclude hydrogens from the symmetry evaluation. (But I think usually template has hydrogens, while raw mol is typically missing hydrogens so it shouldn’t matter) I have thought about using link atoms too. but the information is not passed on the layer. If you have a suggestion, please make the change and I can take a second look |
|
For templates and raw molecules that match, templates without hydrogens have a smaller symmetry score because they have bond orders, while all bonds are single in raw molecules. I'll look into implementing the link atoms and PR to your PR. |
Yes, this is why raw mol is more symmetrical than template and therefore multiple matches are needed. So I think the symmetry score should work to pick out symmetrical raw mol like unprotected serine. Maybe I still don’t understand the templates without hydrogen scenario you described.. I think the match needs to account for vanishing symmetry in this way, regardless of link atoms or not. But doing this can generate multiple protonated conformers and it’s not the current behavior (just construct an arbitrary protomer). To keep things simple, we can use link atoms for now. If people raise other issues about the template matching in the future, a redesgn would be helpful. |
|
I think that without hydrogens this statement is never true, but it's OK to always return all the mappings <- EDITED
Could you clarify what you mean by "vanishing symmetry"? To clarify, I think the best is to return all the matches accounting for symmetry. I really like your implementation. |
I commented in the code (but maybe it's bad and confusing language..). Here's an explanation: The match function is a function that tries to apply a template onto a raw mol. The vanishing of symmetry (raw -> template) means there are multiple equally valid mappings between raw mol and template. The raw mol in the specific case (unprotonated Serine) is lacking hydrogens and bond order, and therefore it becomes highly symmetrical. As you pointed out as an alternate solution, if taking into account the link atoms, this raw will not be symmetrical. But currently, the link atoms evaluation happens after the matching. Therefore, the solution in is PR is to return multiple matches for further evaluation. A potential problem is that this PR might produce some tied mappings that need manual selection & inspection. (This is actually needed, for a single, unprotonated Serine as there's no way to distinguish C(=O) and CB-OH?) So in theory, we should generate all possible outcomes for these but this might be out of the scope of Meeko at this moment, as we always recommend protonating receptors before making the PDBQT. Both #347 and #352 are applied cases that are kind of outside the original design scope. I understand the urgency but if there's no optimal solution, please don't rush to solve it. I recommend fixing the PDB files in applied projects, if possible. |
|
I think this is good to merge after https://github.com/rwxayheee/Meeko/pull/8/files |
I think this is addressed by this block, but let me know if you have concerns: |
|
Hi @diogomart Returning all matching and unifying by template names is not what I intended. Returning all matching: Unifying by template names: If you're looking for a quick fix for this issue, I will approve the changes. But the changes are very different from what I was planning. If we're not account for the symmetry, this PR should be closed and a different approach should be taken. |
If unprotonated ASP/GLU is mapping to ASH/GLH, there's a vanishing symmetry and two possible mappings under the same template key. Maybe it's not worth too much work at this time, to account for exceptions like unprotonated Serine or clashes in sidechains. The matching logic is too complicated and I feel bad for making it more complicated.. |
This seems to work for the specific case in #352 , but requires multiple matches from function match. Maybe it changed more than it should, and make the template matching even more complicated, but it seems to be a working solution.