Skip to content

(Fix #352) Allow multiple matches from raw mol, when raw mol is symmetrical#353

Open
rwxayheee wants to merge 3 commits intoforlilab:developfrom
rwxayheee:allow_multi_matches_from_raw
Open

(Fix #352) Allow multiple matches from raw mol, when raw mol is symmetrical#353
rwxayheee wants to merge 3 commits intoforlilab:developfrom
rwxayheee:allow_multi_matches_from_raw

Conversation

@rwxayheee
Copy link
Copy Markdown
Contributor

@rwxayheee rwxayheee commented Apr 25, 2025

This seems to work for the specific case in #352 , but requires multiple matches from function match. Maybe it changed more than it should, and make the template matching even more complicated, but it seems to be a working solution.

@rwxayheee rwxayheee marked this pull request as draft April 25, 2025 07:14
@rwxayheee rwxayheee requested a review from diogomart April 25, 2025 07:45
@rwxayheee rwxayheee marked this pull request as ready for review April 25, 2025 07:45
@rwxayheee
Copy link
Copy Markdown
Contributor Author

rwxayheee commented Apr 25, 2025

Explanation of #352

This is not a bonding issue, I was thinking about something else that is not relevant. This special case was failing because in the past, only one (arbitrary) match is considered from function mapping_by_mcs and match.

For unprotonated Serine, at the point of mapping evaluation against the template: mapping_by_mcs(self.mol, input_mol)
self.mol is a template.mol, which is asymmetric: (Smiles) [H]NC([H])(C=O)C([H])([H])O[H]
input_mol is the raw_mol, which is symmetric: (Smiles) [N][C]([C][O])[C][O]

Therefore, two matches of the MCS in raw_mol need to be considered, as there's a vanishing of symmetry when the template is enforced on raw_mol.

Solution

The vanishing of symmetry is evaluated by an empirical symmetry score, not sure if it works in all cases:

    def symmetry_score(mol):
        """Returns a rough score of symmetry: higher = more symmetric."""
        ranks = Chem.CanonicalRankAtoms(mol, breakTies=False)
        rank_counts = Counter(ranks)
        score = sum(count for count in rank_counts.values() if count > 1)
        return score

And because the return type of match and mapping_by_mcs have become lists, and one template can possibly have multiple mappings being evaluated, some minor adjustment is made (indexing, etc.)

@diogomart
Copy link
Copy Markdown
Contributor

Wow, this bug was really deep. I did not expect that. Thank you so much.

@diogomart
Copy link
Copy Markdown
Contributor

diogomart commented Apr 25, 2025

The symmetry_score depends on whether the molecule has hydrogens or not:

from rdkit import Chem
from collections import Counter

def symmetry_score(mol):
    """Returns a rough score of symmetry: higher = more symmetric."""
    ranks = Chem.CanonicalRankAtoms(mol, breakTies=False)
    rank_counts = Counter(ranks)
    score = sum(count for count in rank_counts.values() if count > 1)
    return score

p = Chem.SmilesParserParams()
p.removeHs=False
asp_template = Chem.MolFromSmiles("C(=O)C([H])(N[H])C([H])([H])C(=O)[O-]", p)
asp_template_noH = Chem.MolFromSmiles("C(=O)C([H])(N[H])C([H])([H])C(=O)[O-]")
asp_raw = Chem.MolFromSmiles("C(O)C(N)CC(O)O")

ser_template = Chem.MolFromSmiles("C(=O)C([H])(N[H])C([H])([H])O[H]", p)
ser_template_noH = Chem.MolFromSmiles("C(=O)C([H])(N[H])C([H])([H])O[H]")
ser_raw = Chem.MolFromSmiles("C(O)C(N)CO")


print(f"{symmetry_score(ser_template)=}")
print(f"{symmetry_score(ser_template_noH)=}")
print(f"{symmetry_score(ser_raw)=}")
print()
print(f"{symmetry_score(asp_template)=}")
print(f"{symmetry_score(asp_template_noH)=}")
print(f"{symmetry_score(asp_raw)=}")

Here's the result:

symmetry_score(ser_template)=2
symmetry_score(ser_template_noH)=0
symmetry_score(ser_raw)=4

symmetry_score(asp_template)=2
symmetry_score(asp_template_noH)=0
symmetry_score(asp_raw)=2

Thus, I think we need an alternate solution, as the raw mols may or may not have hydrogens. What about keeping multiple mapping only when the indices of link atoms are affected?

@rwxayheee
Copy link
Copy Markdown
Contributor Author

Thus, I think we need an alternate solution, as the raw mols may or may not have hydrogens. What about keeping multiple mapping only when the indices of link atoms are affected?

We can exclude hydrogens from the symmetry evaluation. (But I think usually template has hydrogens, while raw mol is typically missing hydrogens so it shouldn’t matter)

I have thought about using link atoms too. but the information is not passed on the layer. If you have a suggestion, please make the change and I can take a second look

@diogomart
Copy link
Copy Markdown
Contributor

For templates and raw molecules that match, templates without hydrogens have a smaller symmetry score because they have bond orders, while all bonds are single in raw molecules.

I'll look into implementing the link atoms and PR to your PR.

@rwxayheee
Copy link
Copy Markdown
Contributor Author

rwxayheee commented Apr 25, 2025

For templates and raw molecules that match, templates without hydrogens have a smaller symmetry score because they have bond orders, while all bonds are single in raw molecules

Yes, this is why raw mol is more symmetrical than template and therefore multiple matches are needed. So I think the symmetry score should work to pick out symmetrical raw mol like unprotected serine. Maybe I still don’t understand the templates without hydrogen scenario you described..

I think the match needs to account for vanishing symmetry in this way, regardless of link atoms or not. But doing this can generate multiple protonated conformers and it’s not the current behavior (just construct an arbitrary protomer). To keep things simple, we can use link atoms for now. If people raise other issues about the template matching in the future, a redesgn would be helpful.

@diogomart
Copy link
Copy Markdown
Contributor

diogomart commented Apr 25, 2025

I think that without hydrogens this statement is never true, but it's OK to always return all the mappings <- EDITED
https://github.com/rwxayheee/Meeko/blob/569654fe91957c4b882f8e66788596d21df6996f/meeko/polymer.py#L230

I think the match needs to account for vanishing symmetry in this way

Could you clarify what you mean by "vanishing symmetry"? To clarify, I think the best is to return all the matches accounting for symmetry. I really like your implementation.

@rwxayheee
Copy link
Copy Markdown
Contributor Author

rwxayheee commented Apr 25, 2025

Could you clarify what you mean by "vanishing symmetry"? To clarify, I think the best is to return all the matches accounting for symmetry. I really like your implementation.

I commented in the code (but maybe it's bad and confusing language..). Here's an explanation:

The match function is a function that tries to apply a template onto a raw mol. The vanishing of symmetry (raw -> template) means there are multiple equally valid mappings between raw mol and template.

The raw mol in the specific case (unprotonated Serine) is lacking hydrogens and bond order, and therefore it becomes highly symmetrical. As you pointed out as an alternate solution, if taking into account the link atoms, this raw will not be symmetrical. But currently, the link atoms evaluation happens after the matching. Therefore, the solution in is PR is to return multiple matches for further evaluation.

A potential problem is that this PR might produce some tied mappings that need manual selection & inspection. (This is actually needed, for a single, unprotonated Serine as there's no way to distinguish C(=O) and CB-OH?) So in theory, we should generate all possible outcomes for these but this might be out of the scope of Meeko at this moment, as we always recommend protonating receptors before making the PDBQT.

Both #347 and #352 are applied cases that are kind of outside the original design scope. I understand the urgency but if there's no optimal solution, please don't rush to solve it. I recommend fixing the PDB files in applied projects, if possible.

@diogomart
Copy link
Copy Markdown
Contributor

I think this is good to merge after https://github.com/rwxayheee/Meeko/pull/8/files

@diogomart
Copy link
Copy Markdown
Contributor

A potential problem is that this PR might produce some tied mappings that need manual selection & inspection. (This is actually needed, for a single, unprotonated Serine as there's no way to distinguish C(=O) and CB-OH?) So in theory, we should generate all possible outcomes for these but this might be out of the scope of Meeko at this moment, as we always recommend protonating receptors before making the PDBQT.

I think this is addressed by this block, but let me know if you have concerns:
https://github.com/diogomart/Meeko/blob/b129b22f17a43f0c292a807de1d41051139a4f1e/meeko/polymer.py#L1625-L1636

@rwxayheee
Copy link
Copy Markdown
Contributor Author

Hi @diogomart

Returning all matching and unifying by template names is not what I intended.

Returning all matching:
This can return too many mappings

Unifying by template names:
This may not be correct. Same template can have distinguishable mappings.

If you're looking for a quick fix for this issue, I will approve the changes. But the changes are very different from what I was planning. If we're not account for the symmetry, this PR should be closed and a different approach should be taken.

@rwxayheee
Copy link
Copy Markdown
Contributor Author

rwxayheee commented Apr 25, 2025

                # TODO 
                # We need mappings to cover different indices of link atoms only.
                # Carboxylates (e.g. ASP, GLU) will have two mappings for the
                # carboxylate oxygens, but we don't care about this and could filter
                # these out. We care about residues like serine, in which the
                # hydroxyl and carbonyl are equivalent in the raw mol because all
                # bonds are single.

If unprotonated ASP/GLU is mapping to ASH/GLH, there's a vanishing symmetry and two possible mappings under the same template key.

Maybe it's not worth too much work at this time, to account for exceptions like unprotonated Serine or clashes in sidechains. The matching logic is too complicated and I feel bad for making it more complicated..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants