Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chemical entity deprecation, adding enumeration for limited set of chemicals needed (at the moment) in submissions #2324

Merged
merged 17 commits into from
Jan 28, 2025

Conversation

sierra-moxon
Copy link
Member

@sierra-moxon sierra-moxon commented Jan 22, 2025

This PR deprecates the ChemicalEntity class and adds a ChemicalEntityEnum to simplify the validation and storage of the handful of chemicals currently in scope for support in NMDC workflows.

The chemicals we need are primarily represented in ChEBI, but a few proteases are only found in the MS ontology. After discussion, instead of trying to constrain the ChemicalEntity class to contain both MS and ChEBI identifiers and instead of subclassing ChemicalEntity into subclasses differentiated by their usage in workflows, we decided to establish a small enumeration of chemicals that we know are needed, adding meaning and descriptions to the permissible values as required.

I also deprecated chemical_entity_set, removed references to it, and updated ranges currently pointing to the ChemicalEntity class instead of ChemicalEntityEnum. This change also required updating the test data to remove the requirement to generate NMDC identifiers for these chemicals (as the PR looks now, the substances are represented by their labels with meaning set to their respective ontology identifier).

I have two concerns:

  1. "known_as" now contains strings validated with our enumeration, but still just strings (if a label change happens at ChEBI for one of our enumerated permissible values, we won't be notified. This is probably low-risk in terms of the chemicals we are currently using).
    • We did discuss in our decision-making meeting here, that extracting inchi-key, inchi, and smiles for any enumerated permissible value using ChEBI identifiers/labels will be straightforward using the ChEBI ontology metadata. We did discuss that there would separately need to be code to translate a permissible value like this to the necessary links, definitions, and alternate identifiers (e.g. smiles, inchi-key, inchi) necessary for display of these chemicals.
  2. not all permissible values in this enumeration have descriptions (the descriptions were drawn from a previous effort done by NMDC here: ), should they all have descriptions? Are we missing any chemicals that we need?

It may be a bit tricky to review the schema changes in this PR - the extra files here besides schema and test files are a result of me running make squeaky-clean all test and committing the result. Let me know if I am not supposed to do that step here.

fixes #2121
fixes #2153

Copy link

github-actions bot commented Jan 23, 2025

PR Preview Action v1.6.0

🚀 View preview at
https://microbiomedata.github.io/nmdc-schema/pr-preview/pr-2324/

Built to branch gh-pages at 2025-01-28 18:02 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

src/schema/core.yaml Outdated Show resolved Hide resolved
src/schema/core.yaml Outdated Show resolved Hide resolved
@kheal
Copy link
Contributor

kheal commented Jan 24, 2025

Thanks Sierra for getting this rolling.

I think the general rule of thumb is to only commit files in the src/ folder when making changes to the schema unless you're working on changes to the build steps. It would be good to revert the changes to all the other files to avoid merge conflicts and to help people keep their local environments in sync when branching off the main branch. Usually I run make-squeaky-clean locally, but only add / commit the src/ files.

Otherwise, with the two small changes I requested, this looks good to me.

@SamuelPurvine and @mslarae13 - are the chemicals/proteases you need included and sufficiently described? Easy to see it here if you don't want to dig through code https://microbiomedata.github.io/nmdc-schema/pr-preview/pr-2324/ChemicalEntityEnum/

@kheal
Copy link
Contributor

kheal commented Jan 24, 2025

@sierra-moxon if we decide to merge this in, can you make sure to file a follow up ticket to move the deprecated elements to the deprecated yaml after the next release? That is per the deprecation guidelines.

@SamuelPurvine
Copy link
Contributor

OK, I'm not sure I agree with these mappings?

remove HMDB
@sierra-moxon
Copy link
Member Author

arg!!! this is so terrible: I used a LLM to reorder my curated permissible values alphabetically so I could better see if I had a complete list, and it messed up the mappings. ARG! fixing. thank you for catching that.

@sierra-moxon
Copy link
Member Author

sierra-moxon commented Jan 24, 2025

This is what MS_1001917 looks like:
https://www.ebi.ac.uk/ols4/ontologies/ms/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FMS_1001917
glutamyl endopeptidase - that is the label we'd like to use instead of "Glu-C"?

@SamuelPurvine
Copy link
Contributor

This is what MS_1001917 looks like: https://www.ebi.ac.uk/ols4/ontologies/ms/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FMS_1001917 glutamyl endopeptidase - that is the label we'd like to use instead of "Glu-C"?

No please, the exact name of Glu-C is what the parameter file wants to see.

@sierra-moxon
Copy link
Member Author

In addition, isopropyl_alcohol in ChEBI is labeled propan-2-ol - do we want to use that label instead? (it has isopropyl alcohol as a synonym).

@SamuelPurvine
Copy link
Contributor

In addition, isopropyl_alcohol in ChEBI is labeled propan-2-ol - do we want to use that label instead? (it has isopropyl alcohol as a synonym).

I would keep it as is, only nerds really know what propan-2-ol is :) Was always a fun thing to throw newbies in the lab about...

@sierra-moxon
Copy link
Member Author

Thanks Sierra for getting this rolling.

I think the general rule of thumb is to only commit files in the src/ folder when making changes to the schema unless you're working on changes to the build steps. It would be good to revert the changes to all the other files to avoid merge conflicts and to help people keep their local environments in sync when branching off the main branch. Usually I run make-squeaky-clean locally, but only add / commit the src/ files.

alright, done: I reverted all changed files outside of src/ except the pyproject.toml, poetry.lock, and Makefile as I also needed to upgrade the refscan package in this PR.

@sierra-moxon
Copy link
Member Author

@sierra-moxon if we decide to merge this in, can you make sure to file a follow up ticket to move the deprecated elements to the deprecated yaml after the next release? That is per the deprecation guidelines.

done: #2325

@sierra-moxon sierra-moxon requested a review from kheal January 24, 2025 01:06
@sierra-moxon
Copy link
Member Author

@SamuelPurvine - alright, I believe I fixed the mappings. Another minor thing: are you happy with my capitalization, dashes, and underscore conventions in this Enum?

@sierra-moxon
Copy link
Member Author

I think I am going to add a small test here that queries the mapping and the label and makes sure they, or one of their synonyms, agree, and if not, generate a warning. It should be pretty easy to do this with an API call to OLS. Pls hold off on merge until I get that in there.

@kheal
Copy link
Contributor

kheal commented Jan 24, 2025

@sierra-moxon - would you consider switching the value and description of PVs?

So the value would be "CHEBI:30751" and the meaning would be "formic acid"? My gut is that the most standardized id for each chemical should be what we use as the value in the known_as. Is there a reason not to do this?

@sierra-moxon
Copy link
Member Author

sierra-moxon commented Jan 24, 2025

meaning is more or less a reserved metadata element that helps translate the curie into a URI in the RDF serialization of the model. But we could do something like this:

e.g.

  ChemicalEntityEnum:
    permissible_values:
      Arg-C:
        meaning: MS:1001303

becomes:

  ChemicalEntityEnum:
    permissible_values:
      "MS:1001303":
        meaning: MS:1001303
        description: >- 
              Arg-C: A cysteine protease that hydrolyzes peptide, ester, and amide bonds at the C-terminus of arginine and with 
             lower efficiency, lysine.

I much prefer the identifiers as the permissible value.

@SamuelPurvine
Copy link
Contributor

SamuelPurvine commented Jan 24, 2025

For the proteases, can we please make sure that the name, the specific set of characters we had, are the thing that is available to the workflows, and not have to dig to find possible aliases and suchlike? Reason: The parameter file that runs the proteomics workflow has a specific set of nomenclature to set the enzyme that was used in the experiment. This then tells the program (MSGFPlus) where to cleave amino acids when coming up with candidate sequences for MS2 comparison. The characters listed in the enum are pulled directly from this parameter file, so I know that these are what it expects to see. We are doing this for workflow operation, as opposed to trying to make claims about biology or gene expressions or functions, etc.

Specific section of the parameter file reads:
# Enzyme ID
# 0 means unspecific cleavage (cleave after any residue)
# 1 means Trypsin (Default); optionally use this along with NTT=0 for a no-enzyme-specificity search of a tryptically digested sample
# 2: Chymotrypsin, 3: Lys-C, 4: Lys-N, 5: Glu-C, 6: Arg-C, 7: Asp-N, 8: alphaLP, 9: No Cleavage (for peptidomics)
EnzymeID=1

Workflow will need to look for the protease used in sample processing and match that to the ID number in the file. I would LOVE to also have No Cleavage as an option for proteomics workflow protease enums, but I suspect we'll cross that bridge if we ever want to do no enzyme searches (boring explanation redacted here).

Copy link
Contributor

@mslarae13 mslarae13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for the schema and examples. I leave the content changes to Sam.

@@ -1,7 +1,7 @@
type: nmdc:MobilePhaseSegment
substances_used:
- type: nmdc:PortionOfSubstance
known_as: nmdc:chem-99-000003 # see src/data/valid/Database-chemical_entity_set-1.yaml
known_as: alphaLP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mslarae13 This chem needs to be corrected

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like nmdc:chem-99-000003 in the previous schema was identified as methanol. Likely as a placeholder. And rather than always replace nmdc:chem-99-000003 with methanol @sierra-moxon put in a variety of permissible values from the enum. So while alphaLP is unlikely to ever be used as the substance in MobilePhaseSegment.. this does show that there's no validation or checks on the fact that no one chemical in that permissible value is limited to specific classes.

I think we can leave this as is for the example. And if one day, we expand validation to say "these are the only chemicals that can be used for MobilePhaseSegment" this example will need updated.

Copy link
Contributor

@SamuelPurvine SamuelPurvine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved once Montana reviews/adjusts what I commented about

@turbomam
Copy link
Member

Thanks for the great contributions everybody! This PR looks very good to me. I expect that I will have a few comments over then next half hour or so.

@turbomam turbomam self-requested a review January 24, 2025 19:42
Copy link
Member

@turbomam turbomam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was clearly a great team effort. I would prefer that all of the permissible value descriptions be drawn from the same source as the meaning, but it's nice that they are informative and consistent in style.

I have made a few targeted requests.

src/schema/core.yaml Show resolved Hide resolved
src/schema/core.yaml Show resolved Hide resolved
src/schema/core.yaml Show resolved Hide resolved
@sierra-moxon
Copy link
Member Author

sierra-moxon commented Jan 28, 2025

Per MS meeting:

  • use labels for now, then re-evaluate when we have a proteomics workflow that changes based on these chemicals and/or a UI mockup for how these PVs will be used in the submission portal. (lots of good ideas here from @turbomam, @SamuelPurvine @mslarae13 @lamccue about PV subsets, etc. -- see notes in slides and meeting notes)

  • add an EC number for those proteases that don't have MS identifiers, make an issue in the MS ontology repo to add alphaLP as a MS term with a good definition. Note: in reviewing the prefixes in bioregistry.io for EC numbers, I settled on the eccode prefix per their guidance, but have open slack messages to the developers here, to get some more feedback about the change (I'm part of the orgs that use EC vs. eccode historically. This discussion should not hold up this PR) Edit: per @turbomam's review, we can stick with EC for now and do a separate PR to update all instances of "EC codes" used in the schema to use eccode when we have more information from bioregistry, etc.

src/schema/core.yaml Outdated Show resolved Hide resolved
@kheal
Copy link
Contributor

kheal commented Jan 28, 2025

Looks like there were files committed that were not the src/ . files on the last commit. We should revert those before pushing, I think.

@sierra-moxon sierra-moxon requested a review from turbomam January 28, 2025 17:39
@sierra-moxon
Copy link
Member Author

Looks like there were files committed that were not the src/ . files on the last commit. We should revert those before pushing, I think.

All have been reverted besides the pyproject.toml, poetry.lock, and Makefile changes associated with updating refscan.

@sierra-moxon
Copy link
Member Author

MS ontology ticket: HUPO-PSI/psi-ms-CV#366
MS ontology PR: HUPO-PSI/psi-ms-CV#367

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants