-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chemical entity deprecation, adding enumeration for limited set of chemicals needed (at the moment) in submissions #2324
Conversation
|
Thanks Sierra for getting this rolling. I think the general rule of thumb is to only commit files in the src/ folder when making changes to the schema unless you're working on changes to the build steps. It would be good to revert the changes to all the other files to avoid merge conflicts and to help people keep their local environments in sync when branching off the main branch. Usually I run make-squeaky-clean locally, but only add / commit the src/ files. Otherwise, with the two small changes I requested, this looks good to me. @SamuelPurvine and @mslarae13 - are the chemicals/proteases you need included and sufficiently described? Easy to see it here if you don't want to dig through code https://microbiomedata.github.io/nmdc-schema/pr-preview/pr-2324/ChemicalEntityEnum/ |
@sierra-moxon if we decide to merge this in, can you make sure to file a follow up ticket to move the deprecated elements to the deprecated yaml after the next release? That is per the deprecation guidelines. |
OK, I'm not sure I agree with these mappings?
|
remove HMDB
arg!!! this is so terrible: I used a LLM to reorder my curated permissible values alphabetically so I could better see if I had a complete list, and it messed up the mappings. ARG! fixing. thank you for catching that. |
This is what MS_1001917 looks like: |
No please, the exact name of Glu-C is what the parameter file wants to see. |
In addition, |
…ma into chemical-entity
I would keep it as is, only nerds really know what propan-2-ol is :) Was always a fun thing to throw newbies in the lab about... |
alright, done: I reverted all changed files outside of |
done: #2325 |
@SamuelPurvine - alright, I believe I fixed the mappings. Another minor thing: are you happy with my capitalization, dashes, and underscore conventions in this Enum? |
I think I am going to add a small test here that queries the mapping and the label and makes sure they, or one of their synonyms, agree, and if not, generate a warning. It should be pretty easy to do this with an API call to OLS. Pls hold off on merge until I get that in there. |
@sierra-moxon - would you consider switching the value and description of PVs? So the value would be "CHEBI:30751" and the meaning would be "formic acid"? My gut is that the most standardized id for each chemical should be what we use as the value in the |
meaning is more or less a reserved metadata element that helps translate the curie into a URI in the RDF serialization of the model. But we could do something like this: e.g.
becomes:
I much prefer the identifiers as the permissible value. |
For the proteases, can we please make sure that the name, the specific set of characters we had, are the thing that is available to the workflows, and not have to dig to find possible aliases and suchlike? Reason: The parameter file that runs the proteomics workflow has a specific set of nomenclature to set the enzyme that was used in the experiment. This then tells the program (MSGFPlus) where to cleave amino acids when coming up with candidate sequences for MS2 comparison. The characters listed in the enum are pulled directly from this parameter file, so I know that these are what it expects to see. We are doing this for workflow operation, as opposed to trying to make claims about biology or gene expressions or functions, etc. Specific section of the parameter file reads: Workflow will need to look for the protease used in sample processing and match that to the ID number in the file. I would LOVE to also have No Cleavage as an option for proteomics workflow protease enums, but I suspect we'll cross that bridge if we ever want to do no enzyme searches (boring explanation redacted here). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for the schema and examples. I leave the content changes to Sam.
@@ -1,7 +1,7 @@ | |||
type: nmdc:MobilePhaseSegment | |||
substances_used: | |||
- type: nmdc:PortionOfSubstance | |||
known_as: nmdc:chem-99-000003 # see src/data/valid/Database-chemical_entity_set-1.yaml | |||
known_as: alphaLP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mslarae13 This chem needs to be corrected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like nmdc:chem-99-000003
in the previous schema was identified as methanol. Likely as a placeholder. And rather than always replace nmdc:chem-99-000003
with methanol
@sierra-moxon put in a variety of permissible values from the enum. So while alphaLP is unlikely to ever be used as the substance in MobilePhaseSegment.. this does show that there's no validation or checks on the fact that no one chemical in that permissible value is limited to specific classes.
I think we can leave this as is for the example. And if one day, we expand validation to say "these are the only chemicals that can be used for MobilePhaseSegment" this example will need updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved once Montana reviews/adjusts what I commented about
Thanks for the great contributions everybody! This PR looks very good to me. I expect that I will have a few comments over then next half hour or so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was clearly a great team effort. I would prefer that all of the permissible value descriptions be drawn from the same source as the meaning, but it's nice that they are informative and consistent in style.
I have made a few targeted requests.
Per MS meeting:
|
Looks like there were files committed that were not the src/ . files on the last commit. We should revert those before pushing, I think. |
All have been reverted besides the pyproject.toml, poetry.lock, and Makefile changes associated with updating refscan. |
MS ontology ticket: HUPO-PSI/psi-ms-CV#366 |
This PR deprecates the
ChemicalEntity
class and adds aChemicalEntityEnum
to simplify the validation and storage of the handful of chemicals currently in scope for support in NMDC workflows.The chemicals we need are primarily represented in ChEBI, but a few proteases are only found in the MS ontology. After discussion, instead of trying to constrain the
ChemicalEntity
class to contain both MS and ChEBI identifiers and instead of subclassingChemicalEntity
into subclasses differentiated by their usage in workflows, we decided to establish a small enumeration of chemicals that we know are needed, adding meaning and descriptions to the permissible values as required.I also deprecated
chemical_entity_set
, removed references to it, and updated ranges currently pointing to theChemicalEntity
class instead ofChemicalEntityEnum
. This change also required updating the test data to remove the requirement to generate NMDC identifiers for these chemicals (as the PR looks now, the substances are represented by their labels withmeaning
set to their respective ontology identifier).I have two concerns:
nmdc-schema/src/data/valid/Database-interleaved.yaml
Line 58 in bdf58f0
It may be a bit tricky to review the schema changes in this PR - the extra files here besides schema and test files are a result of me running
make squeaky-clean all test
and committing the result. Let me know if I am not supposed to do that step here.fixes #2121
fixes #2153