Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

infer Biosample's env_package #22

Open
turbomam opened this issue Sep 3, 2024 · 6 comments · Fixed by #23
Open

infer Biosample's env_package #22

turbomam opened this issue Sep 3, 2024 · 6 comments · Fixed by #23
Assignees

Comments

@turbomam
Copy link
Member

turbomam commented Sep 3, 2024

@mslarae13 @cmungall @sierra-moxon @aclum

This issue describes a technique I can use to infer env_package annotations for Biosamples that don't have one, which in turn should improve the env_local_scale value set.

If we decide that we are comfortable with these env_package inferences, I would like to project them back into MongoDB (presumably with changesheets)

@turbomam turbomam self-assigned this Sep 3, 2024
@turbomam
Copy link
Member Author

turbomam commented Sep 3, 2024

The environmental context triad squad is trying to determine the most reasonable env_broad_scale, env_local_scale and env_medium value sets for each of the environments/extensions/packages that NMDC supports. Our existing Biosamples are a great input for this, except that the majority do not have an env_package value

Unique value counts for 'normalized_env_package':
normalized_env_package
NaN                                                6070
soil                                               1665
plant-associated                                    192
water                                               192
miscellaneous natural or artificial environment     140
Host-associated                                      61
Name: count, dtype: int64

I have made a sparse matrix of the ancestors and descendants of each Biosamples env_broad_scale, env_local_scale and env_medium values, to take advantage of the hierarchical nature of EnvO. Then I trained a random forest on 30% of the Biosamples that have an env_package and tested the model against the remaining 70%. Here's the performance

After reviewing a first pass of env_package annotations at the Biosample level, I came to the conclusion that some of the Biosamples from nmdc:sty-11-r2h77870 (Bio-Scales: Defining plant gene function and its connection to ecosystem nitrogen and carbon cycle) and nmdc:sty-11-1t150432 (Defining the functional diversity of the Populus root microbiome) should have env_package values of 'plant-associated', due to env_medium values like 'leaf', 'root matter' and 'portion of plant tissue'

I also made the judgement that all Biosamples in nmdc:sty-11-8fb6t785 (Deep subsurface shale carbon reservoir microbial communities from Ohio and West Virginia, USA) should use 'hydrocarbon resources-fluids_swabs' as their env_package. The random forest couldn't figure that out, because there weren't any 'hydrocarbon resources-fluids_swabs' Biosamples to train on.

                                                 precision    recall  f1-score   support

                                Host-associated       1.00      1.00      1.00        23
             hydrocarbon resources-fluids_swabs       1.00      0.83      0.91         6
miscellaneous natural or artificial environment       1.00      1.00      1.00        39
                               plant-associated       1.00      1.00      1.00       124
                                           soil       1.00      1.00      1.00       496
                                          water       0.98      1.00      0.99        57

                                       accuracy                           1.00       745
                                      macro avg       1.00      0.97      0.98       745
                                   weighted avg       1.00      1.00      1.00       745

@turbomam turbomam linked a pull request Sep 3, 2024 that will close this issue
@turbomam
Copy link
Member Author

turbomam commented Sep 3, 2024

@mslarae13 @cmungall @sierra-moxon @aclum

nmdc-production-biosamples-env_package-predictions

env_package heterogeneity of studies

I can discuss this tomorrow, but it isn't core to making the value sets

@turbomam
Copy link
Member Author

turbomam commented Sep 3, 2024

Makefile entrypoints

@turbomam turbomam reopened this Sep 3, 2024
@turbomam
Copy link
Member Author

turbomam commented Sep 4, 2024

the NaN env_package vales are all or at least mostly from GOLD

@turbomam
Copy link
Member Author

turbomam commented Sep 5, 2024

@cmungall may have a better linkml-store solution. That would be preferable because it wouldn't rely on custom code, and it would take more (or all?) Biosample slots into consideration.

@aclum
Copy link

aclum commented Sep 10, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants