Skip to content

Add workflow for evaluating predictions#12

Draft
cthoyt wants to merge 7 commits into
mainfrom
evaluate-predictions
Draft

Add workflow for evaluating predictions#12
cthoyt wants to merge 7 commits into
mainfrom
evaluate-predictions

Conversation

@cthoyt
Copy link
Copy Markdown
Member

@cthoyt cthoyt commented Dec 5, 2023

Closes #7

This workflow takes in three parts:

  1. Positive, manually curated mappings
  2. Negative, manually curated mappings
  3. Predicted mappings

And estimates several metrics such as accuracy, precision, recall, and F1 for the predictions. This gives back an estimation of the true metrics, since the positive and negative manually curated mappings likely are not complete and therefore have some bias in which things were curated (e.g., I always curate the easiest first, leading towards a skew that more of my manual curations result in positive calls).

Why is this useful?

Mapping tool competitions don't have to keep writing their own infrastructure for holding their competitions. You do the following:

  1. Curate (or generate) the gold standard correct and incorrect mappings
  2. Ask the competitors to generate their predictions in SSSOM
  3. Load them into this function and get results

Demonstration

This also comes with a demonstrator by comparing a combination first-party ontology curations combine with third-party Biomappings curations against lexical mapping predictions made by Gilda. It reports the following when applied to a small number of OBO Foundry ontologies.

prefix completion accuracy precision recall $F_1$
chebi 10.8% 98.0% 98.8% 99.1% 99.0%
cl 28.3% 53.7% 90.8% 47.9% 62.7%
clo 52.6% 34.9% 70.0% 38.9% 50.0%
doid 30.1% 26.8% 92.2% 26.3% 40.9%
go 38.0% 80.0% 81.8% 96.8% 88.7%
maxo 44.6% 86.4% 100.0% 86.4% 92.7%
uberon 6.3% 11.2% 98.5% 11.1% 20.0%
vo 66.4% 79.1% 91.7% 77.2% 83.8%

Completion refers to the percentage of predicted mappings that appear in the curated sets (both positive and negative). A higher completion reduces the impact of curation bias. E.g., a completion of 100% means that the metrics are unbiased.

Note that lexical matching has pretty high precision, i.e., most of the predictions it makes are right, but it is more prone to false negatives, so accuracy can vary. Some observations:

  • This leads to the DOID accuracy being pretty low.
  • ChEBI has no curations outside of Biomappings, so the number of false negatives is zero, meaning that the accuracy is a less useful metric (TBD, how to communicate that in the table).
  • CLO has a large number of duplicate terms, which results in an artificially low precision.

Caution

Mapping shouldn't be a competition. Make your predictions, curate them, contribute them to Biomappings or directly upstream, then everyone benefits and we don't have to keep playing this game.

@matentzn
Copy link
Copy Markdown

matentzn commented Dec 5, 2023

Wow this is such a cool idea.. Awesome man!

@codecov
Copy link
Copy Markdown

codecov Bot commented May 2, 2024

Codecov Report

Attention: Patch coverage is 0% with 98 lines in your changes are missing coverage. Please review.

❗ No coverage uploaded for pull request base (main@8d1d4b4). Click here to learn what that means.

Files Patch % Lines
src/semra/evaluate_prediction.py 0.00% 98 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main      #12   +/-   ##
=======================================
  Coverage        ?   28.57%           
=======================================
  Files           ?       32           
  Lines           ?     2390           
  Branches        ?      488           
=======================================
  Hits            ?      683           
  Misses          ?     1666           
  Partials        ?       41           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Automated evaluation of predicted mappings

2 participants