A Python script for processing taxonomic data and generating properly formatted taxonomy request files for the European Nucleotide Archive (ENA). This tool specialises in handling cases where species-level taxonomy IDs are not available, and uses IDs from the GBIF Backbone taxonomy, fetched using the GBIF Species API to support requests to ENA.
- Processes taxonomic metadata from CSV files
- Validates scientific names against GBIF taxonomy
- Implements hierarchical fallback for taxonomic identification (uses species name if available, otherwise falls back to "Genus sp. {process_id}" if only genus is available, or "Family sp. {process_id}" if only family is available.
- Performs taxonomic rank validation against GBIF (checks name existence and spelling, taxonomic status (accepted/synonym), validate taxonomy at higher ranks (order and class), and match confidence (>95% for species / >90% for genus) and match type (exact/fuzzy) as per GBIF guidlines.
- Handles synonyms and taxonomic updates
- Generates ENA-compliant request files
- Python 3.6+
- Required Python packages:
- pandas
- pygbif
- logging
- Clone this repository:
git clone [repository-url]
- Install required packages, e.g. pandas and pygbif
pip install pandas pygbif
python ena_taxonomy_request.py path/to/sample_metadata.csv path/to/samples.csv output_prefix
-
metadata.csv: Contains columns:
- Process ID
- phylum
- class
- order
- family
- genus
- species
- matched_rank
- taxid
-
samples.csv: Contains columns:
- ID (i.e. Process ID)
The script generates several output files with the specified prefix:
- {prefix}_taxonomy_request.tsv: Main output file formatted for ENA submission
- {prefix}_tax_validation_fails.csv: Records that failed taxonomic validation
- {prefix}_gbif_inconsistent.tsv: Records with GBIF inconsistencies (synonyms, etc.)
- {prefix}.log: Detailed processing log
proposed_name | name_type | host | project_id | description |
---|---|---|---|---|
Apatania stylata | published_name | BGE | https://www.gbif.org/species/[GBIF ID] | |
Agapetus iridipennis | published_name | BGE | https://www.gbif.org/species/[GBIF ID] | |
Papomyia sp. BSNHM191-24 | novel_species | BGE | https://www.gbif.org/species/[GBIF ID] |
usageKey | scientificName | canonicalName | rank | status | confidence | matchType | kingdom | phylum | order | family | genus | species | kingdomKey | phylumKey | classKey | orderKey | familyKey | genusKey | speciesKey | synonym | class | index | acceptedUsageKey |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8753555 | Erotesis melanella McLachlan, 1884 | Erotesis melanella | SPECIES | SYNONYM | 98 | EXACT | Animalia | Arthropoda | Trichoptera | Leptoceridae | Adicella | Adicella melanella | 1 | 54 | 216 | 1003 | 4395 | 1436670 | 1436745 | True | Insecta | 5 | 1436745 |
- Dan Parsons @NHMUK