Skip to content

python script to create ENA submission form for generation of new species-level taxonomics IDs

License

Notifications You must be signed in to change notification settings

SchistoDan/ena-taxid-creation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 

Repository files navigation

ENA Taxonomy Request File Generator

A Python script for processing taxonomic data and generating properly formatted taxonomy request files for the European Nucleotide Archive (ENA). This tool specialises in handling cases where species-level taxonomy IDs are not available, and uses IDs from the GBIF Backbone taxonomy, fetched using the GBIF Species API to support requests to ENA.

Features

  • Processes taxonomic metadata from CSV files
  • Validates scientific names against GBIF taxonomy
  • Implements hierarchical fallback for taxonomic identification (uses species name if available, otherwise falls back to "Genus sp. {process_id}" if only genus is available, or "Family sp. {process_id}" if only family is available.
  • Performs taxonomic rank validation against GBIF (checks name existence and spelling, taxonomic status (accepted/synonym), validate taxonomy at higher ranks (order and class), and match confidence (>95% for species / >90% for genus) and match type (exact/fuzzy) as per GBIF guidlines.
  • Handles synonyms and taxonomic updates
  • Generates ENA-compliant request files

Prerequisites

  • Python 3.6+
  • Required Python packages:
    • pandas
    • pygbif
    • logging

Installation

  1. Clone this repository:
git clone [repository-url]
  1. Install required packages, e.g. pandas and pygbif
pip install pandas pygbif

Usage

python ena_taxonomy_request.py path/to/sample_metadata.csv path/to/samples.csv output_prefix

Input files

  • metadata.csv: Contains columns:

    • Process ID
    • phylum
    • class
    • order
    • family
    • genus
    • species
    • matched_rank
    • taxid
  • samples.csv: Contains columns:

    • ID (i.e. Process ID)

Output Files

The script generates several output files with the specified prefix:

  • {prefix}_taxonomy_request.tsv: Main output file formatted for ENA submission
  • {prefix}_tax_validation_fails.csv: Records that failed taxonomic validation
  • {prefix}_gbif_inconsistent.tsv: Records with GBIF inconsistencies (synonyms, etc.)
  • {prefix}.log: Detailed processing log

Example {prefix}_taxonomy_request.tsv

proposed_name name_type host project_id description
Apatania stylata published_name BGE https://www.gbif.org/species/[GBIF ID]
Agapetus iridipennis published_name BGE https://www.gbif.org/species/[GBIF ID]
Papomyia sp. BSNHM191-24 novel_species BGE https://www.gbif.org/species/[GBIF ID]

Example {prefix}_gbif_inconsistent.tsv

usageKey scientificName canonicalName rank status confidence matchType kingdom phylum order family genus species kingdomKey phylumKey classKey orderKey familyKey genusKey speciesKey synonym class index acceptedUsageKey
8753555 Erotesis melanella McLachlan, 1884 Erotesis melanella SPECIES SYNONYM 98 EXACT Animalia Arthropoda Trichoptera Leptoceridae Adicella Adicella melanella 1 54 216 1003 4395 1436670 1436745 True Insecta 5 1436745
  • Erotesis melanella McLachlan, 1884 == 8753555
  • Adicella melanella (McLachlan, 1884) == 1436745

Authors

  • Dan Parsons @NHMUK

About

python script to create ENA submission form for generation of new species-level taxonomics IDs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages