Detailed reference for ToxProt25 analysis commands. For quick start and data processing, see the README.
The analysis pipeline generates figures and statistics from processed ToxProt data, enabling decade-based comparisons (2005 → 2015 → 2025) and full timeline trend analysis.
data/processed/toxprot/
│
├── toxprot_2005.csv
├── toxprot_2015.csv
└── toxprot_2025.csv
│
▼
toxprot analysis <command>
│
▼
figures/
├── summary statistics
├── taxonomic trends
├── protein families
├── sequence lengths
├── habitat distributions
├── tissue sources
├── PTM frequencies
├── GO term coverage
├── protein evidence
└── definition comparisons
All analysis commands inherit these options from the toxprot analysis group:
Controls which ToxProt entries are included in the analysis:
| Value | Description |
|---|---|
venom_tissue |
Entries with venom tissue annotation (default) |
kw_toxin |
Entries with toxin keyword (KW-0800) |
both_only |
Only entries matching both criteria |
all |
All entries (venom tissue OR toxin keyword) |
toxprot analysis summary # Uses venom_tissue (default)
toxprot analysis -d all summary # Include all ToxProt entries
toxprot analysis -d kw_toxin taxa # Only toxin keyword entries| Option | Default | Description |
|---|---|---|
--data-dir |
data/processed/toxprot/ |
Input directory with CSVs |
--output-dir |
figures/ |
Output directory for figures |
All commands output to figures/.
| Command | Years | Description |
|---|---|---|
pipeline |
All | Run all analyses sequentially |
summary |
2005, 2015, 2025 | Dataset statistics table |
taxa |
2005–2025 | Taxonomic distribution trends |
families |
2005, 2015, 2025 | Protein family distributions |
length |
2005, 2015, 2025 | Sequence length histograms |
habitat |
2005, 2015, 2025 | Terrestrial vs marine habitats |
source-tissue |
2005, 2015, 2025 | Tissue annotation evolution |
ptm |
2005, 2015, 2025 | PTM frequency analysis |
go |
2005–2025 | GO term coverage and depth |
protein-evidence |
2008, 2015, 2025 | Evidence level transitions |
definitions |
2025 | Selection criteria comparison |
protspace |
2025 | Protein embedding analysis |
Runs all analysis commands sequentially with default parameters. Useful for regenerating all figures after data updates.
toxprot analysis pipeline # Run all analyses
toxprot analysis -d all pipeline # Use all entries (not just venom_tissue)
toxprot analysis pipeline --data-dir custom/ # Custom input directoryOutput: All figure directories populated under figures/.
Generates a statistics table comparing key metrics across decade snapshots: entry counts, species diversity, taxonomic coverage, and annotation completeness.
toxprot analysis summary
toxprot analysis -d all summary
toxprot analysis summary -o figures/customOutput: figures/dataset_summary_statistics.png
Metrics included:
- Total entries
- Unique protein families
- Missing protein family annotations (count & %)
- Fragment entries (count & %)
- PTM annotations (count & %)
- Toxic dose annotations (count & %)
- Species count
- Order count
Analyzes taxonomic composition across all 21 years. Generates trend plots showing top taxa evolution and alluvial diagrams identifying "newcomers" — taxonomic groups appearing in later releases.
toxprot analysis taxa # Default: Order level
toxprot analysis taxa --level Family # Family-level analysis
toxprot analysis taxa -l all # Generate all taxonomic levels
toxprot analysis taxa --skip-trend # Skip trend plot, only alluvialOptions:
| Option | Default | Description |
|---|---|---|
--level |
Order |
Taxonomic level: Phylum, Class, Order, Family, or all |
--skip-trend |
False | Skip generating trend plot |
Output: figures/
top_taxa_trend.png— Top 5 orders over all 21 years with silhouettes (Squamata/Cobra, Araneae/Spider, Neogastropoda/Conus, Scorpiones/Scorpion, Hymenoptera)taxa_newcomers_alluvial_{level}.png— Decade-step flow diagram (2005→2015→2025)
Notes:
- "Newcomers" are taxa present in a later year but absent in an earlier year
- Trend plot uses all years 2005-2025; alluvial uses decade snapshots
Compares protein family distributions across decade snapshots. Shows which toxin families dominate and how their relative abundance changes over time.
toxprot analysis families
toxprot analysis families --top-n 15 # Show top 15 families
toxprot analysis -d all familiesOptions:
| Option | Default | Description |
|---|---|---|
--top-n |
10 |
Number of top families to display |
Output: figures/top_families_alluvial.png — Alluvial plot showing rank changes
Notes:
- Family names are normalized across years using 80+ mappings (e.g., "Snake toxin family" → "Snake three-finger toxin family")
- This enables accurate cross-year comparison despite naming changes
Generates sequence length distribution histograms comparing 2005, 2015, and 2025 datasets. Useful for understanding how the size profile of characterized toxins has evolved.
toxprot analysis length
toxprot analysis -d all length
toxprot analysis length -o figures/customOutput: figures/sequence_length_distribution.png
Visualization:
- Overlaid histograms (not stacked) with 25 AA bins
- Bins: 1-25, 26-50, 51-75, ..., 276-300, 301+
- Latest year plotted at back for visibility
Analyzes terrestrial vs marine habitat distributions. Creates a two-panel figure showing taxa by habitat and protein family evolution across environments.
toxprot analysis habitat
toxprot analysis habitat --top-n 20 # Show top 20 families per habitat
toxprot analysis -d all habitatOptions:
| Option | Default | Description |
|---|---|---|
--top-n |
15 |
Number of top families per habitat |
Output: figures/habitat.png
Panel A (Taxa by Habitat):
- Left: Stacked bars for Entries, Species, Protein Families (terrestrial/marine/shared)
- Right: Breakdown of exclusive vs shared family contributions
- Flow connections showing data distribution
Panel B (Dual-Habitat Families):
- Diverging horizontal bar chart (marine LEFT, terrestrial RIGHT)
- Top 15 families appearing in BOTH habitats
- Three color shades per habitat showing 2005→2015→2025 evolution
Tracks source tissue annotations over time. Shows which tissues (venom gland, skin secretion, etc.) are most commonly annotated and how this changes.
toxprot analysis source-tissue
toxprot analysis source-tissue --top-n 5 # Show top 5 tissues
toxprot analysis -d all source-tissueOptions:
| Option | Default | Description |
|---|---|---|
--top-n |
10 |
Number of top tissues to display |
Output: figures/source_tissue_alluvial.png
Notes:
- Tissues are exploded from semicolon-separated values in the source data
Analyzes post-translational modification frequencies across decade snapshots. Shows which PTMs are most common in toxins and how annotation completeness has improved.
toxprot analysis ptm
toxprot analysis ptm --years 2010,2020 # Custom year comparison
toxprot analysis -d all ptmOptions:
| Option | Default | Description |
|---|---|---|
--years |
2005,2015,2025 |
Comma-separated years to compare |
Data source: PTM annotations extracted from UniProt XML <feature> elements:
- Feature types:
modified residue,glycosylation site,disulfide bond,cross-link,lipid moiety-binding region - Descriptions resolved using UniProt's ptmlist.txt
Output: figures/ptm_overview.png
Panel A: Horizontal bar chart showing PTM type frequency, sorted by 2025 count
Panel B: 2×3 grid of histograms showing per-protein PTM count distributions (1-9, 10+) for top 6 types
PTM types tracked: Disulfide bond, Amidation, Glycosylation, Hydroxylation, Pyrrolidone carboxylic acid, Gamma-carboxyglutamic acid, D-amino acid, Bromination, Sulfation, Lipidation
Analyzes Gene Ontology term distributions across the full timeline. Shows GO term coverage, annotation depth, and category trends (molecular function, biological process, cellular component).
toxprot analysis go
toxprot analysis go --top-n 10 # Show top 10 GO terms
toxprot analysis -d all goOptions:
| Option | Default | Description |
|---|---|---|
--top-n |
5 |
Number of top GO terms to display |
Output: figures/go_terms_overview.png
Panels:
- Panel A: Total GO term annotations over time (3 lines: MF, BP, CC)
- Panel B: GO category coverage percentages (entries with ≥1 annotation)
- Panels C-E: Top 5 GO terms per category with evolution trends
Notes:
- Uses GO hierarchy (
go-basic.obo) for term relationships - Analyzes all 21 years (2005-2025), not just decade snapshots
Tracks protein existence (PE) level transitions over time. Shows how evidence quality has improved as more toxins receive experimental validation.
toxprot analysis protein-evidence
toxprot analysis -d all protein-evidence
toxprot analysis protein-evidence -o figures/customYears: Uses 2008, 2015, 2025 (not 2005) because PE levels were introduced in UniProt 10.0 (March 2007).
PE Levels:
- Evidence at protein level (experimental)
- Evidence at transcript level
- Inferred from homology
- Predicted
- Uncertain
Output: figures/protein_evidence_sankey.png
Alluvial diagram showing:
- PE category bars at each time point
- Flows between categories showing transitions
- "Removed" intermediate nodes showing proteins dropped from dataset
- Which PE categories lost the most proteins over time
Compares the two ToxProt selection criteria (venom tissue vs toxin keyword) for the 2025 dataset. Shows overlap and entries unique to each criterion.
toxprot analysis definitions
toxprot analysis definitions --year 2024 # Analyze different year
toxprot analysis definitions -o figures/customOptions:
| Option | Default | Description |
|---|---|---|
--year |
2025 |
Year of dataset to use |
Output: figures/definition_comparison.png
Single-panel Venn→phylum→order flow figure showing entry-level criteria (left), phyla (centre), and orders (right). The Venn diagram shows overlap between venom tissue specificity and toxin keyword criteria, with flows through taxonomic levels.
figures/
├── dataset_summary_statistics.png
├── definition_comparison.png
├── go_terms_overview.png
├── habitat.png
├── protein_evidence_sankey.png
├── protspace_silhouette_comparison.csv
├── protspace_silhouette_comparison.png
├── ptm_overview.png
├── sequence_length_distribution.png
├── source_tissue_alluvial.png
├── taxa_newcomers_alluvial_order.png
├── top_families_alluvial.png
└── top_taxa_trend.png
Protein language model (ProtT5) embeddings enable 2D visualization of toxin relationships. Three variants are compared:
| Variant | Description | Silhouette Score |
|---|---|---|
full |
Full-length sequences | 0.262 |
mature |
Signal peptides removed | 0.397 |
mature_clean |
Mature sequences, fragments excluded | 0.474 |
Removing signal peptides and fragments improves clustering quality (silhouette: 0.262 → 0.474).
| Subcommand | Description |
|---|---|
generate-fasta |
Create FASTA files for embedding generation |
prepare |
Prepare metadata and filter H5 files |
run-umap |
Run UMAP dimensionality reduction |
silhouette |
Analyze clustering quality |
pipeline |
Run all steps (except Colab) |
UMAP visualizations are created manually via protspace.app.
See the ProtSpace Guide for the full pipeline and workflow.
- Data Processing Guide — Download, parse, and clean pipeline
- UniProt Release History — Swiss-Prot release versions
- ProtSpace Guide — Protein embedding analysis pipeline