This repository contains a list of datasets with annotated marine/freshwater imagery and the scripts we used to process, clean and aggregate them to create the Community Fish Detection (CFD) Dataset, which we used to train the Community Fish Detector.
This effort was supported by the following folks: Filippo Varini, Dan Morris, Kevin Barnard, Laura Chrobak, Oceane Boulais, Alexander Merdian-Tarko, Devi Ayyagari, Sonny Burniston, Mona Dhiflaoui, Joshua Chen
Email Filippo if anything seems off, or if you know of datasets we're missing.
We welcome contributions! If you know of an underwater fish dataset that isn't listed here, you can help by processing it and submitting a pull request. Check the Unprocessed datasets section at the bottom of this README for datasets we already know about but haven't processed yet — picking one from that list is a great way to start.
A dataset must meet the following requirements to be included:
- Underwater imagery — images must be captured below the water surface (above-water and aerial fish images are rejected)
- Contains fish — the dataset must include annotations (bounding boxes or segmentation masks) on fish or fish-like organisms
- Publicly available — the data must be downloadable without requiring special access or paid subscriptions
All datasets are normalized to a common format before merging. Your processing script must apply the following:
- Single category — all fish-related annotations must be compressed into a single category:
{"id": 1, "name": "fish"}. Regardless of how many species or sub-categories the source dataset has, we merge them all into one. Usecompress_annotations_to_single_category()fromdatasets/utils/. - 1-indexed annotations — COCO category and annotation IDs must be 1-indexed (not 0-indexed). Use
convert_coco_annotations_from_0_indexed_to_1_indexed()if needed. - Filter out non-fish categories — if the source dataset contains non-fish categories (e.g. coral, crab, starfish), filter them out and keep only fish-related annotations. Define a
CATEGORIES_FILTERlist in your script to specify which source categories to keep. - Prefix image filenames — all image filenames must be prefixed with the dataset shortname (e.g.
noaa_puget_000001.jpg) to avoid filename collisions when datasets are merged. Useadd_dataset_shortname_prefix_to_image_names(). - Train/val split — split the dataset into training and validation sets. When possible, split by location, camera, video, or deployment rather than by random image selection. Use
split_coco_dataset_into_train_validation(). - COCO output format — the final annotations must be in COCO format with bounding boxes.
If you've come across a dataset that matches the acceptance criteria above but don't have the time or experience to write a processing script, you can still help:
- Add it to the Unprocessed datasets list at the bottom of this README via a pull request, or
- Email it to Filippo and we'll add it to the list
-
Install dependencies:
pip install -r requirements.txt -
Create a script at
datasets/<dataset_name>.pythat follows the 4-step pattern used by all other dataset scripts:- Download — download and extract the raw data to
data/raw/<dataset_name>/ - Process — convert annotations to COCO format, apply the processing rules above, and save to
data/processing/<dataset_name>/ - Preview — generate a sample annotated image and save it to
previews/ - Split — split into train/val and save to
data/final/<dataset_name>_train/anddata/final/<dataset_name>_val/
- Download — download and extract the raw data to
-
Use shared utilities from
datasets/utils/— see existing scripts like roboflow_fish.py for a straightforward example. -
Define module-level constants:
DATASET_SHORTNAME— a short, unique identifier (e.g."noaa_puget")CATEGORIES_FILTER— list of source category names to keep (orNoneif all categories are fish)
-
Add a dataset entry to this README under Processed datasets, following the same metadata format as the existing entries.
-
Submit a pull request with your script, the preview image, and the README update.
This repo includes a custom Claude Code skill that automates the entire dataset processing workflow — from downloading and analyzing the data, to generating the script, running it, and updating the documentation. To use it:
- Install Claude Code and open this repo
- Paste the dataset URL (e.g. a Zenodo link, a paper, a GitHub repo) and type
/add-dataset <url> - Claude Code will walk you through the full pipeline: research the dataset, test the download, write the processing script, run it, and update the docs
This is the fastest way to add a new dataset if you're already familiar with Claude Code.
Images with 67,990 bounding boxes on fish and crustaceans
Farrell DM, Ferriss B, Sanderson B, Veggerby K, Robinson L, Trivedi A, Pathak S, Muppalla S, Wang J, Morris D, Dodhia R. A labeled data set of underwater images of fish and crab species from five mesohabitats in Puget Sound WA USA. Scientific Data. 2023 Nov 13;10(1):799.
- Data downloadable via via https from LILA (download link)
- License: CDLA-permissive 1.0
- Metadata raw format: COCO
- Categories/species: fish and crustaceans
- Vehicle type: N/A
- Image information: 77,739 images
- Annotation information: 67,990 bounding boxes
- Typical animal size in pixels: N/A
- Code to render sample annotated image: noaa_puget.py
Images of freshwater fish taken from underwater videos with 91,482 bounding boxes
- Data downloadable via via https from LILA (download link)
- License: CDLA-permissive 1.0
- Metadata raw format: COCO
- Categories/species: fish
- Vehicle type: Frames taken fromUnderwater Videos
- Image information: 262,050 images
- Annotation information: 91,482 bounding boxes
- Typical animal size in pixels: N/A
- Code to render sample annotated image: mit_river_herring.py
Annotated stereo imagery of orange roughy from 2019 Tasmanian survey, with expert-labeled bounding boxes for machine learning detection in fisheries science.
Scoulding, Ben; Maguire, Kylie; Orenstein, Eric; Jackett, Chris; & CSIRO (2025): Tasmanian Orange Roughy Stereo Image Machine Learning Dataset. v1. CSIRO. Data Collection. https://doi.org/10.25919/a90r-4962
- Data downloadable via via https from the CSIRO Portal (download link)
- License: CC BY-NC-SA 4.0
- Metadata raw format: COCO
- Categories/species: fish, eel, corals and other benthic organisms.
- Vehicle type: Net-attached Acoustic and Optical System (AOS)
- Image information: 1,051 images
- Annotation information: 14,414 bounding boxes
- Typical animal size in pixels: N/A
- Code to render sample annotated image: torsi.py
2,027 images captured by diver-borne GoPro cameras from a variety of global coral reefs.
- Home: https://josauder.github.io/coralscapes/
- Data downloadable via HuggingFace from https://huggingface.co/datasets/EPFL-ECEO/coralscapes
- License: Apache-2.0
- Metadata raw format: Parquet
- Categories/species: fish, coral, human, rock, etc.
- Vehicle type: diver
- Image information: 2,027 images
- Annotation information: 174,000 segmentation annotations, of which 20,849 are fish
- Code to render sample annotated image: coralscapes.py
~1k images of fish/squid w/bounding boxes
Simon, K. (2018). Project Natick - Microsoft's Self-sufficient Underwater Datacenters. IndraStra Global, 4(6), 1-4. https://nbn-resolving.org/urn:nbn:de:0168-ssoar-57615-2
- Data downloadable via via https from GitHub (download link)
- Metadata raw format: Pascal VOC
- Categories/species: fish, squid
- Vehicle type: fixed camera on structure
- Image information: 1118 RGB images (~5% of images have FN annotations)
- Annotation information: 998 bounding boxes
- Typical animal size in pixels: N/A
- Code to render sample annotated image: project_natick.py
~1k images of fish w/bounding boxes
Solawetz, J. (2023, February 21). Fish object detection dataset. Roboflow. https://public.roboflow.com/object-detection/fish
- Data downloadable via via https from Roboflow (download link)
- License: CC0 1.0 DEED
- Metadata raw format: multiple available
- Categories/species: 26 fish types (e.g. shark, tuna)
- Vehicle type: underwater cameras
- Image information: 1350 RGB images (the taxonomy is often inaccurate)
- Annotation information: 3142 bounding boxes
- Typical animal size in pixels: N/A
- Code to render sample annotated image: roboflow_fish.py
~40k images with a mix of classification, segmentation, and counting labels
Saleh A, Laradji IH, Konovalov DA, Bradley M, Vazquez D, Sheaves M. A realistic fish-habitat dataset to evaluate algorithms for underwater visual analysis. Scientific Reports. 2020 Sep 4;10(1):14671.
- Data downloadable via https from Queensland University (download link) (7.1GB)
- License: Code is MIT, data is implied-MIT
- Metadata raw format: png (segmentation masks)
- Categories/species:
- Vehicle type: underwater camera deployed over the side of a boat
- Image information: 311 images with segmentation masks
- Annotation information: 388 segmentation masks
- Code to render sample annotated image: deepfish.py
Bboxed images of pelagic fish and associated segmentations
Vaneeda Allken, Shale Rosen (2020) Deep Vision fish dataset https://doi.org/10.21335/NMDC-551736490
- Data downloadable via via https from the Norwegian Marine Data Centre (download link)
- License: CC BY 4.0
- Metadata raw format: csv
- Categories/species: economically important pelagic species
- Vehicle type: pictures from fish tanks
- Image information: 1875 RGB images
- Annotation information: 4834 bounding boxes, segmentation masks
- Typical animal size in pixels: N/A
- Code to render sample annotated image: deep_vision.py
~90 videos with bounding boxes on fish. Largely redundant with BrackishMOT (see above).
Detection of Marine Animals in a New Underwater Dataset with Varying Visibility, Pedersen, Malte and Haurum, Joakim Bruslund and Gade, Rikke and Moeslund, Thomas B. and Madsen, Niels, June, 2019
- Data downloadable via via https from Kaggle (download link)
- License: CC BY-SA 4.0
- Metadata raw format: AAU, COCO, YOLO
- Categories/species: fish, small fish, crab, shrimp, jellyfish, starfish
- Vehicle type: underwater cameras in brackish water
- Image information: 12,444 RGB images
- Annotation information: 35,565 bounding boxes
- Typical animal size in pixels: N/A
- Code to render sample annotated image: brackish.py
17 10-minute videos with tracking points
I. Kavasidis, S. Palazzo, R. Di Salvo, D. Giordano, C. Spampinato, An innovative web-based collaborative platform for video annotation, Multimedia Tools and Applications, vol. 70, pp. 413-432, 2013.
I. Kavasidis, S. Palazzo, R. Di Salvo, D. Giordano, C. Spampinato, A semi-automatic tool for detection and tracking ground truth generation in videos, Proceedings of the 1st International Workshop on Visual Interfaces for Ground Truth Collection in Computer Vision Applications, pp. 6:1-6:5, 2012.
- Data downloadable via https from GitHub (download link)
- Metadata raw format: XML, FLV
- Categories/species: N/A
- Vehicle type: N/A
- Image information: N/A
- Annotation information: N/A
- Typical animal size in pixels: N/A
- Code to render sample annotated image: f4k.py
14k boxes on fish in 20k images
Joly A., Goeau H., Glotin H., Spampinato C., Bonnet P., Vellinga W.-P., Planquè R., Rauber A., Palazzo S., Fisher R., and others}, LifeCLEF 2015: multimedia life species identification challenges, International Conference of the Cross-Language Evaluation Forum for European Languages, pp. 462-483, Springer, 2015.
- Data downloadable via https from Zenodo (download link). Note, the dataset was originally hosted on SharePoint. We uploaded it to Zenodo to make it downloadable programmatically.
- Metadata raw format: XML
- Categories/species: marine ray-finned fish
- Vehicle type: N/A
- Image information: 20,000 images
- Annotation information: 14,000 bounding boxes
- Typical animal size in pixels: N/A
- Code to render sample annotated image: fishclef.py
Several thousand BRUV images with bounding boxes on fish and bait
- Data downloadable from Viame (download link)
- Metadata raw format: N/A
- Categories/species: General Fish, Fish Species, Bait and Algae
- Vehicle type: BRUV
- Image information: ~20,000
- Annotation information: bounding boxes
- Code to render sample annotated image: viame_fishtrack.py
~44k images of fish w/ ~83kbounding boxes
Jansen, A., Walden, D., Walker, S., & Buccella, C. (2022). A deep learning dataset for underwater object detection of tropical freshwater fish species in northern Australia (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7250921
- Data downloadable via https from zenodo (download link)
- License: CC BY 4.0 LEGAL CODE
- Metadata raw format: json
- Categories/species: Ambassis agrammus, Ambassis macleayi, Amniataba percoides, Craterocephalus stercusmuscarum, Denariusa bandata, Glossamia aprion, Glossogobius spp., Hephaestus fuliginosus, Lates calcarifer, Leiopotherapon unicolor, Liza ordensis, Megalops cyprinoides, Melanotaenia nigrans, Melanotaenia splendida inornata, Mogurnda mogurnda, Nemetalosa erebi, Neoarius spp., Neosilurus spp., Oxyeleotris spp., Scleropages jardinii, Strongylura kreffti, Syncomistes butleri, Toxotes chatareus
- Vehicle type: RUV
- Image information: 44,112 images (images were derived from Remote Underwater Video (RUV) deployments in deep channel and shallow lowland billabongs, Kakadu National Park, Northern Territory Australia)
- Annotation information: 82,904 bounding boxes
- Typical animal size in pixels: N/A
- Code to render sample annotated image: kakadu.py
2200 images of zebrafish with individual IDs
Bruslund Haurum J, Karpova A, Pedersen M, Hein Bengtson S, Moeslund TB. Re-identification of zebrafish using metric learning. InProceedings of the IEEE/CVF winter conference on applications of computer vision workshops 2020 (pp. 1-11).
- Data downloadable via https from Kaggle (download link)
- License: CC BY 4.0 DEED
- Metadata raw format: csv
- Categories/species: zebrafish
- Vehicle type: underwater camera in fish tank
- Image information: 2224 images
- Annotation information: AAU VAP bounding boxes
- Typical animal size in pixels: N/A
- Code to render sample annotated image: zebrafish.py
586 annotated underwater images of Orange Chromide (Etroplus maculatus) fish in South Indian pond environments with 10,607 bounding boxes
Vijayalakshmi M, Sasithradevi A (2024). Annotated underwater fish detection dataset from pond environments. Mendeley Data, V1. https://doi.org/10.17632/7w45jx35hd.1
- Data downloadable via https from Mendeley Data (download link)
- License: CC BY 4.0
- Metadata raw format: YOLO TXT
- Categories/species: Orange Chromide (Etroplus maculatus)
- Vehicle type: Crosstour CT9000 underwater camera at <4m depth
- Image information: 586 images (640x640)
- Annotation information: 10,607 bounding boxes
- Typical animal size in pixels: N/A
- Code to render sample annotated image: orange_chromide.py
Katija, K., Orenstein, E., Schlining, B. et al. FathomNet: A global image database for enabling artificial intelligence in the ocean. Sci Rep 12, 15914 (2022). https://doi.org/10.1038/s41598-022-19939-2
The FathomNet Database is an open-source image database that can be used to train, test, and validate state-of-the-art artificial intelligence algorithms to help us understand our ocean and its inhabitants.
- Data downloadable via https from FathomNet (website)
- License: CC BY-NC-ND 4.0 with provision: Notwithstanding any contrary provisions of such license, all Images may be used for training and development of machine learning algorithms for commercial, academic, and government purposes. For all other uses of the Images, users should contact the original copyright holder indicated in the Database for the applicable Images. The users of the Images accept full responsibility for their use. (source)
- Metadata raw format: N/A (exportable via fathomnet-py to VOC, COCO, YOLO)
- Categories/species: many (2400+ concepts in the database)
- Vehicle type: primarily ROV
- Image information: 100k+ images (growing over time)
- Annotation information: 300k+ annotations (growing over time)
- Typical animal size in pixels: N/A
- Code to render sample annotated image: fathomnet.py
Two merged Roboflow datasets with bounding boxes on fish, sharks, rays, turtles and other reef species
- Data downloadable via Roboflow (manual download required)
- Metadata raw format: COCO (after Roboflow export)
- Categories/species: fish, shark, ray, turtle, and various reef fish families
- Vehicle type: underwater cameras
- Code to render sample annotated image: marine_detect.py
Boxes on 532,000 frames from 1,567 videos of salmon in two weirs
JOUR, Atlas, William, Ma, Sami, Chou, Yi, Connors, Katrina, Scurfield, Daniel, Nam, Brandon, Ma, Xiaoqiang, Cleveland, Mark, Doire, Janvier, Moore, Jonathan, Shea, Ryan, Liu, Jiangchuan, 2023/09/20, Wild salmon enumeration and monitoring using deep learning empowered detection and tracking, 10, 10.3389/fmars.2023.1200408, Frontiers in Marine Science
- Data downloadable via https from GitHub (download link)
- License: CC BY 4.0
- Metadata raw format: YOLOv6
- Categories/species: pacific salmon
- Vehicle type: multi-object tracking (MOT) and object detection
- Image information: 1567 images
- Annotation information: bounding boxes
- Typical animal size in pixels: N/A
- Fishnet.AI: ~163k bounding boxes on ~35k images of fish and people on fishing vessels. Excluded because images are above-water.
- Visual Marine Animal Tracking (VMAT): 32 video sequences with bounding boxes on marine organisms from AUVs.
- OzFish: Excluded due to annotation quality issues.
- WildFish: Excluded because images are cropped.
- Angling Freshwater Fish Netherlands (AFFiNe): 7k images of 30 species, excluded because images are above water.
- Brook trout imagery for individual ID: Excluded because images are above water.
- 3D-ZeF: Excluded because images are above water and in lab environments.
- Croatian Fish: 794 images of 12 species, excluded because images are cropped.
- Amazonian Fish ML Classifier: Excluded because images are in staging environments.
- BrackishMOT: Same data as the Brackish Dataset.
- Brackish Underwater Dataset: Same data as the Brackish Dataset.
Datasets that we're aware exist, but that we haven't evaluated or processed yet.
- Newfoundland Marine Refuge Fish Classification Dataset (N-MARINE) (~24k images of marine fish in Canada, with ~24k boxes)
- PomerFish: A dataset for fishes across Pomerania freshwater waterbodies in-situ environments (paper) (~20k segmentation masks on ~15k images)
- J-EDI: AI training dataset for detecting organisms from images taken during deep-sea survey. The detection targets are labeled into 19 rough categories such as "shrimp" and "fish" (~8,000 images, COCO annotation)
















