GitHub - mdugenne/PSSdb_Learning: This repository includes the code to model plankton spectral biogeography using the Pelagic Size Structure database taxa-specific products

Machine learning models applied to the Pelagic Size Structure database (PSSdb)

Mathilde Dugenne, Jessica Luo, Rainer Kiko, Marco Corrales-Ugalde, Todd O'Brien, Charles Stock, Jean-Olivier Irisson, Lars Stemmann, Fabien Lombard

This repository includes the code to model plankton spectral biogeography using the Pelagic Size Structure database taxa-specific products and the machine learning model xgboost.

Acknowledgment: This work is funded by NOAA (Award #NA21OAR4310254)

Organisation

This repository contains a:

configuration masterfile: File used to configure this GitHub repository. This file contains the credentials information needed to download environmental variables from NASA, Copernicus, or AVISO

Attention: The repository includes a gitignore file, which is used to protect personal information or avoid tracking datafiles that exceed the limit for GitHub upload (2Gb). Personal information, login and password, are required to download datasets hosted on EcoTaxa, NASA, Copernicus, or AVISO. Read instructions of the template configuration masterfile to save these protected information in a "configuration_masterfile.yaml" file.

scripts section: contains all scripts and functions developed for PSSdb_Learning. This section includes functions required at each step of the Workflow, paired with a numbered script, whose objective is to generate global predictions of taxa-specific Normalized Biovolume Size Spectrum using PSSdb data products (see PSSdb website). Numbered scripts should be run sequentially in order to generate the final PSSdb_Learning products.
data section: contains all datafiles, including PSSdb taxa-specific products (NBSS_ver_xx_xxxx), environmental factors (Environement, not tracked since files are too large), and model predictions (Model_output).
figures section: contains all figures generated for the associated paper.

Workflow

The workflow includes four steps (numbered 0 to 3) that should be run sequentially to train boosted decision trees and predict taxa-specifc NBSS parameters globally.

Pre-steps: Generate taxa-specific products in PSSdb

Taxa-specific Normalized Biovolume Size Spectra (NBSS) are automatically generated by the PSSdb pipeline. First, imaging datasets are downloaded automatically from the platform for automated classification and manual validation EcoTaxa.

Second, datafiles are standardized according to standard formats and units, and taxonomic annotations are standardized according to the World Register of Marine Species

Third, each sample (UVP profile or scan) pass through a quality control to ensure datasets ingested in PSSdb contain the correct information and are well validated (for UVP and scanners)

Fourth, samples in spatial and temporal proximity are aggregated in half-degrees, weekly bins, and then averaged at the final resolution of PSSdb (1 degree, year and month) to ensure repeated and rarer samples are equally represented

Lastly, each spectrum is fitted with a log-linear regression to obtain estimates of NBSS parameters (slope, intercept, coefficient of determination) in each spatio-temporal bin.

Step 0: Check PSSdb taxa-specific products and generate linear fits

This script check the latest taxa-specific products generated by the PSSdb pipeline and fit a log-linear regression to obtain NBSS parameters (intercept, slope, coefficient of determination and size range) in each spatio-temporal bin (1x1 degree latitude/longitude, year month).
python ~/GIT/PSSdb_Learning/scripts/0_explore_NBSS.py

Step 1: Merge NBSS parameters (response variable) input dataset with environmental descriptors (explanatory variables)

This script merges a set of environmental descriptors distributed by NASA, AVISO, Copernicus, or WOA at the resolution of PSSdb datasets to generate the input dataframe for the machine learning model
python ~/GIT/PSSdb_Learning/scripts/1_merge_predictors.py

Step 2: Train boosted decision trees model

This script train a boosted regression trees model using the input dataframe generate on step 1. Both models (json format) and model outputs (in netcdf) are saved automatically and can be loaded using the script of step 3.
python ~/GIT/PSSdb_Learning/scripts/2_train_model.py

Step 3: Check model predictions

This script checks the model predictions by generating climatologies and maps.
python ~/GIT/PSSdb_Learning/scripts/3_check_model.py

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
figures		figures
scripts		scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine learning models applied to the Pelagic Size Structure database (PSSdb)

Mathilde Dugenne, Jessica Luo, Rainer Kiko, Marco Corrales-Ugalde, Todd O'Brien, Charles Stock, Jean-Olivier Irisson, Lars Stemmann, Fabien Lombard

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Machine learning models applied to the Pelagic Size Structure database (PSSdb)

Mathilde Dugenne, Jessica Luo, Rainer Kiko, Marco Corrales-Ugalde, Todd O'Brien, Charles Stock, Jean-Olivier Irisson, Lars Stemmann, Fabien Lombard

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages