Machine learning models applied to the Pelagic Size Structure database (PSSdb)
Mathilde Dugenne, Jessica Luo, Rainer Kiko, Marco Corrales-Ugalde, Todd O'Brien, Charles Stock, Jean-Olivier Irisson, Lars Stemmann, Fabien Lombard
This repository includes the code to model plankton spectral biogeography using the Pelagic Size Structure database taxa-specific products and the machine learning model xgboost.
Acknowledgment: This work is funded by NOAA (Award #NA21OAR4310254)
Organisation
This repository contains a:- configuration masterfile: File used to configure this GitHub repository. This file contains the credentials information needed to download environmental variables from NASA, Copernicus, or AVISO
-
scripts section: contains all scripts and functions developed for PSSdb_Learning. This section includes functions required at each step of the Workflow, paired with a numbered script, whose objective is to generate global predictions of taxa-specific Normalized Biovolume Size Spectrum using PSSdb data products (see PSSdb website). Numbered scripts should be run sequentially in order to generate the final PSSdb_Learning products.
-
data section: contains all datafiles, including PSSdb taxa-specific products (NBSS_ver_xx_xxxx), environmental factors (Environement, not tracked since files are too large), and model predictions (Model_output).
-
figures section: contains all figures generated for the associated paper.
Workflow
The workflow includes four steps (numbered 0 to 3) that should be run sequentially to train boosted decision trees and predict taxa-specifc NBSS parameters globally.
Pre-steps: Generate taxa-specific products in PSSdb
Taxa-specific Normalized Biovolume Size Spectra (NBSS) are automatically generated by the PSSdb pipeline. First, imaging datasets are downloaded automatically from the platform for automated classification and manual validation EcoTaxa.
Second, datafiles are standardized according to standard formats and units, and taxonomic annotations are standardized according to the World Register of Marine Species
Third, each sample (UVP profile or scan) pass through a quality control to ensure datasets ingested in PSSdb contain the correct information and are well validated (for UVP and scanners)
Fourth, samples in spatial and temporal proximity are aggregated in half-degrees, weekly bins, and then averaged at the final resolution of PSSdb (1 degree, year and month) to ensure repeated and rarer samples are equally represented
Lastly, each spectrum is fitted with a log-linear regression to obtain estimates of NBSS parameters (slope, intercept, coefficient of determination) in each spatio-temporal bin.
Step 0: Check PSSdb taxa-specific products and generate linear fits
This script check the latest taxa-specific products generated by the PSSdb pipeline and fit a log-linear regression to obtain NBSS parameters (intercept, slope, coefficient of determination and size range) in each spatio-temporal bin (1x1 degree latitude/longitude, year month).
python ~/GIT/PSSdb_Learning/scripts/0_explore_NBSS.py
Step 1: Merge NBSS parameters (response variable) input dataset with environmental descriptors (explanatory variables)
This script merges a set of environmental descriptors distributed by NASA, AVISO, Copernicus, or WOA at the resolution of PSSdb datasets to generate the input dataframe for the machine learning modelpython ~/GIT/PSSdb_Learning/scripts/1_merge_predictors.py
Step 2: Train boosted decision trees model
This script train a boosted regression trees model using the input dataframe generate on step 1. Both models (json format) and model outputs (in netcdf) are saved automatically and can be loaded using the script of step 3.python ~/GIT/PSSdb_Learning/scripts/2_train_model.py




