Species Outlier Detection Using Machine Learning

This repository provides implementations for outlier detection in species observation data from Finnish Biodiversity Information Facility (FinBIF) using species distribution modeling (SDM) and machine learning models. The workflow depends on the model, but usually includes data preparation, spatial sampling for background data, environmental enrichment, model training, evaluation, and visualization of results. The models integrate re-classified CORINE land cover data, raster-based environmental variables, and occurrence data to predict species probabilities and identify potential outliers.

Models Used

The repository includes three different approaches:

Unsupervised models
- The file unsupervised_models.ipynb
- Uses FinBIF occurrence data from api.laji.fi and several local raster data sets
- Flags unlike observations as outliers without separate training data
Random Forest (RF)
- The file random_forest_with_background_samples.ipynb
- Uses FinBIF occurrence data from api.laji.fi and several local raster data sets
- Flags unlike observations as outliers from the testing data
Supervised models for bird atlas data
- The file multiple_models_YKJ_squares.ipynb
- Uses bird atlas data, 10 km x 10 km YKJ squares and environmental data in one preprocessed file
- Calculates a mean results from three different models: Random Forest, Histogram Gradient Boosting and Maximum Entropy.
- Predicts probabilities for each bird for 10 km x 10 km squares

Note: All models report some false-positive outliers, meaning they may classify valid observations as anomalies.

Dataset Sources

Due to size constraints (>1 GB), raster datasets are not included in this repository. You can download them from official sources or send me an email:

CORINE Land Cover 2018 (25 ha resolution), reclassified: SYKE Open Data
Elevation Model (25m x 25m): National Land Survey of Finland (NLS-FI)
Monthly Mean Temperature (1961-2023, 10km x 10km resolution): Finnish Meteorological Institute (FMI)
Monthly Precipitation (1961-2023, 10km x 10km resolution): Finnish Meteorological Institute (FMI)
Forest Biomass Data (m3/ha): Natural Resources Institute Finland (LUKE)
Coastline Length Calculation (YKJ 10km x 10km grids): Calculated from SYKE Ranta10 dataset
YKJ 10 km x 10 km squares: FinBIF

All data sets have been preprocessed. Read more: ML methods for outlier detection

Installation

Dependencies

Install the required dependencies using:

pip install -r requirements.txt

Environmental Variables

Store the following credentials in a .env file for API access:

VIRVA_ACCESS_TOKEN=Access token for sensitive data queries from api.laji.fi. Not open for everyone.
ACCESS_TOKEN=Access token for querying open data from api.laji.fi
ACCESS_EMAIL=Email for sensitive data queries from api.laji.fi

Usage

Usage depends completely on the method used.

For random_forest_with_background_samples.ipynb and unsupervised_models.ipynb you can just specify the taxon_if parameter after creating the .env file and run the model. multiple_models_YKJ_squares.ipynb is more complicates as it needs a preprocessed GeoPackage file. You can ask more details or read the file ML methods for outlier detection.

Model Outputs

Unsupervised models: Each observation receives a probability score [0,1], where lower values indicate higher likelihood of being an outlier.
Random Forest: Each observation in the testing dataset receives a probability score [0,1], where lower values indicate higher likelihood of being an outlier.
Supervised models for bird atlas data: Each YKJ grid square receives a probability core [0,1] where lower values indicate lower breeding likelihood.
Visualization & Storage:
- All results can be interpolated to the continuous raster using scripts/interpolate_results.py script.
- Results can be also visualized on a map without interpolating.
- Data can be exported as a GeoPackage (.gpkg) file for GIS applications.

Method Selection Guide

Method	Unsupervised Models	Random Forest for All Data	Supervised Models for Bird Atlas Data
Data	All laji.fi observations	All laji.fi observations	Breeding probability indices from the Bird Atlas (YKJ grid)
Purpose	Assigns a probability [0–1] to all species observations to identify outliers based on selected variables.	Assigns a probability [0–1] to each test dataset observation to identify outliers based on selected variables.	Assigns a breeding probability to each species in each YKJ grid cell.
Advantages	No need for separate training or absence data. Easy to use for selected variables.	Well-supported by research. Allows reliability assessment using statistical metrics.	Uses high-quality Bird Atlas data. Produces clear results. Allows reliability assessment using statistical metrics.
Challenges	Difficult to assess model performance without comparison data. Sensitive to parameter choices. Accuracy depends on observation location precision. Differences between models.	Sensitive to parameter choices. Requires generation of absence data, as real absence data is available only for a few species (e.g., butterflies).	Variation within 10 km x 10 km grid cells may be greater than between them. Requires extensive data preprocessing. Predicts breeding probability rather than direct observation reliability. Differences between models.

Contact Information and licence

For further inquiries, please contact me. Also feel free to use this code in a way you want.

Note: I'm not a biologist or machine learning specialist. Do not trust the models.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
scripts		scripts
.example-env		.example-env
.gitignore		.gitignore
ML_methods_for_outlier_detection.pdf		ML_methods_for_outlier_detection.pdf
README.md		README.md
helpers.py		helpers.py
load_data.py		load_data.py
multiple_models_YKJ_squares.ipynb		multiple_models_YKJ_squares.ipynb
random_forest_with_absent_data.ipynb		random_forest_with_absent_data.ipynb
random_forest_with_background_samples.ipynb		random_forest_with_background_samples.ipynb
requirements.txt		requirements.txt
unsupervised_models.ipynb		unsupervised_models.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Species Outlier Detection Using Machine Learning

Models Used

Dataset Sources

Installation

Dependencies

Environmental Variables

Usage

Model Outputs

Method Selection Guide

Contact Information and licence

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Species Outlier Detection Using Machine Learning

Models Used

Dataset Sources

Installation

Dependencies

Environmental Variables

Usage

Model Outputs

Method Selection Guide

Contact Information and licence

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages