This repository provides implementations for outlier detection in species observation data from Finnish Biodiversity Information Facility (FinBIF) using species distribution modeling (SDM) and machine learning models. The workflow depends on the model, but usually includes data preparation, spatial sampling for background data, environmental enrichment, model training, evaluation, and visualization of results. The models integrate re-classified CORINE land cover data, raster-based environmental variables, and occurrence data to predict species probabilities and identify potential outliers.
The repository includes three different approaches:
-
Unsupervised models
- The file unsupervised_models.ipynb
- Uses FinBIF occurrence data from api.laji.fi and several local raster data sets
- Flags unlike observations as outliers without separate training data
-
Random Forest (RF)
- The file random_forest_with_background_samples.ipynb
- Uses FinBIF occurrence data from api.laji.fi and several local raster data sets
- Flags unlike observations as outliers from the testing data
-
Supervised models for bird atlas data
- The file multiple_models_YKJ_squares.ipynb
- Uses bird atlas data, 10 km x 10 km YKJ squares and environmental data in one preprocessed file
- Calculates a mean results from three different models: Random Forest, Histogram Gradient Boosting and Maximum Entropy.
- Predicts probabilities for each bird for 10 km x 10 km squares
Note: All models report some false-positive outliers, meaning they may classify valid observations as anomalies.
Due to size constraints (>1 GB), raster datasets are not included in this repository. You can download them from official sources or send me an email:
- CORINE Land Cover 2018 (25 ha resolution), reclassified: SYKE Open Data
- Elevation Model (25m x 25m): National Land Survey of Finland (NLS-FI)
- Monthly Mean Temperature (1961-2023, 10km x 10km resolution): Finnish Meteorological Institute (FMI)
- Monthly Precipitation (1961-2023, 10km x 10km resolution): Finnish Meteorological Institute (FMI)
- Forest Biomass Data (m3/ha): Natural Resources Institute Finland (LUKE)
- Coastline Length Calculation (YKJ 10km x 10km grids): Calculated from SYKE Ranta10 dataset
- YKJ 10 km x 10 km squares: FinBIF
All data sets have been preprocessed. Read more: ML methods for outlier detection
Install the required dependencies using:
pip install -r requirements.txt
Store the following credentials in a .env file for API access:
VIRVA_ACCESS_TOKEN=Access token for sensitive data queries from api.laji.fi. Not open for everyone.
ACCESS_TOKEN=Access token for querying open data from api.laji.fi
ACCESS_EMAIL=Email for sensitive data queries from api.laji.fi
Usage depends completely on the method used.
For random_forest_with_background_samples.ipynb and unsupervised_models.ipynb you can just specify the taxon_if parameter after creating the .env file and run the model. multiple_models_YKJ_squares.ipynb is more complicates as it needs a preprocessed GeoPackage file. You can ask more details or read the file ML methods for outlier detection.
-
Unsupervised models: Each observation receives a probability score
[0,1], where lower values indicate higher likelihood of being an outlier. -
Random Forest: Each observation in the testing dataset receives a probability score
[0,1], where lower values indicate higher likelihood of being an outlier. -
Supervised models for bird atlas data: Each YKJ grid square receives a probability core
[0,1]where lower values indicate lower breeding likelihood. -
Visualization & Storage:
- All results can be interpolated to the continuous raster using scripts/interpolate_results.py script.
- Results can be also visualized on a map without interpolating.
- Data can be exported as a GeoPackage (.gpkg) file for GIS applications.
| Method | Unsupervised Models | Random Forest for All Data | Supervised Models for Bird Atlas Data |
|---|---|---|---|
| Data | All laji.fi observations | All laji.fi observations | Breeding probability indices from the Bird Atlas (YKJ grid) |
| Purpose | Assigns a probability [0–1] to all species observations to identify outliers based on selected variables. | Assigns a probability [0–1] to each test dataset observation to identify outliers based on selected variables. | Assigns a breeding probability to each species in each YKJ grid cell. |
| Advantages | No need for separate training or absence data. Easy to use for selected variables. |
Well-supported by research. Allows reliability assessment using statistical metrics. |
Uses high-quality Bird Atlas data. Produces clear results. Allows reliability assessment using statistical metrics. |
| Challenges | Difficult to assess model performance without comparison data. Sensitive to parameter choices. Accuracy depends on observation location precision. Differences between models. |
Sensitive to parameter choices. Requires generation of absence data, as real absence data is available only for a few species (e.g., butterflies). |
Variation within 10 km x 10 km grid cells may be greater than between them. Requires extensive data preprocessing. Predicts breeding probability rather than direct observation reliability. Differences between models. |
For further inquiries, please contact me. Also feel free to use this code in a way you want.
Note: I'm not a biologist or machine learning specialist. Do not trust the models.
