Elasticsearch Loader Script

Overview

The loader.py script loads tabular data into Elasticsearch from CSV/TSV files. It performs validation using rules defined in data/columns.csv and data presence in data/traits.csv.

The script supports three loading modes:

machine: for loading machine observation data
in_situ: for in_situ observations
herbarium: for herbarium record data

Each mode uses specific required fields defined in columns.csv.

Usage

usage: loader.py [-h] --mode {machine,in_situ,herbarium} [--test] [--strict] [--batch-size BATCH_SIZE] [--progress-every PROGRESS_EVERY] data_dir drop_existing
loader.py: error: the following arguments are required: data_dir, drop_existing, --mode

Options

Positional

data_dir Directory containing CSV files to load.

Options

--mode {machine,in_situ,herbarium} (required)
--drop-existing / --no-drop-existing (default: --no-drop-existing)
--test Test mode (no ES insert).
--strict Reject rows with invalid field values after coercion.
--batch-size N Docs per bulk request (default: 5000).
--progress-every N Print progress every N rows (default: 50000).

Example

# here is an example load script
python loader.py --mode=machine data/annotations.07.25.2025/ --no-drop-existing --batch-size 5000 --progress-every 50000
python loader.py --mode=in_situ data/npn.1956.01.01-2025.08.31/ --no-drop-existing --batch-size 5000 --progress-every 50000

Under the Hood

`traits.csv`

This file contains trait mappings from ontology trait terms to a pipe delimited list of parent terms

`columns.csv`

Defines schema for all fields that can be used in the Elasticsearch index.
Contains columns:
- field: The name of the field in the data.
- datatype: The expected type (e.g., text, integer, float, boolean, date, keyword, etc.)
- machine_required, inat_required, herbarium_required: Indicates if the field is required for a given mode.
Used for two purposes:
1. Validating presence of required fields.
2. Building Elasticsearch mappings dynamically.

`transform.yaml` (Optional)

A per-dataset YAML file for applying simple value transformations before ingestion. If transform.yaml is present in the data_dir, it is loaded automatically. Only the trait field is currently transformed using this mechanism.

Format:

trait_mappings:
  green leaves present: non-senescing unfolded true leaves present
  senescent leaves: senescing leaves present
  red leaves: colored leaves (non-green)

If a value in the trait column matches a key in trait_mappings (case-insensitive), it is replaced by the corresponding value before validation or Elasticsearch indexing.

This allows for normalizing heterogeneous trait values across datasets without modifying the main loader script.

Elasticsearch Mapping

The script uses columns.csv to generate the index mapping.
If --drop_index is passed, the script deletes the existing index and re-creates it using the generated mapping.

Error Reporting

Rows missing required fields or containing invalid values are logged.
A summary count of invalid rows is displayed after loading.

Requirements

Python 3.8+
Elasticsearch running locally or remotely (endpoint configured in script or via .env file)
pandas, elasticsearch, python-dotenv

Install dependencies:

pip install -r requirements.txt

Notes

The index name is determined by mode (e.g., inat-records, machine-records, etc.)
Validation logic may be extended by modifying the script.
Ensure that columns.csv and traits.csv are present in the working directory or specified via --data_dir.

Author

PhenoBase Project | Biocode, LLC

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
data		data
downloads/npn		downloads/npn
reasoning		reasoning
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
STATUS.md		STATUS.md
loader.py		loader.py
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Elasticsearch Loader Script

Overview

Usage

Options

Example

Under the Hood

`traits.csv`

`columns.csv`

`transform.yaml` (Optional)

Elasticsearch Mapping

Error Reporting

Requirements

Notes

Author

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Phenobase/phenobase_data

Folders and files

Latest commit

History

Repository files navigation

Elasticsearch Loader Script

Overview

Usage

Options

Example

Under the Hood

traits.csv

columns.csv

transform.yaml (Optional)

Elasticsearch Mapping

Error Reporting

Requirements

Notes

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

`traits.csv`

`columns.csv`

`transform.yaml` (Optional)

Packages