This project automates the retrieval and compilation of species-specific biological trait data by integrating biodiversity APIs with large language models. It is designed to scale from focused case studies to generalized, cross-species analyses.
The system demonstrates a general-purpose trait extraction pipeline.
- Uses Europe PMC / PubMed Central (PMC) to query scientific literature
- Retrieves PDFs, parses them, and applies LLM-based extraction prompts to pull out traits such as diet, size, habitat, or environmental associations
- Works for any list of species and any set of traits, driven by an Excel file and trait description mapping
- Provides a UI for easy use, supporting batch processing across taxa
-
Clone the repository (if not already done):
git clone https://github.com/harpak-lab/Data-Compilation-Model.git cd Data-Compilation-Model -
Install dependencies:
pip install -r requirements.txt
-
Configure API Keys:
This project requires API keys for IUCN Red List and OpenAI GPT to retrieve species traits and run extraction.
IUCN Red List API
Go to https://api.iucnredlist.org/ and create an account to generate a new API key.
OpenAI GPT
Sign up or login at https://platform.openai.com/ to get an API key.
Add keys to a .env file in the project root (same folder as requirements.txt).
IUCN_API_KEY=your_iucn_api_key_here OPENAI_API_KEY=your_openai_api_key_here
-
Make sure you're on the main branch and inside the correct directory:
git checkout main cd scripts -
Run the GUI script:
python3 gui.py
-
In the popup window:
- Upload your Excel file: first column = species, remaining columns = traits.
- Upload your trait descriptions text file: UTF-8 encoded; each line in the format trait: description.
-
Start extraction:
- Click Start Data Extraction.
- The system will query APIs, fetch papers, and extract trait data.
- Results will be saved to: Data-Compilation-Model/02_generic_data_compilation/results/