Species Trait Data Compilation

This project automates the retrieval and compilation of species-specific biological trait data by integrating biodiversity APIs with large language models. It is designed to scale from focused case studies to generalized, cross-species analyses.

Generalized Data Pipeline

The system demonstrates a general-purpose trait extraction pipeline.

Uses Europe PMC / PubMed Central (PMC) to query scientific literature
Retrieves PDFs, parses them, and applies LLM-based extraction prompts to pull out traits such as diet, size, habitat, or environmental associations
Works for any list of species and any set of traits, driven by an Excel file and trait description mapping
Provides a UI for easy use, supporting batch processing across taxa

How to Use

Clone the repository (if not already done):

git clone https://github.com/harpak-lab/Data-Compilation-Model.git
cd Data-Compilation-Model

Install dependencies:
```
pip install -r requirements.txt
```
Configure API Keys:

This project requires API keys for IUCN Red List and OpenAI GPT to retrieve species traits and run extraction.

IUCN Red List API

Go to https://api.iucnredlist.org/ and create an account to generate a new API key.

OpenAI GPT

Sign up or login at https://platform.openai.com/ to get an API key.

Add keys to a .env file in the project root (same folder as requirements.txt).
```
IUCN_API_KEY=your_iucn_api_key_here
OPENAI_API_KEY=your_openai_api_key_here
```
Make sure you're on the main branch and inside the correct directory:
```
git checkout main
cd scripts
```
Run the GUI script:
```
python3 gui.py
```
In the popup window:
- Upload your Excel file: first column = species, remaining columns = traits.
- Upload your trait descriptions text file: UTF-8 encoded; each line in the format trait: description.
Start extraction:
- Click Start Data Extraction.
- The system will query APIs, fetch papers, and extract trait data.
- Results will be saved to: Data-Compilation-Model/02_generic_data_compilation/results/

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
results		results
sample_data		sample_data
scripts		scripts
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Species Trait Data Compilation

Generalized Data Pipeline

How to Use

About

Uh oh!

Releases

Packages

Languages

harpak-lab/Data-Compilation-Model

Folders and files

Latest commit

History

Repository files navigation

Species Trait Data Compilation

Generalized Data Pipeline

How to Use

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages