Benchmarking LLMs on Environmental, Social, and Governance (ESG) and Sustainability Knowledge
EMNLP 2025 Main Conference Oral | Resource and Theme Paper Award nominations (Top 1%)
Website | Interactive Heatmap | ACL Anthology | PDF | Evaluation Guide | Dataset Docs
ESGenius is a multiple-choice benchmark for evaluating whether large language models understand ESG and sustainability knowledge at the level needed for standards-aware reasoning. It contains expert-written questions, source-grounded references, reproducible evaluation scripts, published result figures, and a lightweight GitHub Pages site for fast inspection.
| Item | Details |
|---|---|
| Paper | EMNLP 2025 Main Conference Oral |
| Recognition | Nominated for Resource and Theme Paper Awards (Top 1%) |
| Benchmark size | 1,136 multiple-choice questions |
| Answer protocol | A, B, C, D, plus Z for "Not sure" |
| Model results | 50 evaluated models with ranking figures and a question-level heatmap |
| References | Source document names, page references, and supporting excerpts in the reference CSV |
| Website | angel-ntu.github.io/ESGenius |
| Topics | llm, benchmark, esg, sustainability, nlp, evaluation, dataset, emnlp-2025 |
| License | Apache 2.0 |
| Goal | Start here |
|---|---|
| Read the paper | ACL Anthology record or PDF |
| Explore model behavior | Interactive heatmap |
| Download the benchmark | data/ESGenius_1136q.csv or data/ESGenius_1136q.json |
| Use source-grounded references | data/ESGenius_w_ref_1136q.csv |
| Reproduce evaluations | Evaluation guide |
| Cite the work | BibTeX or CITATION.cff |
Sustainability and ESG work is full of specialized terminology, reporting standards, and source-dependent distinctions. ESGenius is designed to test that knowledge directly rather than relying on generic factual recall.
- Covers sustainability reporting, climate disclosure, biodiversity, energy, governance, and standards-driven ESG reasoning.
- Draws on IPCC, GRI, SASB, ISO, IFRS/ISSB, TCFD, CDP, and related sustainability sources.
- Keeps a
Zoption for abstention-style behavior when a model is unsure. - Provides both plain benchmark files and reference-aware files for retrieval or audit experiments.
- Includes open evaluation paths for local Hugging Face models, reference-aware prompting, and Dashscope-compatible Qwen APIs.
| Path | Purpose |
|---|---|
index.html |
Fast project homepage for GitHub Pages |
heatmap.html |
Full interactive Plotly heatmap for model-question inspection |
assets/ |
Homepage styles, JavaScript, and ESGenius logo |
data/ESGenius_1136q.csv |
Plain question set in CSV |
data/ESGenius_1136q.json |
Plain question set in JSON |
data/ESGenius_w_ref_1136q.csv |
Questions with source references and supporting excerpts |
docs/evaluation.md |
Detailed evaluation workflow guide |
evaluation_utils.py |
Shared loading, prompting, parsing, metrics, and Excel export utilities |
eval_opensource.py |
Local Hugging Face evaluation path |
eval_opensource_rag.py |
Simple reference-aware RAG evaluation path |
eval_qwen_api.py |
Dashscope-compatible Qwen API evaluation path |
figures/ |
Paper and site figures |
results/ |
Published result images and generated evaluation outputs |
CITATION.cff |
Repository and preferred paper citation metadata |
Create an environment:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtCopy the environment template:
cp .env.example .envRun a small local smoke test:
python eval_opensource.py \
--dataset ESGenius_1136q.csv \
--models Qwen/Qwen2.5-0.5B-Instruct \
--limit 10Results are written to results/ as Excel workbooks with summary and details sheets.
The public dataset lives in data/.
| File | Use |
|---|---|
ESGenius_1136q.csv |
Main CSV benchmark for standard evaluation |
ESGenius_1136q.json |
JSON mirror of the plain benchmark |
ESGenius_w_ref_1136q.csv |
Reference-aware version with ref_page, ref_doc, and source_text |
Core fields:
| Column | Description |
|---|---|
query_id |
Stable question identifier |
new_id |
Sequential question index |
query |
Question stem |
A, B, C, D |
Candidate answer options |
Z |
"Not sure" option |
answer |
Gold option label |
ref_page, ref_doc, source_text |
Reference metadata and excerpt in the reference CSV |
See data/README.md for schema notes and usage guidance.
The repository provides three evaluation paths with shared parsing, normalization, metrics, and workbook-export utilities.
| Path | Script | Typical use |
|---|---|---|
| Local open-source models | eval_opensource.py |
Run Hugging Face causal language models locally |
| Reference-aware prompting | eval_opensource_rag.py |
Prepend source snippets from the reference CSV |
| Qwen API | eval_qwen_api.py |
Evaluate Dashscope-compatible Qwen models with retry handling |
Reference-aware smoke test:
python eval_opensource_rag.py \
--dataset ESGenius_w_ref_1136q.csv \
--models Qwen/Qwen2.5-0.5B-Instruct \
--limit 10Qwen API smoke test:
python eval_qwen_api.py \
--dataset ESGenius_1136q.csv \
--models Qwen2.5-Max \
--limit 10For all options, output structure, and reproducibility notes, see docs/evaluation.md.
The project website keeps the overview lightweight and sends detailed inspection to the full heatmap page:
Main ESGenius benchmark results. Additional figures are available in figures/ and on the project website.
Validate the static site locally:
python scripts/check_static_site.py
python -m http.server 8000Then open http://127.0.0.1:8000/.
If you use ESGenius, please cite the EMNLP 2025 paper and repository metadata in CITATION.cff.
@inproceedings{he-etal-2025-esgenius,
title = "{ESG}enius: Benchmarking {LLM}s on Environmental, Social, and Governance ({ESG}) and Sustainability Knowledge",
author = "He, Chaoyue and Zhou, Xin and Wu, Yi and Yu, Xinjia and Zhang, Yan and Zhang, Lei and Wang, Di and Lyu, Shengfei and Xu, Hong and Xiaoqiao, Wang and Liu, Wei and Miao, Chunyan",
editor = "Christodoulopoulos, Christos and Chakraborty, Tanmoy and Rose, Carolyn and Peng, Violet",
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.emnlp-main.739/",
doi = "10.18653/v1/2025.emnlp-main.739",
pages = "14612--14653",
ISBN = "979-8-89176-332-6"
}Please see CONTRIBUTING.md for contribution guidance. For vulnerability reporting, see SECURITY.md.
This project is released under the Apache 2.0 License.

