A toolkit for running Q&A evaluations against LLM-powered chat agents using Langfuse datasets and experiments.
This repo provides scripts to:
- Load golden Q&A datasets into Langfuse (from JSON or markdown)
- Run evaluation experiments against any chat API
- Score responses using LLM judges and deterministic evaluators
- Track and compare results in the Langfuse dashboard
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ 1. Dataset │ │ 2. Experiment │ │ 3. Scoring │
│ (Q&A pairs) │ ───▶ │ (Run through │ ───▶ │ (Evaluate vs │
│ │ │ Chat Agent) │ │ Ground Truth) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
# Clone the repo
git clone https://github.com/crowdbotics/langfuse-evals.git
cd langfuse-evals
# Install dependencies
pip install -r requirements.txt
# Or with uv
uv pip install -r requirements.txtSet up your environment variables:
# Langfuse (required)
export LANGFUSE_PUBLIC_KEY=pk-lf-xxx
export LANGFUSE_SECRET_KEY=sk-lf-xxx
export LANGFUSE_HOST=https://cloud.langfuse.com # or your self-hosted instance
# OpenAI (for LLM judge evaluator)
export OPENAI_API_KEY=sk-xxx
# Your chat API (optional, can also pass via CLI)
export CHAT_API_BASE=http://localhost:8000
export CHAT_API_TOKEN=your-token# From JSON
python scripts/load_dataset.py dataset.json
# From markdown
python scripts/load_dataset.py \
--from-markdown questions.md \
--name "my-codebase-qa"python scripts/run_evaluation.py \
--dataset my-codebase-qa \
--project-id 123 \
--name "Baseline v1.0"Open Langfuse dashboard → Datasets → Select your dataset → View runs
{
"name": "my-codebase-qa",
"description": "Golden Q&A for evaluating codebase understanding",
"metadata": {
"source": "evaluation-questions.md",
"version": "1.0"
},
"items": [
{
"input": "What is the main entry point of the application?",
"expected_output": "The main entry point is `src/main.py` which initializes the FastAPI app...",
"metadata": {
"category": "Basic",
"question_id": 1,
"max_points": 3
}
}
]
}The --from-markdown option parses Q&A from this format:
### Question 1: Entry Point
**Question:** What is the main entry point of the application?
**Ground Truth Answer:** The main entry point is `src/main.py` which initializes the FastAPI application and registers all routers.
---
### Question 2: Authentication
**Question:** How does the authentication system work?
**Ground Truth Answer:** Authentication uses JWT tokens validated by the auth middleware...| Evaluator | Type | Description |
|---|---|---|
expert_score |
0-3 | LLM judge comparing response to ground truth |
keyword_coverage |
0-1 | Technical term overlap with expected answer |
completeness |
0-1 | Response length appropriateness |
answered |
bool | Did agent provide a substantive answer |
| Metric | Description |
|---|---|
total_score |
Sum of expert scores |
average_score |
Mean expert score with rating (Excellent/Good/etc.) |
answer_rate |
Percentage of questions with substantive answers |
category_breakdown |
Scores grouped by category metadata |
The default scripts are configured for CoreStory's chat API. To use with a different API, modify the send_message function in scripts/run_evaluation.py:
async def send_message(
client: httpx.AsyncClient,
project_id: int,
conversation_id: int,
message: str,
timeout: float = 180.0
) -> str:
"""Adapt this function to your chat API's interface."""
response = await client.post(
f"{API_BASE_URL}/your/chat/endpoint",
json={"message": message},
timeout=timeout
)
response.raise_for_status()
# Parse your API's response format
return response.json()["answer"]from langfuse import Evaluation
def my_custom_evaluator(
*,
input: str, # The question
output: str, # Agent's response
expected_output: str, # Ground truth answer
metadata: dict, # Item metadata
**kwargs
) -> Evaluation:
# Your evaluation logic
score = calculate_score(output, expected_output)
return Evaluation(
name="my_metric",
value=score,
comment="Explanation of score",
data_type="NUMERIC" # NUMERIC, BOOLEAN, or CATEGORICAL
)Add your evaluator to the evaluators list in run_evaluation.py.
usage: load_dataset.py [-h] [--from-markdown FILE] [--name NAME]
[--description DESC] [--overwrite] [--output-json FILE]
[file]
Arguments:
file JSON dataset file to load
Options:
--from-markdown FILE Parse Q&A from markdown instead of JSON
--name NAME Dataset name (required with --from-markdown)
--description DESC Dataset description
--overwrite Overwrite if dataset exists
--output-json FILE Save parsed dataset to JSON
usage: run_evaluation.py [-h] --dataset NAME --project-id ID --name NAME
[--api-token TOKEN] [--api-base URL]
[--concurrency N] [--timeout SECONDS]
[--skip-llm-judge] [--list-datasets]
Required:
--dataset NAME Langfuse dataset name
--project-id ID Target project ID
--name NAME Experiment name
Options:
--api-token TOKEN API authentication token
--api-base URL API base URL (default: http://localhost:8000)
--concurrency N Max concurrent requests (default: 3)
--timeout SECONDS Request timeout (default: 180)
--skip-llm-judge Skip LLM judge (faster, no OpenAI cost)
--list-datasets List available datasets and exit
This repo includes an example golden dataset for evaluating AI understanding of the AWS CardDemo COBOL codebase:
# Load the CardDemo Q&A dataset
python scripts/load_dataset.py \
--from-markdown examples/carddemo-eval-questions.md \
--name "carddemo-cobol-qa"
# Run evaluation (assuming codebase is ingested at project_id 42)
python scripts/run_evaluation.py \
--dataset carddemo-cobol-qa \
--project-id 42 \
--name "CardDemo Baseline"MIT