Langfuse Evaluations

A toolkit for running Q&A evaluations against LLM-powered chat agents using Langfuse datasets and experiments.

Overview

This repo provides scripts to:

Load golden Q&A datasets into Langfuse (from JSON or markdown)
Run evaluation experiments against any chat API
Score responses using LLM judges and deterministic evaluators
Track and compare results in the Langfuse dashboard

┌─────────────────┐      ┌─────────────────┐      ┌─────────────────┐
│  1. Dataset     │      │  2. Experiment  │      │  3. Scoring     │
│  (Q&A pairs)    │ ───▶ │  (Run through   │ ───▶ │  (Evaluate vs   │
│                 │      │   Chat Agent)   │      │   Ground Truth) │
└─────────────────┘      └─────────────────┘      └─────────────────┘

Quick Start

Installation

# Clone the repo
git clone https://github.com/crowdbotics/langfuse-evals.git
cd langfuse-evals

# Install dependencies
pip install -r requirements.txt

# Or with uv
uv pip install -r requirements.txt

Configuration

Set up your environment variables:

# Langfuse (required)
export LANGFUSE_PUBLIC_KEY=pk-lf-xxx
export LANGFUSE_SECRET_KEY=sk-lf-xxx
export LANGFUSE_HOST=https://cloud.langfuse.com  # or your self-hosted instance

# OpenAI (for LLM judge evaluator)
export OPENAI_API_KEY=sk-xxx

# Your chat API (optional, can also pass via CLI)
export CHAT_API_BASE=http://localhost:8000
export CHAT_API_TOKEN=your-token

Usage

1. Load a Q&A Dataset

# From JSON
python scripts/load_dataset.py dataset.json

# From markdown
python scripts/load_dataset.py \
    --from-markdown questions.md \
    --name "my-codebase-qa"

2. Run an Evaluation

python scripts/run_evaluation.py \
    --dataset my-codebase-qa \
    --project-id 123 \
    --name "Baseline v1.0"

3. View Results

Open Langfuse dashboard → Datasets → Select your dataset → View runs

Dataset Formats

JSON Format

{
  "name": "my-codebase-qa",
  "description": "Golden Q&A for evaluating codebase understanding",
  "metadata": {
    "source": "evaluation-questions.md",
    "version": "1.0"
  },
  "items": [
    {
      "input": "What is the main entry point of the application?",
      "expected_output": "The main entry point is `src/main.py` which initializes the FastAPI app...",
      "metadata": {
        "category": "Basic",
        "question_id": 1,
        "max_points": 3
      }
    }
  ]
}

Markdown Format

The --from-markdown option parses Q&A from this format:

### Question 1: Entry Point
**Question:** What is the main entry point of the application?

**Ground Truth Answer:** The main entry point is `src/main.py` which initializes the FastAPI application and registers all routers.

---

### Question 2: Authentication
**Question:** How does the authentication system work?

**Ground Truth Answer:** Authentication uses JWT tokens validated by the auth middleware...

Evaluators

Per-Item Evaluators

Evaluator	Type	Description
`expert_score`	0-3	LLM judge comparing response to ground truth
`keyword_coverage`	0-1	Technical term overlap with expected answer
`completeness`	0-1	Response length appropriateness
`answered`	bool	Did agent provide a substantive answer

Run-Level Aggregates

Metric	Description
`total_score`	Sum of expert scores
`average_score`	Mean expert score with rating (Excellent/Good/etc.)
`answer_rate`	Percentage of questions with substantive answers
`category_breakdown`	Scores grouped by category metadata

Customizing for Your Chat API

The default scripts are configured for CoreStory's chat API. To use with a different API, modify the send_message function in scripts/run_evaluation.py:

async def send_message(
    client: httpx.AsyncClient,
    project_id: int,
    conversation_id: int,
    message: str,
    timeout: float = 180.0
) -> str:
    """Adapt this function to your chat API's interface."""
    response = await client.post(
        f"{API_BASE_URL}/your/chat/endpoint",
        json={"message": message},
        timeout=timeout
    )
    response.raise_for_status()

    # Parse your API's response format
    return response.json()["answer"]

Writing Custom Evaluators

from langfuse import Evaluation

def my_custom_evaluator(
    *,
    input: str,           # The question
    output: str,          # Agent's response
    expected_output: str, # Ground truth answer
    metadata: dict,       # Item metadata
    **kwargs
) -> Evaluation:
    # Your evaluation logic
    score = calculate_score(output, expected_output)

    return Evaluation(
        name="my_metric",
        value=score,
        comment="Explanation of score",
        data_type="NUMERIC"  # NUMERIC, BOOLEAN, or CATEGORICAL
    )

Add your evaluator to the evaluators list in run_evaluation.py.

CLI Reference

load_dataset.py

usage: load_dataset.py [-h] [--from-markdown FILE] [--name NAME]
                       [--description DESC] [--overwrite] [--output-json FILE]
                       [file]

Arguments:
  file                  JSON dataset file to load

Options:
  --from-markdown FILE  Parse Q&A from markdown instead of JSON
  --name NAME           Dataset name (required with --from-markdown)
  --description DESC    Dataset description
  --overwrite           Overwrite if dataset exists
  --output-json FILE    Save parsed dataset to JSON

run_evaluation.py

usage: run_evaluation.py [-h] --dataset NAME --project-id ID --name NAME
                         [--api-token TOKEN] [--api-base URL]
                         [--concurrency N] [--timeout SECONDS]
                         [--skip-llm-judge] [--list-datasets]

Required:
  --dataset NAME        Langfuse dataset name
  --project-id ID       Target project ID
  --name NAME           Experiment name

Options:
  --api-token TOKEN     API authentication token
  --api-base URL        API base URL (default: http://localhost:8000)
  --concurrency N       Max concurrent requests (default: 3)
  --timeout SECONDS     Request timeout (default: 180)
  --skip-llm-judge      Skip LLM judge (faster, no OpenAI cost)
  --list-datasets       List available datasets and exit

Example: AWS CardDemo COBOL Evaluation

This repo includes an example golden dataset for evaluating AI understanding of the AWS CardDemo COBOL codebase:

# Load the CardDemo Q&A dataset
python scripts/load_dataset.py \
    --from-markdown examples/carddemo-eval-questions.md \
    --name "carddemo-cobol-qa"

# Run evaluation (assuming codebase is ingested at project_id 42)
python scripts/run_evaluation.py \
    --dataset carddemo-cobol-qa \
    --project-id 42 \
    --name "CardDemo Baseline"

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docs		docs
examples		examples
scripts		scripts
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Langfuse Evaluations

Overview

Quick Start

Installation

Configuration

Usage

1. Load a Q&A Dataset

2. Run an Evaluation

3. View Results

Dataset Formats

JSON Format

Markdown Format

Evaluators

Per-Item Evaluators

Run-Level Aggregates

Customizing for Your Chat API

Writing Custom Evaluators

CLI Reference

load_dataset.py

run_evaluation.py

Example: AWS CardDemo COBOL Evaluation

License

About

Uh oh!

Releases

Packages

Languages

crowdbotics/langfuse-evals

Folders and files

Latest commit

History

Repository files navigation

Langfuse Evaluations

Overview

Quick Start

Installation

Configuration

Usage

1. Load a Q&A Dataset

2. Run an Evaluation

3. View Results

Dataset Formats

JSON Format

Markdown Format

Evaluators

Per-Item Evaluators

Run-Level Aggregates

Customizing for Your Chat API

Writing Custom Evaluators

CLI Reference

load_dataset.py

run_evaluation.py

Example: AWS CardDemo COBOL Evaluation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages