Skip to content

Q&A evaluation toolkit using Langfuse datasets and experiments

Notifications You must be signed in to change notification settings

crowdbotics/langfuse-evals

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Langfuse Evaluations

A toolkit for running Q&A evaluations against LLM-powered chat agents using Langfuse datasets and experiments.

Overview

This repo provides scripts to:

  1. Load golden Q&A datasets into Langfuse (from JSON or markdown)
  2. Run evaluation experiments against any chat API
  3. Score responses using LLM judges and deterministic evaluators
  4. Track and compare results in the Langfuse dashboard
┌─────────────────┐      ┌─────────────────┐      ┌─────────────────┐
│  1. Dataset     │      │  2. Experiment  │      │  3. Scoring     │
│  (Q&A pairs)    │ ───▶ │  (Run through   │ ───▶ │  (Evaluate vs   │
│                 │      │   Chat Agent)   │      │   Ground Truth) │
└─────────────────┘      └─────────────────┘      └─────────────────┘

Quick Start

Installation

# Clone the repo
git clone https://github.com/crowdbotics/langfuse-evals.git
cd langfuse-evals

# Install dependencies
pip install -r requirements.txt

# Or with uv
uv pip install -r requirements.txt

Configuration

Set up your environment variables:

# Langfuse (required)
export LANGFUSE_PUBLIC_KEY=pk-lf-xxx
export LANGFUSE_SECRET_KEY=sk-lf-xxx
export LANGFUSE_HOST=https://cloud.langfuse.com  # or your self-hosted instance

# OpenAI (for LLM judge evaluator)
export OPENAI_API_KEY=sk-xxx

# Your chat API (optional, can also pass via CLI)
export CHAT_API_BASE=http://localhost:8000
export CHAT_API_TOKEN=your-token

Usage

1. Load a Q&A Dataset

# From JSON
python scripts/load_dataset.py dataset.json

# From markdown
python scripts/load_dataset.py \
    --from-markdown questions.md \
    --name "my-codebase-qa"

2. Run an Evaluation

python scripts/run_evaluation.py \
    --dataset my-codebase-qa \
    --project-id 123 \
    --name "Baseline v1.0"

3. View Results

Open Langfuse dashboard → Datasets → Select your dataset → View runs

Dataset Formats

JSON Format

{
  "name": "my-codebase-qa",
  "description": "Golden Q&A for evaluating codebase understanding",
  "metadata": {
    "source": "evaluation-questions.md",
    "version": "1.0"
  },
  "items": [
    {
      "input": "What is the main entry point of the application?",
      "expected_output": "The main entry point is `src/main.py` which initializes the FastAPI app...",
      "metadata": {
        "category": "Basic",
        "question_id": 1,
        "max_points": 3
      }
    }
  ]
}

Markdown Format

The --from-markdown option parses Q&A from this format:

### Question 1: Entry Point
**Question:** What is the main entry point of the application?

**Ground Truth Answer:** The main entry point is `src/main.py` which initializes the FastAPI application and registers all routers.

---

### Question 2: Authentication
**Question:** How does the authentication system work?

**Ground Truth Answer:** Authentication uses JWT tokens validated by the auth middleware...

Evaluators

Per-Item Evaluators

Evaluator Type Description
expert_score 0-3 LLM judge comparing response to ground truth
keyword_coverage 0-1 Technical term overlap with expected answer
completeness 0-1 Response length appropriateness
answered bool Did agent provide a substantive answer

Run-Level Aggregates

Metric Description
total_score Sum of expert scores
average_score Mean expert score with rating (Excellent/Good/etc.)
answer_rate Percentage of questions with substantive answers
category_breakdown Scores grouped by category metadata

Customizing for Your Chat API

The default scripts are configured for CoreStory's chat API. To use with a different API, modify the send_message function in scripts/run_evaluation.py:

async def send_message(
    client: httpx.AsyncClient,
    project_id: int,
    conversation_id: int,
    message: str,
    timeout: float = 180.0
) -> str:
    """Adapt this function to your chat API's interface."""
    response = await client.post(
        f"{API_BASE_URL}/your/chat/endpoint",
        json={"message": message},
        timeout=timeout
    )
    response.raise_for_status()

    # Parse your API's response format
    return response.json()["answer"]

Writing Custom Evaluators

from langfuse import Evaluation

def my_custom_evaluator(
    *,
    input: str,           # The question
    output: str,          # Agent's response
    expected_output: str, # Ground truth answer
    metadata: dict,       # Item metadata
    **kwargs
) -> Evaluation:
    # Your evaluation logic
    score = calculate_score(output, expected_output)

    return Evaluation(
        name="my_metric",
        value=score,
        comment="Explanation of score",
        data_type="NUMERIC"  # NUMERIC, BOOLEAN, or CATEGORICAL
    )

Add your evaluator to the evaluators list in run_evaluation.py.

CLI Reference

load_dataset.py

usage: load_dataset.py [-h] [--from-markdown FILE] [--name NAME]
                       [--description DESC] [--overwrite] [--output-json FILE]
                       [file]

Arguments:
  file                  JSON dataset file to load

Options:
  --from-markdown FILE  Parse Q&A from markdown instead of JSON
  --name NAME           Dataset name (required with --from-markdown)
  --description DESC    Dataset description
  --overwrite           Overwrite if dataset exists
  --output-json FILE    Save parsed dataset to JSON

run_evaluation.py

usage: run_evaluation.py [-h] --dataset NAME --project-id ID --name NAME
                         [--api-token TOKEN] [--api-base URL]
                         [--concurrency N] [--timeout SECONDS]
                         [--skip-llm-judge] [--list-datasets]

Required:
  --dataset NAME        Langfuse dataset name
  --project-id ID       Target project ID
  --name NAME           Experiment name

Options:
  --api-token TOKEN     API authentication token
  --api-base URL        API base URL (default: http://localhost:8000)
  --concurrency N       Max concurrent requests (default: 3)
  --timeout SECONDS     Request timeout (default: 180)
  --skip-llm-judge      Skip LLM judge (faster, no OpenAI cost)
  --list-datasets       List available datasets and exit

Example: AWS CardDemo COBOL Evaluation

This repo includes an example golden dataset for evaluating AI understanding of the AWS CardDemo COBOL codebase:

# Load the CardDemo Q&A dataset
python scripts/load_dataset.py \
    --from-markdown examples/carddemo-eval-questions.md \
    --name "carddemo-cobol-qa"

# Run evaluation (assuming codebase is ingested at project_id 42)
python scripts/run_evaluation.py \
    --dataset carddemo-cobol-qa \
    --project-id 42 \
    --name "CardDemo Baseline"

License

MIT

About

Q&A evaluation toolkit using Langfuse datasets and experiments

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages