Skip to content

BrandonShar/eval-rpg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ€– EvalRPG Community Edition

Evaluate your AI agents and MCP servers one character at a time.

EvalRPG Community Edition is an open-source conversational evaluation tool that tests AI systems through realistic multi-turn conversations rather than isolated prompts. Unlike traditional evaluation tools, EvalRPG simulates real conversations where agents must maintain context, use tools, and adapt to user needs over multiple turns.

Open Source Python 3.12+ Streamlit

Check out the introductory video on Youtube!

youtube-thumbnail

🎯 Why Conversational Evaluations?

Traditional evaluation tools test AI systems with single-shot prompts, but real AI agents work through conversations. They need to:

  • Understand context across multiple turns
  • Use tools and external services (MCP servers)
  • Handle ambiguous requests and ask clarifying questions
  • Maintain conversation flow and user experience

EvalRPG bridges that gap by creating realistic conversational scenarios with diverse personas that test your agent's true capabilities.

✨ Key Features

🎭 Dynamic Persona Generation

  • AI-generated diverse test personas
  • Customizable persona characteristics
  • Realistic conversation starters
  • Multiple personas per evaluation run

πŸ”§ MCP Server Integration

  • Connect to external tools and services
  • Test real-world agent capabilities
  • HTTP streaming support
  • Multiple server configurations

πŸ“Š Comprehensive Analytics

  • Success/failure tracking
  • Conversation length analysis
  • Duration metrics
  • Detailed conversation logs

βš™οΈ Flexible Configuration

  • Custom completion criteria
  • Adjustable conversation limits
  • Multiple runs for reliability
  • Various AI model support

πŸš€ Getting Started

Option 1: Use the Streamlit Hosted Version

The easiest way to get started is to use the hosted Streamlit version. Simply upload your configuration and start evaluating!

Try EvalRPG Community Edition β†’

Option 2: Local Installation

Prerequisites

  • Python 3.12 or higher
  • An OpenAI API key (or compatible API)

Installation

  1. Clone the repository:

    git clone <repository-url>
    cd conversational-evals
  2. Install dependencies using uv (recommended):

    uv sync
  3. Run the application:

    streamlit run main.py

πŸ”„ How It Works

1️⃣ Generate Personas

Create diverse conversational personas with unique backgrounds, goals, and communication styles based on your evaluation needs.

  • Custom persona prompts
  • Automatic diversity
  • Realistic opening messages

2️⃣ Configure Agent

Set up your AI agent with system instructions and connect MCP servers for tool access.

  • Custom system prompts
  • MCP server integration
  • Model selection

3️⃣ Run Evaluations

Execute multi-turn conversations and analyze results with detailed success metrics and conversation logs.

  • Completion criteria
  • Turn limits
  • Multiple runs per persona

πŸ“– Usage Guide

Setting Up Your First Evaluation

  1. Configure your AI model in the sidebar:

    • Add your API key
    • Select your model (GPT-4, Claude, etc.)
    • Set base URL if using a custom endpoint
  2. Generate personas:

    • Write a prompt describing the types of users you want to simulate
    • Specify how many personas to generate
    • Click "Generate" to create diverse test characters
  3. Configure evaluation settings:

    • Define completion criteria (what constitutes success)
    • Set maximum conversation turns
    • Choose number of runs per persona
  4. Add MCP servers (optional):

    • Configure external tools your agent can use
    • Set up HTTP streaming endpoints
    • Test tool integrations
  5. Run evaluations:

    • Click "Run Evaluation" to start testing
    • Monitor progress in real-time
    • Review detailed results and conversation logs

Example Persona Prompt

Generate personas for testing a customer support chatbot for an e-commerce platform. 
Include users with different technical skill levels, various types of issues 
(billing, shipping, returns), and different communication styles (direct, verbose, confused).

Example Completion Criteria

The conversation is successful if the agent:
1. Correctly identifies the user's issue
2. Provides a clear solution or next steps
3. Maintains a helpful and professional tone throughout
4. Uses appropriate tools when needed (e.g., order lookup, refund processing)

Supported Models

  • Any OpenAI-compatible API endpoint (OpenAI, Anthropic, LiteLLM)

MCP Server Configuration

EvalRPG supports Model Context Protocol (MCP) servers for tool integration. Configure servers in the sidebar to test your agent's ability to use external tools and services.

πŸ“Š Understanding Results

Metrics Tracked

  • Success Rate: Percentage of conversations meeting completion criteria
  • Average Turns: Mean number of conversation turns
  • Duration: Time taken per conversation
  • Persona Performance: Success rates by persona type

Conversation Analysis

  • Full conversation logs with timestamps
  • Turn-by-turn analysis
  • Tool usage tracking
  • Error identification and categorization

🀝 Contributing

EvalRPG Community Edition is open source and welcomes contributions! Whether you're fixing bugs, adding features, or improving documentation, we'd love your help.

Development Setup

git clone <repository-url>
cd conversational-evals
uv sync --dev

Running Tests

# Add test commands here when available

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

πŸ“ž Support

  • Issues: Report bugs and request features on our GitHub Issues page
  • Discussions: Join the conversation in GitHub Discussions
  • Documentation: Check out our Wiki for detailed guides

EvalRPG Community Edition - Evaluate your agents and MCP servers one character at a time. πŸ€–βœ¨

About

Evaluate your agents and MCP servers one character at a time.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages