🤖 EvalRPG Community Edition

Evaluate your AI agents and MCP servers one character at a time.

EvalRPG Community Edition is an open-source conversational evaluation tool that tests AI systems through realistic multi-turn conversations rather than isolated prompts. Unlike traditional evaluation tools, EvalRPG simulates real conversations where agents must maintain context, use tools, and adapt to user needs over multiple turns.

Check out the introductory video on Youtube!

🎯 Why Conversational Evaluations?

Traditional evaluation tools test AI systems with single-shot prompts, but real AI agents work through conversations. They need to:

Understand context across multiple turns
Use tools and external services (MCP servers)
Handle ambiguous requests and ask clarifying questions
Maintain conversation flow and user experience

EvalRPG bridges that gap by creating realistic conversational scenarios with diverse personas that test your agent's true capabilities.

✨ Key Features

🎭 Dynamic Persona Generation

AI-generated diverse test personas
Customizable persona characteristics
Realistic conversation starters
Multiple personas per evaluation run

🔧 MCP Server Integration

Connect to external tools and services
Test real-world agent capabilities
HTTP streaming support
Multiple server configurations

📊 Comprehensive Analytics

Success/failure tracking
Conversation length analysis
Duration metrics
Detailed conversation logs

⚙️ Flexible Configuration

Custom completion criteria
Adjustable conversation limits
Multiple runs for reliability
Various AI model support

🚀 Getting Started

Option 1: Use the Streamlit Hosted Version

The easiest way to get started is to use the hosted Streamlit version. Simply upload your configuration and start evaluating!

Try EvalRPG Community Edition →

Option 2: Local Installation

Prerequisites

Python 3.12 or higher
An OpenAI API key (or compatible API)

Installation

Clone the repository:

git clone <repository-url>
cd conversational-evals

Install dependencies using uv (recommended):
```
uv sync
```
Run the application:
```
streamlit run main.py
```

🔄 How It Works

1️⃣ Generate Personas

Create diverse conversational personas with unique backgrounds, goals, and communication styles based on your evaluation needs.

Custom persona prompts
Automatic diversity
Realistic opening messages

2️⃣ Configure Agent

Set up your AI agent with system instructions and connect MCP servers for tool access.

Custom system prompts
MCP server integration
Model selection

3️⃣ Run Evaluations

Execute multi-turn conversations and analyze results with detailed success metrics and conversation logs.

Completion criteria
Turn limits
Multiple runs per persona

📖 Usage Guide

Setting Up Your First Evaluation

Configure your AI model in the sidebar:
- Add your API key
- Select your model (GPT-4, Claude, etc.)
- Set base URL if using a custom endpoint
Generate personas:
- Write a prompt describing the types of users you want to simulate
- Specify how many personas to generate
- Click "Generate" to create diverse test characters
Configure evaluation settings:
- Define completion criteria (what constitutes success)
- Set maximum conversation turns
- Choose number of runs per persona
Add MCP servers (optional):
- Configure external tools your agent can use
- Set up HTTP streaming endpoints
- Test tool integrations
Run evaluations:
- Click "Run Evaluation" to start testing
- Monitor progress in real-time
- Review detailed results and conversation logs

Example Persona Prompt

Generate personas for testing a customer support chatbot for an e-commerce platform. 
Include users with different technical skill levels, various types of issues 
(billing, shipping, returns), and different communication styles (direct, verbose, confused).

Example Completion Criteria

The conversation is successful if the agent:
1. Correctly identifies the user's issue
2. Provides a clear solution or next steps
3. Maintains a helpful and professional tone throughout
4. Uses appropriate tools when needed (e.g., order lookup, refund processing)

Supported Models

Any OpenAI-compatible API endpoint (OpenAI, Anthropic, LiteLLM)

MCP Server Configuration

EvalRPG supports Model Context Protocol (MCP) servers for tool integration. Configure servers in the sidebar to test your agent's ability to use external tools and services.

📊 Understanding Results

Metrics Tracked

Success Rate: Percentage of conversations meeting completion criteria
Average Turns: Mean number of conversation turns
Duration: Time taken per conversation
Persona Performance: Success rates by persona type

Conversation Analysis

Full conversation logs with timestamps
Turn-by-turn analysis
Tool usage tracking
Error identification and categorization

🤝 Contributing

EvalRPG Community Edition is open source and welcomes contributions! Whether you're fixing bugs, adding features, or improving documentation, we'd love your help.

Development Setup

git clone <repository-url>
cd conversational-evals
uv sync --dev

Running Tests

# Add test commands here when available

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built with Streamlit for the web interface
Powered by OpenAI Agents for MCP integration
Uses Pydantic for data validation

📞 Support

Issues: Report bugs and request features on our GitHub Issues page
Discussions: Join the conversation in GitHub Discussions
Documentation: Check out our Wiki for detailed guides

EvalRPG Community Edition - Evaluate your agents and MCP servers one character at a time. 🤖✨

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
pages		pages
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pages.py		pages.py
prompts.py		prompts.py
pyproject.toml		pyproject.toml
schemas.py		schemas.py
sidebar.py		sidebar.py
state.py		state.py
utils.py		utils.py
uv.lock		uv.lock

License

BrandonShar/eval-rpg

Folders and files

Latest commit

History

Repository files navigation