Evaluate your AI agents and MCP servers one character at a time.
EvalRPG Community Edition is an open-source conversational evaluation tool that tests AI systems through realistic multi-turn conversations rather than isolated prompts. Unlike traditional evaluation tools, EvalRPG simulates real conversations where agents must maintain context, use tools, and adapt to user needs over multiple turns.
Check out the introductory video on Youtube!
Traditional evaluation tools test AI systems with single-shot prompts, but real AI agents work through conversations. They need to:
- Understand context across multiple turns
- Use tools and external services (MCP servers)
- Handle ambiguous requests and ask clarifying questions
- Maintain conversation flow and user experience
EvalRPG bridges that gap by creating realistic conversational scenarios with diverse personas that test your agent's true capabilities.
- AI-generated diverse test personas
- Customizable persona characteristics
- Realistic conversation starters
- Multiple personas per evaluation run
- Connect to external tools and services
- Test real-world agent capabilities
- HTTP streaming support
- Multiple server configurations
- Success/failure tracking
- Conversation length analysis
- Duration metrics
- Detailed conversation logs
- Custom completion criteria
- Adjustable conversation limits
- Multiple runs for reliability
- Various AI model support
The easiest way to get started is to use the hosted Streamlit version. Simply upload your configuration and start evaluating!
Try EvalRPG Community Edition β
- Python 3.12 or higher
- An OpenAI API key (or compatible API)
-
Clone the repository:
git clone <repository-url> cd conversational-evals
-
Install dependencies using uv (recommended):
uv sync
-
Run the application:
streamlit run main.py
Create diverse conversational personas with unique backgrounds, goals, and communication styles based on your evaluation needs.
- Custom persona prompts
- Automatic diversity
- Realistic opening messages
Set up your AI agent with system instructions and connect MCP servers for tool access.
- Custom system prompts
- MCP server integration
- Model selection
Execute multi-turn conversations and analyze results with detailed success metrics and conversation logs.
- Completion criteria
- Turn limits
- Multiple runs per persona
-
Configure your AI model in the sidebar:
- Add your API key
- Select your model (GPT-4, Claude, etc.)
- Set base URL if using a custom endpoint
-
Generate personas:
- Write a prompt describing the types of users you want to simulate
- Specify how many personas to generate
- Click "Generate" to create diverse test characters
-
Configure evaluation settings:
- Define completion criteria (what constitutes success)
- Set maximum conversation turns
- Choose number of runs per persona
-
Add MCP servers (optional):
- Configure external tools your agent can use
- Set up HTTP streaming endpoints
- Test tool integrations
-
Run evaluations:
- Click "Run Evaluation" to start testing
- Monitor progress in real-time
- Review detailed results and conversation logs
Generate personas for testing a customer support chatbot for an e-commerce platform.
Include users with different technical skill levels, various types of issues
(billing, shipping, returns), and different communication styles (direct, verbose, confused).
The conversation is successful if the agent:
1. Correctly identifies the user's issue
2. Provides a clear solution or next steps
3. Maintains a helpful and professional tone throughout
4. Uses appropriate tools when needed (e.g., order lookup, refund processing)
- Any OpenAI-compatible API endpoint (OpenAI, Anthropic, LiteLLM)
EvalRPG supports Model Context Protocol (MCP) servers for tool integration. Configure servers in the sidebar to test your agent's ability to use external tools and services.
- Success Rate: Percentage of conversations meeting completion criteria
- Average Turns: Mean number of conversation turns
- Duration: Time taken per conversation
- Persona Performance: Success rates by persona type
- Full conversation logs with timestamps
- Turn-by-turn analysis
- Tool usage tracking
- Error identification and categorization
EvalRPG Community Edition is open source and welcomes contributions! Whether you're fixing bugs, adding features, or improving documentation, we'd love your help.
git clone <repository-url>
cd conversational-evals
uv sync --dev# Add test commands here when availableThis project is licensed under the MIT License - see the LICENSE file for details.
- Built with Streamlit for the web interface
- Powered by OpenAI Agents for MCP integration
- Uses Pydantic for data validation
- Issues: Report bugs and request features on our GitHub Issues page
- Discussions: Join the conversation in GitHub Discussions
- Documentation: Check out our Wiki for detailed guides
EvalRPG Community Edition - Evaluate your agents and MCP servers one character at a time. π€β¨
