-
Notifications
You must be signed in to change notification settings - Fork 1
Add custom GEval support with criteria, rubric, and evaluation_steps #37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: literature-eval-enhancements
Are you sure you want to change the base?
Add custom GEval support with criteria, rubric, and evaluation_steps #37
Conversation
- Add MetricConfig class with support for custom criteria, rubric, and evaluation_steps - Add RubricItem class for structured rubric definitions - Implement make_custom_geval() to create GEval instances from MetricConfig - Add model_validator to enforce mutual exclusivity of criteria/evaluation_steps - Update EvalCase to accept both string and MetricConfig metric specifications - Fix make_geval() bug: remove duplicate criteria parameter (keep evaluation_steps only) - Improve error handling in Claude coder (warning instead of exception) - Add comprehensive test suite for MetricConfig validation Addresses limitations with generic evaluation criteria not suitable for: - Exact text extraction tasks - Specific metadata field retrieval - Binary decision evaluations The custom GEval feature allows test-specific evaluation criteria for better accuracy and reliability. evaluation_steps provides more control than criteria according to DeepEval documentation. Tests: Added test_custom_geval.py with 7 test cases covering all validation scenarios
|
Claude finished @justaddcoffee's task —— View job Code Review Complete ✅I've thoroughly reviewed the custom GEval enhancements PR. Overall, this is a well-designed, high-quality implementation that significantly improves the evaluation framework's flexibility. Here's my detailed analysis: Review Tasks
✅ StrengthsExcellent Architecture & Design
Robust Input Validation
Comprehensive Test Coverage
Important Bug Fixes
|
|
Claude finished @justaddcoffee's task —— View job Claude is reviewing the custom GEval enhancements PR Review Tasks
✅ Overall Assessment: APPROVEThis is a well-designed, high-quality implementation that significantly improves the evaluation framework's flexibility. Here's my comprehensive review: StrengthsExcellent Architecture & Design
Robust Input Validation
Comprehensive Test Coverage
Important Bug Fixes
Areas for Improvement1. Rubric Score Range Validation# src/metacoder/evals/runner.py:107-118
if not 0.0 <= item.score <= 1.0:
raise ValueError(f"Rubric score {item.score} must be between 0.0 and 1.0")Suggestion: Add validation that rubric scores are within valid ranges (e.g., 0.0-1.0) 2. Default Criteria Enhancement# src/metacoder/evals/runner.py:137-139
kwargs["criteria"] = "Evaluate the actual output based on the rubric criteria."Suggestion: This fallback criteria could be more descriptive or allow users to specify custom criteria even with rubrics 3. Minor Performance Optimization# src/metacoder/evals/runner.py:357-360
if metric_config and (metric_config.rubric or metric_config.criteria or metric_config.evaluation_steps):Suggestion: Can be simplified to just Security Assessment 🔒
Performance Assessment 🚀
Test Quality Assessment 📋Excellent test coverage with comprehensive scenarios:
Final RecommendationThis implementation is production-ready and addresses a real need for more flexible evaluation criteria. The code follows best practices, includes comprehensive testing, and maintains backward compatibility while adding powerful new functionality. Key Benefits:
The minor improvement suggestions don't block approval and can be addressed in future iterations. |
Summary
This PR adds comprehensive support for custom GEval metrics with three configuration options:
Motivation
The default CorrectnessMetric uses generic evaluation criteria that don't work well for specialized tasks:
Custom metrics allow test-specific evaluation criteria for better accuracy and reliability.
Changes
Core Features
1. Add
MetricConfigclass (eval_model.py)name,criteria,evaluation_steps,rubric@model_validatorenforces mutual exclusivity ofcriteriaandevaluation_stepscriteria,evaluation_steps, orrubric2. Add
RubricItemclass (eval_model.py)scoreandcriteriafields3. Implement
make_custom_geval()(runner.py)4. Update
EvalCase.metrics(eval_model.py)List[str]toList[Union[str, MetricConfig]]5. Update runner logic (
runner.py)make_custom_geval()criteria,evaluation_steps, orrubricpresenceBug Fixes
6. Fix
make_geval()bug (runner.py)criteriaparameter (was specifying bothcriteriaANDevaluation_steps)evaluation_stepsfor more reliable scoring7. Improve error handling (
claude.py)raise ValueErrortologger.warningfor non-auth errorsTests
8. Add comprehensive test suite (
tests/test_custom_geval.py)Usage Examples
Custom Criteria
Custom Rubric
Custom Evaluation Steps
Testing
pytest tests/test_custom_geval.py)Documentation
Compatibility
References
Generated with Claude Code