Add custom GEval support with criteria, rubric, and evaluation_steps #37

justaddcoffee · 2025-11-24T20:59:47Z

Summary

This PR adds comprehensive support for custom GEval metrics with three configuration options:

criteria: Single-string evaluation criteria (auto-generates steps)
evaluation_steps: List of specific evaluation steps (more control and reliability)
rubric: Structured scoring guidelines with score ranges

Motivation

The default CorrectnessMetric uses generic evaluation criteria that don't work well for specialized tasks:

Exact text extraction ("retrieve the first sentence of section 2")
Specific metadata field retrieval ("title", "DOI", "first author")
Binary decisions ("is this paper retracted?")

Custom metrics allow test-specific evaluation criteria for better accuracy and reliability.

Changes

Core Features

1. Add MetricConfig class (eval_model.py)

Fields: name, criteria, evaluation_steps, rubric
@model_validator enforces mutual exclusivity of criteria and evaluation_steps
Requires at least one of: criteria, evaluation_steps, or rubric

2. Add RubricItem class (eval_model.py)

Structured rubric definitions with score and criteria fields

3. Implement make_custom_geval() (runner.py)

Creates GEval instances from MetricConfig
Supports all three configuration options
Includes INPUT parameter in evaluation_params (was missing before)
Comprehensive docstring with usage notes

4. Update EvalCase.metrics (eval_model.py)

Changed from List[str] to List[Union[str, MetricConfig]]
Allows both string format and object format

5. Update runner logic (runner.py)

Detects MetricConfig objects and calls make_custom_geval()
Checks for criteria, evaluation_steps, or rubric presence

Bug Fixes

6. Fix make_geval() bug (runner.py)

Removed duplicate criteria parameter (was specifying both criteria AND evaluation_steps)
Per DeepEval docs: "you can only provide either criteria or evaluation_steps, and not both"
Now uses only evaluation_steps for more reliable scoring

7. Improve error handling (claude.py)

Changed from raise ValueError to logger.warning for non-auth errors
Prevents single test failures from crashing entire evaluation run

Tests

8. Add comprehensive test suite (tests/test_custom_geval.py)

7 test cases covering all validation scenarios:
- evaluation_steps only ✓
- criteria only ✓
- rubric only ✓
- mutual exclusivity validation ✓
- at-least-one requirement ✓
- criteria + rubric combination ✓
- evaluation_steps + rubric combination ✓

Usage Examples

Custom Criteria

- name: exact_text_extraction
  metrics:
  - name: CorrectnessMetric
    criteria: "Check if actual output exactly matches expected output without paraphrasing"
  input: "What is the first sentence?"
  expected_output: "This is the first sentence."
  threshold: 0.9

Custom Rubric

- name: retraction_check
  metrics:
  - name: CorrectnessMetric
    rubric:
    - score: 0.0
      criteria: "Paper not retracted or unclear"
    - score: 1.0
      criteria: "Paper is retracted"
  input: "Is PMID:12345 retracted?"
  expected_output: "Yes"

Custom Evaluation Steps

- name: metadata_extraction
  metrics:
  - name: CorrectnessMetric
    evaluation_steps:
    - "Check if actual output contains the expected metadata field"
    - "Allow minor formatting differences (quotes, spacing)"
    - "Penalize missing or incorrect values"
  input: "What is the title?"
  expected_output: "Example Title"

Testing

All 7 new tests pass (pytest tests/test_custom_geval.py)
Mutual exclusivity validation works correctly
At-least-one requirement enforced
MetricConfig properly handles all three options
make_custom_geval() creates valid GEval instances

Documentation

Added comprehensive docstrings to MetricConfig and make_custom_geval()
Added usage examples in EvalCase docstring
Clear field descriptions with mutual exclusivity notes

Compatibility

✅ Backward compatible: Existing configs using string format still work
✅ No breaking changes to existing APIs
✅ Default CorrectnessMetric behavior unchanged (except bug fix)

References

DeepEval GEval documentation: https://deepeval.com/docs/metrics-llm-evals
GEval paper: https://arxiv.org/pdf/2303.16634.pdf

Generated with Claude Code

- Add MetricConfig class with support for custom criteria, rubric, and evaluation_steps - Add RubricItem class for structured rubric definitions - Implement make_custom_geval() to create GEval instances from MetricConfig - Add model_validator to enforce mutual exclusivity of criteria/evaluation_steps - Update EvalCase to accept both string and MetricConfig metric specifications - Fix make_geval() bug: remove duplicate criteria parameter (keep evaluation_steps only) - Improve error handling in Claude coder (warning instead of exception) - Add comprehensive test suite for MetricConfig validation Addresses limitations with generic evaluation criteria not suitable for: - Exact text extraction tasks - Specific metadata field retrieval - Binary decision evaluations The custom GEval feature allows test-specific evaluation criteria for better accuracy and reliability. evaluation_steps provides more control than criteria according to DeepEval documentation. Tests: Added test_custom_geval.py with 7 test cases covering all validation scenarios

claude · 2025-11-24T21:00:01Z

Claude finished @justaddcoffee's task —— View job

Code Review Complete ✅

I've thoroughly reviewed the custom GEval enhancements PR. Overall, this is a well-designed, high-quality implementation that significantly improves the evaluation framework's flexibility. Here's my detailed analysis:

Review Tasks

Read and analyze all changed files
Review MetricConfig and RubricItem class design
Analyze make_custom_geval() implementation
Check validation logic and error handling
Review test coverage and quality
Assess performance and security considerations
Provide final feedback and recommendations

✅ Strengths

Excellent Architecture & Design

Clean separation of concerns: MetricConfig and RubricItem are well-designed data classes with clear responsibilities
Backward compatibility: Existing string-based metric configurations continue to work seamlessly
Type safety: Proper use of Union[str, MetricConfig] maintains type safety while adding flexibility
Clear validation: Pydantic validators enforce business rules at the data layer

Robust Input Validation

Mutual exclusivity: src/metacoder/evals/eval_model.py:27-44 properly enforces that criteria and evaluation_steps cannot both be specified
Required field validation: Ensures at least one of criteria, evaluation_steps, or rubric is provided
Well-structured error messages: Clear guidance on proper usage

Comprehensive Test Coverage

7 test scenarios covering all validation edge cases
Tests all three configuration modes: criteria-only, evaluation_steps-only, rubric-only
Validates error conditions: Mutual exclusivity and required field enforcement
Tests combination scenarios: Rubric with criteria/evaluation_steps

Important Bug Fixes

Fixed duplicate parameter issue in make_geval() (src/metacoder/evals/runner.py:72-88) - removes conflicting criteria and evaluation_steps parameters
Improved error handling in claude.py:264 - changed from raise ValueError to logger.warning to prevent single test failures from crashing entire evaluation runs

⚠️ Areas for Improvement

1. Documentation & Type Hints

# src/metacoder/evals/runner.py:91-146
def make_custom_geval(
    metric_config: MetricConfig, model: Optional[DeepEvalBaseLLM] = None
) -> GEval:

Suggestion: The docstring is good, but could benefit from:

Example usage snippets for each configuration type
Parameter validation details
Return value specifications

2. Error Handling Edge Cases

# src/metacoder/evals/runner.py:107-118
rubrics = []
if metric_config.rubric:
    for item in metric_config.rubric:
        rubrics.append(
            Rubric(
                score_range=(item.score, item.score),
                expected_outcome=item.criteria,
            )
        )

Potential Issue: No validation that rubric scores are within valid ranges (e.g., 0.0-1.0). Consider adding:

if not 0.0 <= item.score <= 1.0:
    raise ValueError(f"Rubric score {item.score} must be between 0.0 and 1.0")

3. Default Criteria Logic

# src/metacoder/evals/runner.py:137-139
else:
    # Default criteria if only rubric provided
    kwargs["criteria"] = "Evaluate the actual output based on the rubric criteria."

Concern: This fallback criteria is quite generic. Consider making it more descriptive or allowing users to specify custom criteria even with rubrics.

4. Performance Consideration

# src/metacoder/evals/runner.py:357-360
if metric_config and (metric_config.rubric or metric_config.criteria or metric_config.evaluation_steps):
    # Use custom configuration if provided
    logger.info(f"Using custom configuration for {metric_name}")
    metric = make_custom_geval(metric_config, model=None)

Minor: The condition could be simplified since the validator already ensures at least one field is present:

if metric_config:
    metric = make_custom_geval(metric_config, model=None)

🔒 Security Assessment

No security concerns identified
Input validation properly sanitizes user input
No direct file system access or command execution
Pydantic models provide type safety barriers

🚀 Performance Assessment

Minimal performance impact
Lazy evaluation approach maintains efficiency
Type checking and validation happen at configuration time, not runtime
No memory leaks or resource management issues

📋 Test Quality Assessment

Excellent test coverage with comprehensive scenarios:

✅ All validation paths tested
✅ Error conditions properly validated
✅ Normal use cases covered
✅ Clear, readable test names
✅ Good use of pytest fixtures and assertions

🎯 Overall Assessment

This is a high-quality, production-ready implementation that:

Significantly improves functionality - Enables specialized evaluation criteria for domain-specific tasks
Maintains backward compatibility - Existing configurations continue to work
Follows best practices - Clean architecture, proper validation, comprehensive testing
Includes important bug fixes - Resolves existing issues in the evaluation pipeline

Recommendation: ✅ APPROVE

The implementation is well-architected, thoroughly tested, and addresses a real need for more flexible evaluation criteria. The bug fixes alone make this a valuable contribution.

📝 Suggestions for Follow-up

Consider adding rubric score range validation
Add usage examples to the main README
Consider creating a migration guide for users wanting to upgrade from string-based to object-based metric configs

claude · 2025-11-24T21:06:08Z

Claude finished @justaddcoffee's task —— View job

Claude is reviewing the custom GEval enhancements PR

Review Tasks

Read and analyze all changed files
Review MetricConfig and RubricItem class design
Analyze make_custom_geval() implementation
Check validation logic and error handling
Review test coverage and quality
Assess performance and security considerations
Provide final feedback and recommendations

✅ Overall Assessment: APPROVE

This is a well-designed, high-quality implementation that significantly improves the evaluation framework's flexibility. Here's my comprehensive review:

Strengths

Excellent Architecture & Design

Clean separation of concerns: MetricConfig and RubricItem are well-designed data classes with clear responsibilities
Backward compatibility: Existing string-based metric configurations continue to work seamlessly
Type safety: Proper use of Union[str, MetricConfig] maintains type safety while adding flexibility
Clear validation: Pydantic validators enforce business rules at the data layer

Robust Input Validation

Mutual exclusivity: src/metacoder/evals/eval_model.py:27-44 properly enforces that criteria and evaluation_steps cannot both be specified
Required field validation: Ensures at least one of criteria, evaluation_steps, or rubric is provided
Well-structured error messages: Clear guidance on proper usage

Comprehensive Test Coverage

7 test scenarios covering all validation edge cases
Tests all three configuration modes: criteria-only, evaluation_steps-only, rubric-only
Validates error conditions: Mutual exclusivity and required field enforcement
Tests combination scenarios: Rubric with criteria/evaluation_steps

Important Bug Fixes

Fixed duplicate parameter issue in make_geval() (src/metacoder/evals/runner.py:72-88) - removes conflicting criteria and evaluation_steps parameters
Improved error handling in claude.py:264 - changed from raise ValueError to logger.warning to prevent single test failures from crashing entire evaluation runs

Areas for Improvement

1. Rubric Score Range Validation

# src/metacoder/evals/runner.py:107-118
if not 0.0 <= item.score <= 1.0:
    raise ValueError(f"Rubric score {item.score} must be between 0.0 and 1.0")

Suggestion: Add validation that rubric scores are within valid ranges (e.g., 0.0-1.0)

2. Default Criteria Enhancement

# src/metacoder/evals/runner.py:137-139
kwargs["criteria"] = "Evaluate the actual output based on the rubric criteria."

Suggestion: This fallback criteria could be more descriptive or allow users to specify custom criteria even with rubrics

3. Minor Performance Optimization

# src/metacoder/evals/runner.py:357-360
if metric_config and (metric_config.rubric or metric_config.criteria or metric_config.evaluation_steps):

Suggestion: Can be simplified to just if metric_config: since validator ensures at least one field is present

Security Assessment 🔒

No security concerns identified
Input validation properly sanitizes user input
No direct file system access or command execution
Pydantic models provide type safety barriers

Performance Assessment 🚀

Minimal performance impact
Lazy evaluation approach maintains efficiency
Type checking and validation happen at configuration time, not runtime
No memory leaks or resource management issues

Test Quality Assessment 📋

Excellent test coverage with comprehensive scenarios:

✅ All validation paths tested
✅ Error conditions properly validated
✅ Normal use cases covered
✅ Clear, readable test names

Final Recommendation

This implementation is production-ready and addresses a real need for more flexible evaluation criteria. The code follows best practices, includes comprehensive testing, and maintains backward compatibility while adding powerful new functionality.

Key Benefits:

Significantly improves functionality - Enables specialized evaluation criteria for domain-specific tasks
Maintains backward compatibility - Existing configurations continue to work
Includes important bug fixes - Resolves existing issues in the evaluation pipeline
Well-tested and documented - Comprehensive test coverage and clear documentation

The minor improvement suggestions don't block approval and can be addressed in future iterations.

Fix ruff linting error: remove unused field_validator import

3e90e9b

justaddcoffee mentioned this pull request Nov 26, 2025

Fix GeminiCoder CLI invocation to use positional arguments #38

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add custom GEval support with criteria, rubric, and evaluation_steps #37

Add custom GEval support with criteria, rubric, and evaluation_steps #37

Uh oh!

justaddcoffee commented Nov 24, 2025

Uh oh!

claude bot commented Nov 24, 2025 •

edited

Loading

Uh oh!

claude bot commented Nov 24, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add custom GEval support with criteria, rubric, and evaluation_steps #37

Are you sure you want to change the base?

Add custom GEval support with criteria, rubric, and evaluation_steps #37

Uh oh!

Conversation

justaddcoffee commented Nov 24, 2025

Summary

Motivation

Changes

Core Features

Bug Fixes

Tests

Usage Examples

Custom Criteria

Custom Rubric

Custom Evaluation Steps

Testing

Documentation

Compatibility

References

Uh oh!

claude bot commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Complete ✅

Review Tasks

✅ Strengths

Excellent Architecture & Design

Robust Input Validation

Comprehensive Test Coverage

Important Bug Fixes

⚠️ Areas for Improvement

1. Documentation & Type Hints

2. Error Handling Edge Cases

3. Default Criteria Logic

4. Performance Consideration

🔒 Security Assessment

🚀 Performance Assessment

📋 Test Quality Assessment

🎯 Overall Assessment

Recommendation: ✅ APPROVE

📝 Suggestions for Follow-up

Uh oh!

claude bot commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review Tasks

✅ Overall Assessment: APPROVE

Strengths

Excellent Architecture & Design

Robust Input Validation

Comprehensive Test Coverage

Important Bug Fixes

Areas for Improvement

1. Rubric Score Range Validation

2. Default Criteria Enhancement

3. Minor Performance Optimization

Security Assessment 🔒

Performance Assessment 🚀

Test Quality Assessment 📋

Final Recommendation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude bot commented Nov 24, 2025 •

edited

Loading

claude bot commented Nov 24, 2025 •

edited

Loading