Skip to content

Conversation

@justaddcoffee
Copy link
Contributor

Summary

This PR adds comprehensive support for custom GEval metrics with three configuration options:

  • criteria: Single-string evaluation criteria (auto-generates steps)
  • evaluation_steps: List of specific evaluation steps (more control and reliability)
  • rubric: Structured scoring guidelines with score ranges

Motivation

The default CorrectnessMetric uses generic evaluation criteria that don't work well for specialized tasks:

  • Exact text extraction ("retrieve the first sentence of section 2")
  • Specific metadata field retrieval ("title", "DOI", "first author")
  • Binary decisions ("is this paper retracted?")

Custom metrics allow test-specific evaluation criteria for better accuracy and reliability.

Changes

Core Features

1. Add MetricConfig class (eval_model.py)

  • Fields: name, criteria, evaluation_steps, rubric
  • @model_validator enforces mutual exclusivity of criteria and evaluation_steps
  • Requires at least one of: criteria, evaluation_steps, or rubric

2. Add RubricItem class (eval_model.py)

  • Structured rubric definitions with score and criteria fields

3. Implement make_custom_geval() (runner.py)

  • Creates GEval instances from MetricConfig
  • Supports all three configuration options
  • Includes INPUT parameter in evaluation_params (was missing before)
  • Comprehensive docstring with usage notes

4. Update EvalCase.metrics (eval_model.py)

  • Changed from List[str] to List[Union[str, MetricConfig]]
  • Allows both string format and object format

5. Update runner logic (runner.py)

  • Detects MetricConfig objects and calls make_custom_geval()
  • Checks for criteria, evaluation_steps, or rubric presence

Bug Fixes

6. Fix make_geval() bug (runner.py)

  • Removed duplicate criteria parameter (was specifying both criteria AND evaluation_steps)
  • Per DeepEval docs: "you can only provide either criteria or evaluation_steps, and not both"
  • Now uses only evaluation_steps for more reliable scoring

7. Improve error handling (claude.py)

  • Changed from raise ValueError to logger.warning for non-auth errors
  • Prevents single test failures from crashing entire evaluation run

Tests

8. Add comprehensive test suite (tests/test_custom_geval.py)

  • 7 test cases covering all validation scenarios:
    • evaluation_steps only ✓
    • criteria only ✓
    • rubric only ✓
    • mutual exclusivity validation ✓
    • at-least-one requirement ✓
    • criteria + rubric combination ✓
    • evaluation_steps + rubric combination ✓

Usage Examples

Custom Criteria

- name: exact_text_extraction
  metrics:
  - name: CorrectnessMetric
    criteria: "Check if actual output exactly matches expected output without paraphrasing"
  input: "What is the first sentence?"
  expected_output: "This is the first sentence."
  threshold: 0.9

Custom Rubric

- name: retraction_check
  metrics:
  - name: CorrectnessMetric
    rubric:
    - score: 0.0
      criteria: "Paper not retracted or unclear"
    - score: 1.0
      criteria: "Paper is retracted"
  input: "Is PMID:12345 retracted?"
  expected_output: "Yes"

Custom Evaluation Steps

- name: metadata_extraction
  metrics:
  - name: CorrectnessMetric
    evaluation_steps:
    - "Check if actual output contains the expected metadata field"
    - "Allow minor formatting differences (quotes, spacing)"
    - "Penalize missing or incorrect values"
  input: "What is the title?"
  expected_output: "Example Title"

Testing

  • All 7 new tests pass (pytest tests/test_custom_geval.py)
  • Mutual exclusivity validation works correctly
  • At-least-one requirement enforced
  • MetricConfig properly handles all three options
  • make_custom_geval() creates valid GEval instances

Documentation

  • Added comprehensive docstrings to MetricConfig and make_custom_geval()
  • Added usage examples in EvalCase docstring
  • Clear field descriptions with mutual exclusivity notes

Compatibility

  • ✅ Backward compatible: Existing configs using string format still work
  • ✅ No breaking changes to existing APIs
  • ✅ Default CorrectnessMetric behavior unchanged (except bug fix)

References


Generated with Claude Code

- Add MetricConfig class with support for custom criteria, rubric, and evaluation_steps
- Add RubricItem class for structured rubric definitions
- Implement make_custom_geval() to create GEval instances from MetricConfig
- Add model_validator to enforce mutual exclusivity of criteria/evaluation_steps
- Update EvalCase to accept both string and MetricConfig metric specifications
- Fix make_geval() bug: remove duplicate criteria parameter (keep evaluation_steps only)
- Improve error handling in Claude coder (warning instead of exception)
- Add comprehensive test suite for MetricConfig validation

Addresses limitations with generic evaluation criteria not suitable for:
- Exact text extraction tasks
- Specific metadata field retrieval
- Binary decision evaluations

The custom GEval feature allows test-specific evaluation criteria for better
accuracy and reliability. evaluation_steps provides more control than criteria
according to DeepEval documentation.

Tests: Added test_custom_geval.py with 7 test cases covering all validation scenarios
@claude
Copy link
Contributor

claude bot commented Nov 24, 2025

Claude finished @justaddcoffee's task —— View job


Code Review Complete ✅

I've thoroughly reviewed the custom GEval enhancements PR. Overall, this is a well-designed, high-quality implementation that significantly improves the evaluation framework's flexibility. Here's my detailed analysis:

Review Tasks

  • Read and analyze all changed files
  • Review MetricConfig and RubricItem class design
  • Analyze make_custom_geval() implementation
  • Check validation logic and error handling
  • Review test coverage and quality
  • Assess performance and security considerations
  • Provide final feedback and recommendations

Strengths

Excellent Architecture & Design

  • Clean separation of concerns: MetricConfig and RubricItem are well-designed data classes with clear responsibilities
  • Backward compatibility: Existing string-based metric configurations continue to work seamlessly
  • Type safety: Proper use of Union[str, MetricConfig] maintains type safety while adding flexibility
  • Clear validation: Pydantic validators enforce business rules at the data layer

Robust Input Validation

  • Mutual exclusivity: src/metacoder/evals/eval_model.py:27-44 properly enforces that criteria and evaluation_steps cannot both be specified
  • Required field validation: Ensures at least one of criteria, evaluation_steps, or rubric is provided
  • Well-structured error messages: Clear guidance on proper usage

Comprehensive Test Coverage

  • 7 test scenarios covering all validation edge cases
  • Tests all three configuration modes: criteria-only, evaluation_steps-only, rubric-only
  • Validates error conditions: Mutual exclusivity and required field enforcement
  • Tests combination scenarios: Rubric with criteria/evaluation_steps

Important Bug Fixes

  • Fixed duplicate parameter issue in make_geval() (src/metacoder/evals/runner.py:72-88) - removes conflicting criteria and evaluation_steps parameters
  • Improved error handling in claude.py:264 - changed from raise ValueError to logger.warning to prevent single test failures from crashing entire evaluation runs

⚠️ Areas for Improvement

1. Documentation & Type Hints

# src/metacoder/evals/runner.py:91-146
def make_custom_geval(
    metric_config: MetricConfig, model: Optional[DeepEvalBaseLLM] = None
) -> GEval:

Suggestion: The docstring is good, but could benefit from:

  • Example usage snippets for each configuration type
  • Parameter validation details
  • Return value specifications

2. Error Handling Edge Cases

# src/metacoder/evals/runner.py:107-118
rubrics = []
if metric_config.rubric:
    for item in metric_config.rubric:
        rubrics.append(
            Rubric(
                score_range=(item.score, item.score),
                expected_outcome=item.criteria,
            )
        )

Potential Issue: No validation that rubric scores are within valid ranges (e.g., 0.0-1.0). Consider adding:

if not 0.0 <= item.score <= 1.0:
    raise ValueError(f"Rubric score {item.score} must be between 0.0 and 1.0")

3. Default Criteria Logic

# src/metacoder/evals/runner.py:137-139
else:
    # Default criteria if only rubric provided
    kwargs["criteria"] = "Evaluate the actual output based on the rubric criteria."

Concern: This fallback criteria is quite generic. Consider making it more descriptive or allowing users to specify custom criteria even with rubrics.

4. Performance Consideration

# src/metacoder/evals/runner.py:357-360
if metric_config and (metric_config.rubric or metric_config.criteria or metric_config.evaluation_steps):
    # Use custom configuration if provided
    logger.info(f"Using custom configuration for {metric_name}")
    metric = make_custom_geval(metric_config, model=None)

Minor: The condition could be simplified since the validator already ensures at least one field is present:

if metric_config:
    metric = make_custom_geval(metric_config, model=None)

🔒 Security Assessment

  • No security concerns identified
  • Input validation properly sanitizes user input
  • No direct file system access or command execution
  • Pydantic models provide type safety barriers

🚀 Performance Assessment

  • Minimal performance impact
  • Lazy evaluation approach maintains efficiency
  • Type checking and validation happen at configuration time, not runtime
  • No memory leaks or resource management issues

📋 Test Quality Assessment

Excellent test coverage with comprehensive scenarios:

  • ✅ All validation paths tested
  • ✅ Error conditions properly validated
  • ✅ Normal use cases covered
  • ✅ Clear, readable test names
  • ✅ Good use of pytest fixtures and assertions

🎯 Overall Assessment

This is a high-quality, production-ready implementation that:

  1. Significantly improves functionality - Enables specialized evaluation criteria for domain-specific tasks
  2. Maintains backward compatibility - Existing configurations continue to work
  3. Follows best practices - Clean architecture, proper validation, comprehensive testing
  4. Includes important bug fixes - Resolves existing issues in the evaluation pipeline

Recommendation: ✅ APPROVE

The implementation is well-architected, thoroughly tested, and addresses a real need for more flexible evaluation criteria. The bug fixes alone make this a valuable contribution.


📝 Suggestions for Follow-up

  1. Consider adding rubric score range validation
  2. Add usage examples to the main README
  3. Consider creating a migration guide for users wanting to upgrade from string-based to object-based metric configs

@claude
Copy link
Contributor

claude bot commented Nov 24, 2025

Claude finished @justaddcoffee's task —— View job


Claude is reviewing the custom GEval enhancements PR

Review Tasks

  • Read and analyze all changed files
  • Review MetricConfig and RubricItem class design
  • Analyze make_custom_geval() implementation
  • Check validation logic and error handling
  • Review test coverage and quality
  • Assess performance and security considerations
  • Provide final feedback and recommendations

Overall Assessment: APPROVE

This is a well-designed, high-quality implementation that significantly improves the evaluation framework's flexibility. Here's my comprehensive review:

Strengths

Excellent Architecture & Design

  • Clean separation of concerns: MetricConfig and RubricItem are well-designed data classes with clear responsibilities
  • Backward compatibility: Existing string-based metric configurations continue to work seamlessly
  • Type safety: Proper use of Union[str, MetricConfig] maintains type safety while adding flexibility
  • Clear validation: Pydantic validators enforce business rules at the data layer

Robust Input Validation

  • Mutual exclusivity: src/metacoder/evals/eval_model.py:27-44 properly enforces that criteria and evaluation_steps cannot both be specified
  • Required field validation: Ensures at least one of criteria, evaluation_steps, or rubric is provided
  • Well-structured error messages: Clear guidance on proper usage

Comprehensive Test Coverage

  • 7 test scenarios covering all validation edge cases
  • Tests all three configuration modes: criteria-only, evaluation_steps-only, rubric-only
  • Validates error conditions: Mutual exclusivity and required field enforcement
  • Tests combination scenarios: Rubric with criteria/evaluation_steps

Important Bug Fixes

  • Fixed duplicate parameter issue in make_geval() (src/metacoder/evals/runner.py:72-88) - removes conflicting criteria and evaluation_steps parameters
  • Improved error handling in claude.py:264 - changed from raise ValueError to logger.warning to prevent single test failures from crashing entire evaluation runs

Areas for Improvement

1. Rubric Score Range Validation

# src/metacoder/evals/runner.py:107-118
if not 0.0 <= item.score <= 1.0:
    raise ValueError(f"Rubric score {item.score} must be between 0.0 and 1.0")

Suggestion: Add validation that rubric scores are within valid ranges (e.g., 0.0-1.0)

2. Default Criteria Enhancement

# src/metacoder/evals/runner.py:137-139
kwargs["criteria"] = "Evaluate the actual output based on the rubric criteria."

Suggestion: This fallback criteria could be more descriptive or allow users to specify custom criteria even with rubrics

3. Minor Performance Optimization

# src/metacoder/evals/runner.py:357-360
if metric_config and (metric_config.rubric or metric_config.criteria or metric_config.evaluation_steps):

Suggestion: Can be simplified to just if metric_config: since validator ensures at least one field is present

Security Assessment 🔒

  • No security concerns identified
  • Input validation properly sanitizes user input
  • No direct file system access or command execution
  • Pydantic models provide type safety barriers

Performance Assessment 🚀

  • Minimal performance impact
  • Lazy evaluation approach maintains efficiency
  • Type checking and validation happen at configuration time, not runtime
  • No memory leaks or resource management issues

Test Quality Assessment 📋

Excellent test coverage with comprehensive scenarios:

  • ✅ All validation paths tested
  • ✅ Error conditions properly validated
  • ✅ Normal use cases covered
  • ✅ Clear, readable test names

Final Recommendation

This implementation is production-ready and addresses a real need for more flexible evaluation criteria. The code follows best practices, includes comprehensive testing, and maintains backward compatibility while adding powerful new functionality.

Key Benefits:

  1. Significantly improves functionality - Enables specialized evaluation criteria for domain-specific tasks
  2. Maintains backward compatibility - Existing configurations continue to work
  3. Includes important bug fixes - Resolves existing issues in the evaluation pipeline
  4. Well-tested and documented - Comprehensive test coverage and clear documentation

The minor improvement suggestions don't block approval and can be addressed in future iterations.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants