Remove redundant criteria parameter from GEval instantiation #36

Copilot · 2025-11-24T20:56:09Z

GEval in deepeval accepts either criteria OR evaluation_steps, not both. When both are provided, evaluation_steps takes priority and criteria is silently ignored. The current code specifies both, which is misleading.

Changes:

Remove unused criteria parameter from CorrectnessMetric in get_default_metrics()

# Before
"CorrectnessMetric": GEval(
    name="Correctness",
    criteria="Determine whether the actual output is factually correct...",  # ignored
    evaluation_steps=[...],  # this is what's actually used
    ...
)

# After
"CorrectnessMetric": GEval(
    name="Correctness",
    evaluation_steps=[...],
    ...
)

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

api.ipify.org
- Triggering command: /usr/bin/python python -m pytest tests/test_evals/test_REDACTED.py -v --ignore-glob=*llm* (dns block)
astral.sh
- Triggering command: /usr/bin/curl curl -LsSf REDACTED (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

Original prompt

This section details on the original issue you should resolve

<issue_title>Possible issue with default GEval produced in make_geval()</issue_title>
<issue_description>In make_geval() here, if I understand the code correctly, we are specifying the default criteria and evaluation steps:
        "CorrectnessMetric": GEval(
            name="Correctness",
            criteria="Determine whether the actual output is factually correct based on the expected output.",
            # NOTE: you can only provide either criteria or evaluation_steps, and not both
            evaluation_steps=[
                "Check whether the facts in 'actual output' contradicts any facts in 'expected output'",
                "You should also heavily penalize omission of detail",
                "Vague language, or contradicting OPINIONS, are OK",
            ],
            threshold=0.8,
            evaluation_params=[
                LLMTestCaseParams.INPUT,
                LLMTestCaseParams.ACTUAL_OUTPUT,
                LLMTestCaseParams.EXPECTED_OUTPUT,
            ],
        ),
This seems to be coming straight from the docs, but the code above as well as the docs say not to specify both criteria and evaluation_steps, so it's not clear what eval behavior metacoder is using here - i.e. is Deep Eval applying the criteria, or the evaluation_steps? Or is it applying both, and the docs are not correct?

In practice this seems to be behaving reasonably based on some manual checking, but we probably should be clear what the default eval is here</issue_description>

<agent_instructions>@copilot it looks to me that while the behavior of GEVal when both criteria and evaluation_steps are given is undefined, in practice the evaluation_steps are used and criteria are ignored:

def _generate_evaluation_steps(self) -> List[str]:
if self.evaluation_steps: # ← If evaluation_steps provided, use them
return self.evaluation_steps
# Otherwise, generate from criteria
So the behavior is:

If evaluation_steps is provided, it's used regardless of criteria
If evaluation_steps is None, then criteria is used to auto-generate steps
The docs say "mutually exclusive" but the code just prioritizes evaluation_steps
So I suggest we fixed this by just changing the code to:
    "CorrectnessMetric": GEval(
        name="Correctness",
        # get rid of criteria here
        evaluation_steps=[
            "Check whether the facts in 'actual output' contradicts any facts in 'expected output'",
            "You should also heavily penalize omission of detail",
            "Vague language, or contradicting OPINIONS, are OK",
        ],
        threshold=0.8,
        evaluation_params=[
            LLMTestCaseParams.INPUT,
            LLMTestCaseParams.ACTUAL_OUTPUT,
            LLMTestCaseParams.EXPECTED_OUTPUT,
        ],
    ),
If you agree, make a PR and we can look over it</agent_instructions>

Comments on the Issue (you are @copilot in this section)
@justaddcoffee @dragon-ai-agent it looks to me that while the behavior of GEVal when both criteria and evaluation_steps are given is undefined, in practice the evaluation_steps are used and criteria are ignored:
  def _generate_evaluation_steps(self) -> List[str]:
      if self.evaluation_steps:  # ← If evaluation_steps provided, use them
          return self.evaluation_steps
      # Otherwise, generate from criteria
So the behavior is:

If evaluation_steps is provided, it's used regardless of criteria

If evaluation_steps is None, then criteria is used to auto-generate steps

The docs say "mutually exclusive" but the code just prioritizes evaluation_steps

So I suggest we fixed this by just changing the code to:
        "CorrectnessMetric": GEval(
            name="Correctness",
            # get rid of criteria here
            evaluation_steps=[
                "Check whether the facts in 'actual output' contradicts any facts in 'expected output'",
                "You should also heavily penalize omission of detail",
                "Vague language, or contradicting OPINIONS, are OK",
            ],
            threshold=0.8,
            evaluation_params=[
                LLMTestCaseParams.INPUT,
                LLMTestCaseParams.ACTUAL_OUTPUT,
                LLMTestCaseParams.EXPECTED_OUTPUT,
            ],
        ),
If you agree, make a PR and we can look over it </comment_new>

Fixes Possible issue with default GEval produced in make_geval() #35

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

According to deepeval docs, only criteria OR evaluation_steps should be provided. When both are provided, evaluation_steps takes priority and criteria is ignored. This fix removes the unused criteria parameter. Co-authored-by: justaddcoffee <[email protected]>

justaddcoffee

yep, exactly

Initial plan

d3e42c2

Copilot AI assigned Copilot and justaddcoffee Nov 24, 2025

Copilot started work on behalf of justaddcoffee November 24, 2025 20:56 View session

Copilot AI changed the title ~~[WIP] Fix possible issue with default GEval in make_geval function~~ Remove redundant criteria parameter from GEval instantiation Nov 24, 2025

Copilot AI requested a review from justaddcoffee November 24, 2025 21:03

Copilot finished work on behalf of justaddcoffee November 24, 2025 21:03

justaddcoffee approved these changes Nov 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove redundant criteria parameter from GEval instantiation #36

Remove redundant criteria parameter from GEval instantiation #36

Uh oh!

Copilot AI commented Nov 24, 2025 •

edited

Loading

Uh oh!

justaddcoffee left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Remove redundant criteria parameter from GEval instantiation #36

Are you sure you want to change the base?

Remove redundant criteria parameter from GEval instantiation #36

Uh oh!

Conversation

Copilot AI commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

I tried to connect to the following addresses, but was blocked by firewall rules:

Comments on the Issue (you are @copilot in this section)

Uh oh!

justaddcoffee left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Nov 24, 2025 •

edited

Loading