Skip to content

Conversation

Copy link

Copilot AI commented Nov 24, 2025

GEval in deepeval accepts either criteria OR evaluation_steps, not both. When both are provided, evaluation_steps takes priority and criteria is silently ignored. The current code specifies both, which is misleading.

Changes:

  • Remove unused criteria parameter from CorrectnessMetric in get_default_metrics()
# Before
"CorrectnessMetric": GEval(
    name="Correctness",
    criteria="Determine whether the actual output is factually correct...",  # ignored
    evaluation_steps=[...],  # this is what's actually used
    ...
)

# After
"CorrectnessMetric": GEval(
    name="Correctness",
    evaluation_steps=[...],
    ...
)

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • api.ipify.org
    • Triggering command: /usr/bin/python python -m pytest tests/test_evals/test_REDACTED.py -v --ignore-glob=*llm* (dns block)
  • astral.sh
    • Triggering command: /usr/bin/curl curl -LsSf REDACTED (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Original prompt

This section details on the original issue you should resolve

<issue_title>Possible issue with default GEval produced in make_geval()</issue_title>
<issue_description>In make_geval() here, if I understand the code correctly, we are specifying the default criteria and evaluation steps:

        "CorrectnessMetric": GEval(
            name="Correctness",
            criteria="Determine whether the actual output is factually correct based on the expected output.",
            # NOTE: you can only provide either criteria or evaluation_steps, and not both
            evaluation_steps=[
                "Check whether the facts in 'actual output' contradicts any facts in 'expected output'",
                "You should also heavily penalize omission of detail",
                "Vague language, or contradicting OPINIONS, are OK",
            ],
            threshold=0.8,
            evaluation_params=[
                LLMTestCaseParams.INPUT,
                LLMTestCaseParams.ACTUAL_OUTPUT,
                LLMTestCaseParams.EXPECTED_OUTPUT,
            ],
        ),

This seems to be coming straight from the docs, but the code above as well as the docs say not to specify both criteria and evaluation_steps, so it's not clear what eval behavior metacoder is using here - i.e. is Deep Eval applying the criteria, or the evaluation_steps? Or is it applying both, and the docs are not correct?

In practice this seems to be behaving reasonably based on some manual checking, but we probably should be clear what the default eval is here</issue_description>

<agent_instructions>@copilot it looks to me that while the behavior of GEVal when both criteria and evaluation_steps are given is undefined, in practice the evaluation_steps are used and criteria are ignored:

def _generate_evaluation_steps(self) -> List[str]:
if self.evaluation_steps: # ← If evaluation_steps provided, use them
return self.evaluation_steps
# Otherwise, generate from criteria
So the behavior is:

If evaluation_steps is provided, it's used regardless of criteria
If evaluation_steps is None, then criteria is used to auto-generate steps
The docs say "mutually exclusive" but the code just prioritizes evaluation_steps
So I suggest we fixed this by just changing the code to:

    "CorrectnessMetric": GEval(
        name="Correctness",
        # get rid of criteria here
        evaluation_steps=[
            "Check whether the facts in 'actual output' contradicts any facts in 'expected output'",
            "You should also heavily penalize omission of detail",
            "Vague language, or contradicting OPINIONS, are OK",
        ],
        threshold=0.8,
        evaluation_params=[
            LLMTestCaseParams.INPUT,
            LLMTestCaseParams.ACTUAL_OUTPUT,
            LLMTestCaseParams.EXPECTED_OUTPUT,
        ],
    ),

If you agree, make a PR and we can look over it</agent_instructions>

Comments on the Issue (you are @copilot in this section)

@justaddcoffee @dragon-ai-agent it looks to me that while the behavior of GEVal when both criteria and evaluation_steps are given is undefined, in practice the evaluation_steps are used and criteria are ignored:
  def _generate_evaluation_steps(self) -> List[str]:
      if self.evaluation_steps:  # ← If evaluation_steps provided, use them
          return self.evaluation_steps
      # Otherwise, generate from criteria

So the behavior is:

  • If evaluation_steps is provided, it's used regardless of criteria
  • If evaluation_steps is None, then criteria is used to auto-generate steps
  • The docs say "mutually exclusive" but the code just prioritizes evaluation_steps

So I suggest we fixed this by just changing the code to:

        "CorrectnessMetric": GEval(
            name="Correctness",
            # get rid of criteria here
            evaluation_steps=[
                "Check whether the facts in 'actual output' contradicts any facts in 'expected output'",
                "You should also heavily penalize omission of detail",
                "Vague language, or contradicting OPINIONS, are OK",
            ],
            threshold=0.8,
            evaluation_params=[
                LLMTestCaseParams.INPUT,
                LLMTestCaseParams.ACTUAL_OUTPUT,
                LLMTestCaseParams.EXPECTED_OUTPUT,
            ],
        ),

If you agree, make a PR and we can look over it </comment_new>


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

According to deepeval docs, only criteria OR evaluation_steps should be
provided. When both are provided, evaluation_steps takes priority and
criteria is ignored. This fix removes the unused criteria parameter.

Co-authored-by: justaddcoffee <[email protected]>
Copilot AI changed the title [WIP] Fix possible issue with default GEval in make_geval function Remove redundant criteria parameter from GEval instantiation Nov 24, 2025
Copilot AI requested a review from justaddcoffee November 24, 2025 21:03
Copy link
Contributor

@justaddcoffee justaddcoffee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, exactly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Possible issue with default GEval produced in make_geval()

2 participants