Skip to content

Proposal: Semantic assertions for qualitative test validation #692

@richardesp

Description

@richardesp

Problem

Several "qualitative" tests in the suite are flaky because they validate LLM outputs with rigid comparisons (string matching, exact patterns). Generative models can rephrase or slightly shift tone between runs, causing tests to fail even when the response is functionally correct.

Related issues:

Both seem to share a similar root cause, applying traditional testing approaches to non-deterministic outputs.

(Following up on a brief Slack discussion with @psschwei, who suggested opening this issue 😄)

Proposal

One possible direction, from my perspective, could be a lightweight pytest-native semantic assertion layer that validates by meaning instead of exact matching. This would stay aligned with Mellea's existing LLM-as-a-judge approach, integrating it into pytest in a more "Pythonic" way, without introducing heavier evaluation frameworks.

The idea separates two validation dimensions:

  • Content(response): whether the model generated the correct information
  • Behaviour(response): whether the response behaves as expected (e.g. avoiding unexpected safety refusals)

Example 1

def test_kv_store_operations():
    response = m.instruct("Explain key-value store operations")
 
    # Content validation
    assert "key-value operations like get, set, delete" in Content(response)
    assert "unrelated topics" not in Content(response)
 
    # Behaviour validation
    assert "safety refusal or content rejection" not in Behaviour(response)
    assert "informative and direct" in Behaviour(response)

Example 2

def test_langchain_messages():
    response = m.instruct("Format this as a langchain message")
 
    assert "a properly structured message" in Content(response)
    assert "clear and well-formatted" in Behaviour(response)

Proof of Concept

I put together a small PoC to explore this idea using ibm-granite/granite-embedding-30m-english for the semantic similarity layer, integrated as a pytest plugin.

The screenshots below illustrate how the same test behaves when the expected intent is semantically present versus when it is not:

import pytest
from pytest_semantic_assert import Content
 
def test_exact_phrase_passes():
    response = "Redis is an in-memory key-value store used for caching."
    assert "key-value store" in Content(response)  # ✅ passes
import pytest
from pytest_semantic_assert import Content
 
def test_exact_phrase_passes():
    response = "Redis is an in-memory key-value store used for caching."
    assert "Alan Turing" in Content(response)  # ❌ fails — similarity 0.51, below threshold 0.75

Semantic match (passes):

Semantic match passes

No semantic match (fails):

No semantic match fails

Context

I looked at evaluation frameworks like DeepEval and similar tools, but they can introduce additional complexity for what could be a more lightweight, testing-oriented helper. This approach aims to stay minimal and aligned with Mellea's existing philosophy.

Open questions

  • From your perspective, does this direction make sense for how qualitative testing is currently approached in Mellea?
  • Are you already exploring similar approaches internally for handling flaky LLM-based tests?
  • Do you think there's a broader need in the Python/GenAI ecosystem for lightweight semantic testing helpers?

Happy to refine the approach or try applying it to one of the existing flaky tests if that's useful. Thanks in advance for your time!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions