Skip to content

Add LLM-as-a-Judge evaluator using MLflow and OpenAI-compatible endpoints#72

Open
piotrhm wants to merge 7 commits into
mainfrom
mlflow-llmaj
Open

Add LLM-as-a-Judge evaluator using MLflow and OpenAI-compatible endpoints#72
piotrhm wants to merge 7 commits into
mainfrom
mlflow-llmaj

Conversation

@piotrhm

@piotrhm piotrhm commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Description

Adds MlflowLLMJudgeEvaluator, a new BaseEvaluator implementation that uses an LLM as a judge to evaluate RAG quality. It integrates with MLflow's mlflow.genai.evaluate() framework via custom @scorer functions, while routing judge LLM calls through the OpenAI client to any OpenAI-compatible endpoint (vLLM, TGI, etc.).

Motivation

The existing UnitxtEvaluator uses algorithmic (non-LLM) metrics which are fast and deterministic but limited in capturing nuanced quality dimensions like helpfulness, coherence, or domain-specific correctness. LLM-as-a-Judge enables richer evaluation — users can now optimize RAG pipelines using LLM-judged metrics as the objective function in the HPO loop, with results tracked in MLflow.

MLflow 3.x's built-in Guidelines scorer hardcodes requests to api.openai.com for openai:/ URIs, ignoring OPENAI_API_BASE. This implementation works around that by using custom scorers that call the judge LLM directly via the OpenAI client with a configurable base_url.

Changes

  • Add ANSWER_RELEVANCE to MetricType in base_evaluator.py
  • Add MlflowLLMJudgeEvaluator, LLMJudgeConfig, CustomMetricDefinition in new mlflow_llm_judge_evaluator.py
  • Built-in judge prompts for all four metrics (answer_correctness, faithfulness, context_correctness, answer_relevance) on a 1-5 scale normalized to [0.0, 1.0]
  • Support for user-defined custom metrics via CustomMetricDefinition
  • Accept custom evaluator metric names in experiment.py optimization_metric validation
  • Conditional export in evaluator/__init__.py
  • Add mlflow>=3.0.0 and openai>=1.0.0 as optional dependencies (ai4rag[llm-judge])
  • Add design document docs/design/llm-as-judge-design.md

Testing

  • 30 unit tests covering config, scoring, normalization, eval data building, scorer construction, result formatting, full evaluate_metrics flow, and MetricType extensions
  • Integration tested against Llama 3.1 8B Instruct deployed on OpenShift via vLLM — correct answers scored 1.0, incorrect/hallucinated answers scored 0.0

Checklist

  • Tests added/updated
  • Documentation updated
  • Code follows style guide
  • All checks passing

@piotrhm piotrhm marked this pull request as draft June 15, 2026 16:23
piotrhm added 2 commits June 16, 2026 09:37
Signed-off-by: “Piotr <phelm@redhat.com>
Signed-off-by: “Piotr <phelm@redhat.com>
piotrhm and others added 4 commits June 16, 2026 09:59
Narrow except clause from Exception to (OpenAIError, ValueError, KeyError)
and reduce local variables in _format_results by inlining intermediates.

Signed-off-by: Piotr <phelm@redhat.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: “Piotr <phelm@redhat.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: “Piotr <phelm@redhat.com>
Add pytest.importorskip for mlflow and openai so tests are skipped
when these optional dependencies are not installed. Also remove
unused numpy import and apply Black formatting.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: “Piotr <phelm@redhat.com>
Llama and other open models often wrap JSON in markdown fences or
add preamble text. Add _extract_json() that tries direct parsing,
then markdown fence extraction, then first brace-pair extraction.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: “Piotr <phelm@redhat.com>
@piotrhm piotrhm marked this pull request as ready for review June 16, 2026 08:51

@LukaszCmielowski LukaszCmielowski left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add mlflow>=3.0.0 and openai>=1.0.0 as optional dependencies (ai4rag[llm-judge])

If there is no need for mlflow server I would not make it optional. LLMaJ should be mandatory setup to make autoRAG work with desired quality.

@jakub-walaszczyk jakub-walaszczyk left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the implementation of MlflowEvaluator class is nice there are several things to reconsider:

  1. As pipelines-components code is the user of the evalutoar, we should probably keep the logic of metrics selection and validation within ai4rag. What I mean is that metrics for llmaaj should have some llmaaj prefix, and evaluator should be instantiated in the Experiment orchestrator class based on the metric name. The question is whether we want to use llmaaj only when explicitly requested, change metrics names for that purpose, use it as default? This should be clarified.
  2. We should definitely remove OpenAI endpoint default. If by any mistake we will send some user data to the undesired URL this will be very inapropriate.
  3. There is open question with model used for llmaaj metrics assesment. We assume user will pass model, but the user in the different context is us using it in pipelines-components. How do we select the model? How do we select settings? What is the mapping between user's provided metric and whether llmaaj will or will not be used.

These questions should be partially answered by the top design of how this functionality will be triggered.

Maybe we should calculate all of the metrics (why not) and the llmaaj ones simply start with llmaaj_ prefix, e.g. (llmaaj_faithfulness). In that case we can assume 2 evaluators will be used for the metrics assessment or single combined evaluator.

Comment thread pyproject.toml
Comment on lines +58 to +59
"mlflow>=3.0.0",
"openai>=1.0.0",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be best to lock version to enable only patch changes (see ~=x.y.z). OpenAI client is released in the 2.y.z version already, and following this constrain we will enable this major shift as well. Let's use latest compatible version and lock it with ~=

ANSWER_CORRECTNESS = "answer_correctness"
FAITHFULNESS = "faithfulness"
CONTEXT_CORRECTNESS = "context_correctness"
ANSWER_RELEVANCE = "answer_relevance"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This minimal mention in the docstring to follow used convention would be appreciated

Comment on lines +8 to +14
try:
from ai4rag.evaluator.mlflow_llm_judge_evaluator import (
LLMJudgeConfig,
MlflowLLMJudgeEvaluator,
)
except ImportError:
pass

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using silent pass when we end up in the import error seems not like an option. We should AT LEAST log something on the info level.

Comment on lines +67 to +71
base_url: str = "https://api.openai.com/v1"
api_key: str = ""
model: str = "gpt-4o-mini"
temperature: float = 0.0
custom_metrics: list[CustomMetricDefinition] = field(default_factory=list)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this default with open AI URL. We do not want to end upt accidentally sending data to wrong endpoints. APIKEY should be required and base url should be required. same for model, no defaults.

Comment on lines +92 to +98
Where:
- 1 = completely fails the criterion
- 2 = mostly fails with some relevant elements
- 3 = partially meets the criterion
- 4 = mostly meets with minor gaps
- 5 = fully meets the criterion
"""

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reach comparison and optimization, having 5 points scale seems to be somehow not enough. After normalization we will end up with 0.2 0.4. maybe forcing 10 points scale will make more sense to find right balance. To be discussed?

btw, isn't there a custom mlflow mechanism for that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants