Add LLM-as-a-Judge evaluator using MLflow and OpenAI-compatible endpoints by piotrhm · Pull Request #72 · IBM/ai4rag

piotrhm · 2026-06-15T16:22:59Z

Description

Adds MlflowLLMJudgeEvaluator, a new BaseEvaluator implementation that uses an LLM as a judge to evaluate RAG quality. It integrates with MLflow's mlflow.genai.evaluate() framework via custom @scorer functions, while routing judge LLM calls through the OpenAI client to any OpenAI-compatible endpoint (vLLM, TGI, etc.).

Motivation

The existing UnitxtEvaluator uses algorithmic (non-LLM) metrics which are fast and deterministic but limited in capturing nuanced quality dimensions like helpfulness, coherence, or domain-specific correctness. LLM-as-a-Judge enables richer evaluation — users can now optimize RAG pipelines using LLM-judged metrics as the objective function in the HPO loop, with results tracked in MLflow.

MLflow 3.x's built-in Guidelines scorer hardcodes requests to api.openai.com for openai:/ URIs, ignoring OPENAI_API_BASE. This implementation works around that by using custom scorers that call the judge LLM directly via the OpenAI client with a configurable base_url.

Changes

Add ANSWER_RELEVANCE to MetricType in base_evaluator.py
Add MlflowLLMJudgeEvaluator, LLMJudgeConfig, CustomMetricDefinition in new mlflow_llm_judge_evaluator.py
Built-in judge prompts for all four metrics (answer_correctness, faithfulness, context_correctness, answer_relevance) on a 1-5 scale normalized to [0.0, 1.0]
Support for user-defined custom metrics via CustomMetricDefinition
Accept custom evaluator metric names in experiment.py optimization_metric validation
Conditional export in evaluator/__init__.py
Add mlflow>=3.0.0 and openai>=1.0.0 as optional dependencies (ai4rag[llm-judge])
Add design document docs/design/llm-as-judge-design.md

Testing

30 unit tests covering config, scoring, normalization, eval data building, scorer construction, result formatting, full evaluate_metrics flow, and MetricType extensions
Integration tested against Llama 3.1 8B Instruct deployed on OpenShift via vLLM — correct answers scored 1.0, incorrect/hallucinated answers scored 0.0

Checklist

Tests added/updated
Documentation updated
Code follows style guide
All checks passing

Signed-off-by: “Piotr <phelm@redhat.com>

Narrow except clause from Exception to (OpenAIError, ValueError, KeyError) and reduce local variables in _format_results by inlining intermediates. Signed-off-by: Piotr <phelm@redhat.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: “Piotr <phelm@redhat.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: “Piotr <phelm@redhat.com>

Add pytest.importorskip for mlflow and openai so tests are skipped when these optional dependencies are not installed. Also remove unused numpy import and apply Black formatting. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: “Piotr <phelm@redhat.com>

Llama and other open models often wrap JSON in markdown fences or add preamble text. Add _extract_json() that tries direct parsing, then markdown fence extraction, then first brace-pair extraction. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: “Piotr <phelm@redhat.com>

LukaszCmielowski

Add mlflow>=3.0.0 and openai>=1.0.0 as optional dependencies (ai4rag[llm-judge])

If there is no need for mlflow server I would not make it optional. LLMaJ should be mandatory setup to make autoRAG work with desired quality.

jakub-walaszczyk

While the implementation of MlflowEvaluator class is nice there are several things to reconsider:

As pipelines-components code is the user of the evalutoar, we should probably keep the logic of metrics selection and validation within ai4rag. What I mean is that metrics for llmaaj should have some llmaaj prefix, and evaluator should be instantiated in the Experiment orchestrator class based on the metric name. The question is whether we want to use llmaaj only when explicitly requested, change metrics names for that purpose, use it as default? This should be clarified.
We should definitely remove OpenAI endpoint default. If by any mistake we will send some user data to the undesired URL this will be very inapropriate.
There is open question with model used for llmaaj metrics assesment. We assume user will pass model, but the user in the different context is us using it in pipelines-components. How do we select the model? How do we select settings? What is the mapping between user's provided metric and whether llmaaj will or will not be used.

These questions should be partially answered by the top design of how this functionality will be triggered.

Maybe we should calculate all of the metrics (why not) and the llmaaj ones simply start with llmaaj_ prefix, e.g. (llmaaj_faithfulness). In that case we can assume 2 evaluators will be used for the metrics assessment or single combined evaluator.

jakub-walaszczyk · 2026-06-18T07:23:37Z

+    "mlflow>=3.0.0",
+    "openai>=1.0.0",


It would be best to lock version to enable only patch changes (see ~=x.y.z). OpenAI client is released in the 2.y.z version already, and following this constrain we will enable this major shift as well. Let's use latest compatible version and lock it with ~=

jakub-walaszczyk · 2026-06-18T07:26:41Z

    ANSWER_CORRECTNESS = "answer_correctness"
    FAITHFULNESS = "faithfulness"
    CONTEXT_CORRECTNESS = "context_correctness"
+    ANSWER_RELEVANCE = "answer_relevance"


This minimal mention in the docstring to follow used convention would be appreciated

jakub-walaszczyk · 2026-06-18T07:27:38Z

+try:
+    from ai4rag.evaluator.mlflow_llm_judge_evaluator import (
+        LLMJudgeConfig,
+        MlflowLLMJudgeEvaluator,
+    )
+except ImportError:
+    pass


Using silent pass when we end up in the import error seems not like an option. We should AT LEAST log something on the info level.

jakub-walaszczyk · 2026-06-18T07:31:49Z

+    base_url: str = "https://api.openai.com/v1"
+    api_key: str = ""
+    model: str = "gpt-4o-mini"
+    temperature: float = 0.0
+    custom_metrics: list[CustomMetricDefinition] = field(default_factory=list)


Please remove this default with open AI URL. We do not want to end upt accidentally sending data to wrong endpoints. APIKEY should be required and base url should be required. same for model, no defaults.

jakub-walaszczyk · 2026-06-18T07:35:52Z

+Where:
+- 1 = completely fails the criterion
+- 2 = mostly fails with some relevant elements
+- 3 = partially meets the criterion
+- 4 = mostly meets with minor gaps
+- 5 = fully meets the criterion
+"""


For reach comparison and optimization, having 5 points scale seems to be somehow not enough. After normalization we will end up with 0.2 0.4. maybe forcing 10 points scale will make more sense to find right balance. To be discussed?

btw, isn't there a custom mlflow mechanism for that?

piotrhm marked this pull request as draft June 15, 2026 16:23

piotrhm added 2 commits June 16, 2026 09:37

draft

18f4ee4

Signed-off-by: “Piotr <phelm@redhat.com>

lint fixes

5d73be7

Signed-off-by: “Piotr <phelm@redhat.com>

piotrhm force-pushed the mlflow-llmaj branch from f6f13a8 to 5d73be7 Compare June 16, 2026 07:37

piotrhm and others added 4 commits June 16, 2026 09:59

fix: update copyright year to match file creation date

4c1ab6e

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: “Piotr <phelm@redhat.com>

piotrhm marked this pull request as ready for review June 16, 2026 08:51

piotrhm requested review from LukaszCmielowski and jakub-walaszczyk June 16, 2026 08:51

LukaszCmielowski requested changes Jun 16, 2026

View reviewed changes

Merge branch 'main' into mlflow-llmaj

6d9f21d

jakub-walaszczyk requested changes Jun 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add LLM-as-a-Judge evaluator using MLflow and OpenAI-compatible endpoints#72

Add LLM-as-a-Judge evaluator using MLflow and OpenAI-compatible endpoints#72
piotrhm wants to merge 7 commits into
mainfrom
mlflow-llmaj

piotrhm commented Jun 15, 2026

Uh oh!

LukaszCmielowski left a comment

Uh oh!

jakub-walaszczyk left a comment

Uh oh!

jakub-walaszczyk Jun 18, 2026

Uh oh!

jakub-walaszczyk Jun 18, 2026

Uh oh!

jakub-walaszczyk Jun 18, 2026

Uh oh!

jakub-walaszczyk Jun 18, 2026

Uh oh!

jakub-walaszczyk Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

piotrhm commented Jun 15, 2026

Description

Motivation

Changes

Testing

Checklist

Uh oh!

LukaszCmielowski left a comment

Choose a reason for hiding this comment

Uh oh!

jakub-walaszczyk left a comment

Choose a reason for hiding this comment

Uh oh!

jakub-walaszczyk Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

jakub-walaszczyk Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

jakub-walaszczyk Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

jakub-walaszczyk Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

jakub-walaszczyk Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants