Add LLM-as-a-Judge evaluator using MLflow and OpenAI-compatible endpoints#72
Add LLM-as-a-Judge evaluator using MLflow and OpenAI-compatible endpoints#72piotrhm wants to merge 7 commits into
Conversation
Signed-off-by: “Piotr <phelm@redhat.com>
Narrow except clause from Exception to (OpenAIError, ValueError, KeyError) and reduce local variables in _format_results by inlining intermediates. Signed-off-by: Piotr <phelm@redhat.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: “Piotr <phelm@redhat.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: “Piotr <phelm@redhat.com>
Add pytest.importorskip for mlflow and openai so tests are skipped when these optional dependencies are not installed. Also remove unused numpy import and apply Black formatting. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: “Piotr <phelm@redhat.com>
Llama and other open models often wrap JSON in markdown fences or add preamble text. Add _extract_json() that tries direct parsing, then markdown fence extraction, then first brace-pair extraction. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: “Piotr <phelm@redhat.com>
LukaszCmielowski
left a comment
There was a problem hiding this comment.
Add mlflow>=3.0.0 and openai>=1.0.0 as optional dependencies (ai4rag[llm-judge])
If there is no need for mlflow server I would not make it optional. LLMaJ should be mandatory setup to make autoRAG work with desired quality.
jakub-walaszczyk
left a comment
There was a problem hiding this comment.
While the implementation of MlflowEvaluator class is nice there are several things to reconsider:
- As pipelines-components code is the user of the evalutoar, we should probably keep the logic of metrics selection and validation within
ai4rag. What I mean is that metrics for llmaaj should have some llmaaj prefix, and evaluator should be instantiated in the Experiment orchestrator class based on the metric name. The question is whether we want to use llmaaj only when explicitly requested, change metrics names for that purpose, use it as default? This should be clarified. - We should definitely remove OpenAI endpoint default. If by any mistake we will send some user data to the undesired URL this will be very inapropriate.
- There is open question with model used for llmaaj metrics assesment. We assume user will pass model, but the user in the different context is us using it in pipelines-components. How do we select the model? How do we select settings? What is the mapping between user's provided metric and whether llmaaj will or will not be used.
These questions should be partially answered by the top design of how this functionality will be triggered.
Maybe we should calculate all of the metrics (why not) and the llmaaj ones simply start with llmaaj_ prefix, e.g. (llmaaj_faithfulness). In that case we can assume 2 evaluators will be used for the metrics assessment or single combined evaluator.
| "mlflow>=3.0.0", | ||
| "openai>=1.0.0", |
There was a problem hiding this comment.
It would be best to lock version to enable only patch changes (see ~=x.y.z). OpenAI client is released in the 2.y.z version already, and following this constrain we will enable this major shift as well. Let's use latest compatible version and lock it with ~=
| ANSWER_CORRECTNESS = "answer_correctness" | ||
| FAITHFULNESS = "faithfulness" | ||
| CONTEXT_CORRECTNESS = "context_correctness" | ||
| ANSWER_RELEVANCE = "answer_relevance" |
There was a problem hiding this comment.
This minimal mention in the docstring to follow used convention would be appreciated
| try: | ||
| from ai4rag.evaluator.mlflow_llm_judge_evaluator import ( | ||
| LLMJudgeConfig, | ||
| MlflowLLMJudgeEvaluator, | ||
| ) | ||
| except ImportError: | ||
| pass |
There was a problem hiding this comment.
Using silent pass when we end up in the import error seems not like an option. We should AT LEAST log something on the info level.
| base_url: str = "https://api.openai.com/v1" | ||
| api_key: str = "" | ||
| model: str = "gpt-4o-mini" | ||
| temperature: float = 0.0 | ||
| custom_metrics: list[CustomMetricDefinition] = field(default_factory=list) |
There was a problem hiding this comment.
Please remove this default with open AI URL. We do not want to end upt accidentally sending data to wrong endpoints. APIKEY should be required and base url should be required. same for model, no defaults.
| Where: | ||
| - 1 = completely fails the criterion | ||
| - 2 = mostly fails with some relevant elements | ||
| - 3 = partially meets the criterion | ||
| - 4 = mostly meets with minor gaps | ||
| - 5 = fully meets the criterion | ||
| """ |
There was a problem hiding this comment.
For reach comparison and optimization, having 5 points scale seems to be somehow not enough. After normalization we will end up with 0.2 0.4. maybe forcing 10 points scale will make more sense to find right balance. To be discussed?
btw, isn't there a custom mlflow mechanism for that?
Description
Adds
MlflowLLMJudgeEvaluator, a newBaseEvaluatorimplementation that uses an LLM as a judge to evaluate RAG quality. It integrates with MLflow'smlflow.genai.evaluate()framework via custom@scorerfunctions, while routing judge LLM calls through the OpenAI client to any OpenAI-compatible endpoint (vLLM, TGI, etc.).Motivation
The existing
UnitxtEvaluatoruses algorithmic (non-LLM) metrics which are fast and deterministic but limited in capturing nuanced quality dimensions like helpfulness, coherence, or domain-specific correctness. LLM-as-a-Judge enables richer evaluation — users can now optimize RAG pipelines using LLM-judged metrics as the objective function in the HPO loop, with results tracked in MLflow.MLflow 3.x's built-in
Guidelinesscorer hardcodes requests toapi.openai.comforopenai:/URIs, ignoringOPENAI_API_BASE. This implementation works around that by using custom scorers that call the judge LLM directly via the OpenAI client with a configurablebase_url.Changes
ANSWER_RELEVANCEtoMetricTypeinbase_evaluator.pyMlflowLLMJudgeEvaluator,LLMJudgeConfig,CustomMetricDefinitionin newmlflow_llm_judge_evaluator.pyCustomMetricDefinitionexperiment.pyoptimization_metric validationevaluator/__init__.pymlflow>=3.0.0andopenai>=1.0.0as optional dependencies (ai4rag[llm-judge])docs/design/llm-as-judge-design.mdTesting
Checklist