Conversation
| print(f" Page index key check: {matched}/{len(sampled)} sampled source_ids found") | ||
|
|
||
|
|
||
| def main() -> int: |
There was a problem hiding this comment.
Why not make this a tool we can call via import, instead of a main function.
There was a problem hiding this comment.
core evaluation logic has been moved into nemo_retriever.evaluation (importable package, pip-installable via nemo_retriever[eval])
…ith better scoring naming, more generic env naming for api keys for multi model support
d7c48fa to
9262c63
Compare
Greptile SummaryThis PR adds a substantial pluggable QA evaluation harness for retrieval quality measurement, introducing multi-tier scoring (Tier-1 retrieval recall, Tier-2 token F1, Tier-3 LLM-as-judge), a full graph-pipeline-based execution model, and support for swappable datasets, LLMs, and retrieval backends. The architecture is well-structured with clean protocol abstractions in Two bugs in the changed files warrant attention before merging:
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/evaluation/retrieval_loader.py | New source operator that loads retrieval JSON + ground truth CSV into a DataFrame; has a P1 bug where a broad except ValueError swallows the actionable data_dir error for bo767_infographic. |
| tools/harness/src/nv_ingest_harness/utils/recall.py | Added get_retrieval_func helper and updated recall functions; Milvus path in both get_recall_scores and get_recall_scores_pdf_only still passes hybrid=sparse instead of hybrid=hybrid. |
| nemo_retriever/src/nemo_retriever/evaluation/orchestrator.py | Core QAEvalPipeline: well-structured threaded generation+judging+scoring loop; default-arg closure fix for late-binding is present and correct; aggregate output format is comprehensive. |
| nemo_retriever/src/nemo_retriever/evaluation/scoring.py | Multi-tier scoring with word-set Tier-1 check (previously flagged substring bug fixed), SQuAD-style token F1, and classify_failure with judge_error sentinel; logic is clean. |
| nemo_retriever/src/nemo_retriever/evaluation/config.py | Config loading, env-var expansion, and legacy→new format normalization; empty evaluations guard is present; check_unresolved_env called in runner for API keys. |
| nemo_retriever/src/nemo_retriever/evaluation/ground_truth.py | Dataset loaders for bo767_infographic, ViDoRe v3, and generic CSV; bo767_infographic now correctly validates data_dir before calling os.path.join. |
| nemo_retriever/src/nemo_retriever/evaluation/judges.py | LLM-as-judge with JSON parsing and regex fallback; returns JudgeResult(error=...) on failure instead of raising; empty_candidate short-circuit is correct. |
| nemo_retriever/src/nemo_retriever/evaluation/generators.py | Unified LiteLLMClient wrapping litellm; thinking_truncated sentinel when strip_think_tags returns empty; extra_params applied last, which can intentionally override call kwargs. |
| nemo_retriever/src/nemo_retriever/evaluation/runner.py | New run_eval_sweep function replaces deleted script logic; check_unresolved_env guards on API keys per-eval; timestamped JSON output per run is clean. |
| tools/harness/src/nv_ingest_harness/utils/qa/init.py | TopKRetriever correctly passes hybrid=self.hybrid to the Milvus retrieval function (previously flagged swap is resolved); collection existence pre-check is a good defensive pattern. |
| tools/harness/src/nv_ingest_harness/cases/qa_eval.py | Imports _expand_env_vars directly from nemo_retriever.evaluation.config (deduplication resolved); _build_retriever correctly defers heavy harness imports to the topk branch. |
Sequence Diagram
sequenceDiagram
participant User
participant CLI as evaluation/cli.py
participant Loader as RetrievalLoaderOperator
participant GT as ground_truth.py
participant FR as FileRetriever/TopKRetriever
participant Gen as QAGenerationOperator(LiteLLMClient)
participant Judge as JudgingOperator(LLMJudge)
participant Scorer as ScoringOperator(scoring.py)
User->>CLI: retriever eval run --config eval_sweep.yaml
CLI->>Loader: process(None)
Loader->>GT: get_qa_dataset_loader(source)(data_dir)
GT-->>Loader: qa_pairs list[dict]
Loader->>FR: retrieve(query, top_k) per pair
FR-->>Loader: RetrievalResult(chunks, metadata)
Loader-->>Gen: DataFrame(query, reference_answer, context)
Gen->>Gen: LiteLLMClient.generate() [ThreadPoolExecutor]
Gen-->>Judge: DataFrame + answer, gen_error cols
Judge->>Judge: LLMJudge.judge() [ThreadPoolExecutor]
Judge-->>Scorer: DataFrame + judge_score, judge_reasoning cols
Scorer->>Scorer: answer_in_context(), token_f1(), classify_failure()
Scorer-->>CLI: DataFrame with Tier1/2/3 metrics + failure_mode
CLI->>User: JSON results written to results_dir
Comments Outside Diff (1)
-
tools/harness/src/nv_ingest_harness/utils/recall.py, line 137-146 (link)hybrid=sparseshould behybrid=hybridnvingest_retrievalis called withhybrid=sparse, so thehybridparameter passed toget_recall_scoreshas zero effect on the Milvus path. The LanceDB path (viaget_retrieval_func) correctly threads throughhybrid=hybrid, making the inconsistency clear. The same fix applies to the identical call inget_recall_scores_pdf_onlyat line 258.Prompt To Fix With AI
This is a comment left during a code review. Path: tools/harness/src/nv_ingest_harness/utils/recall.py Line: 137-146 Comment: **`hybrid=sparse` should be `hybrid=hybrid`** `nvingest_retrieval` is called with `hybrid=sparse`, so the `hybrid` parameter passed to `get_recall_scores` has zero effect on the Milvus path. The LanceDB path (via `get_retrieval_func`) correctly threads through `hybrid=hybrid`, making the inconsistency clear. The same fix applies to the identical call in `get_recall_scores_pdf_only` at line 258. How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/evaluation/retrieval_loader.py
Line: 71-76
Comment:
**Overly-broad `except ValueError` swallows the `data_dir` error**
Both `get_qa_dataset_loader()` and `loader_fn(self._data_dir)` are inside the same `try` block. When `source="bo767_infographic"` and `self._data_dir=None`, `loader_fn(None)` raises `ValueError("bo767_infographic dataset requires data_dir to be set.")`. That ValueError is caught here and the code falls back to `load_generic_csv("bo767_infographic")`, which then raises a confusing `FileNotFoundError: CSV file not found: bo767_infographic` — swallowing the actionable message entirely.
Split the two operations so only the "unknown dataset" branch falls back to generic CSV:
```suggestion
try:
loader_fn = get_qa_dataset_loader(source)
except ValueError:
qa_pairs = load_generic_csv(source)
else:
qa_pairs = loader_fn(self._data_dir)
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: tools/harness/src/nv_ingest_harness/utils/recall.py
Line: 137-146
Comment:
**`hybrid=sparse` should be `hybrid=hybrid`**
`nvingest_retrieval` is called with `hybrid=sparse`, so the `hybrid` parameter passed to `get_recall_scores` has zero effect on the Milvus path. The LanceDB path (via `get_retrieval_func`) correctly threads through `hybrid=hybrid`, making the inconsistency clear. The same fix applies to the identical call in `get_recall_scores_pdf_only` at line 258.
```suggestion
batch_answers = nvingest_retrieval(
batch_queries,
collection_name,
hybrid=hybrid,
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: tools/harness/src/nv_ingest_harness/utils/recall.py
Line: 256-265
Comment:
**Same `hybrid=sparse` swap in `get_recall_scores_pdf_only`**
Same bug as the Milvus call in `get_recall_scores` (line 140): `hybrid=sparse` ignores the `hybrid` argument for all Milvus-backed PDF-only recall runs.
```suggestion
batch_answers = nvingest_retrieval(
batch_queries,
collection_name,
hybrid=hybrid,
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (10): Last reviewed commit: "restored singular column names for test" | Re-trigger Greptile
jperez999
left a comment
There was a problem hiding this comment.
Moving in the right direction. Lets remove all the changes to the harness not in nemo_retriever. That will slim down the PR quite a bit. Also, unless you feel it is really helpful, lets remove all the extra tools you added and replace them with helper functions for those actions. We should refactor to make it possible to tack these operators on the graph in graph_pipeline.py or into the Retreiver object already in use. We should be trying to reuse as much of the objects that we have as much as possible. Keep in mind, everything here is a discussion, if you feel it is better the way you have it, please explain it to me.
| # --------------------------------------------------------------------------- | ||
|
|
||
|
|
||
| def run_agentic_retrieval( |
There was a problem hiding this comment.
So this is something that we need to do separate from graph_pipeline.py entry point? Cant we just add in the operators we want and use that same entrypoint. It would then allow us to make changes to the query file and datasets and should still get same behavior.
| --output data/test_retrieval/bo767_retrieval_dense.json | ||
| """ | ||
|
|
||
| from __future__ import annotations |
There was a problem hiding this comment.
Why create a whole new file to do what graph_pipeline already mostly does?
There was a problem hiding this comment.
This script exists because retrieval-bench only works with HuggingFace datasets out of the box. We would need this file to load our extraction Parquets, expand chunk hits to full-page markdown, and output the FileRetriever JSON that our QA eval pipeline expects.
| import json | ||
| import os | ||
|
|
||
| from nv_ingest_harness.cases.e2e import main as e2e_main |
There was a problem hiding this comment.
Again it seems like you are creating a whole new graph specifically for this. When what I think we want is to be able to tack on these operations to any graph.
| from nemo_retriever.evaluation.types import RetrievalResult | ||
|
|
||
|
|
||
| class TopKRetriever: |
There was a problem hiding this comment.
Why are you adding this in the harness. This should exist in nemo_retriever. All code changes in legacy nv-ingest can be removed unless necessary to make nemo_retriever work.
There was a problem hiding this comment.
moving it would pull harness dependencies into nemo_retriever right, which isn't what we want. It makes more sense in my mind if the harness consumes the nemo_retriever protocl instead of vice versa.
Description
Capabilities:
Note - the csv containing the q-a pairs is a subset of the existing https://github.com/NVIDIA/NeMo-Retriever/blob/main/data/digital_corpora_10k_annotations.csv. Currently have an separate PR up with a subset annotations for only bo767 specific files here - #1730
)## Checklist