fix: NameError during evalutation of llamaindex query engine (#2331)

Prigoistic · jjmachan · anistark · web-flow · commit 48fe70e6ffa6 · 2025-10-21T19:23:37.000+05:30
### Issue Link / Problem Description - Fixes [#2330](#2330) - Evaluating a LlamaIndex query engine raised a runtime NameError: `EvaluationResult` not defined, because it was imported only under `t.TYPE_CHECKING`. Intermittent LlamaIndex execution failures also led to `IndexError` during result collection due to mismatched lengths. ### Changes Made - Import `EvaluationResult` at runtime from `ragas.dataset_schema` in `src/ragas/integrations/llama_index.py`. - Make response/context collection robust: - Handle failed executor jobs (NaN placeholders) by inserting empty response/context to maintain alignment with dataset size. - Prevent `IndexError` during dataset augmentation. - Light defensive checks to ensure stable evaluation even when some query-engine calls fail. ### Testing - Automated tests added/updated ### How to Test - Manual testing steps: 1. Install for local dev: `uv run pip install -e . -e ./examples` 2. Follow the LlamaIndex integration guide to set up a `query_engine` and `EvaluationDataset`: [docs](https://docs.ragas.io/en/stable/howtos/integrations/_llamaindex/) 3. Ensure LlamaIndex LLM is configured with `n=1` (or unset) to avoid “n values greater than 1 not support” warnings. 4. Run an evaluation that previously failed; it should complete without the `NameError` and without `IndexError` during result collection. 5. Optional: run lints `uv run ruff check .` ### References - Related issues: [#2330](#2330) - Documentation: LlamaIndex integration how-to ([link](https://docs.ragas.io/en/stable/howtos/integrations/_llamaindex/)) ### Screenshots/Examples (if applicable) - N/A --------- Co-authored-by: jjmachan <jamesjithin97@gmail.com> Co-authored-by: Ani <5357586+anistark@users.noreply.github.com>
diff --git a/src/ragas/integrations/llama_index.py b/src/ragas/integrations/llama_index.py
@@ -1,9 +1,10 @@
 from __future__ import annotations
 
 import logging
+import math
 import typing as t
 
-from ragas.dataset_schema import EvaluationDataset, SingleTurnSample
+from ragas.dataset_schema import EvaluationDataset, EvaluationResult, SingleTurnSample
 from ragas.embeddings import LlamaIndexEmbeddingsWrapper
 from ragas.evaluation import evaluate as ragas_evaluate
 from ragas.executor import Executor
@@ -18,10 +19,10 @@
         BaseEmbedding as LlamaIndexEmbeddings,
     )
     from llama_index.core.base.llms.base import BaseLLM as LlamaindexLLM
+    from llama_index.core.base.response.schema import Response as LlamaIndexResponse
     from llama_index.core.workflow import Event
 
     from ragas.cost import TokenUsageParser
-    from ragas.evaluation import EvaluationResult
 
 
 logger = logging.getLogger(__name__)
@@ -78,12 +79,21 @@ def evaluate(
         exec.submit(query_engine.aquery, q, name=f"query-{i}")
 
     # get responses and retrieved contexts
-    responses: t.List[str] = []
-    retrieved_contexts: t.List[t.List[str]] = []
+    responses: t.List[t.Optional[str]] = []
+    retrieved_contexts: t.List[t.Optional[t.List[str]]] = []
     results = exec.results()
-    for r in results:
-        responses.append(r.response)
-        retrieved_contexts.append([n.node.text for n in r.source_nodes])
+    for i, r in enumerate(results):
+        # Handle failed jobs which are recorded as NaN in the executor
+        if isinstance(r, float) and math.isnan(r):
+            responses.append(None)
+            retrieved_contexts.append(None)
+            logger.warning(f"Query engine failed for query {i}: '{queries[i]}'")
+            continue
+
+        # Cast to LlamaIndex Response type for proper type checking
+        response: LlamaIndexResponse = t.cast("LlamaIndexResponse", r)
+        responses.append(response.response if response.response is not None else "")
+        retrieved_contexts.append([n.get_text() for n in response.source_nodes])
 
     # append the extra information to the dataset
     for i, sample in enumerate(samples):