[FIX] Refactored indexing, extraction and retrieval #172

harini-venkataraman · 2025-03-06T16:09:44Z

What

Separated indexing and extraction into distinct APIs in prompt service
Refactored the SDK to accommodate the new APIs.
Subquestion Retrieval using the Llama Index query engine.
Refactored the Retriever service.
Skipped VectorDB usage for zero chunk size.
Added integration tests for the Indexing and Extraction APIs.

...

Why

Pre requisite for cell type merge
...

How

Refactored the existing code
...

Relevant Docs

Related Issues or PRs

Zipstack/unstract#1172
Zipstack/unstract#1149
...

Dependencies Versions / Env Variables

Notes on Testing

Added integration tests
...

Screenshots

...

Checklist

I have read and understood the Contribution Guidelines.

gaya3-zipstack · 2025-03-12T09:09:14Z

src/unstract/sdk/index_v2.py

+            )
+            raise IndexingError(str(e)) from e
+
+    def _delete_existing_nodes_on_reindex(self, vector_db, doc_id, doc_id_found):


The function can be renamed to delete_nodes. That way it can be reused elsewhere if there is a need to delete nodes...

Also please add types for the arguments wherever possible

CLAassistant · 2025-03-12T13:38:28Z

All committers have signed the CLA.

gaya3-zipstack · 2025-03-12T15:19:44Z

src/unstract/sdk/index_v2.py

+            return documents
+        except Exception as e:
+            self.tool.stream_log(
+                f"Error deleting nodes for {doc_id}: {e}",


Wrong error msg.

gaya3-zipstack · 2025-03-12T15:20:24Z

src/unstract/sdk/extract.py

+        self._capture_metrics = capture_metrics
+        self._metrics = {}
+
+    def extract(self):


Where are we using this function?

gaya3-zipstack · 2025-03-12T15:23:26Z

src/unstract/sdk/retrieval/simple.py

+
+class SimpleRetriever(BaseRetriever):
+    def __init__(self, vector_db: VectorDB, prompt: str, doc_id: str, top_k: int):
+        self.vector_db = vector_db


Can't we merely call the parent's constructor here. Or, we should be even able to remove the init here as the parent covers it right?

gaya3-zipstack · 2025-03-12T15:25:06Z

src/unstract/sdk/exceptions.py

+class RetrievalError(SdkError):
+    """Custom exception raised for errors during retrieval from VectorDB."""
+
+    DEFAULT_MESSAGE = (


If this message is going all the way back to the user, then I think we need to shape this up a bit. User may not connect what the query param here would be. Instead we should rephrase it with parameters that user has control on.

gaya3-zipstack · 2025-03-12T15:27:00Z

src/unstract/sdk/dto.py

+class ProcessingOptions:
+    reindex: bool = False
+    enable_highlight: bool = False
+    usage_kwargs: dict[Any, Any] = field(default_factory=dict)


What does usage_kwargs here carry? Wondering why do we keep it as an attribute of ProcessingOptions?

gaya3-zipstack · 2025-03-12T15:29:06Z

src/unstract/sdk/index_v2.py

+
+    @log_elapsed(operation="INDEX")
+    @capture_metrics
+    def perform_indexing(


Thinking if we should retain old function names like index or index_document. That will keep the familiarity level intact for people who read the code.

Also applicable in other places where relevant.

chandrasekharan-zipstack

Left some comments - my main question is on why we have such helper methods for indexing and an API call to the prompt-service

chandrasekharan-zipstack · 2025-03-12T15:09:25Z

src/unstract/sdk/extract.py

+        # TODO: Inherit from StreamMixin and avoid using BaseTool
+        self.tool = tool


@harini-venkataraman do we do this only for logging? If that's the case can we make stream_logs() a static method and remove this dependency?

chandrasekharan-zipstack · 2025-03-12T15:13:06Z

src/unstract/sdk/index_v2.py

+                    self.tool.stream_log(f"No nodes found for {doc_id}")
+            except Exception as e:
+                self.tool.stream_log(
+                    f"Error querying {instance_identifiers.vector_db_instance_id}: {e},"


NIT: Logging the UUID for an adapter will not be useful to the user currently

Either we show the UUID also in the adapter listing page

Or always use the adapter name to log stuff to the user

chandrasekharan-zipstack · 2025-03-12T15:15:24Z

src/unstract/sdk/index_v2.py

+            except Exception as e:
+                self.tool.stream_log(
+                    f"Error querying {instance_identifiers.vector_db_instance_id}: {e},"
+                    " proceeding to index",


I don't think we should assume that we will always index after calling this function. This log should be the responsibility of the caller ideally

chandrasekharan-zipstack · 2025-03-12T15:18:34Z

src/unstract/sdk/index_v2.py

+        except Exception as e:
+            self.tool.stream_log(
+                f"Unexpected error during indexing check: {e}", level=LogLevel.ERROR
+            )


NIT: I don't think this is needed here. Let's say an actual error does happen

We lose context on the error since its not propagated and suppressed instead

We might display some pythonic error to the user's logs this way

It might be better to propagate the error up the call stack and ensure we log a trace and also respond to the user with something meaningful

chandrasekharan-zipstack · 2025-03-12T15:20:40Z

src/unstract/sdk/index_v2.py

+                )
+                raise SdkError(f"Error deleting nodes for {doc_id}: {e}") from e
+
+    def _prepare_documents(self, doc_id, full_text) -> list:


NIT: Please add typing for these args

chandrasekharan-zipstack · 2025-03-12T16:02:30Z

src/unstract/sdk/retrieval/simple.py

+            # UN-1288 For Pinecone, we are seeing an inconsistent case where
+            # query with doc_id fails even though indexing just happened.
+            # This causes the following retrieve to return no text.
+            # To rule out any lag on the Pinecone vector DB write,
+            # the following sleep is added
+            # Note: This will not fix the issue. Since this issue is inconsistent
+            # and not reproducible easily, this is just a safety net.
+            time.sleep(2)


NIT: This should only apply for pinecone then - why do we do this for all vectorDBs?
cc: @gaya3-zipstack

chandrasekharan-zipstack · 2025-03-12T16:05:56Z

src/unstract/sdk/index_v2.py

+
+        return doc_id_found
+
+    @log_elapsed(operation="INDEX")


Suggested change

@log_elapsed(operation="INDEX")

NIT: Remove this since you've already added it for the index API

chandrasekharan-zipstack · 2025-03-12T16:07:11Z

src/unstract/sdk/index_v2.py

+        ]
+        # Convert raw text to llama index usage Document
+        documents = self._prepare_documents(doc_id, full_text)
+        self._delete_existing_nodes_on_reindex(vector_db, doc_id, doc_id_found)


Why do we accept doc_id_found in this function? I feel that we should get rid of it and the caller of this function should ensure that we skip calling this function itself.

chandrasekharan-zipstack · 2025-03-12T16:08:24Z

src/unstract/sdk/prompt.py

+    @log_elapsed(operation="INDEX")
+    def index(
+        self, payload: dict[str, Any], params: Optional[dict[str, str]] = None
+    ) -> dict[str, Any]:
+        url_path = "index"
+        if self.is_public_call:
+            url_path = "index-public"
+        return self._post_call(
+            url_path=url_path,
+            payload=payload,
+            params=params,
+        )


@harini-venkataraman if we have an API to the prompt-service to take care of this, why do we have helper methods in the SDK? Shouldn't they be in prompt-service?

chandrasekharan-zipstack · 2025-03-12T16:13:25Z

src/unstract/sdk/dto.py

+class InstanceIdentifiers:
+    embedding_instance_id: str
+    vector_db_instance_id: str
+    x2text_instance_id: str
+    llm_instance_id: str


@harini-venkataraman I notice that we pass this object and vector_db / embedding adapter to functions. Can't we make use of the adapters alone everywhere? By making use of such UUIDs we force ourselves to use multiple DB queries every now and then

harini-venkataraman added 20 commits May 23, 2024 18:16

Exception handling for Prompt Service

15d11f4

Merge branch main

dca43f7

Merge branch 'main' of github.com:Zipstack/unstract-sdk

cebfaaa

Merge branch main

9a13a56

Merge branch 'main' of github.com:Zipstack/unstract-sdk

7331f57

Merge branch 'main' of github.com:Zipstack/unstract-sdk

c6c6698

Merge branch 'main' of github.com:Zipstack/unstract-sdk

c3b3eb5

Merge branch 'main' of github.com:Zipstack/unstract-sdk

395746b

Merge branch 'main' of github.com:Zipstack/unstract-sdk

3986fb9

Merge branch 'main' of github.com:Zipstack/unstract-sdk

72de20d

Merge branch 'main' of github.com:Zipstack/unstract-sdk

5fa0023

Merge branch 'main' of github.com:Zipstack/unstract-sdk

5a1cc4b

Merge branch 'main' of github.com:Zipstack/unstract-sdk

377193e

Merge branch 'main' of github.com:Zipstack/unstract-sdk

efd358a

Merge branch 'main' of github.com:Zipstack/unstract-sdk

273388f

Merge branch 'main' of github.com:Zipstack/unstract-sdk

5700282

Merge branch 'main' of github.com:Zipstack/unstract-sdk

a6f7403

refactor: Indexing API segregation

1d43206

refactor: Indexing API segregation

071dcb3

refactor: Indexing API segregation

92ad0bd

gaya3-zipstack reviewed Mar 12, 2025

View reviewed changes

harini-venkataraman changed the title ~~REFACTOR : Indexing API~~ [FIX] Refactored indexing, extraction and retrieval Mar 12, 2025

Retrievers - Subquestion & Simple

025b0ce

harini-venkataraman force-pushed the fix/indexing-refactor branch from cbea093 to 025b0ce Compare March 12, 2025 14:26

harini-venkataraman marked this pull request as ready for review March 12, 2025 14:26

gaya3-zipstack reviewed Mar 12, 2025

View reviewed changes

chandrasekharan-zipstack reviewed Mar 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FIX] Refactored indexing, extraction and retrieval #172

[FIX] Refactored indexing, extraction and retrieval #172

harini-venkataraman commented Mar 6, 2025 •

edited

Loading

gaya3-zipstack Mar 12, 2025

chandrasekharan-zipstack Mar 12, 2025

CLAassistant commented Mar 12, 2025 •

edited

Loading

gaya3-zipstack Mar 12, 2025

gaya3-zipstack Mar 12, 2025 •

edited

Loading

gaya3-zipstack Mar 12, 2025

gaya3-zipstack Mar 12, 2025

gaya3-zipstack Mar 12, 2025

gaya3-zipstack Mar 12, 2025

gaya3-zipstack Mar 12, 2025

chandrasekharan-zipstack left a comment

chandrasekharan-zipstack Mar 12, 2025

chandrasekharan-zipstack Mar 12, 2025

chandrasekharan-zipstack Mar 12, 2025

chandrasekharan-zipstack Mar 12, 2025

chandrasekharan-zipstack Mar 12, 2025

chandrasekharan-zipstack Mar 12, 2025

chandrasekharan-zipstack Mar 12, 2025

chandrasekharan-zipstack Mar 12, 2025

chandrasekharan-zipstack Mar 12, 2025

chandrasekharan-zipstack Mar 12, 2025

		# TODO: Inherit from StreamMixin and avoid using BaseTool
		self.tool = tool

[FIX] Refactored indexing, extraction and retrieval #172

Are you sure you want to change the base?

[FIX] Refactored indexing, extraction and retrieval #172

Conversation

harini-venkataraman commented Mar 6, 2025 • edited Loading

What

Why

How

Relevant Docs

Related Issues or PRs

Dependencies Versions / Env Variables

Notes on Testing

Screenshots

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CLAassistant commented Mar 12, 2025 • edited Loading

Choose a reason for hiding this comment

gaya3-zipstack Mar 12, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chandrasekharan-zipstack left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harini-venkataraman commented Mar 6, 2025 •

edited

Loading

CLAassistant commented Mar 12, 2025 •

edited

Loading

gaya3-zipstack Mar 12, 2025 •

edited

Loading