Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Community: Add sqldocstore support for ParentChildRetriever #30300

Open
wants to merge 24 commits into
base: master
Choose a base branch
from

Conversation

hslee16
Copy link
Contributor

@hslee16 hslee16 commented Mar 16, 2025

Description

This PR aims to add support for sqldocstore to be used by ParentDocumentRetriever class

ParentDocumentRetriever class is passing in Documents to mset here:
https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/retrievers/parent_document_retriever.py#L135

Issue

Using langchain_community.storage.sql.SQLStore with the ParentChildRetriever causes the following exception:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[7], line 1
----> 1 retriever.add_documents(data, ids=None)

File langchain/retrievers/parent_document_retriever.py:135, in ParentDocumentRetriever.add_documents(self, documents, ids, add_to_docstore, **kwargs)
    133 self.vectorstore.add_documents(docs, **kwargs)
    134 if add_to_docstore:
--> 135     self.docstore.mset(full_docs)

File langchain_community/storage/sql.py:236, in SQLStore.mset(self, key_value_pairs)
    228 self._mdelete(list(values.keys()), session)
    229 session.add_all(
    230     [
    231         LangchainKeyValueStores(namespace=self.namespace, key=k, value=v)
   (...)
    234     ]
    235 )
--> 236 session.commit()

File sqlalchemy/orm/session.py:2032, in Session.commit(self)
   2029 if trans is None:
   2030     trans = self._autobegin_t()
-> 2032 trans.commit(_to_root=True)

File <string>:2, in commit(self, _to_root)
...
...
...

TypeError: can't escape Document to binary

Add tests and docs

A new test has been added to make sure that SqlDocstore supports langchain document types.

Copy link

vercel bot commented Mar 16, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Mar 20, 2025 3:33pm

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. community Related to langchain-community Ɑ: retriever Related to retriever module labels Mar 16, 2025
@hslee16 hslee16 changed the title Add sqldocstore support for ParentChildRetriever Community: Add sqldocstore support for ParentChildRetriever Mar 16, 2025
@hslee16 hslee16 requested a review from OskarStark March 17, 2025 23:17
@OskarStark
Copy link
Contributor

I propose to rename the PR title to:

feat(community): add `sqldocstore` support in `ParentChildRetriever`

@hslee16 hslee16 requested a review from OskarStark March 19, 2025 00:55
@a-holm
Copy link

a-holm commented Mar 30, 2025

Found two potential issues.

Metadata Loss: The _bytes_or_document method currently only stores the Document.page_content and discards the metadata. ParentDocumentRetriever relies on retrieving the full parent document (including metadata) from the docstore. Storing only the content will likely break the retriever's functionality later on, even though the initial mset call succeeds. A better approach would be to serialize the entire Document object (e.g., using pickle or JSON) into bytes for storage, ensuring that mget can retrieve the bytes and potentially deserialize them back into a full Document (perhaps via a wrapper class like EncoderBackedStore or similar patterns used elsewhere).

Inconsistent Type Hint: The type hint for key_value_pairs in mset was updated to include Document, but the corresponding hint in amset was not.

@hslee16
Copy link
Contributor Author

hslee16 commented Mar 31, 2025

@a-holm thanks for the comments! I'll address them ASAP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Related to langchain-community Ɑ: retriever Related to retriever module size:M This PR changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants