-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Support for Cherry-Picking Specific Documents via Document IDs #66
base: master
Are you sure you want to change the base?
Conversation
…the functionality of the implemented methods with a simple local test querying a Milvus DB with document ID's. For now, the current functionality is such that the query contents are ignored as soon as doc id's are specified, next step is to create hybrid retrieval
… my environment to work for some reason I don't understand...
…stead of just logging when non are found
…ally and works perfectly. If doc ids are specified it will retrieve k most relevant docs normally with vector similarity search and then it checks if manual doc ids are specified, if they are it checks whether it already retrieved them in the similarity search, if it did not it will prepend them at the beginning of the json result
…ver to only look within specified documents instead of looking through entire DB
… the already provided code for better understanding, tested the functionality locally and it works.
src/mmore/run_retriever.py
Outdated
) | ||
for i, result in enumerate(raw_results[0]) | ||
] | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use retriever.retrieve(…) with the empty doc_ids_list? It should work too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good Catch!!
And actually, it was my mistake to bypass the invoke
function and call retrieve
directly when doc ids were specified because doing so skipped the callback manager ( for logging, error handling, performance tracking, and consistent and robust retrieval process).
Although everything seemed to work at first glance, I later realized that by not using invoke
like it was initially done in the initial codebase I was missing out on these important features. So I have updated the code now such that it always calls invoke but with optional document_ids in kwargs
🔍 PR: Add Support for Cherry-Picking Specific Documents via Document IDs
Summary
This PR adds the ability to strictly filter the hybrid retrieval to user-specified document IDs. Rather than searching over the entire collection and later patching in documents, the hybrid search itself applies a filter expression (
id in [...]
) to ensure that only the desired doc IDs are considered.By merging the doc ID restriction directly into both dense and sparse searches, we retain the standard hybrid ranking logic but limit it to the documents the user requested. This is particularly beneficial for scenarios where a user (e.g. MOOVE) wants deterministic or curated doc sets, while still leveraging ML-based ranking (e.g. MOORE) to order those docs by relevance.
✅ Main Changes
Support for Document IDs in the Retrieval Flow
retriever.py
: Theretrieve()
method now has an optionaldocument_ids
argument. If provided, it attaches a filter expression (e.g.,id in ["doc1", "doc2", ...]
) to the dense and sparse search requests.read_queries()
(inrun_rag.py
): Now returns a list of dictionaries, each of which can include both the user query text and an optional list of document IDs.run_retriever.py
: Theretrieve()
function parses--document-ids
(if any) into a Python list before calling the updated retriever, which then restricts search accordingly.Removal of Old “Patch-In” Logic
get_documents_by_ids()
remains in the codebase if we need direct doc lookups in the future.CLI Behavior
--document-ids
/-d
flag can optionally be provided as a comma-separated list (e.g.,doc1,doc2,doc3
) in the retrieve part of the CLI.Bug Fixes & Minor Improvements
read_queries()
can return dictionaries withdocument_ids
as well asinput
text._get_relevant_documents()
to passdocument_ids
along toretrieve()
in order to keep it consistent with the new approach.expr
filter in logs, etc.).🧪 Usage Example
queries.jsonl
.--document-ids
is supplied, it will filter todoc1, doc2, doc3
only.results.json
.