Add Support for Cherry-Picking Specific Documents via Document IDs #66

leagrieder · 2025-03-21T13:16:03Z

🔍 PR: Add Support for Cherry-Picking Specific Documents via Document IDs

Summary

This PR adds the ability to strictly filter the hybrid retrieval to user-specified document IDs. Rather than searching over the entire collection and later patching in documents, the hybrid search itself applies a filter expression (id in [...]) to ensure that only the desired doc IDs are considered.

By merging the doc ID restriction directly into both dense and sparse searches, we retain the standard hybrid ranking logic but limit it to the documents the user requested. This is particularly beneficial for scenarios where a user (e.g. MOOVE) wants deterministic or curated doc sets, while still leveraging ML-based ranking (e.g. MOORE) to order those docs by relevance.

✅ Main Changes

Support for Document IDs in the Retrieval Flow
- retriever.py: The retrieve() method now has an optional document_ids argument. If provided, it attaches a filter expression (e.g., id in ["doc1", "doc2", ...]) to the dense and sparse search requests.
- read_queries() (in run_rag.py): Now returns a list of dictionaries, each of which can include both the user query text and an optional list of document IDs.
- run_retriever.py: The retrieve() function parses --document-ids (if any) into a Python list before calling the updated retriever, which then restricts search accordingly.
Removal of Old “Patch-In” Logic
- The earlier approach of retrieving everything and then fetch-missing-docs is gone. Now, if doc IDs are given, we filter within the search step.
- The helper get_documents_by_ids() remains in the codebase if we need direct doc lookups in the future.
CLI Behavior
- The --document-ids/-d flag can optionally be provided as a comma-separated list (e.g., doc1,doc2,doc3) in the retrieve part of the CLI.
Bug Fixes & Minor Improvements
- Ensured read_queries() can return dictionaries with document_ids as well as input text.
- Updated the logic in _get_relevant_documents() to pass document_ids along to retrieve() in order to keep it consistent with the new approach.
- Logging improvements for better transparency (e.g. showing the expr filter in logs, etc.).

🧪 Usage Example

mmore retrieve \
  --config-file index_config.yaml \
  --input-file queries.jsonl \
  --output-file results.json \
  --document-ids doc1,doc2,doc3

The system will read queries from queries.jsonl.
If --document-ids is supplied, it will filter to doc1, doc2, doc3 only.
Hybrid search ranks these filtered docs by relevance and saves the final results to results.json.

…the functionality of the implemented methods with a simple local test querying a Milvus DB with document ID's. For now, the current functionality is such that the query contents are ignored as soon as doc id's are specified, next step is to create hybrid retrieval

… my environment to work for some reason I don't understand...

…stead of just logging when non are found

…ally and works perfectly. If doc ids are specified it will retrieve k most relevant docs normally with vector similarity search and then it checks if manual doc ids are specified, if they are it checks whether it already retrieved them in the similarity search, if it did not it will prepend them at the beginning of the json result

…ver to only look within specified documents instead of looking through entire DB

… the already provided code for better understanding, tested the functionality locally and it works.

fabnemEPFL · 2025-03-24T13:26:51Z

src/mmore/run_retriever.py

+                )
+                for i, result in enumerate(raw_results[0])
+            ]
+        else:


Why not just use retriever.retrieve(…) with the empty doc_ids_list? It should work too

Good Catch!!
And actually, it was my mistake to bypass the invoke function and call retrieve directly when doc ids were specified because doing so skipped the callback manager ( for logging, error handling, performance tracking, and consistent and robust retrieval process).

Although everything seemed to work at first glance, I later realized that by not using invoke like it was initially done in the initial codebase I was missing out on these important features. So I have updated the code now such that it always calls invoke but with optional document_ids in kwargs

…nt_ids in the json file

leagrieder added 7 commits March 21, 2025 14:03

m

252eb29

had to copy paste the dependency modifications made by Fabrice to get…

5e29d81

… my environment to work for some reason I don't understand...

final testing and added a log if some document id's were not found in…

760b9dd

…stead of just logging when non are found

removed unnecessary added files for the pull request

075cd40

conflict

d395f6e

leagrieder marked this pull request as draft March 23, 2025 12:19

leagrieder added 2 commits March 23, 2025 14:52

modified the current cherry picking implementation to tell the retrie…

84f085d

…ver to only look within specified documents instead of looking through entire DB

added addditionnal comments for changes made and commented thoroughly…

7df6abd

… the already provided code for better understanding, tested the functionality locally and it works.

leagrieder marked this pull request as ready for review March 24, 2025 08:14

fabnemEPFL reviewed Mar 24, 2025

View reviewed changes

call invoke always instead of bypassing it!

d102a1c

leagrieder mentioned this pull request Mar 26, 2025

Issue62 unit tests #67

Open

leagrieder marked this pull request as draft March 26, 2025 05:51

leagrieder added 2 commits March 29, 2025 08:23

Fabrice Changes added to my branch

44ca046

cherry picking from RAG interface is now possible by providing docume…

343d142

…nt_ids in the json file

leagrieder marked this pull request as ready for review March 29, 2025 14:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Support for Cherry-Picking Specific Documents via Document IDs #66

Add Support for Cherry-Picking Specific Documents via Document IDs #66

leagrieder commented Mar 21, 2025 •

edited

Loading

fabnemEPFL Mar 24, 2025

leagrieder Mar 24, 2025 •

edited

Loading

Add Support for Cherry-Picking Specific Documents via Document IDs #66

Are you sure you want to change the base?

Add Support for Cherry-Picking Specific Documents via Document IDs #66

Conversation

leagrieder commented Mar 21, 2025 • edited Loading

🔍 PR: Add Support for Cherry-Picking Specific Documents via Document IDs

Summary

✅ Main Changes

🧪 Usage Example

fabnemEPFL Mar 24, 2025

Choose a reason for hiding this comment

leagrieder Mar 24, 2025 • edited Loading

Choose a reason for hiding this comment

leagrieder commented Mar 21, 2025 •

edited

Loading

leagrieder Mar 24, 2025 •

edited

Loading