Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Support for Cherry-Picking Specific Documents via Document IDs #66

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

leagrieder
Copy link

@leagrieder leagrieder commented Mar 21, 2025

🔍 PR: Add Support for Cherry-Picking Specific Documents via Document IDs

Summary

This PR adds the ability to strictly filter the hybrid retrieval to user-specified document IDs. Rather than searching over the entire collection and later patching in documents, the hybrid search itself applies a filter expression (id in [...]) to ensure that only the desired doc IDs are considered.

By merging the doc ID restriction directly into both dense and sparse searches, we retain the standard hybrid ranking logic but limit it to the documents the user requested. This is particularly beneficial for scenarios where a user (e.g. MOOVE) wants deterministic or curated doc sets, while still leveraging ML-based ranking (e.g. MOORE) to order those docs by relevance.

✅ Main Changes

  1. Support for Document IDs in the Retrieval Flow

    • retriever.py: The retrieve() method now has an optional document_ids argument. If provided, it attaches a filter expression (e.g., id in ["doc1", "doc2", ...]) to the dense and sparse search requests.
    • read_queries() (in run_rag.py): Now returns a list of dictionaries, each of which can include both the user query text and an optional list of document IDs.
    • run_retriever.py: The retrieve() function parses --document-ids (if any) into a Python list before calling the updated retriever, which then restricts search accordingly.
  2. Removal of Old “Patch-In” Logic

    • The earlier approach of retrieving everything and then fetch-missing-docs is gone. Now, if doc IDs are given, we filter within the search step.
    • The helper get_documents_by_ids() remains in the codebase if we need direct doc lookups in the future.
  3. CLI Behavior

    • The --document-ids/-d flag can optionally be provided as a comma-separated list (e.g., doc1,doc2,doc3) in the retrieve part of the CLI.
  4. Bug Fixes & Minor Improvements

    • Ensured read_queries() can return dictionaries with document_ids as well as input text.
    • Updated the logic in _get_relevant_documents() to pass document_ids along to retrieve() in order to keep it consistent with the new approach.
    • Logging improvements for better transparency (e.g. showing the expr filter in logs, etc.).

🧪 Usage Example

mmore retrieve \
  --config-file index_config.yaml \
  --input-file queries.jsonl \
  --output-file results.json \
  --document-ids doc1,doc2,doc3
  • The system will read queries from queries.jsonl.
  • If --document-ids is supplied, it will filter to doc1, doc2, doc3 only.
  • Hybrid search ranks these filtered docs by relevance and saves the final results to results.json.

…the functionality of the implemented methods with a simple local test querying a Milvus DB with document ID's. For now, the current functionality is such that the query contents are ignored as soon as doc id's are specified, next step is to create hybrid retrieval
… my environment to work for some reason I don't understand...
…ally and works perfectly. If doc ids are specified it will retrieve k most relevant docs normally with vector similarity search and then it checks if manual doc ids are specified, if they are it checks whether it already retrieved them in the similarity search, if it did not it will prepend them at the beginning of the json result
@leagrieder leagrieder marked this pull request as draft March 23, 2025 12:19
…ver to only look within specified documents instead of looking through entire DB
… the already provided code for better understanding, tested the functionality locally and it works.
@leagrieder leagrieder marked this pull request as ready for review March 24, 2025 08:14
)
for i, result in enumerate(raw_results[0])
]
else:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use retriever.retrieve(…) with the empty doc_ids_list? It should work too

Copy link
Author

@leagrieder leagrieder Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Catch!!
And actually, it was my mistake to bypass the invoke function and call retrieve directly when doc ids were specified because doing so skipped the callback manager ( for logging, error handling, performance tracking, and consistent and robust retrieval process).

Although everything seemed to work at first glance, I later realized that by not using invoke like it was initially done in the initial codebase I was missing out on these important features. So I have updated the code now such that it always calls invoke but with optional document_ids in kwargs

@leagrieder leagrieder mentioned this pull request Mar 26, 2025
@leagrieder leagrieder marked this pull request as draft March 26, 2025 05:51
@leagrieder leagrieder marked this pull request as ready for review March 29, 2025 14:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants