Skip to content

Conversation

@Izukimat
Copy link
Collaborator

Summary

What does this PR do? Please provide a brief summary of the changes introduced.

  • Bug fix
  • New feature
  • Documentation update
  • Code quality / linting
  • Other (please describe):

Expand BaseKnowledgeStore and RAGSystem to support multimodal retrieval with separate collections strategy

Description

This PR expands the knowledge store layer to support multimodal embeddings and implements the separate collections strategy for multimodal RAG. Key changes include:

BaseKnowledgeStore Changes:

  • Added MultiModalEmbedding type supporting text, image, audio, and video embeddings
  • Updated retrieve() method signature to accept QueryEmbedding (Union of list[float] or MultiModalEmbedding)
  • Added new retrieve_by_modality() method for modality-specific collection queries
  • Updated both sync and async base classes

RAG System Changes:

  • Updated method signatures to accept str | Query for multimodal input support
  • Added _prepare_modality_embeddings() method to extract embeddings by modality from retriever tensors
  • Enhanced retrieve() method to query separate collections for each modality and merge results
  • Improved _format_context() with configuration-driven approach for modality-specific formatting
  • Added robust tensor dimension handling (1D, 2D, >2D cases)
  • Maintained full backward compatibility with existing text-only workflows
  • Assuming separate collections strategy: text collection, image collection, audio collection, video collection

Any information reviewers should be aware of:

  • This is a draft PR - concrete implementations (InMemoryKnowledgeStore, QdrantKnowledgeStore) and comprehensive tests are not yet included

Checklist

Before submitting your PR, please check off the following:

  • My code follows the existing style and conventions
  • I've run linting (make lint)
  • I've added/updated relevant documentation (will add in final version)
  • I've added/updated tests as needed (planned for final version)
  • I've verified integration with existing tools (HuggingFace, LlamaIndex, LangChain, etc. if applicable) (public interface unchanged, should work)
  • I've added an entry to the CHANGELOG.md (if applicable) (will add for final version)

@Izukimat Izukimat requested a review from nerdai July 13, 2025 14:24
Copy link
Collaborator

@nerdai nerdai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take another look in a bit!



# Union type for backward compatibility
QueryEmbedding = Union[list[float], MultiModalEmbedding]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, I kind of like how you did this better. Subtle difference, but I think cleaner than what I got.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Expand BaseKnowledgeStore to accept multi-modal inputs when retrieving.

2 participants