A sophisticated hybrid retrieval system that combines multiple retrieval strategies and reranking approaches, inspired by Anthropic's Contextual Retrieval RAG announcement.
- Hybrid Retrieval System: Combines vector and term-based search for robust document retrieval
- Contextual Enhancement: Adds semantic context to each document chunk
- Multi-Stage Reranking: Implements sophisticated reranking pipeline
- Dual Storage System: Leverages both ChromaDB and Elasticsearch
- Deduplication: Ensures unique results across retrieval methods
- Python 3.8 or higher
- Docker and Docker Compose
- OpenAI API key
- Clone the repository:
git clonehttps://github.com/Abiorh001/Contextual_rag.git
cd Contextual_rag- Set up the environment:
uv venv OR python -m venv venv
source venv/bin/activate
uv sync- Start the required services:
docker-compose up -d- Set up your environment variables:
cp .env.example .env
# Edit .env with your API keys and configurations- Text Splitting: RecursiveTextSplitter implementation
- Chunk size: 400 tokens
- Overlap: 200 tokens
- Strategy: ChromaDB-optimized chunking
-
ChromaDB (Vector Store)
- Docker-hosted vector database
- Stores embeddings from text-embedding-ada-002
- Maintains both original and contextualized chunks
- Enables semantic search capabilities
-
Elasticsearch (Term-Based Search)
- BM25 algorithm implementation
- Stores original and contextualized chunks
- Enables term-based retrieval with IDF scoring
- Model: GPT-4o-mini
- Process: Generates contextual information for each chunk
- Storage: Maintains contextualized versions alongside originals
-
Vector Retrieval:
- Query embedding via text-embedding-ada-002
- Similarity search in ChromaDB
-
Term-Based Retrieval:
- BM25 scoring in Elasticsearch
- Traditional keyword matching
- Primary Reranking: Cohere rerank model
- Secondary Reranking: Cross-encoder/ms-marco-MiniLM-L-6-v2
- Deduplication: Set-based removal of duplicate results
Here's a complete example of how to use the system:
from contextual_rag import ContextualizedRAG
# Initialize the RAG system
rag = ContextualizedRAG(collection_name="test")
# Step 1: Document Chunking
document_chunking = rag.document_chunking(file_path="Data/test.csv")
# Step 2: Contextualize Documents
contextualize_documents = rag.contextualize_documents(
documents=document_chunking,
save_file_path="Data/contextualized_documents.csv"
)
# Step 3: Load Documents to Elasticsearch
contextualize_documents = rag.read_contextualized_csv(
contextualized_file_path="Data/contextualized_documents.csv"
)
rag.load_documents_to_esbm25(documents=contextualize_documents)
# Step 4: Create Embeddings Store
rag.create_embeddings_store_temp(
documents=contextualize_documents,
embeddings_file_path="Data/embeddings.pkl"
)
# Step 5: Load Embeddings and Save to ChromaDB
rag.load_embeddings_save_chromadb(
documents=contextualize_documents,
embeddings_file_path="Data/embeddings.pkl"
)
# Step 6: Process Hybrid Search
query = "what is RAG"
hybrid_search_results = rag.process_hybrid_search(query=query)
# Step 7: Custom Reranking (Optional)
custom_reranking_results = rag.custom_reranking(
query=query,
documents=hybrid_search_results
)
# Step 8: Cohere Reranking (Optional)
cohere_reranking_results = rag.cohere_reranking(
query=query,
documents=hybrid_search_results
)
# Step 9: Deduplication (Optional, recommended when using both reranking methods)
documents = custom_reranking_results + cohere_reranking_results
final_results = rag.deduplication(documents=documents)-
Reranking Options:
- You can use either custom reranking or Cohere reranking independently
- Cohere reranking generally provides better results than custom reranking
- Deduplication is recommended only when using both reranking methods
-
Data Format:
- Input data should be in CSV format with columns: url, title, content
- Example data is provided in
Data/test.csv
-
Storage:
- Elasticsearch is used for term-based search
- ChromaDB is used for vector storage
- Both services must be running via docker-compose
The system can be configured through environment variables or a configuration file:
OPENAI_API_KEY=your_api_key
COHERE_API_KEY=your_api_keyContributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Inspired by Anthropic's Contextual Retrieval RAG announcement
- Built with LangChain and other open-source tools
- Special thanks to all contributors
For support, please open an issue in the GitHub repository or contact the maintainers.
Made with ❤️ by [Abiola Adeshina]