A sophisticated place recommendation system built for the lovely DATATHON. This implementation uses modern Retrieval Augmented Generation (RAG) techniques with multimodal understanding to find places in NYC that match specific "vibes" and intangible qualities.
- Multimodal Understanding: Combines both textual and visual analysis to truly understand place characteristics
- Visual Attribute Detection: Analyzes images to detect atmospheres, lighting, colors, settings, and more
- Semantic Understanding: Capture the true meaning behind queries like "where to find hot guys" or "cafes to cowork from"
- True RAG Implementation: Uses modern embedding models and visual CLIP analysis for nuanced understanding
- Vibe Detection: Automatically identifies vibes and atmospheres from both text and visual data
- Contextual Explanations: Explains why each place matches your query across both text and visual dimensions
- Fast Results: Optimized to deliver results quickly despite complex processing
- Beautiful UI: Modern, responsive interface to display results
# Clone the repository
git clone https://github.com/Ashish-Reddy-T/kundi.git
cd kundi
# ![PLEASE MOVE THE `places.csv`, `media.csv` and `reviews.csv` files into this folder]!
# Install dependencies
pip install -r requirements.txtFor new users, follow these steps to set up the full pipeline:
- Analyze images with the CLIP-based analyzer:
# Process all place images and generate visual attributes
python place_image_analyzer.py --batch --places_csv "places.csv" --media_csv "media.csv"
# Generate a report of visual attributes (optional)
python place_image_analyzer.py --report
# Export embeddings (optional)
python place_image_analyzer.py --export_embeddings- Generate image embeddings for RAG:
# Use the enhanced script with the analyzed data
python ingestion/rag_index_images.py --use_existing --analysis_path "place_clip_analysis_data.pkl"- Generate text embeddings with visual attributes:
# Generate text embeddings incorporating visual attributes
python ingestion/rag_index.py --model bge-large- Combine text and image embeddings:
# Combine with tuned weights
python ingestion/rag_index_combine.py --model bge-large --text_weight 0.7 --image_weight 0.3
# Weights can be altered according to your will (range:[0,1])- Run the RAG application:
# Run the application with the combined data
python run_rag.py --embedding-model bge-large- Access the web interface: Open your browser and go to: http://localhost:8000
Once the initial setup is complete, you can simply run:
python run_rag.py --embedding-model bge-large- CLIP Visual Analysis: Deep visual understanding of places
- Detects visual attributes like atmosphere, setting, colors, lighting
- Generates consistent visual embeddings
- Enhances search with visual context
- Combined Text-Visual Embeddings: Weighted fusion of textual and visual understanding
- Sentence Transformers: Modern semantic embedding models
- Options: MiniLM, BGE, MPNet
- Captures nuanced meaning in text
- FAISS: High-performance similarity search
- Fast retrieval even with thousands of items
- Efficient cosine similarity calculations
- PlaceImageAnalyzer: Advanced CLIP-based image analysis
- Detects 11 categories of visual attributes
- Generates detailed atmospheric understanding
- Maps visual attributes to vibes
- LangChain: Framework for LLM applications
- Structured prompts for context and explanations
- Support for multiple LLM providers
- Local LLMs: Via llama-cpp-python
- Works offline with models like Llama, Mistral, etc.
- OpenAI API: For enhanced explanations
- Set OPENAI_API_KEY in .env file to use
This system follows a sophisticated multimodal RAG architecture:
-
Indexing Phase:
- Process place data, reviews, and media
- Analyze images with CLIP for visual attributes
- Extract vibe attributes from both text and images
- Generate high-quality text and image embeddings
- Combine embeddings with appropriate weights
- Build optimized vector index
-
Retrieval Phase:
- Process user query with semantic understanding
- Expand query to capture related concepts
- Retrieve relevant places using multimodal vector similarity
- Apply contextual filtering (neighborhoods, vibes)
-
Generation Phase:
- Generate explanations for why places match the query
- Provide context about each place's vibe and atmosphere
- Create a cohesive, helpful response highlighting both visual and textual matches
The system processes place data from multiple sources:
- Structured Data: Name, location, tags, etc.
- Unstructured Data: Reviews, descriptions
- Visual Data: Images analyzed with CLIP to detect visual attributes
- Place type (restaurant, cafe, bar, etc.)
- Setting (indoor, outdoor, rooftop, etc.)
- Atmosphere (upscale, casual, romantic, etc.)
- Lighting (bright, dim, candlelit, etc.)
- Colors and materials
- Furniture and decor
- View characteristics
- Crowd dynamics
- Time context
- Food and drink focus
- Overall vibes
This data is processed to extract vibe attributes from both text and images, which are combined with the original data and embedded using state-of-the-art models.
When you search for something like "cafes to cowork from", the system:
- Understands the Concept: Recognizes "coworking" implies wifi, quiet, outlets, etc.
- Expands the Query: Adds related terms to improve results
- Extracts Constraints: Identifies location filters, price ranges, etc.
The system searches for places that match the expanded query:
- Vector Similarity: Finds places with similar text AND visual characteristics
- Balanced Retrieval: Weights visual and textual importance appropriately
- Filtering: Applies neighborhood and vibe filters
- Ranking: Orders results by relevance across both modalities
For each result, the system generates an explanation:
- Contextual Understanding: Why this place matches your query
- Visual Highlights: Emphasizes relevant visual attributes detected
- Text Highlights: Features from descriptions and reviews
- Natural Language: Presents information conversationally
# Run with specific embedding model
python run_rag.py --build-index --embedding-model minilm
# Adjust text vs image importance
python ingestion/rag_index_combine.py --model minilm --text_weight 0.7 --image_weight 0.3
# Run on specific port
python run_rag.py --port 8080# Generate visual embeddings using existing analysis
python ingestion/rag_index_images.py --use_existing --analysis_path "place_clip_analysis_data.pkl"
# Analyze images with specific settings
python place_image_analyzer.py --batch --places_csv "places.csv" --media_csv "media.csv" --max_width 600 --max_height 600
# Generate visual attribute reports
python place_image_analyzer.py --report
python place_image_analyzer.py --vibes_reportCreate a .env file to configure:
EMBEDDING_MODEL=bge-large
LLM_PROVIDER=local
OPENAI_API_KEY=your_key_here
- "cafes to cowork from"
- "matcha latte in the east village"
- "where can I spend a sunny day?"
- "romantic restaurants with dim lighting"
- "dance-y bars that have disco balls"
- "restaurants with outdoor seating and string lights"
- "cozy cafes with warm ambient lighting"
- "upscale cocktail bars with a view"
This implementation supports multiple embedding models:
| Model | Dimensions | Speed | Quality |
|---|---|---|---|
| MiniLM | 384 | ⚡⚡⚡ | ⭐⭐ |
| BGE-Small | 384 | ⚡⚡⚡ | ⭐⭐⭐ |
| BGE-Base | 768 | ⚡⚡ | ⭐⭐⭐⭐ |
| BGE-Large | 1024 | ⚡ | ⭐⭐⭐⭐⭐ |
| MPNet | 768 | ⚡⚡ | ⭐⭐⭐⭐ |
- Though strictly speaking, there is no real difference for this dataset; significant changes can only be seen if the dataset is altered!
Ex: bge-large for us ran queries under 0.1 seconds most of the times
The system analyzes images across 11 categories of visual attributes:
- Place Type: restaurant, cafe, bar, etc.
- Setting: indoor, outdoor, rooftop, waterfront, etc.
- Atmosphere: upscale, casual, romantic, lively, etc.
- Lighting: bright, dim, candlelit, colored, etc.
- Colors & Materials: wood, brick, colorful, monochrome, etc.
- Furniture & Decor: modern, vintage, minimalist, etc.
- View: skyline, water, garden, none, etc.
- Crowd: empty, sparse, filled, busy, etc.
- Time Context: daytime, evening, golden hour, etc.
- Food & Drink Focus: plated food, cocktails, coffee, etc.
- Vibes: date night, group hangout, instagram-worthy, etc.
The system automatically categorizes places into vibes from both text and visual cues:
- date_night
- work_friendly
- outdoor_vibes
- group_hangout
- food_focus
- drinks_focus
- coffee_tea
- dancing_music
- quiet_relaxing
- upscale_fancy
- casual_lowkey
- unique_special
- trendy_cool
- budget_friendly
If you see this error when using a different embedding model, you may need to update the embed_text function in rag_app.py to handle the dimensionality of that specific embedding model.
If you encounter issues with image processing:
- Check that your
places.csvandmedia.csvhave the correct format - Try reducing the
--max_widthand--max_heightparameters - Ensure you have adequate memory for CLIP model loading
If the FAISS index fails to build or search:
- Check that the text and image embeddings match in count
- Try rebuilding the index with the
--build-indexoption - Ensure the embedding dimensions match what's expected in
rag_app.py
- Vectorization: Fast similarity search with FAISS
- Cached Embeddings: Reuse query embeddings for similar searches
- Cached Explanations: Store explanations for common patterns
- Timeout Handling: Enforce limits on search and explanation time
- Batched Processing: Process data in optimal batches
- Weighted Embeddings: Balance text and visual importance
- Image Resizing: Process images at optimal dimensions for speed/quality balance
Made with LOVE from the VibeLabs team for the wonderful sponsor CORNER 💖