Vibe Search™ - Multimodal RAG-Based Place Recommender

A sophisticated place recommendation system built for the lovely DATATHON. This implementation uses modern Retrieval Augmented Generation (RAG) techniques with multimodal understanding to find places in NYC that match specific "vibes" and intangible qualities.

Key Features

Multimodal Understanding: Combines both textual and visual analysis to truly understand place characteristics
Visual Attribute Detection: Analyzes images to detect atmospheres, lighting, colors, settings, and more
Semantic Understanding: Capture the true meaning behind queries like "where to find hot guys" or "cafes to cowork from"
True RAG Implementation: Uses modern embedding models and visual CLIP analysis for nuanced understanding
Vibe Detection: Automatically identifies vibes and atmospheres from both text and visual data
Contextual Explanations: Explains why each place matches your query across both text and visual dimensions
Fast Results: Optimized to deliver results quickly despite complex processing
Beautiful UI: Modern, responsive interface to display results

Quick Start

Installation

# Clone the repository
git clone https://github.com/Ashish-Reddy-T/kundi.git
cd kundi

# ![PLEASE MOVE THE `places.csv`, `media.csv` and `reviews.csv` files into this folder]!

# Install dependencies
pip install -r requirements.txt

Complete Implementation Steps

For new users, follow these steps to set up the full pipeline:

Analyze images with the CLIP-based analyzer:

# Process all place images and generate visual attributes
python place_image_analyzer.py --batch --places_csv "places.csv" --media_csv "media.csv"

# Generate a report of visual attributes (optional)
python place_image_analyzer.py --report

# Export embeddings (optional)
python place_image_analyzer.py --export_embeddings

Generate image embeddings for RAG:

# Use the enhanced script with the analyzed data
python ingestion/rag_index_images.py --use_existing --analysis_path "place_clip_analysis_data.pkl"

Generate text embeddings with visual attributes:

# Generate text embeddings incorporating visual attributes
python ingestion/rag_index.py --model bge-large

Combine text and image embeddings:

# Combine with tuned weights
python ingestion/rag_index_combine.py --model bge-large --text_weight 0.7 --image_weight 0.3
# Weights can be altered according to your will (range:[0,1])

Run the RAG application:

# Run the application with the combined data
python run_rag.py --embedding-model bge-large

Access the web interface: Open your browser and go to: http://localhost:8000

Quick Run (After Initial Setup)

Once the initial setup is complete, you can simply run:

python run_rag.py --embedding-model bge-large

Technologies Used

Multimodal Understanding

CLIP Visual Analysis: Deep visual understanding of places
- Detects visual attributes like atmosphere, setting, colors, lighting
- Generates consistent visual embeddings
- Enhances search with visual context
Combined Text-Visual Embeddings: Weighted fusion of textual and visual understanding

Embedding Models

Sentence Transformers: Modern semantic embedding models
- Options: MiniLM, BGE, MPNet
- Captures nuanced meaning in text

Vector Database

FAISS: High-performance similarity search
- Fast retrieval even with thousands of items
- Efficient cosine similarity calculations

Visual Analysis

PlaceImageAnalyzer: Advanced CLIP-based image analysis
- Detects 11 categories of visual attributes
- Generates detailed atmospheric understanding
- Maps visual attributes to vibes

LLM Integration

LangChain: Framework for LLM applications
- Structured prompts for context and explanations
- Support for multiple LLM providers

Optional LLMs

Local LLMs: Via llama-cpp-python
- Works offline with models like Llama, Mistral, etc.
OpenAI API: For enhanced explanations
- Set OPENAI_API_KEY in .env file to use

Architecture

This system follows a sophisticated multimodal RAG architecture:

Indexing Phase:
- Process place data, reviews, and media
- Analyze images with CLIP for visual attributes
- Extract vibe attributes from both text and images
- Generate high-quality text and image embeddings
- Combine embeddings with appropriate weights
- Build optimized vector index
Retrieval Phase:
- Process user query with semantic understanding
- Expand query to capture related concepts
- Retrieve relevant places using multimodal vector similarity
- Apply contextual filtering (neighborhoods, vibes)
Generation Phase:
- Generate explanations for why places match the query
- Provide context about each place's vibe and atmosphere
- Create a cohesive, helpful response highlighting both visual and textual matches

How Vibe Search Works

1. Multimodal Ingestion & Embedding

The system processes place data from multiple sources:

Structured Data: Name, location, tags, etc.
Unstructured Data: Reviews, descriptions
Visual Data: Images analyzed with CLIP to detect visual attributes
- Place type (restaurant, cafe, bar, etc.)
- Setting (indoor, outdoor, rooftop, etc.)
- Atmosphere (upscale, casual, romantic, etc.)
- Lighting (bright, dim, candlelit, etc.)
- Colors and materials
- Furniture and decor
- View characteristics
- Crowd dynamics
- Time context
- Food and drink focus
- Overall vibes

This data is processed to extract vibe attributes from both text and images, which are combined with the original data and embedded using state-of-the-art models.

2. Query Understanding & Expansion

When you search for something like "cafes to cowork from", the system:

Understands the Concept: Recognizes "coworking" implies wifi, quiet, outlets, etc.
Expands the Query: Adds related terms to improve results
Extracts Constraints: Identifies location filters, price ranges, etc.

3. Multimodal Similarity Search & Filtering

The system searches for places that match the expanded query:

Vector Similarity: Finds places with similar text AND visual characteristics
Balanced Retrieval: Weights visual and textual importance appropriately
Filtering: Applies neighborhood and vibe filters
Ranking: Orders results by relevance across both modalities

4. Explanation Generation

For each result, the system generates an explanation:

Contextual Understanding: Why this place matches your query
Visual Highlights: Emphasizes relevant visual attributes detected
Text Highlights: Features from descriptions and reviews
Natural Language: Presents information conversationally

Advanced Usage

Command Line Options

# Run with specific embedding model
python run_rag.py --build-index --embedding-model minilm

# Adjust text vs image importance
python ingestion/rag_index_combine.py --model minilm --text_weight 0.7 --image_weight 0.3

# Run on specific port
python run_rag.py --port 8080

Visual Analysis Options

# Generate visual embeddings using existing analysis
python ingestion/rag_index_images.py --use_existing --analysis_path "place_clip_analysis_data.pkl"

# Analyze images with specific settings
python place_image_analyzer.py --batch --places_csv "places.csv" --media_csv "media.csv" --max_width 600 --max_height 600

# Generate visual attribute reports
python place_image_analyzer.py --report
python place_image_analyzer.py --vibes_report

Environment Variables

Create a .env file to configure:

EMBEDDING_MODEL=bge-large
LLM_PROVIDER=local
OPENAI_API_KEY=your_key_here

Example Queries

"cafes to cowork from"
"matcha latte in the east village"
"where can I spend a sunny day?"
"romantic restaurants with dim lighting"
"dance-y bars that have disco balls"
"restaurants with outdoor seating and string lights"
"cozy cafes with warm ambient lighting"
"upscale cocktail bars with a view"

Technical Details

Embedding Models

This implementation supports multiple embedding models:

Model	Dimensions	Speed	Quality
MiniLM	384	⚡⚡⚡	⭐⭐
BGE-Small	384	⚡⚡⚡	⭐⭐⭐
BGE-Base	768	⚡⚡	⭐⭐⭐⭐
BGE-Large	1024	⚡	⭐⭐⭐⭐⭐
MPNet	768	⚡⚡	⭐⭐⭐⭐

Though strictly speaking, there is no real difference for this dataset; significant changes can only be seen if the dataset is altered!

Ex: bge-large for us ran queries under 0.1 seconds most of the times

Visual Analysis Categories

The system analyzes images across 11 categories of visual attributes:

Place Type: restaurant, cafe, bar, etc.
Setting: indoor, outdoor, rooftop, waterfront, etc.
Atmosphere: upscale, casual, romantic, lively, etc.
Lighting: bright, dim, candlelit, colored, etc.
Colors & Materials: wood, brick, colorful, monochrome, etc.
Furniture & Decor: modern, vintage, minimalist, etc.
View: skyline, water, garden, none, etc.
Crowd: empty, sparse, filled, busy, etc.
Time Context: daytime, evening, golden hour, etc.
Food & Drink Focus: plated food, cocktails, coffee, etc.
Vibes: date night, group hangout, instagram-worthy, etc.

Vibe Categories

The system automatically categorizes places into vibes from both text and visual cues:

date_night
work_friendly
outdoor_vibes
group_hangout
food_focus
drinks_focus
coffee_tea
dancing_music
quiet_relaxing
upscale_fancy
casual_lowkey
unique_special
trendy_cool
budget_friendly

Troubleshooting

"Unexpected text embedding size" Error

If you see this error when using a different embedding model, you may need to update the embed_text function in rag_app.py to handle the dimensionality of that specific embedding model.

Image Processing Issues

If you encounter issues with image processing:

Check that your places.csv and media.csv have the correct format
Try reducing the --max_width and --max_height parameters
Ensure you have adequate memory for CLIP model loading

FAISS Index Issues

If the FAISS index fails to build or search:

Check that the text and image embeddings match in count
Try rebuilding the index with the --build-index option
Ensure the embedding dimensions match what's expected in rag_app.py

Performance Optimizations

Vectorization: Fast similarity search with FAISS
Cached Embeddings: Reuse query embeddings for similar searches
Cached Explanations: Store explanations for common patterns
Timeout Handling: Enforce limits on search and explanation time
Batched Processing: Process data in optimal batches
Weighted Embeddings: Balance text and visual importance
Image Resizing: Process images at optimal dimensions for speed/quality balance

Made with LOVE from the VibeLabs team for the wonderful sponsor CORNER 💖

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
ingestion		ingestion
templates		templates
.env		.env
README.md		README.md
media.csv		media.csv
ollama_wrapper.py		ollama_wrapper.py
place_clip_analysis_data.pkl		place_clip_analysis_data.pkl
place_image_analyzer.py		place_image_analyzer.py
places.csv		places.csv
rag_app.py		rag_app.py
requirements.txt		requirements.txt
reviews.csv		reviews.csv
run_rag.py		run_rag.py

Folders and files

Latest commit

History

Repository files navigation

Vibe Search™ - Multimodal RAG-Based Place Recommender

Key Features

Quick Start

Installation

Complete Implementation Steps

Quick Run (After Initial Setup)

Technologies Used

Multimodal Understanding

Embedding Models

Vector Database

Visual Analysis

LLM Integration

Optional LLMs

Architecture

How Vibe Search Works

1. Multimodal Ingestion & Embedding

2. Query Understanding & Expansion

3. Multimodal Similarity Search & Filtering

4. Explanation Generation

Advanced Usage

Command Line Options

Visual Analysis Options

Environment Variables

Example Queries

Technical Details

Embedding Models

Visual Analysis Categories

Vibe Categories

Troubleshooting

"Unexpected text embedding size" Error

Image Processing Issues

FAISS Index Issues

Performance Optimizations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages