ChromaDB Enhanced Multi-Format Processing

High-performance ChromaDB server with built-in support for multiple state-of-the-art embedding models, enabling superior semantic search across PDFs, source code, and markdown with store-optimized chunking strategies.

Prerequisites

Docker Desktop - Install Docker
Python 3.8+ - Tested with 3.8 to 3.12
8GB+ RAM - allocated to Docker (for embedding models)
10GB+ disk space - (Docker image + model cache)

System Dependencies (for OCR)

macOS

brew install tesseract

# Ubuntu/Debian
sudo apt-get install tesseract-ocr

# Or skip Tesseract and use EasyOCR (pure Python)
pip install .[easyocr]

First-time Setup

# 1. Clone and enter directory
git clone <repository_url>
cd chroma-embedded

# 2. Create Python virtual environment (optional but recommended)
python3 -m venv venv
source venv/bin/activate

# 3. Install Python dependencies
pip install .

# 4. Verify dependencies
python3 check_deps.py

# 5. Create docker volumes for persistent storage
docker volume create chromadb-data # Your collections and embeddings
docker volume create chromadb-models # Hugging Face model cache (~2GB)

# 6. Build the docker image
./build.sh

# 7. Start the server
docker run -d --name chromadb-enhanced -p 9000:8000 -v chromadb-data:/data -v chromadb-models:/models chromadb-enhanced:latest

# 8. Verify server is running
curl http://localhost:9000/api/v2/heartbeat

Note: First startup downloads the Stella embedding model (~1.5GB). Subsequent startups use the cached model from the chromadb-models volume.

VPN Users

If behind a VPN with SSL inspection, the model downloads may fail with certificate errors. Temporarily disable the VPN during the first startup to allow model downloads, or pre-download models manually into the /models volume.

# Activate virtual environment
source venv/bin/activate

# Start the server (if not running)
docker start chromadb-enhanced

# Or if the container was removed, recreate it:
docker run -d --name chromadb-enhanced -p 9000:8000 -v chromadb-data:/data -v chromadb-models:/models chromadb-enhanced:latest

Upload content with automatic model-optimized chunking

(upload.sh is a thin wrapper over the unified fast Python uploader)

./upload.sh -i /path/to/pdfs --store pdf -e stella -c ResearchLibrary

Source code with AST-aware chunking (uses 400 tokens for Stella)

./upload.sh -i /path/to/source --store source-code -e stella -c CodeLibrary

Documentation with heading-aware markdown chunking (uses 430 tokens for Stella)

./upload.sh -i /path/to/markdown --store markdown -e stella -c DocsLibrary


## 📁 Project Structure

| File | Purpose |
|------|---------|
| `Dockerfile` | Multi-model ChromaDB Docker image |
| `build.sh` | Build script for Docker image |
| `server.sh` | Server management script |
| `upload.sh` | Thin wrapper for the unified fast Python uploader |
| `upload.py` | Unified uploader (PDF + OCR, source code, markdown) with client-side MPS embedding |
| `embedding_functions.py` | Client-side embedding: `detect_device()`, `load_model()`, `embed_documents()` + server-side classes |
| `chunk_utils.py` | Token-aware, AST-aware, and markdown heading-aware chunking utilities |
| `seed_manifest.py` | Seed `.chroma-uploads.json` manifest from existing ChromaDB collections |
| `compare_manifest.py` | Compare filesystem repos to manifest (find un-uploaded repos) |
| `test.sh` | Complete setup testing |
| `check_deps.py` | Dependency checker (OCR + ASTChunk) |
| `requirements.txt` | Python dependencies (includes ASTChunk) |
| `pyproject.toml` | Modern Python packaging |
| `.gitignore` | Git ignore rules |
| `LICENSE` | MIT license |

## 📋 Installation & Dependencies

### Python Dependencies

```bash
# Install all dependencies (includes ASTChunk and Tesseract wrapper)
pip install .

# Check all dependencies are working (OCR + AST parsing)
python3 check_deps.py

# Development install
pip install -e .[dev]

Key Dependencies Added

ASTChunk (astchunk>=0.1.0) - AST-aware source code chunking
Tree-sitter - Multi-language parsing support (Python, Java, TypeScript, C#, etc.)
Enhanced metadata extraction - Store-specific metadata for better retrieval

OCR Engine Setup

Choose your preferred OCR engine:

Option 1: Tesseract (Recommended - faster)

# Install system dependency
# macOS: brew install tesseract
# Ubuntu/Debian: sudo apt-get install tesseract-ocr
# CentOS/RHEL: sudo yum install tesseract

# Python wrapper already installed with: pip install .
# Ready to use (default engine)

Option 2: EasyOCR (Pure Python - no system deps)

# Install EasyOCR package
pip install .[easyocr]

# Use with --ocr-engine easyocr flag

🎯 Available Embedding Models

Model	Dimensions	Best For	Performance
stella	1024	Research papers, academic content	🥇 Top MTEB performer
modernbert	1024	General purpose, latest tech	🔬 State-of-the-art 2024
bge-large	1024	Production deployments	🏭 Battle-tested
default	384	Quick testing, compatibility	⚡ Fast, lightweight

🖥️ Client-Side Embedding (MPS / GPU)

Embedding is performed client-side using Apple Silicon's Metal Performance Shaders (MPS) or CUDA on NVIDIA GPUs. The Docker container stores pre-computed vectors without running any embedding models.

Device Detection

detect_device() automatically selects the best available device:

MPS (Apple Silicon GPU) — preferred on macOS
CUDA (NVIDIA GPU) — preferred on Linux
CPU — fallback

Usage

# Default: auto-detect device (MPS on Apple Silicon)
python3 upload.py -c projects -i /path/to/code --store source-code -e stella

# Force CPU (disable GPU)
python3 upload.py -c projects -i /path/to/code --store source-code -e stella --no-gpu

Querying

Queries must also embed client-side with the same model. The cdbsp and cdbsd bash functions handle this automatically — they load Stella locally and use query_embeddings instead of query_texts.

# Search source code (client-side Stella embedding)
cdbsp "authentication middleware" 5

# Search docs (client-side Stella embedding)
cdbsd "deployment instructions" 5

Do not use query_texts directly against collections uploaded with client-side embedding — the server's default model produces different-dimension vectors, causing a dimension mismatch error.

Dimension Mismatch Protection

If you try to upload to a collection that has embeddings with a different dimension (e.g., 384-dim from an old upload vs 1024-dim from Stella), the upload will abort with a clear message. Use --delete-collection to recreate with the correct dimensions.

📄 Store Types & Chunking Strategies

The upload script supports three optimized store types, each with tailored chunking and metadata extraction:

Store Type	Chunk Size	Overlap	Processing	Best For
`pdf`	Auto-optimized	10% overlap	OCR + Text extraction	Research papers, documents
`source-code`	Auto-optimized	5% overlap	AST-aware chunking	Code analysis, API understanding
`markdown`	Auto-optimized	Smart overlap	Heading-aware for markdown	README, wikis, tutorials

🧠 Model-Optimized Chunking

The system automatically optimizes chunk sizes for each embedding model:

Stella: 400 tokens/chunk with 50% safety buffer (~640 chars)
ModernBERT: 920 tokens/chunk (large context window)
BGE-Large: 400 tokens/chunk with 50% safety buffer
Default: 400 tokens/chunk with 50% safety buffer

AST-aware source code chunking:

Automatically splits large functions at statement boundaries
Preserves code structure and semantic meaning
Uses conservative sizing to prevent token limit violations

Heading-aware markdown chunking:

Respects H1-H6 heading hierarchy
Keeps sections together when they fit in token limits
Splits at subsection boundaries when sections are too large
Preserves heading context in chunk metadata

🔍 PDF Store Type

OCR Support: Automatic image-only PDF processing with Tesseract/EasyOCR
Language Support: 100+ OCR languages supported
Metadata: File size, extraction method, OCR confidence, image detection

💻 Source Code Store Type

Git Project-Aware: Automatically detects .git directories and tracks project-level changes
Smart Change Detection: Compares git commit hashes to detect when projects need re-indexing
Respects .gitignore: Uses git ls-files to only index tracked files (applies to all store types inside git repos)
Skips Common Junk Directories: node_modules, .venv, venv, __pycache__, .tox, dist, build are always excluded
AST-Aware Chunking: Respects function/class boundaries using ASTChunk
Language Support: 15+ programming languages (Python, Java, JS/TS, C#, Go, Rust, C/C++, PHP, Ruby, Kotlin, Scala, Swift)
Enhanced Metadata: Programming language, function/class detection, import analysis, line counts, git project context
Automatic Language Detection: Based on file extensions
Project Search Depth: Control how deep to search for nested git projects

📚 Markdown Store Type

Heading-Aware Markdown Chunking: Intelligently splits markdown at section boundaries
Structure Preservation: Respects H1-H6 heading hierarchy
Smart Splitting: Keeps sections together when possible, splits at subsections when needed
Enhanced Metadata: Heading hierarchy, section depth, primary heading per chunk
Content Analysis: Detects code blocks, links, and document structure
Supported Formats: Markdown (.md), text (.txt), reStructuredText (.rst), AsciiDoc (.adoc), HTML, XML

🔄 Git Project-Aware Processing

When using --store source-code, the system automatically detects and manages git projects with intelligent change detection:

Key Features

Automatic Discovery: Finds .git directories to identify project boundaries
Smart Change Detection: Compares git commit hashes to detect when re-indexing is needed
Clean Updates: Deletes all existing chunks for a project when its commit hash changes
Respects .gitignore: Uses git ls-files for all store types when inside a git repo (not just source-code)
Skips Junk Directories: node_modules, .venv, __pycache__, dist, build are always excluded from directory walks
Project Metadata: Every chunk includes git project context (name, commit hash, remote URL, branch)

Depth Control

--depth 1        # Only direct subdirectories (fast, good for organized workspaces)
--depth 2        # Two levels deep (includes some nested projects)
# No --depth     # Unlimited depth (finds all nested git projects)

Change Detection Workflow

First Run: Indexes all git-tracked files, stores commit hash with each chunk
Subsequent Runs: Compares stored vs current commit hash
If Changed: Deletes all project chunks and re-indexes all files
If Unchanged: Uses regular file-by-file processing for new files only

Benefits

Automatic Cleanup: Moved/deleted files are automatically removed
Project Context: Search results include which project and commit the code came from
Efficient Updates: Only re-processes projects that have actually changed
Workspace Friendly: Handles directories with multiple git projects gracefully

📝 Markdown Documentation Processing

When using --store markdown with markdown files (.md), the system automatically uses heading-aware chunking:

Key Features

Hierarchical Chunking: Splits at H1-H6 heading boundaries
Smart Section Grouping: Keeps related content together when it fits within token limits
Subsection Splitting: Automatically splits large sections at subsection boundaries
Heading Context: Each chunk includes full heading hierarchy in metadata
Token-Optimized: Respects model-specific token limits (430 tokens for Stella, 880 for ModernBERT)

Chunking Strategy

Parse Structure: Identifies all headings (H1-H6) and their content
Build Hierarchy: Tracks parent-child relationships between sections
Smart Grouping: Combines consecutive sections that fit within token limits
Intelligent Splitting: When sections exceed limits, splits at subsection boundaries
Metadata Enrichment: Adds heading hierarchy, section depth, and primary heading to each chunk

Metadata Fields

Each markdown chunk includes:

markdown_headings: Full heading hierarchy (e.g., "Introduction > Getting Started > Installation")
markdown_primary_heading: The main heading for this chunk
markdown_section_depth: Nesting level of the section (0 = no headings, 1 = H1, 2 = H2, etc.)
markdown_heading_aware: Flag indicating heading-aware chunking was used

Usage Examples

# Process markdown documentation with heading-aware chunking
./upload.sh -i /path/to/markdown/docs --store markdown -e stella -c MarkdownDocs

# Query by section using metadata filters (client-side embedding required)
python3 -c "
import sys
sys.path.insert(0, '.')
from embedding_functions import load_model, embed_documents
import chromadb

model = load_model('stella')
query_vec = embed_documents(model, ['How do I install?'])[0]

client = chromadb.HttpClient(host='localhost', port=9000)
collection = client.get_collection('MarkdownDocs')

# Find all chunks from 'Installation' section
results = collection.query(
    query_embeddings=[query_vec],
    where={'markdown_primary_heading': 'Installation'},
    n_results=5
)
"

# View heading structure of indexed documents
python3 -c "
import chromadb
client = chromadb.HttpClient(host='localhost', port=9000)
collection = client.get_collection('MarkdownDocs')
docs = collection.get(include=['metadatas'], limit=20)

for meta in docs['metadatas']:
    if 'markdown_headings' in meta:
        print(f'{meta[\"filename\"]}: {meta[\"markdown_headings\"]}')
"

Benefits

Better Semantic Search: Chunks aligned with document structure
Section-Aware Queries: Filter results by specific sections
Context Preservation: Full heading hierarchy provides better context
Improved Retrieval: More relevant results due to semantic boundaries

🔧 Server Management

Start Server

# Start with Stella embeddings (recommended)
./server.sh -m stella

# Start with ModernBERT on custom port
./server.sh -m modernbert -p 9001

# Start with BGE-Large for production
./server.sh -m bge-large

Server Operations

# View logs
./server.sh --logs

# Stop server
./server.sh --stop

# Restart with different model
./server.sh --restart -m modernbert

🚧 Known Limitations

ChromaDB Unique Metadata Values

ChromaDB currently does not provide built-in aggregate functions or SQL-like DISTINCT operations for efficiently retrieving unique metadata values. This limitation affects scenarios where you need to:

Get a list of unique project names from a large collection
Count distinct values in metadata fields
Perform aggregate operations on metadata

Current Workaround: The most efficient approach available is to retrieve metadata-only results in small batches and manually deduplicate using Python sets:

# Get all metadata without document content
all_metadatas = collection.get(include=["metadatas"])["metadatas"]

# Extract unique values using Python sets
unique_projects = {meta.get("git_project_name") for meta in all_metadatas}
unique_projects = list(unique_projects)

Community Request: This feature has been actively requested by the ChromaDB community. You can track progress and add your support at:

GitHub Issue: Query with unique metadata filter #2873

Impact: For large collections (thousands of documents), retrieving unique metadata values requires scanning all documents, which is the current best practice until native aggregation support is added to ChromaDB.

🚨 Payload Size Error Handling

When uploading large files (especially minified JavaScript or large source files), you may encounter "413 Payload Too Large" errors. The system now provides fail-fast error handling with clear recovery options:

Error Detection & Recovery

# If you get a payload error, the system will show:
❌ PAYLOAD TOO LARGE ERROR
📁 File: /path/to/aws-amplify.min.js
📊 File size: 1,234,567 bytes
🧩 Total chunks: 156
💾 Batch payload: ~2,500,000 characters

💡 RECOMMENDATIONS:
   Suggested chunk size: 800 tokens
   Suggested batch size: 25

🔧 RECOVERY OPTIONS:
   1. Reduce chunk size: --chunk-size 800 --batch-size 25
   2. Delete partial project: --delete-project my-project

Prevention & Optimization

# Preview chunk sizes before uploading (dry-run)
./upload.sh --dry-run -i /path/to/source --store source-code

# Upload with conservative settings for large files
./upload.sh -i /path/to/source --store source-code --chunk-size 800 --batch-size 25

# Auto-cleanup failed projects
./upload.sh -i /path/to/source --store source-code --delete-failed-project

Project Cleanup Commands

# Delete specific project from collection
./upload.sh --delete-project my-project-name -c MyCollection

# List available projects (shown when project not found)
./upload.sh --delete-project nonexistent -c MyCollection

📤 Multi-Format Upload Examples

PDF Processing (Research Papers & Documents)

# Basic PDF upload with OCR (auto-optimized: 460 tokens for Stella)
./upload.sh -i /path/to/pdfs --store pdf -e stella -c ResearchLibrary

# Multi-language OCR support
./upload.sh -i /path/to/pdfs --store pdf -e stella --ocr-language fra -c FrenchPapers
./upload.sh -i /path/to/pdfs --store pdf -e stella --ocr-engine easyocr --ocr-language es -c SpanishPapers

# Disable OCR for text-only PDFs (faster processing)
./upload.sh -i /path/to/pdfs --store pdf -e stella --disable-ocr -c TextOnlyPDFs

Source Code Processing (API Understanding & Analysis)

# Git project-aware source code chunking (auto-optimized: 400 tokens for Stella)
./upload.sh -i /path/to/source --store source-code -e stella -c CodeLibrary

# Only scan direct subdirectories for git projects
./upload.sh -i /workspace --store source-code -e stella -c MainProjects --depth 1

# Process specific git project (detects changes via commit hash)
./upload.sh -i ./my-project --store source-code -e stella -c MyProject --delete-collection
./upload.sh -i ./my-project --store source-code -e stella -c MyProject  # Re-run: only processes if changed

# Multi-project workspace processing
./upload.sh -i /workspace --store source-code -e stella -c AllProjects
./upload.sh -i /workspace --store source-code -e stella -c AllProjects --depth 2

# Language-specific collections
./upload.sh -i ./python_project --store source-code -e stella -c PythonCode
./upload.sh -i ./java_project --store source-code -e stella -c JavaCode

# Custom chunking only if needed (overrides auto-optimization)
./upload.sh -i /path/to/source --store source-code -e stella --chunk-size 300 -c SmallChunks

Markdown Processing (README, Wikis, Tutorials)

# Optimized markdown processing (auto-optimized: 430 tokens for Stella)
./upload.sh -i /path/to/docs --store markdown -e stella -c DocsLibrary

# Process specific markdown types
./upload.sh -i ./wiki --store markdown -e stella -c ProjectWiki
./upload.sh -i ./tutorials --store markdown -e stella -c Tutorials

Advanced Multi-Format Workflows

# Create specialized collections per content type
./upload.sh -i ./papers --store pdf -e stella -c Research --delete-collection
./upload.sh -i ./codebase --store source-code -e stella -c CodeAnalysis --delete-collection
./upload.sh -i ./documentation --store markdown -e stella -c ProjectDocs --delete-collection

# Git project-aware workflows
./upload.sh -i /workspace --store source-code -e stella -c WorkspaceCode --depth 1  # Top-level projects only
./upload.sh -i /workspace/thirdparty --store source-code -e stella -c ThirdPartyCode --depth 2  # Include nested libs

# Mixed source code and markdown
./upload.sh -i ./my-project --store source-code -e stella -c MyProject --delete-collection
./upload.sh -i ./my-project/docs --store markdown -e stella -c MyProjectDocs --delete-collection

# Custom chunking only when needed (overrides auto-optimization)
./upload.sh -i /path/to/files --store pdf --chunk-size 300 --chunk-overlap 30 -c SmallChunks

# Remote server deployment
./upload.sh -i /path/to/files --store pdf -h production-server.com -p 8000 -e modernbert

# Incremental git project updates (only re-processes changed projects)
./upload.sh -i /workspace --store source-code -e stella -c DevEnvironment  # Daily runs

🏗️ Architecture

Embeddings are computed client-side on Apple Silicon GPU (MPS) or CPU, not inside Docker. The Docker container is a pure storage layer.

flowchart TB
    subgraph Client["Client (macOS / Apple Silicon)"]
        direction TB
        UPLOAD["upload.py<br/>PDFs + OCR, Source Code + AST, Markdown"]
        QUERY["cdbsp / cdbsd<br/>query functions"]

        subgraph GPU["Stella-400M on MPS (GPU)"]
            EMBED["embed_documents()<br/>1024-dim vectors"]
        end

        DETECT["detect_device()<br/>mps / cuda / cpu"]
    end

    subgraph Docker["Docker Container (Storage Only)"]
        CHROMA["ChromaDB Server :9000"]
        STORE[("Collections<br/>projects (source-code, 1024d)<br/>docs (markdown/pdf, 1024d)")]
    end

    UPLOAD --> DETECT
    QUERY --> DETECT
    DETECT --> EMBED
    UPLOAD -->|"documents + metadatas"| CHROMA
    EMBED -->|"embeddings="| CHROMA
    EMBED -->|"query_embeddings="| CHROMA
    CHROMA --> STORE
    CHROMA -->|"ranked results"| QUERY

Key: Both uploads and queries embed client-side with the same Stella model, then pass pre-computed vectors to ChromaDB via embeddings= (uploads) or query_embeddings= (queries). The server never computes embeddings.

💻 Source Code Support

Supported Programming Languages

The source code store type supports 15+ programming languages with automatic detection:

Language	Extensions	AST Parser	Enhanced Metadata
Python	`.py`	✅ tree-sitter-python	Functions, classes, imports
Java	`.java`	✅ tree-sitter-java	Methods, classes, packages
JavaScript	`.js`, `.jsx`	✅ tree-sitter-typescript	Functions, objects, imports
TypeScript	`.ts`, `.tsx`	✅ tree-sitter-typescript	Types, interfaces, modules
C#	`.cs`	✅ tree-sitter-c-sharp	Methods, classes, namespaces
Go	`.go`	✅ tree-sitter-go	Functions, structs, packages
Rust	`.rs`	✅ tree-sitter-rust	Functions, traits, modules
C/C++	`.c`, `.cpp`	✅ tree-sitter-cpp	Functions, classes, includes
PHP	`.php`	✅ tree-sitter-php	Functions, classes, namespaces
Ruby	`.rb`	✅ tree-sitter-ruby	Methods, classes, modules
Kotlin	`.kt`	✅ (via Java parser)	Classes, functions, packages
Scala	`.scala`	✅ (via Java parser)	Objects, classes, traits
Swift	`.swift`	✅ (via C parser)	Functions, classes, protocols

AST-Aware Chunking Benefits

Traditional Text Chunking Problems:

# ❌ Basic chunking might split mid-function
def calculate_api_response(data):
    # Processing logic here...
    return result
# CHUNK BREAK - Context lost!

class DatabaseManager:
    def connect(self):

AST-Aware Chunking Solution:

# ✅ ASTChunk preserves semantic boundaries
def calculate_api_response(data):
    """Complete function with docstring intact"""
    # All related logic stays together
    return result

# New chunk starts at natural boundary
class DatabaseManager:
    """Complete class with all methods"""
    def connect(self):
        # Method implementation complete

Enhanced Metadata for Code Understanding

Each source code chunk includes rich metadata for precise retrieval:

{
  "store_type": "source-code",
  "programming_language": "python",
  "file_extension": ".py",
  "has_functions": true,
  "has_classes": true,
  "has_imports": true,
  "line_count": 45,
  "ast_chunked": true,
  "text_extraction_method": "astchunk_python"
}

API Understanding Use Cases

Perfect for:

🔍 API Discovery: Find similar function signatures across projects
📚 Usage Examples: Locate how specific APIs are used in practice
🔧 Implementation Patterns: Discover common coding patterns and practices
🐛 Error Handling: Find error handling approaches for specific scenarios
📖 Documentation Gap Filling: When official docs are lacking or incomplete

Query Examples:

"How to authenticate with REST APIs in Python?"
"Show me error handling patterns for database connections"
"Find examples of async/await usage in JavaScript"
"What are common patterns for dependency injection in Java?"

📚 Markdown Processing

Optimized for Technical Documentation

The markdown store type is specifically tuned for technical content:

Supported Formats:

Markdown (.md) - README files, wikis, technical guides
Text (.txt) - Plain text documentation
reStructuredText (.rst) - Python documentation standard
AsciiDoc (.adoc) - Technical documentation format
HTML (.html) - Web documentation
XML (.xml) - Structured documentation

Enhanced Content Analysis

Documentation chunks include intelligent content detection:

{
  "store_type": "markdown",
  "doc_type": "markdown",
  "has_code_blocks": true,
  "has_links": true,
  "line_count": 89,
  "text_extraction_method": "direct_read"
}

Documentation Use Cases

Perfect for:

📖 Project Onboarding: Quickly understand new codebases and their documentation
🔗 Cross-Reference Discovery: Find related documentation across different projects
💡 Best Practice Learning: Extract patterns and recommendations from documentation
🏗️ Architecture Understanding: Grasp system design from architectural docs
🚀 Setup Instructions: Locate installation and configuration guides

Query Examples:

"How to set up development environment for this project?"
"What are the deployment procedures and requirements?"
"Find architectural decisions and design patterns used"
"Show me configuration examples and environment variables"

🔄 Migration from Old Setup

If currently using PersistentClient or basic PDF-only setup:

# 1. Rebuild with enhanced capabilities
./build.sh

# 2. Start server
./server.sh -m stella

# 3. Migrate existing PDFs with explicit store type
./upload.sh -i /path/to/pdfs --store pdf -e stella --delete-collection

# 4. Add new content types
./upload.sh -i /path/to/source --store source-code -e stella -c CodeLibrary
./upload.sh -i /path/to/docs --store markdown -e stella -c DocsLibrary

Then update your claude.json MCP configuration to use localhost:9000.

🧪 Testing & Validation

Comprehensive Testing

# Run all tests (includes new store types)
./test.sh

# Test each store type individually
./upload.sh -i ./embedding_functions.py --store source-code -e stella -l 1 -c TestSource --delete-collection
./upload.sh -i ./README.md --store markdown -e stella -l 1 -c TestDocs --delete-collection
./upload.sh -i /path/to/test.pdf --store pdf -e stella -l 1 -c TestPDF --delete-collection

Verify AST Chunking

# Check if ASTChunk is working properly
python3 -c "
import astchunk
from astchunk import ASTChunkBuilder
print('✅ ASTChunk available and ready')
configs = {'max_chunk_size': 1000, 'language': 'python', 'metadata_template': 'default'}
chunker = ASTChunkBuilder(**configs)
print('✅ ASTChunk chunker initialized successfully')
"

Validate Store-Specific Metadata

# Query and inspect metadata for different store types
python3 -c "
import chromadb
client = chromadb.HttpClient(host='localhost', port=9000)

# Check source code metadata
try:
    collection = client.get_collection('TestSource')
    docs = collection.get(limit=1, include=['metadatas'])
    metadata = docs['metadatas'][0]
    print('Source Code Metadata:')
    print(f'  Language: {metadata.get(\"programming_language\", \"N/A\")}')
    print(f'  Has Functions: {metadata.get(\"has_functions\", \"N/A\")}')
    print(f'  AST Chunked: {metadata.get(\"ast_chunked\", \"N/A\")}')
    print('✅ Source code metadata validated')
except:
    print('⚠️  No source code collection found')

# Check markdown metadata
try:
    collection = client.get_collection('TestDocs')
    docs = collection.get(limit=1, include=['metadatas'])
    metadata = docs['metadatas'][0]
    print('Markdown Metadata:')
    print(f'  Doc Type: {metadata.get(\"doc_type\", \"N/A\")}')
    print(f'  Has Code Blocks: {metadata.get(\"has_code_blocks\", \"N/A\")}')
    print(f'  Has Links: {metadata.get(\"has_links\", \"N/A\")}')
    print('✅ Markdown metadata validated')
except:
    print('⚠️  No markdown collection found')
"

🎛️ Environment Variables

# Upload script configuration
export PDF_INPUT_PATH=/path/to/files     # Input path (works with all store types)

# Server configuration (Docker container — storage only, no embedding)
export CHROMA_EMBEDDING_MODEL=stella     # Legacy: server model (unused with client-side embedding)
export TRANSFORMERS_CACHE=/models        # Model cache directory (inside Docker)
export HF_HOME=/models                   # Hugging Face cache directory (inside Docker)

# Store-specific defaults (optional)
export DEFAULT_STORE_TYPE=pdf            # Default store type
# Note: Chunk sizes are now auto-optimized per embedding model

🔍 Troubleshooting

Server Won't Start

# Check Docker
docker ps

# View server logs
./server.sh --logs

# Restart server
./server.sh --restart

Upload Failures

# Test server connection
curl http://localhost:9000/api/v2/heartbeat

# Check all dependencies including OCR and ASTChunk
python3 -c "import chromadb, fitz, astchunk, PIL; print('✅ All Dependencies OK')"

# Test OCR functionality (EasyOCR)
python3 -c "import easyocr; print('✅ EasyOCR available')"

# Test Tesseract if using it
python3 -c "import pytesseract; print('Tesseract Version:', pytesseract.get_tesseract_version())"

# Test ASTChunk functionality
python3 -c "from astchunk import ASTChunkBuilder; print('✅ ASTChunk available')"

# Test with smaller uploads for each store type
./upload.sh -i /path/to/test.pdf --store pdf -e stella -l 1 -c TestPDF --delete-collection
./upload.sh -i ./embedding_functions.py --store source-code -e stella -l 1 -c TestCode --delete-collection

OCR Issues (PDF Store Type)

# EasyOCR issues (should work out of the box)
python3 -c "import easyocr; print('EasyOCR OK')"

# Tesseract issues (if using --ocr-engine tesseract)
tesseract --version
pip install .[tesseract]

# Test with OCR disabled if having issues
./upload.sh -i /path/to/pdfs --store pdf -e stella --disable-ocr -l 1 -c TestCollection --delete-collection

ASTChunk Issues (Source Code Store Type)

# Verify ASTChunk installation
python3 -c "import astchunk; from astchunk import ASTChunkBuilder; print('ASTChunk working')"

# Test with basic chunking fallback if ASTChunk fails
./upload.sh -i ./test.py --store source-code -e stella -l 1 -c TestFallback --delete-collection

# Check tree-sitter language parsers
python3 -c "
import tree_sitter_python
import tree_sitter_java
import tree_sitter_typescript
print('✅ Tree-sitter parsers available')
"

# Manual ASTChunk test
python3 -c "
from astchunk import ASTChunkBuilder
configs = {'max_chunk_size': 1000, 'language': 'python', 'metadata_template': 'default'}
chunker = ASTChunkBuilder(**configs)
result = chunker.chunkify('def hello(): print(\"Hello World\")')
print(f'✅ ASTChunk test successful: {len(result)} chunks')
"

Model Loading Issues

Ensure Docker has sufficient memory (8GB+ recommended)
Check network connectivity for model downloads
Verify disk space (~10GB for all models)

🎓 Best Practices

Store Type Selection

Choose the Right Store Type:
- --store pdf for research papers and documents
- --store source-code for API understanding and code analysis
- --store markdown for README files and technical guides
Collection Organization:
- Use descriptive collection names: ResearchLibrary, CodeLibrary, DocsLibrary
- Separate collections by content type for better semantic coherence
- Consider language-specific collections for source code: PythonCode, JavaCode

Embedding Model Strategy

Model Selection by Use Case:
- Stella (recommended): Best for research papers and technical content
- ModernBERT: Latest technology, good for mixed content
- BGE-Large: Production-ready, reliable for all content types

Processing Optimization

Model-Optimized Chunking (2024 Update):
- Use default auto-optimization for best results (no --chunk-size needed)
- System automatically respects each model's token limits with safety margins
- Source code benefits from AST-aware chunking (automatic with ASTChunk)
- Only override chunking for special requirements (e.g., very small chunks)
Resource Management:
- Ensure Docker has 8GB+ RAM for optimal performance
- ASTChunk requires additional memory for multiple language parsers
- Monitor disk space for model downloads (~10GB total)

Content-Specific Tips

PDF Processing:
- Enable OCR by default (handles image-only PDFs)
- Test with different OCR engines if accuracy issues occur
- Use --ocr-language for non-English documents
Source Code Processing:
- Let ASTChunk handle chunking automatically (preserves function boundaries)
- Include test files - they often contain the best usage examples
- Process entire project directories for complete context
Documentation Processing:
- Include all related docs in same collection for cross-referencing
- Markdown files provide the richest structural information
- Smaller chunk sizes work better for precise documentation retrieval

Quality Assurance

Testing & Validation:
- Always test with small uploads first (-l 5)
- Verify metadata is populated correctly for each store type
- Use python3 check_deps.py to validate all dependencies
Backup & Recovery:
- Backup collections before major changes
- Keep source files organized for re-processing if needed
- Document your embedding model choices for consistency

🔌 Claude Code MCP Integration

Setup Steps

Start ChromaDB Server:
```
./server.sh -m stella
```

Configure MCP in claude.json:

{
  "mcpServers": {
    "chroma-docker": {
      "command": "docker",
      "args": [
        "run", "-i", "--rm", "--network", "host",
        "mcp/chroma", "chroma-mcp",
        "--client-type", "http",
        "--host", "localhost",
        "--port", "9000",
        "--ssl", "false"
      ]
    }
  }
}

Test Connection:

curl http://localhost:9000/api/v2/heartbeat

Restart Claude Code to load the configuration

Benefits

✅ Client-Side GPU Embedding: Stella-400m on Apple Silicon MPS — no fan spin, fast uploads
✅ Multi-Format Support: PDFs, source code, and markdown in one system
✅ AST-Aware Code Analysis: Semantic chunking preserves function boundaries
✅ Enhanced Metadata: Store-specific metadata for precise retrieval
✅ OCR Support: Automatically processes image-only PDFs
✅ Dimension Mismatch Protection: Catches mismatched embeddings before upload
✅ Centralized Management: One server for all content types
✅ Research & Development Optimized: Designed for technical workflows

🔮 Future Enhancements

Support for additional embedding models
Model fine-tuning capabilities
Multi-modal embeddings (text + images)
Distributed embedding clusters
Model performance benchmarking

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.gitignore		.gitignore
BENCHMARK.md		BENCHMARK.md
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
check_deps.py		check_deps.py
chunk_utils.py		chunk_utils.py
compare_manifest.py		compare_manifest.py
embedding_functions.py		embedding_functions.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
seed_manifest.py		seed_manifest.py
server.sh		server.sh
setup.sh		setup.sh
test.sh		test.sh
test_incremental.py		test_incremental.py
upload.py		upload.py
upload.sh		upload.sh

Folders and files

Latest commit

History

Repository files navigation

ChromaDB Enhanced Multi-Format Processing

Prerequisites

System Dependencies (for OCR)

macOS

First-time Setup

VPN Users

Upload content with automatic model-optimized chunking

(upload.sh is a thin wrapper over the unified fast Python uploader)

Source code with AST-aware chunking (uses 400 tokens for Stella)

Documentation with heading-aware markdown chunking (uses 430 tokens for Stella)

Key Dependencies Added

OCR Engine Setup

🎯 Available Embedding Models

🖥️ Client-Side Embedding (MPS / GPU)

Device Detection

Usage

Querying

Dimension Mismatch Protection

📄 Store Types & Chunking Strategies

🧠 Model-Optimized Chunking

🔍 PDF Store Type

💻 Source Code Store Type

📚 Markdown Store Type

🔄 Git Project-Aware Processing

Key Features

Depth Control

Change Detection Workflow

Benefits

📝 Markdown Documentation Processing

Key Features

Chunking Strategy

Metadata Fields

Usage Examples

Benefits

🔧 Server Management

Start Server

Server Operations

🚧 Known Limitations

ChromaDB Unique Metadata Values

🚨 Payload Size Error Handling

Error Detection & Recovery

Prevention & Optimization

Project Cleanup Commands

📤 Multi-Format Upload Examples

PDF Processing (Research Papers & Documents)

Source Code Processing (API Understanding & Analysis)

Markdown Processing (README, Wikis, Tutorials)

Advanced Multi-Format Workflows

🏗️ Architecture

💻 Source Code Support

Supported Programming Languages

AST-Aware Chunking Benefits

Enhanced Metadata for Code Understanding

API Understanding Use Cases

📚 Markdown Processing

Optimized for Technical Documentation

Enhanced Content Analysis

Documentation Use Cases

🔄 Migration from Old Setup

🧪 Testing & Validation

Comprehensive Testing

Verify AST Chunking

Validate Store-Specific Metadata

🎛️ Environment Variables

🔍 Troubleshooting

Server Won't Start

Upload Failures

OCR Issues (PDF Store Type)

ASTChunk Issues (Source Code Store Type)

Model Loading Issues

🎓 Best Practices

Store Type Selection

Embedding Model Strategy

Processing Optimization

Content-Specific Tips

Quality Assurance

Packages