High-performance ChromaDB server with built-in support for multiple state-of-the-art embedding models, enabling superior semantic search across PDFs, source code, and markdown with store-optimized chunking strategies.
- Docker Desktop - Install Docker
- Python 3.8+ - Tested with 3.8 to 3.12
- 8GB+ RAM - allocated to Docker (for embedding models)
- 10GB+ disk space - (Docker image + model cache)
brew install tesseract
# Ubuntu/Debian
sudo apt-get install tesseract-ocr
# Or skip Tesseract and use EasyOCR (pure Python)
pip install .[easyocr]# 1. Clone and enter directory
git clone <repository_url>
cd chroma-embedded
# 2. Create Python virtual environment (optional but recommended)
python3 -m venv venv
source venv/bin/activate
# 3. Install Python dependencies
pip install .
# 4. Verify dependencies
python3 check_deps.py
# 5. Create docker volumes for persistent storage
docker volume create chromadb-data # Your collections and embeddings
docker volume create chromadb-models # Hugging Face model cache (~2GB)
# 6. Build the docker image
./build.sh
# 7. Start the server
docker run -d --name chromadb-enhanced -p 9000:8000 -v chromadb-data:/data -v chromadb-models:/models chromadb-enhanced:latest
# 8. Verify server is running
curl http://localhost:9000/api/v2/heartbeatNote: First startup downloads the Stella embedding model (~1.5GB). Subsequent startups use the cached model from the chromadb-models volume.
If behind a VPN with SSL inspection, the model downloads may fail with certificate errors. Temporarily disable the VPN during the first startup to allow model downloads, or pre-download models manually into the /models volume.
# Activate virtual environment
source venv/bin/activate
# Start the server (if not running)
docker start chromadb-enhanced
# Or if the container was removed, recreate it:
docker run -d --name chromadb-enhanced -p 9000:8000 -v chromadb-data:/data -v chromadb-models:/models chromadb-enhanced:latest./upload.sh -i /path/to/pdfs --store pdf -e stella -c ResearchLibrary
./upload.sh -i /path/to/source --store source-code -e stella -c CodeLibrary
./upload.sh -i /path/to/markdown --store markdown -e stella -c DocsLibrary
## 📁 Project Structure
| File | Purpose |
|------|---------|
| `Dockerfile` | Multi-model ChromaDB Docker image |
| `build.sh` | Build script for Docker image |
| `server.sh` | Server management script |
| `upload.sh` | Thin wrapper for the unified fast Python uploader |
| `upload.py` | Unified uploader (PDF + OCR, source code, markdown) with client-side MPS embedding |
| `embedding_functions.py` | Client-side embedding: `detect_device()`, `load_model()`, `embed_documents()` + server-side classes |
| `chunk_utils.py` | Token-aware, AST-aware, and markdown heading-aware chunking utilities |
| `seed_manifest.py` | Seed `.chroma-uploads.json` manifest from existing ChromaDB collections |
| `compare_manifest.py` | Compare filesystem repos to manifest (find un-uploaded repos) |
| `test.sh` | Complete setup testing |
| `check_deps.py` | Dependency checker (OCR + ASTChunk) |
| `requirements.txt` | Python dependencies (includes ASTChunk) |
| `pyproject.toml` | Modern Python packaging |
| `.gitignore` | Git ignore rules |
| `LICENSE` | MIT license |
## 📋 Installation & Dependencies
### Python Dependencies
```bash
# Install all dependencies (includes ASTChunk and Tesseract wrapper)
pip install .
# Check all dependencies are working (OCR + AST parsing)
python3 check_deps.py
# Development install
pip install -e .[dev]
- ASTChunk (
astchunk>=0.1.0) - AST-aware source code chunking - Tree-sitter - Multi-language parsing support (Python, Java, TypeScript, C#, etc.)
- Enhanced metadata extraction - Store-specific metadata for better retrieval
Choose your preferred OCR engine:
Option 1: Tesseract (Recommended - faster)
# Install system dependency
# macOS: brew install tesseract
# Ubuntu/Debian: sudo apt-get install tesseract-ocr
# CentOS/RHEL: sudo yum install tesseract
# Python wrapper already installed with: pip install .
# Ready to use (default engine)Option 2: EasyOCR (Pure Python - no system deps)
# Install EasyOCR package
pip install .[easyocr]
# Use with --ocr-engine easyocr flag| Model | Dimensions | Best For | Performance |
|---|---|---|---|
| stella | 1024 | Research papers, academic content | 🥇 Top MTEB performer |
| modernbert | 1024 | General purpose, latest tech | 🔬 State-of-the-art 2024 |
| bge-large | 1024 | Production deployments | 🏭 Battle-tested |
| default | 384 | Quick testing, compatibility | ⚡ Fast, lightweight |
Embedding is performed client-side using Apple Silicon's Metal Performance Shaders (MPS) or CUDA on NVIDIA GPUs. The Docker container stores pre-computed vectors without running any embedding models.
detect_device() automatically selects the best available device:
- MPS (Apple Silicon GPU) — preferred on macOS
- CUDA (NVIDIA GPU) — preferred on Linux
- CPU — fallback
# Default: auto-detect device (MPS on Apple Silicon)
python3 upload.py -c projects -i /path/to/code --store source-code -e stella
# Force CPU (disable GPU)
python3 upload.py -c projects -i /path/to/code --store source-code -e stella --no-gpuQueries must also embed client-side with the same model. The cdbsp and cdbsd bash functions handle this automatically — they load Stella locally and use query_embeddings instead of query_texts.
# Search source code (client-side Stella embedding)
cdbsp "authentication middleware" 5
# Search docs (client-side Stella embedding)
cdbsd "deployment instructions" 5Do not use query_texts directly against collections uploaded with client-side embedding — the server's default model produces different-dimension vectors, causing a dimension mismatch error.
If you try to upload to a collection that has embeddings with a different dimension (e.g., 384-dim from an old upload vs 1024-dim from Stella), the upload will abort with a clear message. Use --delete-collection to recreate with the correct dimensions.
The upload script supports three optimized store types, each with tailored chunking and metadata extraction:
| Store Type | Chunk Size | Overlap | Processing | Best For |
|---|---|---|---|---|
pdf |
Auto-optimized | 10% overlap | OCR + Text extraction | Research papers, documents |
source-code |
Auto-optimized | 5% overlap | AST-aware chunking | Code analysis, API understanding |
markdown |
Auto-optimized | Smart overlap | Heading-aware for markdown | README, wikis, tutorials |
The system automatically optimizes chunk sizes for each embedding model:
- Stella: 400 tokens/chunk with 50% safety buffer (~640 chars)
- ModernBERT: 920 tokens/chunk (large context window)
- BGE-Large: 400 tokens/chunk with 50% safety buffer
- Default: 400 tokens/chunk with 50% safety buffer
AST-aware source code chunking:
- Automatically splits large functions at statement boundaries
- Preserves code structure and semantic meaning
- Uses conservative sizing to prevent token limit violations
Heading-aware markdown chunking:
- Respects H1-H6 heading hierarchy
- Keeps sections together when they fit in token limits
- Splits at subsection boundaries when sections are too large
- Preserves heading context in chunk metadata
- OCR Support: Automatic image-only PDF processing with Tesseract/EasyOCR
- Language Support: 100+ OCR languages supported
- Metadata: File size, extraction method, OCR confidence, image detection
- Git Project-Aware: Automatically detects
.gitdirectories and tracks project-level changes - Smart Change Detection: Compares git commit hashes to detect when projects need re-indexing
- Respects .gitignore: Uses
git ls-filesto only index tracked files (applies to all store types inside git repos) - Skips Common Junk Directories:
node_modules,.venv,venv,__pycache__,.tox,dist,buildare always excluded - AST-Aware Chunking: Respects function/class boundaries using ASTChunk
- Language Support: 15+ programming languages (Python, Java, JS/TS, C#, Go, Rust, C/C++, PHP, Ruby, Kotlin, Scala, Swift)
- Enhanced Metadata: Programming language, function/class detection, import analysis, line counts, git project context
- Automatic Language Detection: Based on file extensions
- Project Search Depth: Control how deep to search for nested git projects
- Heading-Aware Markdown Chunking: Intelligently splits markdown at section boundaries
- Structure Preservation: Respects H1-H6 heading hierarchy
- Smart Splitting: Keeps sections together when possible, splits at subsections when needed
- Enhanced Metadata: Heading hierarchy, section depth, primary heading per chunk
- Content Analysis: Detects code blocks, links, and document structure
- Supported Formats: Markdown (
.md), text (.txt), reStructuredText (.rst), AsciiDoc (.adoc), HTML, XML
When using --store source-code, the system automatically detects and manages git projects with intelligent change detection:
- Automatic Discovery: Finds
.gitdirectories to identify project boundaries - Smart Change Detection: Compares git commit hashes to detect when re-indexing is needed
- Clean Updates: Deletes all existing chunks for a project when its commit hash changes
- Respects .gitignore: Uses
git ls-filesfor all store types when inside a git repo (not just source-code) - Skips Junk Directories:
node_modules,.venv,__pycache__,dist,buildare always excluded from directory walks - Project Metadata: Every chunk includes git project context (name, commit hash, remote URL, branch)
--depth 1 # Only direct subdirectories (fast, good for organized workspaces)
--depth 2 # Two levels deep (includes some nested projects)
# No --depth # Unlimited depth (finds all nested git projects)- First Run: Indexes all git-tracked files, stores commit hash with each chunk
- Subsequent Runs: Compares stored vs current commit hash
- If Changed: Deletes all project chunks and re-indexes all files
- If Unchanged: Uses regular file-by-file processing for new files only
- Automatic Cleanup: Moved/deleted files are automatically removed
- Project Context: Search results include which project and commit the code came from
- Efficient Updates: Only re-processes projects that have actually changed
- Workspace Friendly: Handles directories with multiple git projects gracefully
When using --store markdown with markdown files (.md), the system automatically uses heading-aware chunking:
- Hierarchical Chunking: Splits at H1-H6 heading boundaries
- Smart Section Grouping: Keeps related content together when it fits within token limits
- Subsection Splitting: Automatically splits large sections at subsection boundaries
- Heading Context: Each chunk includes full heading hierarchy in metadata
- Token-Optimized: Respects model-specific token limits (430 tokens for Stella, 880 for ModernBERT)
- Parse Structure: Identifies all headings (H1-H6) and their content
- Build Hierarchy: Tracks parent-child relationships between sections
- Smart Grouping: Combines consecutive sections that fit within token limits
- Intelligent Splitting: When sections exceed limits, splits at subsection boundaries
- Metadata Enrichment: Adds heading hierarchy, section depth, and primary heading to each chunk
Each markdown chunk includes:
markdown_headings: Full heading hierarchy (e.g., "Introduction > Getting Started > Installation")markdown_primary_heading: The main heading for this chunkmarkdown_section_depth: Nesting level of the section (0 = no headings, 1 = H1, 2 = H2, etc.)markdown_heading_aware: Flag indicating heading-aware chunking was used
# Process markdown documentation with heading-aware chunking
./upload.sh -i /path/to/markdown/docs --store markdown -e stella -c MarkdownDocs
# Query by section using metadata filters (client-side embedding required)
python3 -c "
import sys
sys.path.insert(0, '.')
from embedding_functions import load_model, embed_documents
import chromadb
model = load_model('stella')
query_vec = embed_documents(model, ['How do I install?'])[0]
client = chromadb.HttpClient(host='localhost', port=9000)
collection = client.get_collection('MarkdownDocs')
# Find all chunks from 'Installation' section
results = collection.query(
query_embeddings=[query_vec],
where={'markdown_primary_heading': 'Installation'},
n_results=5
)
"
# View heading structure of indexed documents
python3 -c "
import chromadb
client = chromadb.HttpClient(host='localhost', port=9000)
collection = client.get_collection('MarkdownDocs')
docs = collection.get(include=['metadatas'], limit=20)
for meta in docs['metadatas']:
if 'markdown_headings' in meta:
print(f'{meta[\"filename\"]}: {meta[\"markdown_headings\"]}')
"- Better Semantic Search: Chunks aligned with document structure
- Section-Aware Queries: Filter results by specific sections
- Context Preservation: Full heading hierarchy provides better context
- Improved Retrieval: More relevant results due to semantic boundaries
# Start with Stella embeddings (recommended)
./server.sh -m stella
# Start with ModernBERT on custom port
./server.sh -m modernbert -p 9001
# Start with BGE-Large for production
./server.sh -m bge-large# View logs
./server.sh --logs
# Stop server
./server.sh --stop
# Restart with different model
./server.sh --restart -m modernbertChromaDB currently does not provide built-in aggregate functions or SQL-like DISTINCT operations for efficiently retrieving unique metadata values. This limitation affects scenarios where you need to:
- Get a list of unique project names from a large collection
- Count distinct values in metadata fields
- Perform aggregate operations on metadata
Current Workaround: The most efficient approach available is to retrieve metadata-only results in small batches and manually deduplicate using Python sets:
# Get all metadata without document content
all_metadatas = collection.get(include=["metadatas"])["metadatas"]
# Extract unique values using Python sets
unique_projects = {meta.get("git_project_name") for meta in all_metadatas}
unique_projects = list(unique_projects)Community Request: This feature has been actively requested by the ChromaDB community. You can track progress and add your support at:
- GitHub Issue: Query with unique metadata filter #2873
Impact: For large collections (thousands of documents), retrieving unique metadata values requires scanning all documents, which is the current best practice until native aggregation support is added to ChromaDB.
When uploading large files (especially minified JavaScript or large source files), you may encounter "413 Payload Too Large" errors. The system now provides fail-fast error handling with clear recovery options:
# If you get a payload error, the system will show:
❌ PAYLOAD TOO LARGE ERROR
📁 File: /path/to/aws-amplify.min.js
📊 File size: 1,234,567 bytes
🧩 Total chunks: 156
💾 Batch payload: ~2,500,000 characters
💡 RECOMMENDATIONS:
Suggested chunk size: 800 tokens
Suggested batch size: 25
🔧 RECOVERY OPTIONS:
1. Reduce chunk size: --chunk-size 800 --batch-size 25
2. Delete partial project: --delete-project my-project# Preview chunk sizes before uploading (dry-run)
./upload.sh --dry-run -i /path/to/source --store source-code
# Upload with conservative settings for large files
./upload.sh -i /path/to/source --store source-code --chunk-size 800 --batch-size 25
# Auto-cleanup failed projects
./upload.sh -i /path/to/source --store source-code --delete-failed-project# Delete specific project from collection
./upload.sh --delete-project my-project-name -c MyCollection
# List available projects (shown when project not found)
./upload.sh --delete-project nonexistent -c MyCollection# Basic PDF upload with OCR (auto-optimized: 460 tokens for Stella)
./upload.sh -i /path/to/pdfs --store pdf -e stella -c ResearchLibrary
# Multi-language OCR support
./upload.sh -i /path/to/pdfs --store pdf -e stella --ocr-language fra -c FrenchPapers
./upload.sh -i /path/to/pdfs --store pdf -e stella --ocr-engine easyocr --ocr-language es -c SpanishPapers
# Disable OCR for text-only PDFs (faster processing)
./upload.sh -i /path/to/pdfs --store pdf -e stella --disable-ocr -c TextOnlyPDFs# Git project-aware source code chunking (auto-optimized: 400 tokens for Stella)
./upload.sh -i /path/to/source --store source-code -e stella -c CodeLibrary
# Only scan direct subdirectories for git projects
./upload.sh -i /workspace --store source-code -e stella -c MainProjects --depth 1
# Process specific git project (detects changes via commit hash)
./upload.sh -i ./my-project --store source-code -e stella -c MyProject --delete-collection
./upload.sh -i ./my-project --store source-code -e stella -c MyProject # Re-run: only processes if changed
# Multi-project workspace processing
./upload.sh -i /workspace --store source-code -e stella -c AllProjects
./upload.sh -i /workspace --store source-code -e stella -c AllProjects --depth 2
# Language-specific collections
./upload.sh -i ./python_project --store source-code -e stella -c PythonCode
./upload.sh -i ./java_project --store source-code -e stella -c JavaCode
# Custom chunking only if needed (overrides auto-optimization)
./upload.sh -i /path/to/source --store source-code -e stella --chunk-size 300 -c SmallChunks# Optimized markdown processing (auto-optimized: 430 tokens for Stella)
./upload.sh -i /path/to/docs --store markdown -e stella -c DocsLibrary
# Process specific markdown types
./upload.sh -i ./wiki --store markdown -e stella -c ProjectWiki
./upload.sh -i ./tutorials --store markdown -e stella -c Tutorials# Create specialized collections per content type
./upload.sh -i ./papers --store pdf -e stella -c Research --delete-collection
./upload.sh -i ./codebase --store source-code -e stella -c CodeAnalysis --delete-collection
./upload.sh -i ./documentation --store markdown -e stella -c ProjectDocs --delete-collection
# Git project-aware workflows
./upload.sh -i /workspace --store source-code -e stella -c WorkspaceCode --depth 1 # Top-level projects only
./upload.sh -i /workspace/thirdparty --store source-code -e stella -c ThirdPartyCode --depth 2 # Include nested libs
# Mixed source code and markdown
./upload.sh -i ./my-project --store source-code -e stella -c MyProject --delete-collection
./upload.sh -i ./my-project/docs --store markdown -e stella -c MyProjectDocs --delete-collection
# Custom chunking only when needed (overrides auto-optimization)
./upload.sh -i /path/to/files --store pdf --chunk-size 300 --chunk-overlap 30 -c SmallChunks
# Remote server deployment
./upload.sh -i /path/to/files --store pdf -h production-server.com -p 8000 -e modernbert
# Incremental git project updates (only re-processes changed projects)
./upload.sh -i /workspace --store source-code -e stella -c DevEnvironment # Daily runsEmbeddings are computed client-side on Apple Silicon GPU (MPS) or CPU, not inside Docker. The Docker container is a pure storage layer.
flowchart TB
subgraph Client["Client (macOS / Apple Silicon)"]
direction TB
UPLOAD["upload.py<br/>PDFs + OCR, Source Code + AST, Markdown"]
QUERY["cdbsp / cdbsd<br/>query functions"]
subgraph GPU["Stella-400M on MPS (GPU)"]
EMBED["embed_documents()<br/>1024-dim vectors"]
end
DETECT["detect_device()<br/>mps / cuda / cpu"]
end
subgraph Docker["Docker Container (Storage Only)"]
CHROMA["ChromaDB Server :9000"]
STORE[("Collections<br/>projects (source-code, 1024d)<br/>docs (markdown/pdf, 1024d)")]
end
UPLOAD --> DETECT
QUERY --> DETECT
DETECT --> EMBED
UPLOAD -->|"documents + metadatas"| CHROMA
EMBED -->|"embeddings="| CHROMA
EMBED -->|"query_embeddings="| CHROMA
CHROMA --> STORE
CHROMA -->|"ranked results"| QUERY
Key: Both uploads and queries embed client-side with the same Stella model, then pass pre-computed vectors to ChromaDB via embeddings= (uploads) or query_embeddings= (queries). The server never computes embeddings.
The source code store type supports 15+ programming languages with automatic detection:
| Language | Extensions | AST Parser | Enhanced Metadata |
|---|---|---|---|
| Python | .py |
✅ tree-sitter-python | Functions, classes, imports |
| Java | .java |
✅ tree-sitter-java | Methods, classes, packages |
| JavaScript | .js, .jsx |
✅ tree-sitter-typescript | Functions, objects, imports |
| TypeScript | .ts, .tsx |
✅ tree-sitter-typescript | Types, interfaces, modules |
| C# | .cs |
✅ tree-sitter-c-sharp | Methods, classes, namespaces |
| Go | .go |
✅ tree-sitter-go | Functions, structs, packages |
| Rust | .rs |
✅ tree-sitter-rust | Functions, traits, modules |
| C/C++ | .c, .cpp |
✅ tree-sitter-cpp | Functions, classes, includes |
| PHP | .php |
✅ tree-sitter-php | Functions, classes, namespaces |
| Ruby | .rb |
✅ tree-sitter-ruby | Methods, classes, modules |
| Kotlin | .kt |
✅ (via Java parser) | Classes, functions, packages |
| Scala | .scala |
✅ (via Java parser) | Objects, classes, traits |
| Swift | .swift |
✅ (via C parser) | Functions, classes, protocols |
Traditional Text Chunking Problems:
# ❌ Basic chunking might split mid-function
def calculate_api_response(data):
# Processing logic here...
return result
# CHUNK BREAK - Context lost!
class DatabaseManager:
def connect(self):AST-Aware Chunking Solution:
# ✅ ASTChunk preserves semantic boundaries
def calculate_api_response(data):
"""Complete function with docstring intact"""
# All related logic stays together
return result
# New chunk starts at natural boundary
class DatabaseManager:
"""Complete class with all methods"""
def connect(self):
# Method implementation completeEach source code chunk includes rich metadata for precise retrieval:
{
"store_type": "source-code",
"programming_language": "python",
"file_extension": ".py",
"has_functions": true,
"has_classes": true,
"has_imports": true,
"line_count": 45,
"ast_chunked": true,
"text_extraction_method": "astchunk_python"
}Perfect for:
- 🔍 API Discovery: Find similar function signatures across projects
- 📚 Usage Examples: Locate how specific APIs are used in practice
- 🔧 Implementation Patterns: Discover common coding patterns and practices
- 🐛 Error Handling: Find error handling approaches for specific scenarios
- 📖 Documentation Gap Filling: When official docs are lacking or incomplete
Query Examples:
- "How to authenticate with REST APIs in Python?"
- "Show me error handling patterns for database connections"
- "Find examples of async/await usage in JavaScript"
- "What are common patterns for dependency injection in Java?"
The markdown store type is specifically tuned for technical content:
Supported Formats:
- Markdown (
.md) - README files, wikis, technical guides - Text (
.txt) - Plain text documentation - reStructuredText (
.rst) - Python documentation standard - AsciiDoc (
.adoc) - Technical documentation format - HTML (
.html) - Web documentation - XML (
.xml) - Structured documentation
Documentation chunks include intelligent content detection:
{
"store_type": "markdown",
"doc_type": "markdown",
"has_code_blocks": true,
"has_links": true,
"line_count": 89,
"text_extraction_method": "direct_read"
}Perfect for:
- 📖 Project Onboarding: Quickly understand new codebases and their documentation
- 🔗 Cross-Reference Discovery: Find related documentation across different projects
- 💡 Best Practice Learning: Extract patterns and recommendations from documentation
- 🏗️ Architecture Understanding: Grasp system design from architectural docs
- 🚀 Setup Instructions: Locate installation and configuration guides
Query Examples:
- "How to set up development environment for this project?"
- "What are the deployment procedures and requirements?"
- "Find architectural decisions and design patterns used"
- "Show me configuration examples and environment variables"
If currently using PersistentClient or basic PDF-only setup:
# 1. Rebuild with enhanced capabilities
./build.sh
# 2. Start server
./server.sh -m stella
# 3. Migrate existing PDFs with explicit store type
./upload.sh -i /path/to/pdfs --store pdf -e stella --delete-collection
# 4. Add new content types
./upload.sh -i /path/to/source --store source-code -e stella -c CodeLibrary
./upload.sh -i /path/to/docs --store markdown -e stella -c DocsLibraryThen update your claude.json MCP configuration to use localhost:9000.
# Run all tests (includes new store types)
./test.sh
# Test each store type individually
./upload.sh -i ./embedding_functions.py --store source-code -e stella -l 1 -c TestSource --delete-collection
./upload.sh -i ./README.md --store markdown -e stella -l 1 -c TestDocs --delete-collection
./upload.sh -i /path/to/test.pdf --store pdf -e stella -l 1 -c TestPDF --delete-collection# Check if ASTChunk is working properly
python3 -c "
import astchunk
from astchunk import ASTChunkBuilder
print('✅ ASTChunk available and ready')
configs = {'max_chunk_size': 1000, 'language': 'python', 'metadata_template': 'default'}
chunker = ASTChunkBuilder(**configs)
print('✅ ASTChunk chunker initialized successfully')
"# Query and inspect metadata for different store types
python3 -c "
import chromadb
client = chromadb.HttpClient(host='localhost', port=9000)
# Check source code metadata
try:
collection = client.get_collection('TestSource')
docs = collection.get(limit=1, include=['metadatas'])
metadata = docs['metadatas'][0]
print('Source Code Metadata:')
print(f' Language: {metadata.get(\"programming_language\", \"N/A\")}')
print(f' Has Functions: {metadata.get(\"has_functions\", \"N/A\")}')
print(f' AST Chunked: {metadata.get(\"ast_chunked\", \"N/A\")}')
print('✅ Source code metadata validated')
except:
print('⚠️ No source code collection found')
# Check markdown metadata
try:
collection = client.get_collection('TestDocs')
docs = collection.get(limit=1, include=['metadatas'])
metadata = docs['metadatas'][0]
print('Markdown Metadata:')
print(f' Doc Type: {metadata.get(\"doc_type\", \"N/A\")}')
print(f' Has Code Blocks: {metadata.get(\"has_code_blocks\", \"N/A\")}')
print(f' Has Links: {metadata.get(\"has_links\", \"N/A\")}')
print('✅ Markdown metadata validated')
except:
print('⚠️ No markdown collection found')
"# Upload script configuration
export PDF_INPUT_PATH=/path/to/files # Input path (works with all store types)
# Server configuration (Docker container — storage only, no embedding)
export CHROMA_EMBEDDING_MODEL=stella # Legacy: server model (unused with client-side embedding)
export TRANSFORMERS_CACHE=/models # Model cache directory (inside Docker)
export HF_HOME=/models # Hugging Face cache directory (inside Docker)
# Store-specific defaults (optional)
export DEFAULT_STORE_TYPE=pdf # Default store type
# Note: Chunk sizes are now auto-optimized per embedding model# Check Docker
docker ps
# View server logs
./server.sh --logs
# Restart server
./server.sh --restart# Test server connection
curl http://localhost:9000/api/v2/heartbeat
# Check all dependencies including OCR and ASTChunk
python3 -c "import chromadb, fitz, astchunk, PIL; print('✅ All Dependencies OK')"
# Test OCR functionality (EasyOCR)
python3 -c "import easyocr; print('✅ EasyOCR available')"
# Test Tesseract if using it
python3 -c "import pytesseract; print('Tesseract Version:', pytesseract.get_tesseract_version())"
# Test ASTChunk functionality
python3 -c "from astchunk import ASTChunkBuilder; print('✅ ASTChunk available')"
# Test with smaller uploads for each store type
./upload.sh -i /path/to/test.pdf --store pdf -e stella -l 1 -c TestPDF --delete-collection
./upload.sh -i ./embedding_functions.py --store source-code -e stella -l 1 -c TestCode --delete-collection# EasyOCR issues (should work out of the box)
python3 -c "import easyocr; print('EasyOCR OK')"
# Tesseract issues (if using --ocr-engine tesseract)
tesseract --version
pip install .[tesseract]
# Test with OCR disabled if having issues
./upload.sh -i /path/to/pdfs --store pdf -e stella --disable-ocr -l 1 -c TestCollection --delete-collection# Verify ASTChunk installation
python3 -c "import astchunk; from astchunk import ASTChunkBuilder; print('ASTChunk working')"
# Test with basic chunking fallback if ASTChunk fails
./upload.sh -i ./test.py --store source-code -e stella -l 1 -c TestFallback --delete-collection
# Check tree-sitter language parsers
python3 -c "
import tree_sitter_python
import tree_sitter_java
import tree_sitter_typescript
print('✅ Tree-sitter parsers available')
"
# Manual ASTChunk test
python3 -c "
from astchunk import ASTChunkBuilder
configs = {'max_chunk_size': 1000, 'language': 'python', 'metadata_template': 'default'}
chunker = ASTChunkBuilder(**configs)
result = chunker.chunkify('def hello(): print(\"Hello World\")')
print(f'✅ ASTChunk test successful: {len(result)} chunks')
"- Ensure Docker has sufficient memory (8GB+ recommended)
- Check network connectivity for model downloads
- Verify disk space (~10GB for all models)
-
Choose the Right Store Type:
--store pdffor research papers and documents--store source-codefor API understanding and code analysis--store markdownfor README files and technical guides
-
Collection Organization:
- Use descriptive collection names:
ResearchLibrary,CodeLibrary,DocsLibrary - Separate collections by content type for better semantic coherence
- Consider language-specific collections for source code:
PythonCode,JavaCode
- Use descriptive collection names:
- Model Selection by Use Case:
- Stella (recommended): Best for research papers and technical content
- ModernBERT: Latest technology, good for mixed content
- BGE-Large: Production-ready, reliable for all content types
-
Model-Optimized Chunking (2024 Update):
- Use default auto-optimization for best results (no --chunk-size needed)
- System automatically respects each model's token limits with safety margins
- Source code benefits from AST-aware chunking (automatic with ASTChunk)
- Only override chunking for special requirements (e.g., very small chunks)
-
Resource Management:
- Ensure Docker has 8GB+ RAM for optimal performance
- ASTChunk requires additional memory for multiple language parsers
- Monitor disk space for model downloads (~10GB total)
-
PDF Processing:
- Enable OCR by default (handles image-only PDFs)
- Test with different OCR engines if accuracy issues occur
- Use
--ocr-languagefor non-English documents
-
Source Code Processing:
- Let ASTChunk handle chunking automatically (preserves function boundaries)
- Include test files - they often contain the best usage examples
- Process entire project directories for complete context
-
Documentation Processing:
- Include all related docs in same collection for cross-referencing
- Markdown files provide the richest structural information
- Smaller chunk sizes work better for precise documentation retrieval
-
Testing & Validation:
- Always test with small uploads first (
-l 5) - Verify metadata is populated correctly for each store type
- Use
python3 check_deps.pyto validate all dependencies
- Always test with small uploads first (
-
Backup & Recovery:
- Backup collections before major changes
- Keep source files organized for re-processing if needed
- Document your embedding model choices for consistency
-
Start ChromaDB Server:
./server.sh -m stella
-
Configure MCP in claude.json:
{ "mcpServers": { "chroma-docker": { "command": "docker", "args": [ "run", "-i", "--rm", "--network", "host", "mcp/chroma", "chroma-mcp", "--client-type", "http", "--host", "localhost", "--port", "9000", "--ssl", "false" ] } } } -
Test Connection:
curl http://localhost:9000/api/v2/heartbeat
-
Restart Claude Code to load the configuration
- ✅ Client-Side GPU Embedding: Stella-400m on Apple Silicon MPS — no fan spin, fast uploads
- ✅ Multi-Format Support: PDFs, source code, and markdown in one system
- ✅ AST-Aware Code Analysis: Semantic chunking preserves function boundaries
- ✅ Enhanced Metadata: Store-specific metadata for precise retrieval
- ✅ OCR Support: Automatically processes image-only PDFs
- ✅ Dimension Mismatch Protection: Catches mismatched embeddings before upload
- ✅ Centralized Management: One server for all content types
- ✅ Research & Development Optimized: Designed for technical workflows
- Support for additional embedding models
- Model fine-tuning capabilities
- Multi-modal embeddings (text + images)
- Distributed embedding clusters
- Model performance benchmarking