-- Document chunks with metadata
CREATE TABLE chunks (
id TEXT PRIMARY KEY,
source TEXT NOT NULL, -- Original file path
title TEXT, -- Extracted section title
content TEXT NOT NULL, -- Chunk text
chunk_index INTEGER, -- Position within source
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- FTS5 full-text index
CREATE VIRTUAL TABLE chunks_fts USING fts5(
content,
title,
content=chunks,
content_rowid=rowid
);
-- Vector embeddings (via sqlite-vec)
CREATE VIRTUAL TABLE chunks_vec USING vec0(
embedding float[384] -- Dimension matches model
);
-- Source file tracking for incremental updates
CREATE TABLE sources (
path TEXT PRIMARY KEY,
hash TEXT NOT NULL,
indexed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);Documents are split using a header-aware algorithm:
- Primary split: Markdown headers (
##,###, etc.) - Secondary split: If a section exceeds
chunk_size, split on paragraph boundaries - Tertiary split: If still too large, split on sentence boundaries
- Overlap: Each chunk includes
chunk_overlapcharacters from the previous chunk's end
Section titles are preserved as metadata for better context in search results.
Combines FTS5 keyword matching with vector similarity using Reciprocal Rank Fusion (RRF):
score = Σ 1/(k + rank_i)
Where k=60 (standard RRF constant) and rank_i is the rank from each method.
Uses SQLite FTS5 with BM25 ranking. Best for exact terms, API names, error messages.
Uses cosine similarity on embeddings. Handles vocabulary mismatch ("make grid finer" → "mesh refinement").
--output PATH Output database path (default: db/docs.db)
--chunk-size N Target chunk size in characters (default: 1500)
--chunk-overlap N Overlap between chunks (default: 200)
--embedding-model Model name (default: BAAI/bge-small-en-v1.5)
--no-embeddings Skip embedding generation (testing only)
--verbose Show progress details
./scripts/convert_html.sh /path/to/html ./markdown
uv run python build_index.py ./markdown --output db/docs.dbThe script hashes files and skips unchanged content.
uv run python -c "
import sqlite3
conn = sqlite3.connect('db/docs.db')
print(f'Chunks: {conn.execute(\"SELECT COUNT(*) FROM chunks\").fetchone()[0]}')
print(f'Sources: {conn.execute(\"SELECT COUNT(*) FROM sources\").fetchone()[0]}')
"For documentation with semantic HTML (<h1>, <h2>, <p> tags):
./scripts/convert_html.sh /path/to/html/docs ./markdownFor other formats:
pandoc -f rst -t gfm input.rst -o output.md # Sphinx RST
pandoc -f docx -t gfm input.docx -o output.md # Word docsFor projects that host Markdown docs on GitHub:
python scripts/fetch_github_docs.py <owner/repo> <docs_path> <output_dir> [--ref <branch>]Example:
python scripts/fetch_github_docs.py some-org/lib docs ./markdown/lib --ref v2.0The script uses the GitHub API to recursively download all .md files while preserving directory structure.
For projects hosted on ReadTheDocs (or similar Sphinx-based sites):
python scripts/fetch_rtd_docs.py <base_url> <output_dir>Example:
python scripts/fetch_rtd_docs.py https://mph.readthedocs.io/en/stable/ ./markdown/MPhThe script crawls from the base URL, following internal links, and converts HTML to Markdown using pandoc. It extracts main content and skips navigation/theme elements.
uv run pytest tests/ -vTests cover chunking, indexing, FTS triggers, RRF scoring, and search functionality. All tests use temporary databases and clean up after themselves.
Skills help LLMs know when to use your MCP server. Create a skill file for each documentation set.
Both Claude Code and Codex use the Agent Skills specification:
---
name: your-docs
description: Search YOUR_PRODUCT documentation. Use when asked about [list key topics, features, common questions].
---
# Your Documentation Search
Use the `search_docs` MCP tool to find documentation.
## When to use
- [List specific use cases]
- [Topics this documentation covers]
- [Types of questions it answers]
## Prerequisites
The MCP server must be configured:
\`\`\`bash
claude mcp add --transport stdio your-docs -- docs-mcp --db your-docs.db
\`\`\`| IDE | User-level location |
|---|---|
| Claude Code | ~/.claude/skills/your-docs/SKILL.md |
| Codex CLI | ~/.codex/skills/your-docs/SKILL.md |
- Be specific in the description - include keywords users would mention
- List concrete examples - helps the LLM match user queries to your skill
- Update prerequisites - use the correct MCP add command for each IDE
The default model is BAAI/bge-small-en-v1.5. You can change it with --embedding-model.
| Model | Dimensions | Size | Speed | Quality | Notes |
|---|---|---|---|---|---|
BAAI/bge-small-en-v1.5 |
384 | 130MB | Fast | Good | Default, best balance |
BAAI/bge-base-en-v1.5 |
768 | 440MB | Medium | Better | More accurate, 2x slower |
BAAI/bge-large-en-v1.5 |
1024 | 1.3GB | Slow | Best | Diminishing returns for docs |
all-MiniLM-L6-v2 |
384 | 90MB | Fastest | OK | Smaller, less accurate |
- bge-small (default): Best for most use cases. Good accuracy, fast indexing.
- bge-base: Use if search quality matters more than indexing time.
- bge-large: Rarely needed. The accuracy gain over base is marginal for documentation.
- MiniLM: Use if disk space or memory is constrained.
sentence-transformers auto-detects CUDA. On a GPU, even bge-large indexes quickly.
# Check if GPU is available
python -c "import torch; print(torch.cuda.is_available())"| Setting | Effect |
|---|---|
| Smaller chunks (500-1000) | More precise matches, more chunks to search, larger database |
| Larger chunks (2000-3000) | More context per result, fewer chunks, may include irrelevant content |
| Default (1500) | Good balance for technical documentation |
| Setting | Effect |
|---|---|
| No overlap (0) | Smallest database, may miss matches at chunk boundaries |
| Small overlap (100-200) | Default, catches most boundary cases |
| Large overlap (300+) | Better boundary matching, larger database, more redundancy |
| Mode | Best for | Speed |
|---|---|---|
keyword |
Exact terms, API names, error codes, CLI testing | Instant |
semantic |
Natural language, vocabulary mismatch, conceptual queries | Slower (model load) |
hybrid |
Production use, best overall results | Slower (model load) |
- Without embeddings: ~1000 files/second
- With embeddings (CPU): ~50 chunks/second
- With embeddings (GPU): ~500 chunks/second
For large documentation sets (10k+ files), use --no-embeddings first to verify conversion worked, then rebuild with embeddings.