Skip to content

Commit 1d0b736

Browse files
committed
docs cleanup
1 parent 0cee9fc commit 1d0b736

9 files changed

Lines changed: 345 additions & 24 deletions

File tree

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# This project
22
markdown/
33
db/
4+
nul
45

56
# Byte-compiled / optimized / DLL files
67
__pycache__/

AGENTS.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,12 +12,12 @@ Local documentation retrieval system using MCP (Model Context Protocol). Indexes
1212
HTML docs → pandoc → Markdown → build_index.py → SQLite DB → mcp_server.py → MCP tools
1313
```
1414

15-
**Key files:**
15+
Key files:
1616
- `build_index.py` — Indexes Markdown into SQLite (FTS5 + optional sqlite-vec embeddings)
1717
- `mcp_server.py` — MCP server exposing `search_docs`, `get_chunk`, `list_sources` tools
1818
- `scripts/convert_html.sh` — Batch HTML→Markdown conversion via pandoc
1919

20-
**Database schema:**
20+
Database schema:
2121
- `chunks` — Document chunks with id, source, title, content, chunk_index
2222
- `chunks_fts` — FTS5 virtual table for keyword search
2323
- `chunks_vec` — sqlite-vec virtual table for embeddings (optional)

README.md

Lines changed: 21 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,14 @@
22

33
Make engineering documentation searchable by LLM coding assistants (Claude Code, Cursor, Codex CLI). Uses SQLite FTS5 for keyword search and vector embeddings for semantic search. Single file, no external services.
44

5+
## Why This Architecture
6+
7+
Hybrid search gives you the best of both worlds. FTS5 handles exact matches—API names, error messages, symbols. Embeddings handle vocabulary mismatch—when someone searches "make grid finer near edges" instead of "mesh refinement."
8+
9+
SQLite FTS5 + sqlite-vec keeps everything in one file. No vector database to operate, no Docker, no external services.
10+
11+
What this replaces: grep (no ranking), Qdrant/Weaviate (operational overhead), local LLMs (slow, no accuracy benefit for retrieval).
12+
513
## Requirements
614

715
- Python 3.11+
@@ -72,14 +80,25 @@ For COMSOL-specific conversion, see [docs/comsol.md](docs/comsol.md). You can ad
7280

7381
## MCP Tools
7482

75-
There are a few key MCP commands for the LLM.
76-
7783
| Tool | Description |
7884
|------|-------------|
7985
| `search_docs` | Hybrid keyword + semantic search. Returns matching chunks with scores. |
8086
| `get_chunk` | Retrieve a specific chunk by ID. |
8187
| `list_sources` | List all indexed source files. |
8288

89+
Example `search_docs` response:
90+
```json
91+
[
92+
{
93+
"chunk_id": "comsol_ref_mesh.24.80.md:0",
94+
"source": "comsol_ref_mesh.24.80.md",
95+
"title": "Mesh Refinement",
96+
"content": "Use Refine to refine a mesh by splitting elements...",
97+
"score": 0.032
98+
}
99+
]
100+
```
101+
83102
## Publishing Databases
84103

85104
You can publish database snapshots for ease of use using the following example command:

build_index.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -400,7 +400,8 @@ def main():
400400
for i, md_path in enumerate(md_files):
401401
# Check if file has changed
402402
current_hash = file_hash(md_path)
403-
relative_path = str(md_path.relative_to(args.source_dir))
403+
# Normalize path separators for cross-platform consistency
404+
relative_path = str(md_path.relative_to(args.source_dir)).replace("\\", "/")
404405

405406
existing = conn.execute(
406407
"SELECT hash FROM sources WHERE path = ?", (relative_path,)

docs/comsol.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,9 @@ When installing COMSOL, select:
1212

1313
COMSOL 6.4 HTML documentation default paths:
1414

15-
- **Windows:** `C:\Program Files\COMSOL\COMSOL64\Multiphysics\doc\help\wtpwebapps\ROOT\doc\`
16-
- **Linux:** `/usr/local/comsol/multiphysics/doc/help/wtpwebapps/ROOT/doc`
15+
- Windows: `C:\Program Files\COMSOL\COMSOL64\Multiphysics\doc\help\wtpwebapps\ROOT\doc\`
16+
- macOS: `/Applications/COMSOL64/Multiphysics/doc/help/wtpwebapps/ROOT/doc/`
17+
- Linux: `/usr/local/comsol/multiphysics/doc/help/wtpwebapps/ROOT/doc/`
1718

1819
The HTML files are spread across subdirectories (`comsol_ref_manual/`, `acdc_module/`, etc.).
1920

docs/development.md

Lines changed: 118 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -38,10 +38,10 @@ CREATE TABLE sources (
3838

3939
Documents are split using a header-aware algorithm:
4040

41-
1. **Primary split:** Markdown headers (`##`, `###`, etc.)
42-
2. **Secondary split:** If a section exceeds `chunk_size`, split on paragraph boundaries
43-
3. **Tertiary split:** If still too large, split on sentence boundaries
44-
4. **Overlap:** Each chunk includes `chunk_overlap` characters from the previous chunk's end
41+
1. Primary split: Markdown headers (`##`, `###`, etc.)
42+
2. Secondary split: If a section exceeds `chunk_size`, split on paragraph boundaries
43+
3. Tertiary split: If still too large, split on sentence boundaries
44+
4. Overlap: Each chunk includes `chunk_overlap` characters from the previous chunk's end
4545

4646
Section titles are preserved as metadata for better context in search results.
4747

@@ -112,3 +112,117 @@ For other formats:
112112
pandoc -f rst -t gfm input.rst -o output.md # Sphinx RST
113113
pandoc -f docx -t gfm input.docx -o output.md # Word docs
114114
```
115+
116+
## Running Tests
117+
118+
```bash
119+
uv run pytest tests/ -v
120+
```
121+
122+
Tests cover chunking, indexing, FTS triggers, RRF scoring, and search functionality. All tests use temporary databases and clean up after themselves.
123+
124+
## Creating Skills for New Documentation
125+
126+
Skills help LLMs know when to use your MCP server. Create a skill file for each documentation set.
127+
128+
### Skill Format
129+
130+
Both Claude Code and Codex use the [Agent Skills specification](https://agentskills.io/specification):
131+
132+
```markdown
133+
---
134+
name: your-docs
135+
description: Search YOUR_PRODUCT documentation. Use when asked about [list key topics, features, common questions].
136+
---
137+
138+
# Your Documentation Search
139+
140+
Use the `search_docs` MCP tool to find documentation.
141+
142+
## When to use
143+
144+
- [List specific use cases]
145+
- [Topics this documentation covers]
146+
- [Types of questions it answers]
147+
148+
## Prerequisites
149+
150+
The MCP server must be configured:
151+
\`\`\`bash
152+
claude mcp add --transport stdio your-docs -- docs-mcp --db your-docs.db
153+
\`\`\`
154+
```
155+
156+
### Skill Locations
157+
158+
| IDE | User-level location |
159+
|-----|---------------------|
160+
| Claude Code | `~/.claude/skills/your-docs.md` |
161+
| Codex CLI | `~/.codex/skills/your-docs.md` |
162+
163+
### Tips
164+
165+
- Be specific in the description - include keywords users would mention
166+
- List concrete examples - helps the LLM match user queries to your skill
167+
- Update prerequisites - use the correct MCP add command for each IDE
168+
169+
## Embedding Models
170+
171+
The default model is `BAAI/bge-small-en-v1.5`. You can change it with `--embedding-model`.
172+
173+
| Model | Dimensions | Size | Speed | Quality | Notes |
174+
|-------|------------|------|-------|---------|-------|
175+
| `BAAI/bge-small-en-v1.5` | 384 | 130MB | Fast | Good | Default, best balance |
176+
| `BAAI/bge-base-en-v1.5` | 768 | 440MB | Medium | Better | More accurate, 2x slower |
177+
| `BAAI/bge-large-en-v1.5` | 1024 | 1.3GB | Slow | Best | Diminishing returns for docs |
178+
| `all-MiniLM-L6-v2` | 384 | 90MB | Fastest | OK | Smaller, less accurate |
179+
180+
### Recommendations
181+
182+
- bge-small (default): Best for most use cases. Good accuracy, fast indexing.
183+
- bge-base: Use if search quality matters more than indexing time.
184+
- bge-large: Rarely needed. The accuracy gain over base is marginal for documentation.
185+
- MiniLM: Use if disk space or memory is constrained.
186+
187+
### GPU Acceleration
188+
189+
sentence-transformers auto-detects CUDA. On a GPU, even bge-large indexes quickly.
190+
191+
```bash
192+
# Check if GPU is available
193+
python -c "import torch; print(torch.cuda.is_available())"
194+
```
195+
196+
## Performance Tuning
197+
198+
### Chunk Size
199+
200+
| Setting | Effect |
201+
|---------|--------|
202+
| Smaller chunks (500-1000) | More precise matches, more chunks to search, larger database |
203+
| Larger chunks (2000-3000) | More context per result, fewer chunks, may include irrelevant content |
204+
| Default (1500) | Good balance for technical documentation |
205+
206+
### Chunk Overlap
207+
208+
| Setting | Effect |
209+
|---------|--------|
210+
| No overlap (0) | Smallest database, may miss matches at chunk boundaries |
211+
| Small overlap (100-200) | Default, catches most boundary cases |
212+
| Large overlap (300+) | Better boundary matching, larger database, more redundancy |
213+
214+
### When to Use Each Search Mode
215+
216+
| Mode | Best for | Speed |
217+
|------|----------|-------|
218+
| `keyword` | Exact terms, API names, error codes, CLI testing | Instant |
219+
| `semantic` | Natural language, vocabulary mismatch, conceptual queries | Slower (model load) |
220+
| `hybrid` | Production use, best overall results | Slower (model load) |
221+
222+
### Indexing Performance
223+
224+
- Without embeddings: ~1000 files/second
225+
- With embeddings (CPU): ~50 chunks/second
226+
- With embeddings (GPU): ~500 chunks/second
227+
228+
For large documentation sets (10k+ files), use `--no-embeddings` first to verify conversion worked, then rebuild with embeddings.

docs/troubleshooting.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,3 +46,42 @@ The embedding model loads on first semantic search. This is a one-time cost per
4646
```bash
4747
docs-mcp --db comsol.db --test "query" --mode keyword
4848
```
49+
50+
## Windows-specific issues
51+
52+
### "Database is locked" or temp file errors
53+
54+
SQLite WAL mode can cause file locking issues on Windows. The server handles this automatically, but if you see errors:
55+
56+
1. Close any other programs accessing the database
57+
2. Delete `.db-wal` and `.db-shm` files if present
58+
3. Restart the MCP server
59+
60+
### NPX commands fail with "Connection closed"
61+
62+
On native Windows (not WSL), wrap NPX commands with `cmd /c`:
63+
64+
```powershell
65+
# Instead of: npx -y some-package
66+
cmd /c npx -y some-package
67+
```
68+
69+
### Path issues
70+
71+
Always use forward slashes or escaped backslashes in config files:
72+
73+
```json
74+
{
75+
"args": ["--db", "C:/Users/name/docs-mcp/comsol.db"]
76+
}
77+
```
78+
79+
Or use environment variables:
80+
81+
```json
82+
{
83+
"args": ["--db", "comsol.db"]
84+
}
85+
```
86+
87+
(Database files in `%LOCALAPPDATA%\docs-mcp\` are found automatically.)

mcp_server.py

Lines changed: 68 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,9 @@
2626
)
2727
logger = logging.getLogger(__name__)
2828

29+
# Valid search modes for search_docs
30+
VALID_SEARCH_MODES = ("keyword", "semantic", "hybrid")
31+
2932

3033
def get_data_dir() -> Path:
3134
"""Get the default data directory for database files."""
@@ -89,24 +92,60 @@ def get_embedding_model():
8992
return _embedding_model
9093

9194

95+
def sanitize_fts_query(query: str) -> str:
96+
"""
97+
Sanitize a query string for safe use with FTS5 MATCH.
98+
99+
FTS5 has special syntax (AND, OR, NOT, quotes, parentheses, etc.)
100+
that can cause errors or unexpected behavior. This wraps each word
101+
in quotes to treat them as literals.
102+
"""
103+
# Split on whitespace and filter empty tokens
104+
tokens = query.split()
105+
if not tokens:
106+
return ""
107+
108+
# Escape double quotes within tokens and wrap each in quotes
109+
# This treats each word as a literal phrase, joined by implicit AND
110+
escaped = []
111+
for token in tokens:
112+
# Escape any existing double quotes
113+
safe_token = token.replace('"', '""')
114+
escaped.append(f'"{safe_token}"')
115+
116+
return " ".join(escaped)
117+
118+
92119
def search_fts(query: str, limit: int) -> list[tuple[str, float]]:
93120
"""Full-text search using FTS5. Returns (chunk_id, score) pairs."""
121+
# Handle empty or whitespace-only queries
122+
if not query or not query.strip():
123+
return []
124+
94125
conn = get_connection()
95126

96-
# BM25 scoring (lower is better in FTS5, so we negate)
97-
results = conn.execute(
98-
"""
99-
SELECT c.id, -bm25(chunks_fts, 1, 10) as score
100-
FROM chunks_fts
101-
JOIN chunks c ON chunks_fts.rowid = c.rowid
102-
WHERE chunks_fts MATCH ?
103-
ORDER BY score DESC
104-
LIMIT ?
105-
""",
106-
(query, limit),
107-
).fetchall()
127+
# Sanitize query to prevent FTS5 syntax errors
128+
safe_query = sanitize_fts_query(query)
129+
if not safe_query:
130+
return []
108131

109-
return results
132+
try:
133+
# BM25 scoring (lower is better in FTS5, so we negate)
134+
results = conn.execute(
135+
"""
136+
SELECT c.id, -bm25(chunks_fts, 1, 10) as score
137+
FROM chunks_fts
138+
JOIN chunks c ON chunks_fts.rowid = c.rowid
139+
WHERE chunks_fts MATCH ?
140+
ORDER BY score DESC
141+
LIMIT ?
142+
""",
143+
(safe_query, limit),
144+
).fetchall()
145+
return results
146+
except sqlite3.OperationalError as e:
147+
logger.warning(f"FTS5 search error: {e}")
148+
return []
110149

111150

112151
def search_vec(query: str, limit: int) -> list[tuple[str, float]]:
@@ -147,6 +186,16 @@ def reciprocal_rank_fusion(
147186
Combine multiple ranked lists using Reciprocal Rank Fusion.
148187
149188
RRF score = sum(1 / (k + rank_i)) for each list where item appears
189+
190+
Args:
191+
results_lists: List of ranked result lists, each containing (id, score) tuples
192+
k: Ranking constant that controls how much weight is given to lower-ranked items.
193+
Default of 60 is from the original RRF paper (Cormack et al., 2009) and works
194+
well in practice. Lower k gives more weight to top results; higher k makes
195+
the ranking more uniform.
196+
197+
Returns:
198+
Combined list of (id, rrf_score) tuples, sorted by score descending
150199
"""
151200
scores: dict[str, float] = {}
152201

@@ -204,7 +253,13 @@ def search_docs_impl(query: str, limit: int = 10, mode: str = "hybrid") -> list[
204253
205254
Returns:
206255
List of matching chunks with scores
256+
257+
Raises:
258+
ValueError: If mode is not one of the valid search modes
207259
"""
260+
if mode not in VALID_SEARCH_MODES:
261+
raise ValueError(f"Invalid search mode '{mode}'. Must be one of: {VALID_SEARCH_MODES}")
262+
208263
results_lists = []
209264

210265
if mode in ("keyword", "hybrid"):

0 commit comments

Comments
 (0)