Skip to content

Commit 2761ace

Browse files
committed
iterating on extraction
1 parent cc12d58 commit 2761ace

4 files changed

Lines changed: 248 additions & 40 deletions

File tree

.gitignore

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# This project
2-
markdown/**/*.md
3-
*.db*
2+
markdown/
3+
db/
44

55
# Byte-compiled / optimized / DLL files
66
__pycache__/

README.md

Lines changed: 53 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -40,21 +40,31 @@ uv sync
4040
./scripts/convert_html.sh /path/to/html/docs ./markdown
4141

4242
# Build index WITHOUT embeddings first (fast, for testing)
43-
uv run python build_index.py ./markdown --output docs.db --no-embeddings
43+
uv run python build_index.py ./markdown --output db/docs.db --no-embeddings
4444

4545
# Test that search works
46-
uv run python mcp_server.py --db docs.db --test "your query here"
46+
uv run python mcp_server.py --db db/docs.db --test "your query here"
4747

4848
# Once satisfied, rebuild WITH embeddings (required for production use)
49-
uv run python build_index.py ./markdown --output docs.db
49+
uv run python build_index.py ./markdown --output db/docs.db
5050

5151
# Run the MCP server
52-
uv run python mcp_server.py --db docs.db
52+
uv run python mcp_server.py --db db/docs.db
5353

5454
# Configure your IDE (see below)
5555
```
5656

57-
## Example: Comsol Documentation
57+
## Application-Specific Conversion
58+
59+
Some applications use non-semantic HTML (CSS classes instead of proper heading tags). We provide specialized conversion scripts for these cases.
60+
61+
### Comsol Documentation
62+
63+
Options to select during install:
64+
- Install application libraries for selected products
65+
- Isntall documentation relevant to selected products
66+
67+
Comsol's HTML documentation uses CSS classes like `Head1_DVD`, `Body_text_DVD` instead of semantic `<h1>`, `<p>` tags. The specialized Python script handles this structure correctly.
5868

5969
Comsol 6.4's HTML documentation is installed by default on Windows at:
6070

@@ -64,52 +74,57 @@ C:\Program Files\COMSOL\COMSOL64\Multiphysics\doc\help\wtpwebapps\ROOT\doc\
6474

6575
The HTML files are spread across subdirectories (e.g., `comsol_ref_manual/`, `acdc_module/`, etc.).
6676

67-
### Windows (PowerShell)
68-
69-
The conversion script requires bash. Use Git Bash (included with Git for Windows) or WSL:
77+
#### Windows
7078

7179
```powershell
72-
# From Git Bash
73-
./scripts/convert_html.sh "/c/Program Files/COMSOL/COMSOL64/Multiphysics/doc/help/wtpwebapps/ROOT/doc" ./markdown
74-
75-
# From WSL
76-
./scripts/convert_html.sh "/mnt/c/Program Files/COMSOL/COMSOL64/Multiphysics/doc/help/wtpwebapps/ROOT/doc" ./markdown
80+
# Convert using the Comsol-specific script
81+
uv run python scripts/convert_comsol_html.py "C:\Program Files\COMSOL\COMSOL64\Multiphysics\doc\help\wtpwebapps\ROOT\doc" ./markdown
7782
7883
# Build index (no embeddings for initial test)
79-
uv run python build_index.py ./markdown --output comsol.db --no-embeddings
84+
uv run python build_index.py ./markdown --output db/comsol.db --no-embeddings
8085
8186
# Test search
82-
uv run python mcp_server.py --db comsol.db --test "mesh refinement"
87+
uv run python mcp_server.py --db db/comsol.db --test "mesh refinement"
8388
8489
# Rebuild with embeddings for production
85-
uv run python build_index.py ./markdown --output comsol.db
90+
uv run python build_index.py ./markdown --output db/comsol.db
8691
```
8792

88-
### macOS/Linux (if Comsol installed locally)
93+
#### macOS/Linux
8994

9095
```bash
9196
# Typical Linux path
92-
./scripts/convert_html.sh /usr/local/comsol/multiphysics/doc/help/wtpwebapps/ROOT/doc ./markdown
97+
uv run python scripts/convert_comsol_html.py /usr/local/comsol/multiphysics/doc/help/wtpwebapps/ROOT/doc ./markdown
9398

9499
# Or copy docs from Windows machine first
95-
./scripts/convert_html.sh ./comsol_docs_copy ./markdown
100+
uv run python scripts/convert_comsol_html.py ./comsol_docs_copy ./markdown
96101

97102
# Build and test
98-
uv run python build_index.py ./markdown --output comsol.db --no-embeddings
99-
uv run python mcp_server.py --db comsol.db --test "boundary conditions"
103+
uv run python build_index.py ./markdown --output db/comsol.db --no-embeddings
104+
uv run python mcp_server.py --db db/comsol.db --test "boundary conditions"
100105

101106
# Production build with embeddings
102-
uv run python build_index.py ./markdown --output comsol.db
107+
uv run python build_index.py ./markdown --output db/comsol.db
103108
```
104109

105-
### Expected Output
110+
#### Expected Output
106111

107112
A typical Comsol installation produces:
108113
- ~8,000–15,000 HTML files
109114
- ~20,000–40,000 chunks after indexing
110115
- ~50–150 MB database with embeddings
111116
- ~5–10 minutes for full rebuild with embeddings
112117

118+
## Generic HTML Conversion
119+
120+
For documentation that uses semantic HTML (proper `<h1>`, `<h2>`, `<p>` tags), use the generic conversion script:
121+
122+
```bash
123+
./scripts/convert_html.sh /path/to/html/docs ./markdown
124+
```
125+
126+
This uses pandoc to convert HTML to GitHub-Flavored Markdown. Works well for most documentation systems.
127+
113128
## Project Structure
114129

115130
```
@@ -118,9 +133,11 @@ local-docs-mcp/
118133
├── build_index.py # Indexing script
119134
├── mcp_server.py # MCP server exposing search tools
120135
├── scripts/
121-
│ └── convert_html.sh # HTML to Markdown conversion
136+
│ ├── convert_html.sh # Generic HTML to Markdown (pandoc)
137+
│ └── convert_comsol_html.py # Comsol-specific conversion
122138
├── markdown/ # Converted docs (gitignored)
123-
└── docs.db # SQLite database (gitignored)
139+
└── db/ # SQLite databases (gitignored)
140+
└── docs.db
124141
```
125142

126143
## Detailed Setup
@@ -177,19 +194,19 @@ pandoc -f docx -t gfm input.docx -o output.md
177194

178195
For initial testing (fast, seconds):
179196
```bash
180-
uv run python build_index.py ./markdown --output docs.db --no-embeddings
197+
uv run python build_index.py ./markdown --output db/docs.db --no-embeddings
181198
```
182199

183200
For production use (with embeddings, minutes):
184201
```bash
185-
uv run python build_index.py ./markdown --output docs.db
202+
uv run python build_index.py ./markdown --output db/docs.db
186203
```
187204

188205
Embeddings enable semantic search—finding "mesh refinement" when someone searches "make grid finer." Without embeddings, only exact keyword matching works. Always use embeddings for actual team usage.
189206

190207
Options:
191208
```
192-
--output PATH Output database path (default: docs.db)
209+
--output PATH Output database path (default: db/docs.db)
193210
--chunk-size N Target chunk size in characters (default: 1500)
194211
--chunk-overlap N Overlap between chunks (default: 200)
195212
--embedding-model Model name (default: BAAI/bge-small-en-v1.5)
@@ -202,7 +219,7 @@ Rebuild vs. incremental: The script hashes files and skips unchanged content. A
202219
### 5. Test the MCP Server Locally
203220

204221
```bash
205-
uv run python mcp_server.py --db docs.db
222+
uv run python mcp_server.py --db db/docs.db
206223
```
207224

208225
The server communicates via stdio. For testing, you can pipe JSON-RPC messages, but it's easier to just configure your IDE and test there.
@@ -218,7 +235,7 @@ Create or edit `.cursor/mcp.json` in your project root:
218235
"mcpServers": {
219236
"docs": {
220237
"command": "uv",
221-
"args": ["run", "python", "/absolute/path/to/mcp_server.py", "--db", "/absolute/path/to/docs.db"],
238+
"args": ["run", "python", "/absolute/path/to/mcp_server.py", "--db", "/absolute/path/to/db/docs.db"],
222239
"cwd": "/absolute/path/to/local-docs-mcp"
223240
}
224241
}
@@ -236,7 +253,7 @@ Add to your Claude Code MCP configuration:
236253
"mcpServers": {
237254
"docs": {
238255
"command": "uv",
239-
"args": ["run", "python", "mcp_server.py", "--db", "docs.db"],
256+
"args": ["run", "python", "mcp_server.py", "--db", "db/docs.db"],
240257
"cwd": "/absolute/path/to/local-docs-mcp"
241258
}
242259
}
@@ -248,7 +265,7 @@ Add to your Claude Code MCP configuration:
248265
Configure via `~/.codex/config.json` or use the CLI:
249266

250267
```bash
251-
codex mcp add docs "uv run python /path/to/mcp_server.py --db /path/to/docs.db"
268+
codex mcp add docs "uv run python /path/to/mcp_server.py --db /path/to/db/docs.db"
252269
```
253270

254271
## MCP Tools Exposed
@@ -372,7 +389,7 @@ Section titles are preserved as metadata for better context in search results.
372389
./scripts/convert_html.sh /path/to/html ./markdown
373390

374391
# Rebuild index with embeddings (incremental—skips unchanged files)
375-
uv run python build_index.py ./markdown --output docs.db
392+
uv run python build_index.py ./markdown --output db/docs.db
376393

377394
# Restart MCP server (or it picks up changes on next query)
378395
```
@@ -383,13 +400,13 @@ uv run python build_index.py ./markdown --output docs.db
383400
# Check database integrity
384401
uv run python -c "
385402
import sqlite3
386-
conn = sqlite3.connect('docs.db')
403+
conn = sqlite3.connect('db/docs.db')
387404
print(f'Chunks: {conn.execute(\"SELECT COUNT(*) FROM chunks\").fetchone()[0]}')
388405
print(f'Sources: {conn.execute(\"SELECT COUNT(*) FROM sources\").fetchone()[0]}')
389406
"
390407

391408
# Test search from command line
392-
uv run python mcp_server.py --db docs.db --test "boundary conditions"
409+
uv run python mcp_server.py --db db/docs.db --test "boundary conditions"
393410
```
394411

395412
## Troubleshooting
@@ -415,7 +432,7 @@ uv add sqlite-vec --reinstall
415432

416433
### "MCP server not connecting"
417434

418-
1. Test manually: `uv run python mcp_server.py --db docs.db` should start without errors
435+
1. Test manually: `uv run python mcp_server.py --db db/docs.db` should start without errors
419436
2. Check paths in IDE config are absolute
420437
3. Check `cwd` is set correctly
421438
4. Look at IDE's MCP debug logs

build_index.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -296,8 +296,8 @@ def main():
296296
parser.add_argument(
297297
"--output", "-o",
298298
type=Path,
299-
default=Path("docs.db"),
300-
help="Output database path (default: docs.db)",
299+
default=Path("db/docs.db"),
300+
help="Output database path (default: db/docs.db)",
301301
)
302302
parser.add_argument(
303303
"--chunk-size",
@@ -333,6 +333,9 @@ def main():
333333
if not args.source_dir.is_dir():
334334
print(f"Error: {args.source_dir} is not a directory", file=sys.stderr)
335335
sys.exit(1)
336+
337+
# Ensure output directory exists
338+
args.output.parent.mkdir(parents=True, exist_ok=True)
336339

337340
# Get embedding dimension from model (or default)
338341
embedding_dim = 384 # Default for bge-small

0 commit comments

Comments
 (0)