-
Notifications
You must be signed in to change notification settings - Fork 1k
Open
Description
Problem
Local GGUF embedding works well for privacy/offline use, but has friction:
- Requires downloading a ~300MB+ model on first run
- Slow on CPU; needs Metal/CUDA for reasonable throughput
- Limited to models that fit in local memory
There's no way to use cloud embedding APIs (OpenAI, Gemini, Ollama remote, etc.) without forking.
Proposed solution
Add three environment variables to activate API-based embedding as an alternative to local GGUF:
QMD_EMBED_API_URL= # base URL — presence activates API mode
QMD_EMBED_API_KEY= # API key
QMD_EMBED_API_MODEL= # model nameAPI type auto-detected from URL — no extra config:
googleapis.com→ Gemini (batchEmbedContents)- anything else → OpenAI-compatible (OpenAI, Ollama, LM Studio, etc.)
Local mode continues to work exactly as before when QMD_EMBED_API_URL is unset.
Benchmark (63 Korean+English chunks, M3 Max)
| Model | Per chunk | Dims | Cost |
|---|---|---|---|
| embeddinggemma-300M (local/Metal) | 72ms | 768 | $0 |
| text-embedding-3-small | 16ms | 1536 | $0.020/1M tokens |
| text-embedding-3-large | 13ms | 3072 | $0.130/1M tokens |
| gemini-embedding-001 | 38ms | 3072 | free tier |
Implementation
PR #427 has a working implementation with benchmark script.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels