Skip to content

Feature request: OpenAI-compatible and Gemini API embedding support #428

@ysys143

Description

@ysys143

Problem

Local GGUF embedding works well for privacy/offline use, but has friction:

  • Requires downloading a ~300MB+ model on first run
  • Slow on CPU; needs Metal/CUDA for reasonable throughput
  • Limited to models that fit in local memory

There's no way to use cloud embedding APIs (OpenAI, Gemini, Ollama remote, etc.) without forking.

Proposed solution

Add three environment variables to activate API-based embedding as an alternative to local GGUF:

QMD_EMBED_API_URL=   # base URL — presence activates API mode
QMD_EMBED_API_KEY=   # API key
QMD_EMBED_API_MODEL= # model name

API type auto-detected from URL — no extra config:

  • googleapis.com → Gemini (batchEmbedContents)
  • anything else → OpenAI-compatible (OpenAI, Ollama, LM Studio, etc.)

Local mode continues to work exactly as before when QMD_EMBED_API_URL is unset.

Benchmark (63 Korean+English chunks, M3 Max)

Model Per chunk Dims Cost
embeddinggemma-300M (local/Metal) 72ms 768 $0
text-embedding-3-small 16ms 1536 $0.020/1M tokens
text-embedding-3-large 13ms 3072 $0.130/1M tokens
gemini-embedding-001 38ms 3072 free tier

Implementation

PR #427 has a working implementation with benchmark script.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions