Skip to content

feat(voice): local Whisper transcription with pluggable backends#118

Open
Lunatik-006 wants to merge 1 commit into
chigwell:mainfrom
Lunatik-006:feat/voice-transcription
Open

feat(voice): local Whisper transcription with pluggable backends#118
Lunatik-006 wants to merge 1 commit into
chigwell:mainfrom
Lunatik-006:feat/voice-transcription

Conversation

@Lunatik-006
Copy link
Copy Markdown

Summary

Adds local voice/audio transcription via Whisper. Audio never leaves the host.

Two new tools registered with the FastMCP server:

  • transcribe_voice(chat_id, message_id, language=None, account=None) — downloads a voice or audio attachment from a message and returns the transcribed text.
  • voice_transcription_info() — reports the active backend, device, and config (useful for debugging hardware selection).

The whole feature is opt-in via optional extras — without any of them installed the tools return a clean transcription_unavailable JSON blob and nothing else changes. No torch dependency is added by default.

Backend strategy

Three pluggable backends, selected by env or auto-detected at startup:

Hardware Extras Backend
Universal CPU on any OS, NVIDIA CUDA pip install -e ".[voice]" faster-whisper (CTranslate2)
Intel CPU / Iris Xe iGPU / Arc dGPU pip install -e ".[voice-openvino]" OpenVINO GenAI
Apple Silicon (M1/M2/M3/M4) pip install -e ".[voice-mlx]" mlx-whisper

Auto-detect priority (when WHISPER_BACKEND=auto, the default):

  1. Apple Silicon → MLX → MPS
  2. NVIDIA CUDA → faster-whisper → cuda
  3. Intel discrete/integrated GPU → OpenVINO → GPU.{N} (highest-indexed → typically the discrete dGPU)
  4. CPU fallback via faster-whisper (or OpenVINO if it's the only one installed) with a one-time whisper_cpu_only warning, silenceable via WHISPER_WARN_CPU=false.

Config (env vars)

Var Default Meaning
WHISPER_ENABLED true Set to false to disable transcription tools entirely.
WHISPER_BACKEND auto auto / faster_whisper / openvino / mlx.
WHISPER_DEVICE auto auto / cpu / cuda / gpu / gpu.0 / gpu.1
WHISPER_MODEL base tiny / base / small / medium / large-v3 / large-v3-turbo
WHISPER_LANGUAGE (unset = auto-detect) ISO 639-1 code (ru, en, …).
WHISPER_WARN_CPU true Log a one-time warning when running CPU-only.
WHISPER_CACHE_DIR ~/.cache/telegram-mcp/whisper Where weights are downloaded.

Module layout

telegram_mcp/voice/
├── __init__.py              # public facade: transcribe(...), get_backend_info()
├── config.py                # WhisperConfig from env vars
├── detect.py                # hardware/backend auto-detection
└── backends/
    ├── base.py              # WhisperBackend ABC
    ├── faster_whisper.py    # CPU/CUDA via CTranslate2
    ├── openvino.py          # Intel via OpenVINO GenAI
    └── mlx.py               # Apple Silicon via MLX
telegram_mcp/tools/voice.py  # MCP tools transcribe_voice + voice_transcription_info

Backends are imported lazily — users only pay for the deps they install.

Tests

tests/test_voice_{config,detect,facade}.py — 30 unit tests covering:

  • env-driven config parsing (defaults, overrides, alias handling)
  • hardware detection / backend auto-pick (Apple Silicon, CUDA, Intel GPU, CPU fallback)
  • OpenVINO device selection (specific GPU.N, generic GPU → highest index, no GPU → CPU)
  • explicit backend pin without the corresponding extra installed → clean error
  • disabled-via-env path

All 30 pass on Windows / Python 3.13. Existing test suite is untouched.

Performance reference

Whisper-base, 11 s English clip, OpenVINO backend (this PR), best of 3 runs:

Device Inference × realtime
Intel Arc A550M (dGPU) ~160 ms 67×
Intel Iris Xe (iGPU) ~330 ms 33×
i7-12700H, 20 threads (OpenVINO CPU) ~600 ms 18×

Why bump telethon to 1.43.2

Telethon 1.43 adds a tmp_auth_key column to the SQLite sessions table while keeping the schema's CURRENT_VERSION = 8. Session files written by Telethon 1.43+ tooling cannot be loaded by 1.42 — TelegramClient.__init__ raises ValueError: too many values to unpack (expected 5) because 1.42 unpacks 5 columns where the row now has 6.

The bump:

  • Satisfies the existing >=1.42.0 constraint.
  • Is fully backwards compatible with older session files.
  • Lets users who migrate from other Telethon-based MCPs reuse their existing session.

Backwards compatibility

  • No existing tool changed; only two new ones added.
  • No new required deps. pip install telegram-mcp (without extras) keeps the same install footprint.
  • The transcribe_voice tool degrades gracefully if no whisper extras are installed.

Test plan

  • Unit tests pass (pytest tests/test_voice_*.py → 30 passed)
  • Server boots with the voice tool registered (verified via JSON-RPC stdio probe; tool count goes from 110 → 112)
  • voice_transcription_info() returns config + active backend
  • End-to-end transcribe_voice against a real voice message (post-merge, on the user's hardware — Intel Arc A550M)

🤖 Generated with Claude Code

Two new MCP tools:
- transcribe_voice(chat_id, message_id, language=None) — downloads voice
  or audio attachment and runs Whisper locally; returns text.
- voice_transcription_info() — reports active backend/device/model.

Backends (all optional via extras, lazy-loaded):
- [voice]          → faster-whisper (universal CPU + NVIDIA CUDA)
- [voice-openvino] → OpenVINO GenAI (Intel CPU/Iris Xe iGPU/Arc dGPU)
- [voice-mlx]      → mlx-whisper (Apple Silicon M1-M4)

Auto-detect priority at startup:
  Apple Silicon → MLX → MPS
  NVIDIA CUDA   → faster-whisper → cuda
  Intel GPU     → OpenVINO → GPU.{N} (highest = discrete)
  CPU fallback  → faster-whisper (or OpenVINO if it's the only one
                  installed) with one-time "будет долго" warning,
                  silenceable via WHISPER_WARN_CPU=false.

Config (env vars, all optional):
  WHISPER_ENABLED, WHISPER_BACKEND, WHISPER_DEVICE, WHISPER_MODEL,
  WHISPER_LANGUAGE, WHISPER_WARN_CPU, WHISPER_CACHE_DIR.

Without any voice extra installed the new tools return a clear
transcription_unavailable JSON blob — the rest of the server is
unaffected. No torch dep is added by default.

Tests:
  30 unit tests covering config parsing, hardware detection, OpenVINO
  device selection, and the unavailable-backend path.

Bench (Whisper-base, 11s English clip, this PR's OpenVINO backend):
  Intel Arc A550M  ~160 ms  (67x realtime)
  Intel Iris Xe    ~330 ms  (33x realtime)
  i7-12700H CPU    ~600 ms  (18x realtime)

Also bumps telethon to 1.43.2: 1.43 adds a tmp_auth_key column to
the SQLite sessions table while keeping CURRENT_VERSION = 8. This
caused ValueError: too many values to unpack (expected 5) on session
files written by 1.43+ tooling. 1.43.2 satisfies the existing
>=1.42.0 constraint and is fully backwards compatible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Owner

@chigwell chigwell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution! Could you please fix the black formatting?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants