feat(voice): local Whisper transcription with pluggable backends#118
Open
Lunatik-006 wants to merge 1 commit into
Open
feat(voice): local Whisper transcription with pluggable backends#118Lunatik-006 wants to merge 1 commit into
Lunatik-006 wants to merge 1 commit into
Conversation
Two new MCP tools:
- transcribe_voice(chat_id, message_id, language=None) — downloads voice
or audio attachment and runs Whisper locally; returns text.
- voice_transcription_info() — reports active backend/device/model.
Backends (all optional via extras, lazy-loaded):
- [voice] → faster-whisper (universal CPU + NVIDIA CUDA)
- [voice-openvino] → OpenVINO GenAI (Intel CPU/Iris Xe iGPU/Arc dGPU)
- [voice-mlx] → mlx-whisper (Apple Silicon M1-M4)
Auto-detect priority at startup:
Apple Silicon → MLX → MPS
NVIDIA CUDA → faster-whisper → cuda
Intel GPU → OpenVINO → GPU.{N} (highest = discrete)
CPU fallback → faster-whisper (or OpenVINO if it's the only one
installed) with one-time "будет долго" warning,
silenceable via WHISPER_WARN_CPU=false.
Config (env vars, all optional):
WHISPER_ENABLED, WHISPER_BACKEND, WHISPER_DEVICE, WHISPER_MODEL,
WHISPER_LANGUAGE, WHISPER_WARN_CPU, WHISPER_CACHE_DIR.
Without any voice extra installed the new tools return a clear
transcription_unavailable JSON blob — the rest of the server is
unaffected. No torch dep is added by default.
Tests:
30 unit tests covering config parsing, hardware detection, OpenVINO
device selection, and the unavailable-backend path.
Bench (Whisper-base, 11s English clip, this PR's OpenVINO backend):
Intel Arc A550M ~160 ms (67x realtime)
Intel Iris Xe ~330 ms (33x realtime)
i7-12700H CPU ~600 ms (18x realtime)
Also bumps telethon to 1.43.2: 1.43 adds a tmp_auth_key column to
the SQLite sessions table while keeping CURRENT_VERSION = 8. This
caused ValueError: too many values to unpack (expected 5) on session
files written by 1.43+ tooling. 1.43.2 satisfies the existing
>=1.42.0 constraint and is fully backwards compatible.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
chigwell
approved these changes
May 4, 2026
chigwell
requested changes
May 4, 2026
Owner
chigwell
left a comment
There was a problem hiding this comment.
Thank you for the contribution! Could you please fix the black formatting?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds local voice/audio transcription via Whisper. Audio never leaves the host.
Two new tools registered with the FastMCP server:
transcribe_voice(chat_id, message_id, language=None, account=None)— downloads a voice or audio attachment from a message and returns the transcribed text.voice_transcription_info()— reports the active backend, device, and config (useful for debugging hardware selection).The whole feature is opt-in via optional extras — without any of them installed the tools return a clean
transcription_unavailableJSON blob and nothing else changes. Notorchdependency is added by default.Backend strategy
Three pluggable backends, selected by env or auto-detected at startup:
pip install -e ".[voice]"faster-whisper(CTranslate2)pip install -e ".[voice-openvino]"pip install -e ".[voice-mlx]"mlx-whisperAuto-detect priority (when
WHISPER_BACKEND=auto, the default):GPU.{N}(highest-indexed → typically the discrete dGPU)whisper_cpu_onlywarning, silenceable viaWHISPER_WARN_CPU=false.Config (env vars)
WHISPER_ENABLEDtruefalseto disable transcription tools entirely.WHISPER_BACKENDautoauto/faster_whisper/openvino/mlx.WHISPER_DEVICEautoauto/cpu/cuda/gpu/gpu.0/gpu.1…WHISPER_MODELbasetiny/base/small/medium/large-v3/large-v3-turboWHISPER_LANGUAGEru,en, …).WHISPER_WARN_CPUtrueWHISPER_CACHE_DIR~/.cache/telegram-mcp/whisperModule layout
Backends are imported lazily — users only pay for the deps they install.
Tests
tests/test_voice_{config,detect,facade}.py— 30 unit tests covering:All 30 pass on Windows / Python 3.13. Existing test suite is untouched.
Performance reference
Whisper-base, 11 s English clip, OpenVINO backend (this PR), best of 3 runs:
Why bump
telethonto 1.43.2Telethon 1.43 adds a
tmp_auth_keycolumn to the SQLitesessionstable while keeping the schema'sCURRENT_VERSION = 8. Session files written by Telethon 1.43+ tooling cannot be loaded by 1.42 —TelegramClient.__init__raisesValueError: too many values to unpack (expected 5)because 1.42 unpacks 5 columns where the row now has 6.The bump:
>=1.42.0constraint.Backwards compatibility
pip install telegram-mcp(without extras) keeps the same install footprint.transcribe_voicetool degrades gracefully if no whisper extras are installed.Test plan
pytest tests/test_voice_*.py→ 30 passed)voice_transcription_info()returns config + active backendtranscribe_voiceagainst a real voice message (post-merge, on the user's hardware — Intel Arc A550M)🤖 Generated with Claude Code