feat(voice): local Whisper transcription with pluggable backends by Lunatik-006 · Pull Request #118 · chigwell/telegram-mcp

Lunatik-006 · 2026-05-04T08:36:38Z

Summary

Adds local voice/audio transcription via Whisper. Audio never leaves the host.

Two new tools registered with the FastMCP server:

transcribe_voice(chat_id, message_id, language=None, account=None) — downloads a voice or audio attachment from a message and returns the transcribed text.
voice_transcription_info() — reports the active backend, device, and config (useful for debugging hardware selection).

The whole feature is opt-in via optional extras — without any of them installed the tools return a clean transcription_unavailable JSON blob and nothing else changes. No torch dependency is added by default.

Backend strategy

Three pluggable backends, selected by env or auto-detected at startup:

Hardware	Extras	Backend
Universal CPU on any OS, NVIDIA CUDA	`pip install -e ".[voice]"`	`faster-whisper` (CTranslate2)
Intel CPU / Iris Xe iGPU / Arc dGPU	`pip install -e ".[voice-openvino]"`	OpenVINO GenAI
Apple Silicon (M1/M2/M3/M4)	`pip install -e ".[voice-mlx]"`	`mlx-whisper`

Auto-detect priority (when WHISPER_BACKEND=auto, the default):

Apple Silicon → MLX → MPS
NVIDIA CUDA → faster-whisper → cuda
Intel discrete/integrated GPU → OpenVINO → GPU.{N} (highest-indexed → typically the discrete dGPU)
CPU fallback via faster-whisper (or OpenVINO if it's the only one installed) with a one-time whisper_cpu_only warning, silenceable via WHISPER_WARN_CPU=false.

Config (env vars)

Var	Default	Meaning
`WHISPER_ENABLED`	`true`	Set to `false` to disable transcription tools entirely.
`WHISPER_BACKEND`	`auto`	`auto` / `faster_whisper` / `openvino` / `mlx`.
`WHISPER_DEVICE`	`auto`	`auto` / `cpu` / `cuda` / `gpu` / `gpu.0` / `gpu.1` …
`WHISPER_MODEL`	`base`	`tiny` / `base` / `small` / `medium` / `large-v3` / `large-v3-turbo`
`WHISPER_LANGUAGE`	(unset = auto-detect)	ISO 639-1 code (`ru`, `en`, …).
`WHISPER_WARN_CPU`	`true`	Log a one-time warning when running CPU-only.
`WHISPER_CACHE_DIR`	`~/.cache/telegram-mcp/whisper`	Where weights are downloaded.

Module layout

telegram_mcp/voice/
├── __init__.py              # public facade: transcribe(...), get_backend_info()
├── config.py                # WhisperConfig from env vars
├── detect.py                # hardware/backend auto-detection
└── backends/
    ├── base.py              # WhisperBackend ABC
    ├── faster_whisper.py    # CPU/CUDA via CTranslate2
    ├── openvino.py          # Intel via OpenVINO GenAI
    └── mlx.py               # Apple Silicon via MLX
telegram_mcp/tools/voice.py  # MCP tools transcribe_voice + voice_transcription_info

Backends are imported lazily — users only pay for the deps they install.

Tests

tests/test_voice_{config,detect,facade}.py — 30 unit tests covering:

env-driven config parsing (defaults, overrides, alias handling)
hardware detection / backend auto-pick (Apple Silicon, CUDA, Intel GPU, CPU fallback)
OpenVINO device selection (specific GPU.N, generic GPU → highest index, no GPU → CPU)
explicit backend pin without the corresponding extra installed → clean error
disabled-via-env path

All 30 pass on Windows / Python 3.13. Existing test suite is untouched.

Performance reference

Whisper-base, 11 s English clip, OpenVINO backend (this PR), best of 3 runs:

Device	Inference	× realtime
Intel Arc A550M (dGPU)	~160 ms	67×
Intel Iris Xe (iGPU)	~330 ms	33×
i7-12700H, 20 threads (OpenVINO CPU)	~600 ms	18×

Why bump `telethon` to 1.43.2

Telethon 1.43 adds a tmp_auth_key column to the SQLite sessions table while keeping the schema's CURRENT_VERSION = 8. Session files written by Telethon 1.43+ tooling cannot be loaded by 1.42 — TelegramClient.__init__ raises ValueError: too many values to unpack (expected 5) because 1.42 unpacks 5 columns where the row now has 6.

The bump:

Satisfies the existing >=1.42.0 constraint.
Is fully backwards compatible with older session files.
Lets users who migrate from other Telethon-based MCPs reuse their existing session.

Backwards compatibility

No existing tool changed; only two new ones added.
No new required deps. pip install telegram-mcp (without extras) keeps the same install footprint.
The transcribe_voice tool degrades gracefully if no whisper extras are installed.

Test plan

Unit tests pass (pytest tests/test_voice_*.py → 30 passed)
Server boots with the voice tool registered (verified via JSON-RPC stdio probe; tool count goes from 110 → 112)
voice_transcription_info() returns config + active backend
End-to-end transcribe_voice against a real voice message (post-merge, on the user's hardware — Intel Arc A550M)

🤖 Generated with Claude Code

Two new MCP tools: - transcribe_voice(chat_id, message_id, language=None) — downloads voice or audio attachment and runs Whisper locally; returns text. - voice_transcription_info() — reports active backend/device/model. Backends (all optional via extras, lazy-loaded): - [voice] → faster-whisper (universal CPU + NVIDIA CUDA) - [voice-openvino] → OpenVINO GenAI (Intel CPU/Iris Xe iGPU/Arc dGPU) - [voice-mlx] → mlx-whisper (Apple Silicon M1-M4) Auto-detect priority at startup: Apple Silicon → MLX → MPS NVIDIA CUDA → faster-whisper → cuda Intel GPU → OpenVINO → GPU.{N} (highest = discrete) CPU fallback → faster-whisper (or OpenVINO if it's the only one installed) with one-time "будет долго" warning, silenceable via WHISPER_WARN_CPU=false. Config (env vars, all optional): WHISPER_ENABLED, WHISPER_BACKEND, WHISPER_DEVICE, WHISPER_MODEL, WHISPER_LANGUAGE, WHISPER_WARN_CPU, WHISPER_CACHE_DIR. Without any voice extra installed the new tools return a clear transcription_unavailable JSON blob — the rest of the server is unaffected. No torch dep is added by default. Tests: 30 unit tests covering config parsing, hardware detection, OpenVINO device selection, and the unavailable-backend path. Bench (Whisper-base, 11s English clip, this PR's OpenVINO backend): Intel Arc A550M ~160 ms (67x realtime) Intel Iris Xe ~330 ms (33x realtime) i7-12700H CPU ~600 ms (18x realtime) Also bumps telethon to 1.43.2: 1.43 adds a tmp_auth_key column to the SQLite sessions table while keeping CURRENT_VERSION = 8. This caused ValueError: too many values to unpack (expected 5) on session files written by 1.43+ tooling. 1.43.2 satisfies the existing >=1.42.0 constraint and is fully backwards compatible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chigwell

Thank you for the contribution! Could you please fix the black formatting?

chigwell approved these changes May 4, 2026

View reviewed changes

chigwell requested changes May 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(voice): local Whisper transcription with pluggable backends#118

feat(voice): local Whisper transcription with pluggable backends#118
Lunatik-006 wants to merge 1 commit into
chigwell:mainfrom
Lunatik-006:feat/voice-transcription

Lunatik-006 commented May 4, 2026

Uh oh!

chigwell left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Lunatik-006 commented May 4, 2026

Summary

Backend strategy

Config (env vars)

Module layout

Tests

Performance reference

Why bump telethon to 1.43.2

Backwards compatibility

Test plan

Uh oh!

chigwell left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Why bump `telethon` to 1.43.2