Skip to content

KiranRajeev-KV/vox

Repository files navigation

Vox

Press a key. Speak. Text appears.

A local, offline, GPU-accelerated voice-to-text tool for Linux. Audio never leaves your machine. LLM cleanup runs locally by default, or via any OpenAI-compatible cloud provider you configure.

FeaturesInstallationUsageConfigurationBenchmarksDevelopment


What Is Vox?

Vox is a personal voice-to-text tool for Linux. Press a hotkey, speak naturally, and your words are transcribed and pasted directly into whatever window is focused — your editor, browser, terminal, notes app, anything.

It runs Whisper locally on your GPU, cleans up your speech with an LLM of your choice, and gets out of your way. No accounts, no subscriptions, no telemetry. Inspired by Wispr Flow and Superwhisper.


Features

  • Global hotkey — configurable single keys (f9, space) and modifier combos (ctrl+shift+space), works across all apps
  • Two recording modes — toggle (press once to start, press again to stop) or push-to-talk (hold to record)
  • Local transcription — faster-whisper with Whisper large-v3, runs on your GPU via CUDA
  • Voice Activity Detection — Silero VAD v6 bundled in faster-whisper, no extra dependency
  • Smart cleanup — filler word removal and grammar cleanup via any OpenAI-compatible LLM (Ollama, Groq, OpenRouter, OpenAI, etc.)
  • Optimistic paste — raw text appears immediately, LLM cleanup runs silently in the background and replaces the text when ready
  • Context-aware replace — skips LLM text replacement in terminals and editors where synthetic keyboard input is unsafe, with configurable blacklist and stale timeout
  • Clipboard fallback — if direct paste fails, copies to clipboard and notifies you
  • Recording indicator — thin bar at the top of the screen while the mic is live (tkinter, click-through DOCK window)
  • Sound cues — subtle start/stop beeps with configurable volume
  • Full history — every transcription saved to SQLite with FTS5 full-text search
  • Personal dictionary — correct Whisper's persistent misrecognitions (say "vox dict add ")
  • Web UI — browse and search your transcription history in a minimal Flask interface
  • CLI toolsvox history, vox search, vox stats, vox dict from the command line
  • Real-time streaming — optional CarelessWhisper backend shows partial text on the indicator as you speak
  • Privacy-first — audio never leaves your machine. Transcribed text stays local when using a local LLM; sent only to the configured endpoint if using a cloud provider

Requirements

System

  • Arch Linux (or any systemd-based Linux)
  • X11 (Wayland not supported — xdotool requires X11)
  • NVIDIA GPU with CUDA (recommended — CPU fallback works but is slower)
  • uv for Python environment management

System Packages

sudo pacman -S xdotool libnotify xprop

LLM Provider (Optional)

Vox works without an LLM — you'll get raw Whisper output with filler word removal. For grammar cleanup, configure any OpenAI-compatible endpoint:

Local (Ollama):

ollama pull phi3-mini

Cloud providers — no local install needed, just configure base_url + model in config.toml:

Provider base_url Example model
Ollama (local) http://localhost:11434/v1 phi3-mini
Groq https://api.groq.com/openai/v1 llama-3.1-8b-instant
OpenRouter https://openrouter.ai/api/v1 google/gemini-flash-1.5
OpenAI https://api.openai.com/v1 gpt-4o-mini

Streaming Transcription (Optional)

Vox supports real-time streaming transcription via CarelessWhisper, showing partial text on the indicator bar as you speak. This requires an NVIDIA GPU with CUDA 13 support and ~2.5GB VRAM for the small model.

just setup-streaming   # clone WhisperRT-Streaming, sync deps, login to HF

Then enable in config.toml:

[streaming]
enabled = true
model = "small"           # base, small, large-v2
chunk_size_ms = 300       # decode frequency
beam_size = 0             # 0 = greedy, 1+ = beam search

When streaming is enabled, the indicator bar expands to 40px with 12px font to display live partial transcriptions. The indicator collapses back to normal height when idle.

Offline fallback: If streaming fails to start, Vox falls back to the standard faster-whisper offline transcription automatically.


Installation

git clone https://github.com/KiranRajeev-KV/vox.git
cd vox
uv sync
uv sync --group dev          # install dev tools (optional)
cp config.toml.example config.toml
just install-hooks           # set up pre-commit hooks

Edit config.toml to configure your hotkey, transcription model, and LLM provider.


Usage

Start Vox

just run

# Or directly
uv run main.py
uv run main.py --debug       # verbose logging to stderr
uv run main.py --version     # print version and exit

Record

  1. Press F9 (or your configured hotkey) to start recording
  2. Speak naturally
  3. Press F9 again to stop (toggle mode) or release the key (push-to-talk)
  4. Text appears in the focused window

A thin red bar at the top of your screen indicates the mic is live. You'll hear a subtle beep on start and stop.

History & Search

uv run main.py history                    # browse all transcriptions
uv run main.py history --search "query"   # full-text search (FTS5)
uv run main.py stats                      # usage statistics

Web UI

uv run main.py web                        # launch Flask UI on port 8080

Browse your transcription history, search, and view individual sessions in a minimal web interface.


Configuration

All settings live in config.toml:

[hotkey]
trigger = "f9"                    # f1-f20, ctrl, alt, shift, space, etc.
mode = "toggle"                   # "toggle" or "push_to_talk"

[audio]
device = ""                       # "" = default, or integer index
sample_rate = 16000
channels = 1

[transcription]
model = "large-v3-turbo"              # large-v3, large-v3-turbo, medium, small, tiny
device = "cuda"                   # "cuda" or "cpu"
compute_type = "int8"             # "int8", "float16", "float32"
language = ""                     # "" = auto-detect, or "en", "ml", "hi"
vad = true

[llm]
enabled = false                   # set true to enable LLM cleanup
base_url = "http://localhost:11434/v1"
api_key = ""                      # or set VOX_LLM_API_KEY env var
model = "phi3-mini"
timeout_seconds = 10

[output]
method = "xdotool"
fallback_to_clipboard = true
replace_strategy = "select"       # "select" (overwrite), "append", "skip"
replace_timeout_seconds = 5
replace_blacklist = ["Alacritty", "kitty", "st", "Vim", "nvim", "Code"]

[indicator]
style = "bar"
position = "top"
height = 4
width = "200"                     # "full" or pixel value
color = "#ff4444"
opacity = 0.9

[history]
enabled = true
db_path = "~/.local/share/vox/history.db"
max_entries = 10000

[sounds]
enabled = true
start_sound = "assets/start.wav"
stop_sound = "assets/stop.wav"
volume = 0.7

Full config reference: see config.toml.example.


GPU / VRAM Notes

Tested on RTX 4050 (6GB VRAM):

Component VRAM When Active
Whisper large-v3 (int8) ~2.5GB During transcription
Local LLM (e.g. phi3-mini) ~2.3GB During LLM cleanup only
Both simultaneously Never Pipeline is sequential

When using a local provider (Ollama), Vox automatically passes keep_alive=0 so the model unloads from VRAM immediately after inference, preventing contention with Whisper. Cloud providers use no local VRAM for the LLM.

Low on VRAM? Use a smaller Whisper model:

[transcription]
model = "medium"

Or disable LLM entirely:

[llm]
enabled = false

Architecture

image

Threading model: The critical path (transcribe → process → paste) runs inline on the Main thread for offline transcription. With streaming enabled, a dedicated CWWorkerThread handles GPU decode to avoid blocking the sounddevice callback.

Daemon threads:

  • HotkeyThread — pynput listener, sends start/stop/shutdown to control_queue
  • IndicatorThread — tkinter mainloop, polls indicator_queue every 50ms
  • LLMThread — spawned per-session after paste, runs cleanup, dies on completion
  • StreamerThread — polls transcriber for partial updates, drives indicator text
  • CWWorkerThread — CarelessWhisper GPU decode loop (streaming only)

State machine: IDLE → RECORDING → TRANSCRIBING → PROCESSING → PASTING → IDLE

LLM cleanup runs asynchronously alongside IDLE — a new recording can start while cleanup is in progress.

Streaming state machine: IDLE → RECORDING (streaming) → PASTING → IDLE — transcription runs inline via the worker thread, bypassing TRANSCRIBING and PROCESSING states for lower latency.


Benchmarks

Vox includes a WER (Word Error Rate) regression benchmark suite that tests multiple Whisper models against LibriSpeech test-clean samples. Results are streamed via HuggingFace — no full datasets downloaded.

Running Benchmarks

# Quick benchmark (4 models)
just benchmark

# Full benchmark (12 models)
just benchmark-full

# Specific dataset
just benchmark-single clean

Results

image

The benchmark suite tests 12 Whisper models including large-v3, large-v3-turbo, medium, small, tiny, base, large-v1, large-v2, and distilled variants. Metrics tracked:

Metric Description
WER (mean/median/p95) Word Error Rate across samples
Latency (mean/median/p95) Transcription time per sample
LLM delta WER(raw) − WER(cleaned) — positive = LLM helped
Paste success rate Logged in SQLite stats table

Results are appended to tests/benchmark_results.jsonl and charts are generated in tests/benchmark_charts/.


Project Structure

vox/
├── config.toml              # Active configuration
├── config.toml.example      # Full config reference with docs
├── pyproject.toml           # Dependencies and metadata
├── justfile                 # Task runner commands
├── pyrightconfig.json       # Strict type checking config
├── .pre-commit-config.yaml  # ruff + pyright hooks
├── main.py                  # CLI entry point, logging, history/stats/web
│
├── vox/
│   ├── __init__.py          # __version__ = "0.1.0"
│   ├── config.py            # TOML loader, frozen dataclass settings
│   ├── pipeline.py          # State machine, thread wiring, main loop
│   ├── hotkey.py            # pynput listener, toggle + push-to-talk
│   ├── recorder.py          # sounddevice audio capture
│   ├── transcriber.py       # Factory function, delegates to backends
│   ├── transcriber_base.py  # ABC: shared interface and types
│   ├── transcriber_fw.py    # Faster-whisper backend (offline)
│   ├── transcriber_cw.py    # CarelessWhisper backend (streaming)
│   ├── transcriber_cw_decode.py  # Decoding options for CW
│   ├── processor.py         # Filler words + LLM cleanup
│   ├── output.py            # OutputBackend abstraction, xdotool + clipboard
│   ├── indicator.py         # tkinter thin bar overlay + live text
│   ├── sounds.py            # WAV loading, playback, tone generation
│   ├── history.py           # SQLite + FTS5 full-text search
│   ├── dictionary.py        # Personal vocabulary corrections
│   └── web.py               # Flask web UI for history
│
├── tests/
│   ├── conftest.py          # Shared fixtures
│   ├── test_transcriber.py  # Factory function tests
│   ├── test_transcriber_base.py  # ABC contract tests
│   ├── test_transcriber_fw.py   # Faster-whisper backend tests
│   ├── test_transcriber_cw.py   # CarelessWhisper streaming tests
│   ├── test_pipeline.py     # Pipeline state machine tests
│   ├── test_indicator.py    # Indicator window tests
│   ├── test_integration.py  # End-to-end pipeline tests
│   ├── test_benchmark.py    # WER regression benchmarks
│   └── benchmark_charts/    # Generated PNG charts
│
├── third_party/
│   └── WhisperRT-Streaming/  # CarelessWhisper model (git submodule)
│
└── assets/
    ├── start.wav            # Recording start sound cue
    └── stop.wav             # Recording stop sound cue

Development

Tech Stack

Layer Technology
Language Python 3.12
Package manager uv
Transcription faster-whisper + Whisper large-v3
Audio capture sounddevice
Hotkeys pynput
LLM client openai (universal OpenAI-compatible client)
Clipboard pyperclip
Web UI Flask
Database SQLite + FTS5
Linting/formatting ruff
Type checking pyright
Testing pytest + jiwer
Task runner just

Common Commands

just run          # start Vox
just debug        # start with verbose logging
just test         # run all tests
just lint         # ruff check
just fmt          # ruff format (auto-fix)
just typecheck    # pyright strict check
just benchmark    # quick WER regression test
just deadcode     # vulture dead code detection
just web          # launch Flask history UI
just setup-streaming  # install CarelessWhisper streaming deps

Pre-commit Hooks

Hooks run ruff and pyright automatically on every commit. A commit that fails either check is rejected.

just install-hooks   # first-time setup

Testing

uv sync --group dev              # install dev dependencies

uv run pytest                    # run all tests
uv run pytest tests/ -v --tb=short  # verbose
uv run pytest tests/test_transcriber.py  # single module

# Marked tests
uv run pytest -m slow            # integration tests (real audio)
uv run pytest -m benchmark       # benchmark suite

Test layers:

  • Unit tests — each module in isolation, mocked audio/GPU/xdotool (~3,000+ lines)
  • Integration tests — real WhisperModel with real LibriSpeech audio streaming
  • Benchmark tests — WER regression across 12 Whisper models with chart generation

Troubleshooting

Problem Solution
Hotkey not working Make sure Vox is running as your user, not root. Check for hotkey conflicts with your WM.
Paste not working in some apps Electron apps intercept keyboard events. Vox falls back to clipboard automatically — press Ctrl+V.
Text replacement misfires in terminal/Vim Add the window class to replace_blacklist in config.toml. Find class: xdotool getactivewindow getwindowclassname
LLM is slow on first use (local) Vox sends a warmup ping on startup to pre-load the model. Wait a few seconds after launch before your first recording. Cloud providers don't have this issue.
Out of VRAM Use model = "medium" or model = "small", or set [llm] enabled = false
Mic not found Check arecord -l for available devices. Set device = <index> in [audio] section.

Roadmap

See PLAN.md for the full feature roadmap and build order.

Planned

  • Idea expansion mode (LLM expands rough voice notes into structured writeups)
  • Custom vocabulary / personal dictionary (post-correct Whisper misheard words)
  • Snippets / voice shortcuts (say a cue, expand to full text)
  • Transcription search UI (minimal GUI or rofi integration)
  • Per-app profiles (different post-processing per active window)
  • Multilingual support (Malayalam, Hindi — Whisper handles this natively)
  • WhisperX two-mode architecture (faster-whisper for quick dictation, WhisperX for long-form with word-level timestamps)

License

MIT

About

Local, offline, GPU-accelerated voice-to-text for Linux. Press a hotkey, speak, text appears. Audio never leaves your machine.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages