Voice Assistant

A multifunctional AI Voice Assistant that integrates local LLM (Ollama), Speech-to-Text (Whisper), and Text-to-Speech (VoiceVox) to provide a seamless voice interaction experience. It supports information retrieval via web search, application launching, and media playback control.

Overview

This project is a Python-based voice assistant designed to run locally on Windows. It features a GUI for visual feedback and a robust backend for handling voice commands. The assistant can:

Understand natural language queries in Japanese.
Perform hybrid searches (Wikipedia + DuckDuckGo + Specialized Sites like Qiita/Zenn).
Launch local applications (Notepad, Calculator, Browser, etc.).
Speak responses using high-quality TTS (VoiceVox).

Usage Flow

Launch: Run main.py (after ensuring prerequisites are met). The GUI will appear.
Speak: The system automatically detects voice activity (VAD). Speak your command or question clearly.
- Example: "今日のニュースを教えて" (Tell me today's news)
- Example: "メモ帳を開いて" (Open Notepad)
Transcribe: The audio is converted to text using faster-whisper.
Think: The AI (Ollama/Llama 3) analyzes the intent.
- If a search is needed, it queries the web first.
- If a tool is needed (Open App, Music), it executes the tool.
Reply: The AI generates a concise response in Japanese.
Speak Back: The response is read aloud using VoiceVox.

Technical Stack

Language: Python 3.10+
GUI: Tkinter (Standard Python GUI)
Speech-to-Text (STT): faster-whisper (Optimized Whisper implementation)
Large Language Model (LLM): Ollama running llama3.2:3b
Text-to-Speech (TTS): VoiceVox (Local HTTP Server)
Audio I/O: sounddevice, soundfile
Voice Activity Detection: webrtcvad
Search: duckduckgo_search, wikipedia

Configuration

All configurable settings are stored in config.json.

{
    "audio": {
        "sample_rate": 16000,          // Audio sample rate
        "frame_ms": 30,                // Frame duration for VAD
        "vad_mode": 3,                 // VAD aggressiveness (0-3)
        "start_voiced_frames": 5,      // Frames to trigger speech start
        "end_silence_duration_ms": 1000 // Silence duration to end speech
    },
    "whisper": {
        "model_size": "medium",        // Model size (tiny, base, small, medium, large-v2)
        "device": "cuda",              // "cuda" for GPU, "cpu" for CPU
        "compute_type": "int8"         // Quantization (float16, int8)
    },
    "ollama": {
        "base_url": "http://localhost:11434", // Ollama API URL
        "model": "llama3.2:3b",        // Model tag
        "max_turns": 10                // Context history limit
    },
    "voicevox": {
        "base_url": "http://localhost:50021", // VoiceVox API URL
        "speaker_id": 3                // Speaker ID (3 = Zundamon Normal)
    },
    "prompts": {
        "system_prompt": "...",        // Main persona prompt
        "intent_router_prompt": "..."  // Search intent classification prompt
    }
}

Important Notes

Ollama: Must be installed and running on port 11434.
- Ensure you have pulled the model: ollama pull llama3.2:3b
VoiceVox: Must be installed and running (Engine) on port 50021.
GPU: A CUDA-capable GPU is highly recommended for faster-whisper and Ollama for acceptable latency. If using CPU, change whisper.device to "cpu" in config.json (will be significantly slower).

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.json		config.json
gui.py		gui.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Voice Assistant

Overview

Usage Flow

Technical Stack

Configuration

Important Notes

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Voice Assistant

Overview

Usage Flow

Technical Stack

Configuration

Important Notes

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages