A multifunctional AI Voice Assistant that integrates local LLM (Ollama), Speech-to-Text (Whisper), and Text-to-Speech (VoiceVox) to provide a seamless voice interaction experience. It supports information retrieval via web search, application launching, and media playback control.
This project is a Python-based voice assistant designed to run locally on Windows. It features a GUI for visual feedback and a robust backend for handling voice commands. The assistant can:
- Understand natural language queries in Japanese.
- Perform hybrid searches (Wikipedia + DuckDuckGo + Specialized Sites like Qiita/Zenn).
- Launch local applications (Notepad, Calculator, Browser, etc.).
- Speak responses using high-quality TTS (VoiceVox).
- Launch: Run
main.py(after ensuring prerequisites are met). The GUI will appear. - Speak: The system automatically detects voice activity (VAD). Speak your command or question clearly.
- Example: "今日のニュースを教えて" (Tell me today's news)
- Example: "メモ帳を開いて" (Open Notepad)
- Transcribe: The audio is converted to text using
faster-whisper. - Think: The AI (Ollama/Llama 3) analyzes the intent.
- If a search is needed, it queries the web first.
- If a tool is needed (Open App, Music), it executes the tool.
- Reply: The AI generates a concise response in Japanese.
- Speak Back: The response is read aloud using VoiceVox.
- Language: Python 3.10+
- GUI: Tkinter (Standard Python GUI)
- Speech-to-Text (STT): faster-whisper (Optimized Whisper implementation)
- Large Language Model (LLM): Ollama running
llama3.2:3b - Text-to-Speech (TTS): VoiceVox (Local HTTP Server)
- Audio I/O:
sounddevice,soundfile - Voice Activity Detection:
webrtcvad - Search:
duckduckgo_search,wikipedia
All configurable settings are stored in config.json.
{
"audio": {
"sample_rate": 16000, // Audio sample rate
"frame_ms": 30, // Frame duration for VAD
"vad_mode": 3, // VAD aggressiveness (0-3)
"start_voiced_frames": 5, // Frames to trigger speech start
"end_silence_duration_ms": 1000 // Silence duration to end speech
},
"whisper": {
"model_size": "medium", // Model size (tiny, base, small, medium, large-v2)
"device": "cuda", // "cuda" for GPU, "cpu" for CPU
"compute_type": "int8" // Quantization (float16, int8)
},
"ollama": {
"base_url": "http://localhost:11434", // Ollama API URL
"model": "llama3.2:3b", // Model tag
"max_turns": 10 // Context history limit
},
"voicevox": {
"base_url": "http://localhost:50021", // VoiceVox API URL
"speaker_id": 3 // Speaker ID (3 = Zundamon Normal)
},
"prompts": {
"system_prompt": "...", // Main persona prompt
"intent_router_prompt": "..." // Search intent classification prompt
}
}- Ollama: Must be installed and running on port
11434.- Ensure you have pulled the model:
ollama pull llama3.2:3b
- Ensure you have pulled the model:
- VoiceVox: Must be installed and running (Engine) on port
50021. - GPU: A CUDA-capable GPU is highly recommended for
faster-whisperand Ollama for acceptable latency. If using CPU, changewhisper.deviceto"cpu"inconfig.json(will be significantly slower).
This project is licensed under the MIT License.