Sample backend server for the Avatar Chat Widget
See the Avatar Chat Widget in action -> Try Nyx
Real-time voice-to-avatar interaction server combining AI agents (OpenAI Realtime API, Google Gemini Live API or your custom agent) with the Wav2Arkit model for synchronized avatar facial animation. This is an example server that powers the Avatar Chat Widget by processing audio streams and generating ARKit blendshapes for realistic facial animations.
- Real-time Voice-to-Voice AI: OpenAI Realtime API or Google Gemini Live API integration for natural conversation
- Facial Animation Sync: Wav2Arkit model for ARKit-compatible blendshapes
- Modular Agent System: Pluggable agents (sample OpenAI/Gemini or custom implementations)
- WebSocket Communication: Low-latency bidirectional streaming
- CPU Acceleration: ONNX-optimized inference for real-time performance without GPU
- Production Ready: Docker support, health checks, logging, authentication
- AI Customer Support Agents: Replace chatbots with branded, empathetic 3D avatars
- Interactive Kiosks: Voice-enabled avatars for retail or information desks
- Virtual Concierges: Hospitality assistants that can see (via multimodal input) and speak
- Education & Training: Roleplay scenarios with responsive virtual characters
This server acts as the central brain between the client (Widget) and the AI (LLM):
- Audio In: Receives microphone audio from the user (Client)
- Agent Processing: Forwards audio to OpenAI/Gemini/Custom Agent
- Response Generation: Receives audio response from Agent
- Facial Animation: Runs Wav2Arkit model (ONNX) to generate lip-sync frames
- Sync Out: Streams synchronized Audio + Blendshape packets to Client at 30 FPS
- Prerequisites
- Quick Start
- Configuration
- API Documentation
- Docker Usage
- Authentication
- Performance Best Practices
- Agent Modularity
- Development
- License
- Acknowledgments
- Python 3.10.x (exact version required)
- uv package manager
- One of the following API keys (depending on your chosen agent):
- OpenAI API key with Realtime API access (for
sample_openaiagent) - Google Gemini API key (for
sample_geminiagent)
- OpenAI API key with Realtime API access (for
- ONNX Runtime (CPU-optimized, included in dependencies)
- Docker 20.10+
- Docker Compose 2.0+
- One of the following API keys (depending on your chosen agent):
- OpenAI API key with Realtime API access (for
sample_openaiagent) - Google Gemini API key (for
sample_geminiagent)
- OpenAI API key with Realtime API access (for
# 1. Install uv (if not already installed)
# Windows:
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
# macOS/Linux:
curl -LsSf https://astral.sh/uv/install.sh | sh
# 2. Clone the repository
git clone https://github.com/myned-ai/avatar-chat-server.git
cd avatar-chat-server
# 3. Install dependencies
uv sync
# 4. Download the Wav2Arkit model
pip install -U "huggingface_hub[cli]"
mkdir -p src/pretrained_models
huggingface-cli download myned-ai/wav2arkit_cpu --local-dir src/pretrained_models
# 5. Configure environment
cp .env.example .env
# Edit .env with your OPENAI_API_KEY and other settings
# 6. Run server
uv run python src/main.pyServer will start at http://localhost:8080
Test the server: Open test.html in your browser to test the Avatar Chat Widget with your local server. Make sure AUTH_ENABLED=false in your .env file for testing.
# 1. Clone and configure
git clone https://github.com/myned-ai/avatar-chat-server.git
cd avatar-chat-server
cp .env.example .env
# Edit .env with your settings
# 2. Download the ONNX model
pip install -U "huggingface_hub[cli]"
mkdir -p src/pretrained_models
huggingface-cli download myned-ai/wav2arkit_cpu --local-dir src/pretrained_models
# 3. Build and run (production)
docker-compose up -d
# 4. View logs
docker-compose logs -f
# 5. Stop server
docker-compose down
# Development mode (with hot reload)
docker-compose --profile dev upAll settings can be configured via environment variables or .env file.
At least one API key is required, depending on your chosen AGENT_TYPE:
| Variable | Description |
|---|---|
OPENAI_API_KEY |
Your OpenAI API key (required if using sample_openai agent) |
GEMINI_API_KEY |
Your Google Gemini API key (required if using sample_gemini agent) |
| Variable | Default | Description |
|---|---|---|
| Agent Configuration | ||
AGENT_TYPE |
sample_openai |
Agent type: sample_openai, sample_gemini, remote |
AGENT_URL |
(none) | WebSocket URL for remote agent (e.g., ws://agent-service:8080/ws) |
| OpenAI Configuration | ||
OPENAI_MODEL |
gpt-realtime |
Realtime API model |
OPENAI_VOICE |
alloy |
Voice: alloy, ash, ballad, coral, echo, sage, shimmer, verse, marin, cedar |
OPENAI_VOICE_SPEED |
1.2 |
Voice speed (0.25–1.5, where 1.0 is normal) |
OPENAI_TRANSCRIPT_SPEED |
15.0 |
Transcript character emission speed (chars/sec) for client sync |
OPENAI_TRANSCRIPTION_MODEL |
gpt-4o-transcribe |
Transcription model for user speech |
OPENAI_TRANSCRIPTION_LANGUAGE |
en |
Language for transcription (ISO-639-1 code) |
OPENAI_VAD_TYPE |
semantic_vad |
Voice Activity Detection type: server_vad, semantic_vad |
OPENAI_VAD_THRESHOLD |
0.9 |
VAD threshold (0.0–1.0, higher = less sensitive) |
OPENAI_VAD_SILENCE_DURATION_MS |
500 |
Silence duration before turn ends (ms) |
OPENAI_VAD_PREFIX_PADDING_MS |
300 |
Audio to include before detected speech (ms) |
OPENAI_NOISE_REDUCTION |
near_field |
Noise reduction: near_field, far_field, or empty to disable |
| Gemini Configuration | ||
GEMINI_MODEL |
gemini-2.5-flash-native-audio-preview-12-2025 |
Live API model |
GEMINI_VOICE |
Leda |
Voice: Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus, Zephyr |
GEMINI_API_VERSION |
v1alpha |
Gemini API version |
GEMINI_THINKING_BUDGET |
-1 |
Thinking budget: 0=disabled, -1=dynamic, 1–32768=fixed tokens |
GEMINI_GOOGLE_SEARCH_GROUNDING |
false |
Enable Google Search grounding for real-time information |
GEMINI_PROACTIVE_AUDIO |
false |
Allow model to decide not to respond if content is not relevant |
GEMINI_CONTEXT_WINDOW_COMPRESSION |
true |
Enable context window compression for longer sessions |
GEMINI_INPUT_SAMPLE_RATE |
16000 |
Audio sample rate sent to Gemini |
GEMINI_OUTPUT_SAMPLE_RATE |
24000 |
Audio sample rate received from Gemini |
GEMINI_TRANSCRIPT_SPEED |
15.0 |
Transcript character emission speed (chars/sec) |
GEMINI_VAD_START_SENSITIVITY |
START_SENSITIVITY_LOW |
VAD start sensitivity |
GEMINI_VAD_END_SENSITIVITY |
END_SENSITIVITY_LOW |
VAD end sensitivity |
GEMINI_TURN_COVERAGE |
TURN_INCLUDES_ALL_INPUT |
Turn coverage mode |
| Assistant Configuration | ||
ASSISTANT_INSTRUCTIONS |
see .env.example | System prompt for the AI assistant |
| Model Configuration | ||
ONNX_MODEL_PATH |
./pretrained_models/wav2arkit_cpu.onnx |
Path to ONNX model weights (CPU-only) |
| Server Configuration | ||
SERVER_HOST |
0.0.0.0 |
Server bind address |
SERVER_PORT |
8080 |
Server port |
USE_SSL |
false |
Enable SSL (WebSocket endpoint becomes wss://) |
DEBUG |
false |
Enable debug logging (verbose output) |
DEBUG_AUDIO_CAPTURE |
false |
Save incoming audio to files for debugging |
| Audio Configuration | ||
INPUT_SAMPLE_RATE |
24000 |
Input audio sample rate (widget format) |
OUTPUT_SAMPLE_RATE |
24000 |
Output audio sample rate (for playback and lip-sync) |
WAV2ARKIT_SAMPLE_RATE |
16000 |
Wav2Arkit model expected sample rate |
BLENDSHAPE_FPS |
30 |
Output blendshape frame rate |
AUDIO_CHUNK_DURATION |
0.5 |
Audio chunk duration in seconds for Wav2Arkit processing |
TRANSCRIPT_CHARS_PER_SECOND |
16.0 |
Transcript timing estimation (chars/sec) |
| Authentication | ||
AUTH_ENABLED |
false |
Enable HMAC token authentication |
AUTH_SECRET_KEY |
(auto-generated) | Secret key for HMAC signing (generate with openssl rand -hex 32) |
AUTH_TOKEN_TTL |
3600 |
Token time-to-live in seconds |
AUTH_ALLOWED_ORIGINS |
localhost:5173,... |
CORS allowed origins (comma-separated) |
AUTH_ENABLE_RATE_LIMITING |
true |
Enable rate limiting |
See .env.example for complete configuration template.
GET /inf- Server info and statusGET /health- Health check endpointGET /docs- OpenAPI documentation (Swagger UI)GET /redoc- ReDoc documentationPOST /api/auth/token- Generate HMAC auth token (if authentication enabled)
Connect to ws://localhost:8080/ws (or wss:// for production with TLS)
With authentication:
ws://localhost:8080/ws?token=YOUR_AUTH_TOKEN
| Type | Payload | Description |
|---|---|---|
text |
{"type": "text", "data": "Hello"} |
Send text message to AI |
audio_stream_start |
{"type": "audio_stream_start", "userId": "user123"} |
Start audio streaming session |
audio |
{"type": "audio", "data": "<base64>"} |
Audio chunk (e.g. PCM16, 24kHz mono, base64-encoded) |
interrupt |
{"type": "interrupt"} |
Explicitly interrupt AI response |
ping |
{"type": "ping"} |
Heartbeat to keep connection alive |
| Type | Description |
|---|---|
config |
Sent on connection. Contains negotiated audio settings (see below) |
audio_start |
AI started responding. Includes sessionId, turnId, sampleRate, format |
sync_frame |
Synchronized audio + blendshape frame at 30 FPS (see below) |
audio_chunk |
Fallback audio-only chunk when Wav2Arkit model is unavailable |
audio_end |
AI finished responding. Includes sessionId, turnId |
transcript_delta |
Streaming text fragment with timing offsets (see below) |
transcript_done |
Complete transcript for a turn (role: "user" or "assistant") |
avatar_state |
Avatar state change ("Listening" or "Responding") |
interrupt |
User interrupted AI response. Includes turnId, offsetMs |
error |
Error message |
pong |
Heartbeat response with timestamp |
config — Sent immediately on WebSocket connection:
{
"type": "config",
"audio": {
"inputSampleRate": 24000
}
}audio_start — AI started a new response turn:
{
"type": "audio_start",
"sessionId": "...",
"turnId": "turn_1234567890_abcdef01",
"sampleRate": 24000,
"format": "audio/pcm16",
"timestamp": 1234567890123
}sync_frame — Synchronized audio + blendshape data (30 FPS):
{
"type": "sync_frame",
"weights": { "jawOpen": 0.3, "mouthSmile_L": 0.5, ... },
"audio": "<base64-pcm16>",
"sessionId": "...",
"turnId": "...",
"timestamp": 1234567890123,
"frameIndex": 0
}audio_chunk — Fallback when Wav2Arkit model is not available (no blendshapes):
{
"type": "audio_chunk",
"data": "<base64-pcm16>",
"sessionId": "...",
"timestamp": 1234567890123
}transcript_delta — Streaming text with timing offsets for client sync:
{
"type": "transcript_delta",
"text": "Hello",
"role": "assistant",
"turnId": "...",
"sessionId": "...",
"timestamp": 1234567890123,
"startOffset": 0,
"endOffset": 312
}interrupt — Sent when user interrupts an AI response:
{
"type": "interrupt",
"timestamp": 1234567890123,
"turnId": "...",
"offsetMs": 1500
}The weights object in sync_frame messages contains 52 ARKit-compatible blendshape coefficients (0.0-1.0):
{
"browInnerUp": 0.0,
"browDown_L": 0.0,
"browDown_R": 0.0,
"jawOpen": 0.3,
"mouthSmile_L": 0.5,
"mouthSmile_R": 0.5,
...
}See ARKit Blendshape Documentation for complete list.
The Dockerfile uses a multi-stage build optimized for CPU-only production:
- Base Stage: Python 3.10-slim (Debian-based) with system dependencies
- Dependencies Stage: Fast dependency installation with uv
- Production Stage: Minimal image with non-root user, health checks
- Development Stage: Hot reload support for development
# Build image
docker build -t avatar-chat-server .
# Run (CPU-only)
docker run -d \
--name avatar-chat-server \
-p 8080:8080 \
--env-file .env \
--restart unless-stopped \
avatar-chat-server
# View logs
docker logs -f avatar-chat-server
# Health check
curl http://localhost:8080/health-
Default (Production):
docker-compose up -d- Optimized production build
- Runs as non-root user
- Health checks enabled
- Auto-restart on failure
-
Development:
docker-compose --profile dev up- Hot reload enabled
- Source code mounted as volume
- Debug logging enabled
You can give the assistant custom knowledge to use when answering questions. Set KNOWLEDGE_BASE_SOURCE to a URL — the content is fetched at runtime and injected into the assistant's system prompt:
docker run -d \
--name avatar-chat-server \
-p 8080:8080 \
--env-file .env \
-e KNOWLEDGE_BASE_SOURCE=https://example.com/my-knowledge.md \
avatar-chat-serverFor local development (without Docker), you can also use a local file path (e.g. KNOWLEDGE_BASE_SOURCE=data/knowledge.md).
The server uses HMAC-SHA256 signed tokens for authentication. Tokens are origin-bound and time-limited.
Set in .env:
AUTH_ENABLED=true
AUTH_SECRET_KEY=your-secret-key-here-generate-with-openssl-rand-hex-32
AUTH_ALLOWED_ORIGINS=https://yourwebsite.com,https://www.yourwebsite.comIf AUTH_SECRET_KEY is not set, the server will auto-generate one at startup (not suitable for production — the key will change on each restart).
openssl rand -hex 32The token endpoint validates the Origin header and returns a signed token for that origin:
curl -X POST http://localhost:8080/api/auth/token \
-H "Origin: https://yourwebsite.com"Response:
{
"token": "base64-encoded-hmac-token",
"ttl": 3600,
"origin": "https://yourwebsite.com"
}Include in WebSocket URL:
ws://localhost:8080/ws?token=YOUR_AUTH_TOKEN
The authentication middleware implements a 4-layer security model:
- Origin Validation — Whitelist check against
AUTH_ALLOWED_ORIGINS - HMAC Token Verification — Signature and expiry validation
- Rate Limiting — Token bucket algorithm (if enabled)
- Monitoring — Audit logging
When enabled, uses a token bucket algorithm with two levels:
- Per-domain: 100 token capacity, refills at 10 tokens/sec
- Per-session: 30 token capacity, refills at 5 tokens/sec
Exceeding limits results in WebSocket-level rejection (not HTTP 429).
- CPU: 4+ cores recommended for real-time ONNX inference (8+ cores optimal)
- RAM: 8GB+ (16GB recommended)
- Network: Low-latency connection to AI API (< 100ms recommended)
The server uses ONNX Runtime for CPU-optimized inference. For best performance:
- ONNX Runtime automatically uses the best available SIMD instructions (SSE, AVX2, AVX-512)
- Ensure sufficient RAM (16GB+ recommended)
- Monitor RTF (Real-Time Factor) in debug logs; aim for <1.0
- If audio drops occur, consider reducing
audio_chunk_durationin config
The server implements several performance optimizations for real-time operation:
- Model Warmup: Eliminates first-inference delay
- orjson: 3-5x faster JSON serialization
- Base64 Pre-encoding: Moves encoding off critical 30 FPS broadcast path
- Bounded Queues: Prevents memory exhaustion under load
- Fire-and-forget Tasks: Non-blocking broadcast to WebSocket clients
Check server logs for performance indicators:
# Debug mode shows timing information
DEBUG=true uv run python src/main.pyKey metrics:
- WebSocket connection count
- Audio processing latency
- Model inference time
- Queue depths
The server uses a modular agent system allowing different conversational AI backends:
- sample_openai: Uses OpenAI Realtime API (default)
- sample_gemini: Uses Google Gemini Live API
Implement the BaseAgent interface for custom AI services:
from agents import BaseAgent, ConversationState
class MyCustomAgent(BaseAgent):
@property
def is_connected(self) -> bool:
# Return connection status
...
@property
def state(self) -> ConversationState:
# Return current conversation state
...
@property
def transcript_speed(self) -> float:
# Transcript chars/sec for client sync (e.g., 15.0)
...
def set_event_handlers(self, **kwargs) -> None:
# Store callbacks: on_audio_delta, on_transcript_delta,
# on_response_start, on_response_end, on_user_transcript,
# on_interrupted, on_error
...
async def connect(self) -> None:
# Connect to your AI service
...
def send_text_message(self, text: str) -> None:
# Send text to AI
...
def append_audio(self, audio_bytes: bytes) -> None:
# Send audio to AI (PCM16 bytes)
...
async def disconnect(self) -> None:
# Disconnect from your AI service
...Set AGENT_TYPE=remote and AGENT_URL=ws://your-agent-service/ws for remote agents.
- Set
AGENT_TYPEin.env - Provide required API keys
- Restart the server
The core chat-server (WebSocket handling, blendshape generation) remains unchanged.
uv sync --group devThis project uses Ruff for linting and formatting, and ty for static type checking.
# Check for linting issues
uv run ruff check src/
# Auto-fix linting issues
uv run ruff check src/ --fix
# Fix with unsafe fixes (use with caution)
uv run ruff check src/ --fix --unsafe-fixes# Check formatting
uv run ruff format src/ --check
# Format code
uv run ruff format src/# Run static type analysis
uv run ty check src/# Lint, format, and type check
uv run ruff check src/ --fix && uv run ruff format src/ && uv run ty check src/uv run pytestThis project is licensed under the MIT License - see the LICENSE file for details.
- Apple ARKit for the blendshape specification standard
- LAM Audio2Expression for facial animation model
- Wav2Vec 2.0 for speech representation learning
- OpenAI for Realtime API
- Google for Gemini Live API