ezLocalai v1.0.8

Video Generation (NEW)

LTX-2.3 GGUF Video Generation — Full text-to-video, image-to-video, and video-to-video generation powered by LTX-2.3 with GGUF quantization (Q4_K_M, Q8_0)
- New POST /v1/videos/generations endpoint
- Sequential CPU offload for low-VRAM GPUs (8GB+)
- Multimodal outputs including synchronized audio generation (24kHz)
- Multi-frame conditioning for video-to-video workflows
- Text encoder strategies automatically selected based on available memory: GPU BNB 4-bit, CPU Quanto INT8, or CPU BF16
Configurable via VIDEO_MODEL environment variable (default: "none", set to "unsloth/LTX-2.3-GGUF" to enable)

Vision-language models can now process video inputs in chat completions
Scene-change detection automatically identifies key frames for efficient analysis
Audio track extraction and transcription via Whisper for combined visual + audio understanding
Supports video URLs and base64-encoded video inputs

FLUX.2-klein-4B with GGUF quantization — Replaces previous image model with a unified architecture supporting both image generation and editing
- New POST /v1/images/edits endpoint
- 15 GGUF quantization options (Q2_K through F16)
- Sequential CPU offload support matching video pipeline
Configurable via IMG_MODEL environment variable (default: "unsloth/FLUX.2-klein-4B-GGUF")

Automatic speaker identification in audio transcriptions using MFCC-based clustering
Persistent voice print storage with session-scoped UUIDs and 24-hour TTL
Auto-detects number of speakers using inconsistency coefficient
Cross-chunk speaker consistency — same physical speaker maintains the same ID across segments
SRT and WebVTT subtitle output with speaker labels

Voice Server (VOICE_SERVER) — Offload TTS/STT to a dedicated ezlocalai instance
Image Server (IMAGE_SERVER) — Offload image and video generation to a dedicated instance
Text Server (TEXT_SERVER) — Offload LLM text completion to a dedicated instance
Each server role supports independent API keys
Lazy loading of voice models by default (LAZY_LOAD_VOICE=true)

Route requests to a remote server when local resources are exhausted
Supports both ezlocalai instances and OpenAI-compatible APIs as fallback targets
Configurable memory threshold (FALLBACK_MEMORY_THRESHOLD) for combined VRAM + RAM monitoring
Automatic queue-based fallback with configurable wait timeout (QUEUE_WAIT_TIMEOUT)

Async request queue with configurable concurrency limits (MAX_CONCURRENT_REQUESTS, MAX_QUEUE_SIZE)
Request lifecycle tracking: queued → processing → completed/failed
Cancellation support for fallback routing
Request history and metrics (total processed, failed, queued, active)
Configurable request timeout (REQUEST_TIMEOUT)

Default model updated to Qwen3.5-4B-GGUF with new chat templates for Qwen3.5 4B, 2B, and 0.8B variants
Auto-reduce GPU layers on OOM instead of falling back entirely
Multi-GPU improvements — Dynamic tensor split calculation based on free VRAM across all GPUs
Parallel inference slots (N_PARALLEL) — Auto-scales targeting ~32K context per slot (max 16 slots)
Configurable KV cache type (QUANT_TYPE default: Q4_K_XL)
Configurable LLM_MAX_TOKENS (default: 40000) and REASONING_BUDGET for thinking tokens
Improved concurrency handling and logging

NVIDIA Jetson — New jetson.Dockerfile and docker-compose-jetson.yml with platform-specific import handling
Raspberry Pi 5 with AI HAT+ 2 — New rpi.Dockerfile, docker-compose-rpi.yml, and rpi-requirements.txt
Chatterbox TTS — Watermarker compatibility fix for perth module

Updated to official xllamacpp v0.2.12 releases
Updated setuptools to 78.1.1
Updated cuda and rocm requirement files
All Docker Compose files updated with new environment variables for distributed inference