You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
LTX-2.3 GGUF Video Generation — Full text-to-video, image-to-video, and video-to-video generation powered by LTX-2.3 with GGUF quantization (Q4_K_M, Q8_0)
New POST /v1/videos/generations endpoint
Sequential CPU offload for low-VRAM GPUs (8GB+)
Multimodal outputs including synchronized audio generation (24kHz)
Multi-frame conditioning for video-to-video workflows
Text encoder strategies automatically selected based on available memory: GPU BNB 4-bit, CPU Quanto INT8, or CPU BF16
Configurable via VIDEO_MODEL environment variable (default: "none", set to "unsloth/LTX-2.3-GGUF" to enable)
Video Understanding (NEW)
Vision-language models can now process video inputs in chat completions
Scene-change detection automatically identifies key frames for efficient analysis
Audio track extraction and transcription via Whisper for combined visual + audio understanding
Supports video URLs and base64-encoded video inputs
Image Editing (NEW)
FLUX.2-klein-4B with GGUF quantization — Replaces previous image model with a unified architecture supporting both image generation and editing
New POST /v1/images/edits endpoint
15 GGUF quantization options (Q2_K through F16)
Sequential CPU offload support matching video pipeline
Configurable via IMG_MODEL environment variable (default: "unsloth/FLUX.2-klein-4B-GGUF")
Speaker Diarization (NEW)
Automatic speaker identification in audio transcriptions using MFCC-based clustering
Persistent voice print storage with session-scoped UUIDs and 24-hour TTL
Auto-detects number of speakers using inconsistency coefficient
Cross-chunk speaker consistency — same physical speaker maintains the same ID across segments
SRT and WebVTT subtitle output with speaker labels
Distributed Inference & Server Roles (NEW)
Voice Server (VOICE_SERVER) — Offload TTS/STT to a dedicated ezlocalai instance
Image Server (IMAGE_SERVER) — Offload image and video generation to a dedicated instance
Text Server (TEXT_SERVER) — Offload LLM text completion to a dedicated instance
Each server role supports independent API keys
Lazy loading of voice models by default (LAZY_LOAD_VOICE=true)
Fallback Server Support (NEW)
Route requests to a remote server when local resources are exhausted
Supports both ezlocalai instances and OpenAI-compatible APIs as fallback targets
Configurable memory threshold (FALLBACK_MEMORY_THRESHOLD) for combined VRAM + RAM monitoring
Automatic queue-based fallback with configurable wait timeout (QUEUE_WAIT_TIMEOUT)
Request Queue System (NEW)
Async request queue with configurable concurrency limits (MAX_CONCURRENT_REQUESTS, MAX_QUEUE_SIZE)