Skip to content

v1.0.8

Latest

Choose a tag to compare

@Josh-XT Josh-XT released this 26 Mar 13:56
· 20 commits to main since this release

ezLocalai v1.0.8

Video Generation (NEW)

  • LTX-2.3 GGUF Video Generation — Full text-to-video, image-to-video, and video-to-video generation powered by LTX-2.3 with GGUF quantization (Q4_K_M, Q8_0)
    • New POST /v1/videos/generations endpoint
    • Sequential CPU offload for low-VRAM GPUs (8GB+)
    • Multimodal outputs including synchronized audio generation (24kHz)
    • Multi-frame conditioning for video-to-video workflows
    • Text encoder strategies automatically selected based on available memory: GPU BNB 4-bit, CPU Quanto INT8, or CPU BF16
  • Configurable via VIDEO_MODEL environment variable (default: "none", set to "unsloth/LTX-2.3-GGUF" to enable)

Video Understanding (NEW)

  • Vision-language models can now process video inputs in chat completions
  • Scene-change detection automatically identifies key frames for efficient analysis
  • Audio track extraction and transcription via Whisper for combined visual + audio understanding
  • Supports video URLs and base64-encoded video inputs

Image Editing (NEW)

  • FLUX.2-klein-4B with GGUF quantization — Replaces previous image model with a unified architecture supporting both image generation and editing
    • New POST /v1/images/edits endpoint
    • 15 GGUF quantization options (Q2_K through F16)
    • Sequential CPU offload support matching video pipeline
  • Configurable via IMG_MODEL environment variable (default: "unsloth/FLUX.2-klein-4B-GGUF")

Speaker Diarization (NEW)

  • Automatic speaker identification in audio transcriptions using MFCC-based clustering
  • Persistent voice print storage with session-scoped UUIDs and 24-hour TTL
  • Auto-detects number of speakers using inconsistency coefficient
  • Cross-chunk speaker consistency — same physical speaker maintains the same ID across segments
  • SRT and WebVTT subtitle output with speaker labels

Distributed Inference & Server Roles (NEW)

  • Voice Server (VOICE_SERVER) — Offload TTS/STT to a dedicated ezlocalai instance
  • Image Server (IMAGE_SERVER) — Offload image and video generation to a dedicated instance
  • Text Server (TEXT_SERVER) — Offload LLM text completion to a dedicated instance
  • Each server role supports independent API keys
  • Lazy loading of voice models by default (LAZY_LOAD_VOICE=true)

Fallback Server Support (NEW)

  • Route requests to a remote server when local resources are exhausted
  • Supports both ezlocalai instances and OpenAI-compatible APIs as fallback targets
  • Configurable memory threshold (FALLBACK_MEMORY_THRESHOLD) for combined VRAM + RAM monitoring
  • Automatic queue-based fallback with configurable wait timeout (QUEUE_WAIT_TIMEOUT)

Request Queue System (NEW)

  • Async request queue with configurable concurrency limits (MAX_CONCURRENT_REQUESTS, MAX_QUEUE_SIZE)
  • Request lifecycle tracking: queued → processing → completed/failed
  • Cancellation support for fallback routing
  • Request history and metrics (total processed, failed, queued, active)
  • Configurable request timeout (REQUEST_TIMEOUT)

LLM Improvements

  • Default model updated to Qwen3.5-4B-GGUF with new chat templates for Qwen3.5 4B, 2B, and 0.8B variants
  • Auto-reduce GPU layers on OOM instead of falling back entirely
  • Multi-GPU improvements — Dynamic tensor split calculation based on free VRAM across all GPUs
  • Parallel inference slots (N_PARALLEL) — Auto-scales targeting ~32K context per slot (max 16 slots)
  • Configurable KV cache type (QUANT_TYPE default: Q4_K_XL)
  • Configurable LLM_MAX_TOKENS (default: 40000) and REASONING_BUDGET for thinking tokens
  • Improved concurrency handling and logging

Platform Support

  • NVIDIA Jetson — New jetson.Dockerfile and docker-compose-jetson.yml with platform-specific import handling
  • Raspberry Pi 5 with AI HAT+ 2 — New rpi.Dockerfile, docker-compose-rpi.yml, and rpi-requirements.txt
  • Chatterbox TTS — Watermarker compatibility fix for perth module

Testing & CI

  • Expanded GitHub Actions test workflow with TTS and STT testing enabled
  • Tests use Qwen3.5-0.8B for faster CI runs
  • Improved test reliability with better concurrency handling

Docker & Dependencies

  • Updated to official xllamacpp v0.2.12 releases
  • Updated setuptools to 78.1.1
  • Updated cuda and rocm requirement files
  • All Docker Compose files updated with new environment variables for distributed inference