Skip to content

Releases: DevXT-LLC/ezlocalai

v1.0.8

26 Mar 13:56

Choose a tag to compare

ezLocalai v1.0.8

Video Generation (NEW)

  • LTX-2.3 GGUF Video Generation — Full text-to-video, image-to-video, and video-to-video generation powered by LTX-2.3 with GGUF quantization (Q4_K_M, Q8_0)
    • New POST /v1/videos/generations endpoint
    • Sequential CPU offload for low-VRAM GPUs (8GB+)
    • Multimodal outputs including synchronized audio generation (24kHz)
    • Multi-frame conditioning for video-to-video workflows
    • Text encoder strategies automatically selected based on available memory: GPU BNB 4-bit, CPU Quanto INT8, or CPU BF16
  • Configurable via VIDEO_MODEL environment variable (default: "none", set to "unsloth/LTX-2.3-GGUF" to enable)

Video Understanding (NEW)

  • Vision-language models can now process video inputs in chat completions
  • Scene-change detection automatically identifies key frames for efficient analysis
  • Audio track extraction and transcription via Whisper for combined visual + audio understanding
  • Supports video URLs and base64-encoded video inputs

Image Editing (NEW)

  • FLUX.2-klein-4B with GGUF quantization — Replaces previous image model with a unified architecture supporting both image generation and editing
    • New POST /v1/images/edits endpoint
    • 15 GGUF quantization options (Q2_K through F16)
    • Sequential CPU offload support matching video pipeline
  • Configurable via IMG_MODEL environment variable (default: "unsloth/FLUX.2-klein-4B-GGUF")

Speaker Diarization (NEW)

  • Automatic speaker identification in audio transcriptions using MFCC-based clustering
  • Persistent voice print storage with session-scoped UUIDs and 24-hour TTL
  • Auto-detects number of speakers using inconsistency coefficient
  • Cross-chunk speaker consistency — same physical speaker maintains the same ID across segments
  • SRT and WebVTT subtitle output with speaker labels

Distributed Inference & Server Roles (NEW)

  • Voice Server (VOICE_SERVER) — Offload TTS/STT to a dedicated ezlocalai instance
  • Image Server (IMAGE_SERVER) — Offload image and video generation to a dedicated instance
  • Text Server (TEXT_SERVER) — Offload LLM text completion to a dedicated instance
  • Each server role supports independent API keys
  • Lazy loading of voice models by default (LAZY_LOAD_VOICE=true)

Fallback Server Support (NEW)

  • Route requests to a remote server when local resources are exhausted
  • Supports both ezlocalai instances and OpenAI-compatible APIs as fallback targets
  • Configurable memory threshold (FALLBACK_MEMORY_THRESHOLD) for combined VRAM + RAM monitoring
  • Automatic queue-based fallback with configurable wait timeout (QUEUE_WAIT_TIMEOUT)

Request Queue System (NEW)

  • Async request queue with configurable concurrency limits (MAX_CONCURRENT_REQUESTS, MAX_QUEUE_SIZE)
  • Request lifecycle tracking: queued → processing → completed/failed
  • Cancellation support for fallback routing
  • Request history and metrics (total processed, failed, queued, active)
  • Configurable request timeout (REQUEST_TIMEOUT)

LLM Improvements

  • Default model updated to Qwen3.5-4B-GGUF with new chat templates for Qwen3.5 4B, 2B, and 0.8B variants
  • Auto-reduce GPU layers on OOM instead of falling back entirely
  • Multi-GPU improvements — Dynamic tensor split calculation based on free VRAM across all GPUs
  • Parallel inference slots (N_PARALLEL) — Auto-scales targeting ~32K context per slot (max 16 slots)
  • Configurable KV cache type (QUANT_TYPE default: Q4_K_XL)
  • Configurable LLM_MAX_TOKENS (default: 40000) and REASONING_BUDGET for thinking tokens
  • Improved concurrency handling and logging

Platform Support

  • NVIDIA Jetson — New jetson.Dockerfile and docker-compose-jetson.yml with platform-specific import handling
  • Raspberry Pi 5 with AI HAT+ 2 — New rpi.Dockerfile, docker-compose-rpi.yml, and rpi-requirements.txt
  • Chatterbox TTS — Watermarker compatibility fix for perth module

Testing & CI

  • Expanded GitHub Actions test workflow with TTS and STT testing enabled
  • Tests use Qwen3.5-0.8B for faster CI runs
  • Improved test reliability with better concurrency handling

Docker & Dependencies

  • Updated to official xllamacpp v0.2.12 releases
  • Updated setuptools to 78.1.1
  • Updated cuda and rocm requirement files
  • All Docker Compose files updated with new environment variables for distributed inference

v1.0.7

24 Mar 16:22

Choose a tag to compare

What's Changed

  • Add speaker diarization support to audio transcriptions by @Copilot in #72

New Contributors

  • @Copilot made their first contribution in #72

Full Changelog: v1.0.6...v1.0.7

v1.0.6

21 Feb 14:05
2fc4f7c

Choose a tag to compare

What's Changed

Full Changelog: v1.0.5...v1.0.6

v1.0.5

02 Dec 16:10
15e4167

Choose a tag to compare

What's Changed

Full Changelog: v1.0.4...v1.0.5

v1.0.4

26 Nov 18:21

Choose a tag to compare

ezLocalai v1.0.4

Summary

This release represents a complete modernization of ezlocalai's AI stack, making it truly "ez" - simpler configuration, smarter defaults, and better performance. The result is a cleaner codebase with fewer dependencies, automatic optimization, and minimal required configuration.

Philosophy: VRAM is auto-detected, context is dynamic, GPU layers are auto-calibrated. Just set your models and go.

Key Changes

1. Dynamic Context Sizing

No more guessing context sizes! The system automatically sizes context to fit your prompt:

  • Estimates prompt tokens (chars / 4)
  • Rounds up to nearest 32k boundary (32k, 64k, 96k, 128k)
  • Recalibrates GPU layers if context increases
  • Result: Much faster inference on short prompts (128k→32k = 63% speedup)

2. Auto VRAM Detection

VRAM budget is now detected automatically at startup:

  • Queries torch.cuda.get_device_properties()
  • Rounds down to nearest GB for safety margin
  • No more manual VRAM_BUDGET configuration

3. Simplified Environment Variables

Removed env vars (now automatic):

Variable Reason
GPU_LAYERS Auto-calibrated based on VRAM
VRAM_BUDGET Auto-detected from GPU
LLM_MAX_TOKENS Dynamic based on prompt size
IMG_ENABLED Detected from IMG_MODEL being set
EMBEDDING_ENABLED Always available
IMG_DEVICE Auto-detects CUDA availability

Renamed:

Old New Reason
SD_MODEL IMG_MODEL Clearer naming

Simplified DEFAULT_MODEL format:

# Old: model@tokens,model@tokens
DEFAULT_MODEL=model1@64000,model2@8192

# New: just list models (context is dynamic)
DEFAULT_MODEL=model1,model2

4. Vision Model Fallback

When a non-vision model receives an image request:

  1. System detects images in request
  2. Finds a vision-capable model from available models
  3. Uses vision model to describe images
  4. Prepends description to prompt
  5. Processes with requested model

This allows coding models (non-vision) to receive image context!

5. LLM Engine: llama-cpp-python → xllamacpp

Replaced llama-cpp-python with xllamacpp:

  • Unified LLM + Vision via multimodal projector (mmproj)
  • Native estimate_gpu_layers() function for fast calibration
  • Cleaner API through xllamacpp.Server

6. Multi-Model Hot-Swap

Support for multiple LLMs with automatic hot-swap:

DEFAULT_MODEL=unsloth/Qwen3-VL-4B-Instruct-GGUF,unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
  • First model loads at startup with baseline 32k context
  • Models swap on demand based on request
  • Context-aware: recalibrates GPU layers when context size changes

7. Pre-Calibration at Startup

All models in DEFAULT_MODEL are pre-calibrated at startup:

  • Uses xllamacpp's native estimate_gpu_layers()
  • Calibrates at 32k baseline context
  • Caches results for instant swaps
  • Recalibrates on-demand for larger contexts

8. Text-to-Speech: XTTS → Chatterbox TTS

Replaced XTTS with Chatterbox TTS:

  • Modern Llama 3.2 backbone (0.5B params)
  • Better voice cloning quality
  • Preloaded at startup to warm cache (~38s → ~5s on reload)
  • Lazy-loaded on demand, unloaded after to free VRAM

9. Image Generation: SDXL-Turbo → SDXL-Lightning

Migrated to ByteDance/SDXL-Lightning:

  • Better image quality at low step counts
  • GPU: 2-step generation (blazing fast)
  • CPU: 4-step generation (stable)
  • Auto-detects CUDA for device selection

10. Embeddings: ONNX → BGE-M3

Replaced ONNX embedding with native BAAI/bge-m3:

  • Removed ONNX/Optimum dependencies
  • 1024-dimensional embeddings
  • Native PyTorch GPU acceleration

11. Pip-Installable CLI

New lightweight CLI for easy installation and management:

pip install ezlocalai
ezlocalai start

Features:

  • Auto-detects GPU (NVIDIA) or falls back to CPU mode
  • Auto-installs prerequisites on Linux (Docker, NVIDIA Container Toolkit)
  • Simple commands: start, stop, restart, status, logs
  • Configurable: --model, --uri, --api-key, --ngrok, --whisper, --img-model
  • Minimal dependencies: Only click and requests (no heavy ML libraries)
# Examples
ezlocalai start --model unsloth/gemma-3-4b-it-GGUF
ezlocalai start --api-key my-secret-key --ngrok <token>
ezlocalai logs -f

Performance Benchmarks

Test System

  • CPU: 12th Gen Intel Core i9-12900KS (24 cores)
  • GPU: NVIDIA GeForce RTX 4090 (24GB VRAM)
  • Models Tested:
    • unsloth/Qwen3-VL-4B-Instruct-GGUF - 4B vision model (Q4_K_XL quantization)
    • unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF - 30B MoE coding model (Q4_K_XL quantization)

LLM Inference Performance (RTX 4090)

Model Size Tokens Time Speed
Qwen3-VL-4B 4B 100 0.47s 213 tok/s
Qwen3-VL-4B 4B 134 0.64s 209 tok/s
Qwen3-Coder-30B 30B (MoE) 100 1.53s 65 tok/s
Qwen3-Coder-30B 30B (MoE) 300 4.41s 68 tok/s

Key Observations

  • 4B Vision Model: ~210 tok/s average - excellent for interactive use
  • 30B Coding Model: ~65-68 tok/s - great for code generation despite size
  • Hot-swap time: ~1s between models (pre-calibrated)
  • Dynamic context: 32k baseline, scales up as needed

GPU Layer Calibration

Auto-calibrated at startup for 32k context:

Model GPU Layers VRAM Usage
Qwen3-VL-4B 37 layers ~12GB
Qwen3-Coder-30B 45 layers ~24GB

Minimal Configuration

.env file (that's it!):

# ezlocalai Configuration - keeping it "ez"
# VRAM is auto-detected, context is dynamic, GPU layers are auto-calibrated

MAIN_GPU=0
DEFAULT_MODEL=unsloth/Qwen3-VL-4B-Instruct-GGUF,unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF

# Image generation (set to enable, leave empty to disable)
IMG_MODEL=

# Speech-to-text model
WHISPER_MODEL=base

# Queue settings
MAX_CONCURRENT_REQUESTS=2

API Compatibility

All API endpoints remain unchanged. This is a drop-in upgrade.

New CLI

A lightweight pip-installable CLI for managing ezlocalai Docker containers:

pip install ezlocalai
ezlocalai start

The CLI:

  • Auto-detects GPU and uses CUDA or CPU mode appropriately
  • Auto-installs prerequisites (Docker, NVIDIA Container Toolkit) on Linux
  • Builds CUDA image locally from source (too large for DockerHub)
  • Persists data in ~/.ezlocalai/data/ (models, outputs, voices)
  • No prompts - designed to be as "ez" as possible

Commands:

  • ezlocalai start [--model X] - Start with optional model override
  • ezlocalai stop - Stop the container
  • ezlocalai restart - Restart with same config
  • ezlocalai status - Show running status and configuration
  • ezlocalai logs [-f] - View container logs
  • ezlocalai update - Pull latest CPU image / rebuild CUDA image

v1.0.3

26 Nov 15:36

Choose a tag to compare

ezLocalai v1.0.3

Summary

This release represents a complete modernization of ezlocalai's AI stack, making it truly "ez" - simpler configuration, smarter defaults, and better performance. The result is a cleaner codebase with fewer dependencies, automatic optimization, and minimal required configuration.

Philosophy: VRAM is auto-detected, context is dynamic, GPU layers are auto-calibrated. Just set your models and go.

Key Changes

1. Dynamic Context Sizing

No more guessing context sizes! The system automatically sizes context to fit your prompt:

  • Estimates prompt tokens (chars / 4)
  • Rounds up to nearest 32k boundary (32k, 64k, 96k, 128k)
  • Recalibrates GPU layers if context increases
  • Result: Much faster inference on short prompts (128k→32k = 63% speedup)

2. Auto VRAM Detection

VRAM budget is now detected automatically at startup:

  • Queries torch.cuda.get_device_properties()
  • Rounds down to nearest GB for safety margin
  • No more manual VRAM_BUDGET configuration

3. Simplified Environment Variables

Removed env vars (now automatic):

Variable Reason
GPU_LAYERS Auto-calibrated based on VRAM
VRAM_BUDGET Auto-detected from GPU
LLM_MAX_TOKENS Dynamic based on prompt size
IMG_ENABLED Detected from IMG_MODEL being set
EMBEDDING_ENABLED Always available
IMG_DEVICE Auto-detects CUDA availability

Renamed:

Old New Reason
SD_MODEL IMG_MODEL Clearer naming

Simplified DEFAULT_MODEL format:

# Old: model@tokens,model@tokens
DEFAULT_MODEL=model1@64000,model2@8192

# New: just list models (context is dynamic)
DEFAULT_MODEL=model1,model2

4. Vision Model Fallback

When a non-vision model receives an image request:

  1. System detects images in request
  2. Finds a vision-capable model from available models
  3. Uses vision model to describe images
  4. Prepends description to prompt
  5. Processes with requested model

This allows coding models (non-vision) to receive image context!

5. LLM Engine: llama-cpp-python → xllamacpp

Replaced llama-cpp-python with xllamacpp:

  • Unified LLM + Vision via multimodal projector (mmproj)
  • Native estimate_gpu_layers() function for fast calibration
  • Cleaner API through xllamacpp.Server

6. Multi-Model Hot-Swap

Support for multiple LLMs with automatic hot-swap:

DEFAULT_MODEL=unsloth/Qwen3-VL-4B-Instruct-GGUF,unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
  • First model loads at startup with baseline 32k context
  • Models swap on demand based on request
  • Context-aware: recalibrates GPU layers when context size changes

7. Pre-Calibration at Startup

All models in DEFAULT_MODEL are pre-calibrated at startup:

  • Uses xllamacpp's native estimate_gpu_layers()
  • Calibrates at 32k baseline context
  • Caches results for instant swaps
  • Recalibrates on-demand for larger contexts

8. Text-to-Speech: XTTS → Chatterbox TTS

Replaced XTTS with Chatterbox TTS:

  • Modern Llama 3.2 backbone (0.5B params)
  • Better voice cloning quality
  • Preloaded at startup to warm cache (~38s → ~5s on reload)
  • Lazy-loaded on demand, unloaded after to free VRAM

9. Image Generation: SDXL-Turbo → SDXL-Lightning

Migrated to ByteDance/SDXL-Lightning:

  • Better image quality at low step counts
  • GPU: 2-step generation (blazing fast)
  • CPU: 4-step generation (stable)
  • Auto-detects CUDA for device selection

10. Embeddings: ONNX → BGE-M3

Replaced ONNX embedding with native BAAI/bge-m3:

  • Removed ONNX/Optimum dependencies
  • 1024-dimensional embeddings
  • Native PyTorch GPU acceleration

11. Pip-Installable CLI

New lightweight CLI for easy installation and management:

pip install ezlocalai
ezlocalai start

Features:

  • Auto-detects GPU (NVIDIA) or falls back to CPU mode
  • Auto-installs prerequisites on Linux (Docker, NVIDIA Container Toolkit)
  • Simple commands: start, stop, restart, status, logs
  • Configurable: --model, --uri, --api-key, --ngrok, --whisper, --img-model
  • Minimal dependencies: Only click and requests (no heavy ML libraries)
# Examples
ezlocalai start --model unsloth/gemma-3-4b-it-GGUF
ezlocalai start --api-key my-secret-key --ngrok <token>
ezlocalai logs -f

Performance Benchmarks

Test System

  • CPU: 12th Gen Intel Core i9-12900KS (24 cores)
  • GPU: NVIDIA GeForce RTX 4090 (24GB VRAM)
  • Models Tested:
    • unsloth/Qwen3-VL-4B-Instruct-GGUF - 4B vision model (Q4_K_XL quantization)
    • unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF - 30B MoE coding model (Q4_K_XL quantization)

LLM Inference Performance (RTX 4090)

Model Size Tokens Time Speed
Qwen3-VL-4B 4B 100 0.47s 213 tok/s
Qwen3-VL-4B 4B 134 0.64s 209 tok/s
Qwen3-Coder-30B 30B (MoE) 100 1.53s 65 tok/s
Qwen3-Coder-30B 30B (MoE) 300 4.41s 68 tok/s

Key Observations

  • 4B Vision Model: ~210 tok/s average - excellent for interactive use
  • 30B Coding Model: ~65-68 tok/s - great for code generation despite size
  • Hot-swap time: ~1s between models (pre-calibrated)
  • Dynamic context: 32k baseline, scales up as needed

GPU Layer Calibration

Auto-calibrated at startup for 32k context:

Model GPU Layers VRAM Usage
Qwen3-VL-4B 37 layers ~12GB
Qwen3-Coder-30B 45 layers ~24GB

Minimal Configuration

.env file (that's it!):

# ezlocalai Configuration - keeping it "ez"
# VRAM is auto-detected, context is dynamic, GPU layers are auto-calibrated

MAIN_GPU=0
DEFAULT_MODEL=unsloth/Qwen3-VL-4B-Instruct-GGUF,unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF

# Image generation (set to enable, leave empty to disable)
IMG_MODEL=

# Speech-to-text model
WHISPER_MODEL=base

# Queue settings
MAX_CONCURRENT_REQUESTS=2

API Compatibility

All API endpoints remain unchanged. This is a drop-in upgrade.

New CLI

A lightweight pip-installable CLI for managing ezlocalai Docker containers:

pip install ezlocalai
ezlocalai start

The CLI:

  • Auto-detects GPU and uses CUDA or CPU mode appropriately
  • Auto-installs prerequisites (Docker, NVIDIA Container Toolkit) on Linux
  • Builds CUDA image locally from source (too large for DockerHub)
  • Persists data in ~/.ezlocalai/data/ (models, outputs, voices)
  • No prompts - designed to be as "ez" as possible

Commands:

  • ezlocalai start [--model X] - Start with optional model override
  • ezlocalai stop - Stop the container
  • ezlocalai restart - Restart with same config
  • ezlocalai status - Show running status and configuration
  • ezlocalai logs [-f] - View container logs
  • ezlocalai update - Pull latest CPU image / rebuild CUDA image

v1.0.2

26 Nov 14:44

Choose a tag to compare

ezLocalai v1.0.2

Summary

This release represents a complete modernization of ezlocalai's AI stack, making it truly "ez" - simpler configuration, smarter defaults, and better performance. The result is a cleaner codebase with fewer dependencies, automatic optimization, and minimal required configuration.

Philosophy: VRAM is auto-detected, context is dynamic, GPU layers are auto-calibrated. Just set your models and go.

Key Changes

1. Dynamic Context Sizing

No more guessing context sizes! The system automatically sizes context to fit your prompt:

  • Estimates prompt tokens (chars / 4)
  • Rounds up to nearest 32k boundary (32k, 64k, 96k, 128k)
  • Recalibrates GPU layers if context increases
  • Result: Much faster inference on short prompts (128k→32k = 63% speedup)

2. Auto VRAM Detection

VRAM budget is now detected automatically at startup:

  • Queries torch.cuda.get_device_properties()
  • Rounds down to nearest GB for safety margin
  • No more manual VRAM_BUDGET configuration

3. Simplified Environment Variables

Removed env vars (now automatic):

Variable Reason
GPU_LAYERS Auto-calibrated based on VRAM
VRAM_BUDGET Auto-detected from GPU
LLM_MAX_TOKENS Dynamic based on prompt size
IMG_ENABLED Detected from IMG_MODEL being set
EMBEDDING_ENABLED Always available
IMG_DEVICE Auto-detects CUDA availability

Renamed:

Old New Reason
SD_MODEL IMG_MODEL Clearer naming

Simplified DEFAULT_MODEL format:

# Old: model@tokens,model@tokens
DEFAULT_MODEL=model1@64000,model2@8192

# New: just list models (context is dynamic)
DEFAULT_MODEL=model1,model2

4. Vision Model Fallback

When a non-vision model receives an image request:

  1. System detects images in request
  2. Finds a vision-capable model from available models
  3. Uses vision model to describe images
  4. Prepends description to prompt
  5. Processes with requested model

This allows coding models (non-vision) to receive image context!

5. LLM Engine: llama-cpp-python → xllamacpp

Replaced llama-cpp-python with xllamacpp:

  • Unified LLM + Vision via multimodal projector (mmproj)
  • Native estimate_gpu_layers() function for fast calibration
  • Cleaner API through xllamacpp.Server

6. Multi-Model Hot-Swap

Support for multiple LLMs with automatic hot-swap:

DEFAULT_MODEL=unsloth/Qwen3-VL-4B-Instruct-GGUF,unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
  • First model loads at startup with baseline 32k context
  • Models swap on demand based on request
  • Context-aware: recalibrates GPU layers when context size changes

7. Pre-Calibration at Startup

All models in DEFAULT_MODEL are pre-calibrated at startup:

  • Uses xllamacpp's native estimate_gpu_layers()
  • Calibrates at 32k baseline context
  • Caches results for instant swaps
  • Recalibrates on-demand for larger contexts

8. Text-to-Speech: XTTS → Chatterbox TTS

Replaced XTTS with Chatterbox TTS:

  • Modern Llama 3.2 backbone (0.5B params)
  • Better voice cloning quality
  • Preloaded at startup to warm cache (~38s → ~5s on reload)
  • Lazy-loaded on demand, unloaded after to free VRAM

9. Image Generation: SDXL-Turbo → SDXL-Lightning

Migrated to ByteDance/SDXL-Lightning:

  • Better image quality at low step counts
  • GPU: 2-step generation (blazing fast)
  • CPU: 4-step generation (stable)
  • Auto-detects CUDA for device selection

10. Embeddings: ONNX → BGE-M3

Replaced ONNX embedding with native BAAI/bge-m3:

  • Removed ONNX/Optimum dependencies
  • 1024-dimensional embeddings
  • Native PyTorch GPU acceleration

11. Pip-Installable CLI

New lightweight CLI for easy installation and management:

pip install ezlocalai
ezlocalai start

Features:

  • Auto-detects GPU (NVIDIA) or falls back to CPU mode
  • Auto-installs prerequisites on Linux (Docker, NVIDIA Container Toolkit)
  • Simple commands: start, stop, restart, status, logs
  • Configurable: --model, --uri, --api-key, --ngrok, --whisper, --img-model
  • Minimal dependencies: Only click and requests (no heavy ML libraries)
# Examples
ezlocalai start --model unsloth/gemma-3-4b-it-GGUF
ezlocalai start --api-key my-secret-key --ngrok <token>
ezlocalai logs -f

Performance Benchmarks

Test System

  • CPU: 12th Gen Intel Core i9-12900KS (24 cores)
  • GPU: NVIDIA GeForce RTX 4090 (24GB VRAM)
  • Models Tested:
    • unsloth/Qwen3-VL-4B-Instruct-GGUF - 4B vision model (Q4_K_XL quantization)
    • unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF - 30B MoE coding model (Q4_K_XL quantization)

LLM Inference Performance (RTX 4090)

Model Size Tokens Time Speed
Qwen3-VL-4B 4B 100 0.47s 213 tok/s
Qwen3-VL-4B 4B 134 0.64s 209 tok/s
Qwen3-Coder-30B 30B (MoE) 100 1.53s 65 tok/s
Qwen3-Coder-30B 30B (MoE) 300 4.41s 68 tok/s

Key Observations

  • 4B Vision Model: ~210 tok/s average - excellent for interactive use
  • 30B Coding Model: ~65-68 tok/s - great for code generation despite size
  • Hot-swap time: ~1s between models (pre-calibrated)
  • Dynamic context: 32k baseline, scales up as needed

GPU Layer Calibration

Auto-calibrated at startup for 32k context:

Model GPU Layers VRAM Usage
Qwen3-VL-4B 37 layers ~12GB
Qwen3-Coder-30B 45 layers ~24GB

Minimal Configuration

.env file (that's it!):

# ezlocalai Configuration - keeping it "ez"
# VRAM is auto-detected, context is dynamic, GPU layers are auto-calibrated

MAIN_GPU=0
DEFAULT_MODEL=unsloth/Qwen3-VL-4B-Instruct-GGUF,unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF

# Image generation (set to enable, leave empty to disable)
IMG_MODEL=

# Speech-to-text model
WHISPER_MODEL=base

# Queue settings
MAX_CONCURRENT_REQUESTS=2

API Compatibility

All API endpoints remain unchanged. This is a drop-in upgrade.

New CLI

A lightweight pip-installable CLI for managing ezlocalai Docker containers:

pip install ezlocalai
ezlocalai start

The CLI:

  • Auto-detects GPU and uses CUDA or CPU mode appropriately
  • Auto-installs prerequisites (Docker, NVIDIA Container Toolkit) on Linux
  • Builds CUDA image locally from source (too large for DockerHub)
  • Persists data in ~/.ezlocalai/data/ (models, outputs, voices)
  • No prompts - designed to be as "ez" as possible

Commands:

  • ezlocalai start [--model X] - Start with optional model override
  • ezlocalai stop - Stop the container
  • ezlocalai restart - Restart with same config
  • ezlocalai status - Show running status and configuration
  • ezlocalai logs [-f] - View container logs
  • ezlocalai update - Pull latest CPU image / rebuild CUDA image

v1.0.1

26 Nov 14:39

Choose a tag to compare

ezLocalai v1.0.1

Summary

This release represents a complete modernization of ezlocalai's AI stack, making it truly "ez" - simpler configuration, smarter defaults, and better performance. The result is a cleaner codebase with fewer dependencies, automatic optimization, and minimal required configuration.

Philosophy: VRAM is auto-detected, context is dynamic, GPU layers are auto-calibrated. Just set your models and go.

Key Changes

1. Dynamic Context Sizing

No more guessing context sizes! The system automatically sizes context to fit your prompt:

  • Estimates prompt tokens (chars / 4)
  • Rounds up to nearest 32k boundary (32k, 64k, 96k, 128k)
  • Recalibrates GPU layers if context increases
  • Result: Much faster inference on short prompts (128k→32k = 63% speedup)

2. Auto VRAM Detection

VRAM budget is now detected automatically at startup:

  • Queries torch.cuda.get_device_properties()
  • Rounds down to nearest GB for safety margin
  • No more manual VRAM_BUDGET configuration

3. Simplified Environment Variables

Removed env vars (now automatic):

Variable Reason
GPU_LAYERS Auto-calibrated based on VRAM
VRAM_BUDGET Auto-detected from GPU
LLM_MAX_TOKENS Dynamic based on prompt size
IMG_ENABLED Detected from IMG_MODEL being set
EMBEDDING_ENABLED Always available
IMG_DEVICE Auto-detects CUDA availability

Renamed:

Old New Reason
SD_MODEL IMG_MODEL Clearer naming

Simplified DEFAULT_MODEL format:

# Old: model@tokens,model@tokens
DEFAULT_MODEL=model1@64000,model2@8192

# New: just list models (context is dynamic)
DEFAULT_MODEL=model1,model2

4. Vision Model Fallback

When a non-vision model receives an image request:

  1. System detects images in request
  2. Finds a vision-capable model from available models
  3. Uses vision model to describe images
  4. Prepends description to prompt
  5. Processes with requested model

This allows coding models (non-vision) to receive image context!

5. LLM Engine: llama-cpp-python → xllamacpp

Replaced llama-cpp-python with xllamacpp:

  • Unified LLM + Vision via multimodal projector (mmproj)
  • Native estimate_gpu_layers() function for fast calibration
  • Cleaner API through xllamacpp.Server

6. Multi-Model Hot-Swap

Support for multiple LLMs with automatic hot-swap:

DEFAULT_MODEL=unsloth/Qwen3-VL-4B-Instruct-GGUF,unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
  • First model loads at startup with baseline 32k context
  • Models swap on demand based on request
  • Context-aware: recalibrates GPU layers when context size changes

7. Pre-Calibration at Startup

All models in DEFAULT_MODEL are pre-calibrated at startup:

  • Uses xllamacpp's native estimate_gpu_layers()
  • Calibrates at 32k baseline context
  • Caches results for instant swaps
  • Recalibrates on-demand for larger contexts

8. Text-to-Speech: XTTS → Chatterbox TTS

Replaced XTTS with Chatterbox TTS:

  • Modern Llama 3.2 backbone (0.5B params)
  • Better voice cloning quality
  • Preloaded at startup to warm cache (~38s → ~5s on reload)
  • Lazy-loaded on demand, unloaded after to free VRAM

9. Image Generation: SDXL-Turbo → SDXL-Lightning

Migrated to ByteDance/SDXL-Lightning:

  • Better image quality at low step counts
  • GPU: 2-step generation (blazing fast)
  • CPU: 4-step generation (stable)
  • Auto-detects CUDA for device selection

10. Embeddings: ONNX → BGE-M3

Replaced ONNX embedding with native BAAI/bge-m3:

  • Removed ONNX/Optimum dependencies
  • 1024-dimensional embeddings
  • Native PyTorch GPU acceleration

11. Pip-Installable CLI

New lightweight CLI for easy installation and management:

pip install ezlocalai
ezlocalai start

Features:

  • Auto-detects GPU (NVIDIA) or falls back to CPU mode
  • Auto-installs prerequisites on Linux (Docker, NVIDIA Container Toolkit)
  • Simple commands: start, stop, restart, status, logs
  • Configurable: --model, --uri, --api-key, --ngrok, --whisper, --img-model
  • Minimal dependencies: Only click and requests (no heavy ML libraries)
# Examples
ezlocalai start --model unsloth/gemma-3-4b-it-GGUF
ezlocalai start --api-key my-secret-key --ngrok <token>
ezlocalai logs -f

Performance Benchmarks

Test System

  • CPU: 12th Gen Intel Core i9-12900KS (24 cores)
  • GPU: NVIDIA GeForce RTX 4090 (24GB VRAM)
  • Models Tested:
    • unsloth/Qwen3-VL-4B-Instruct-GGUF - 4B vision model (Q4_K_XL quantization)
    • unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF - 30B MoE coding model (Q4_K_XL quantization)

LLM Inference Performance (RTX 4090)

Model Size Tokens Time Speed
Qwen3-VL-4B 4B 100 0.47s 213 tok/s
Qwen3-VL-4B 4B 134 0.64s 209 tok/s
Qwen3-Coder-30B 30B (MoE) 100 1.53s 65 tok/s
Qwen3-Coder-30B 30B (MoE) 300 4.41s 68 tok/s

Key Observations

  • 4B Vision Model: ~210 tok/s average - excellent for interactive use
  • 30B Coding Model: ~65-68 tok/s - great for code generation despite size
  • Hot-swap time: ~1s between models (pre-calibrated)
  • Dynamic context: 32k baseline, scales up as needed

GPU Layer Calibration

Auto-calibrated at startup for 32k context:

Model GPU Layers VRAM Usage
Qwen3-VL-4B 37 layers ~12GB
Qwen3-Coder-30B 45 layers ~24GB

Minimal Configuration

.env file (that's it!):

# ezlocalai Configuration - keeping it "ez"
# VRAM is auto-detected, context is dynamic, GPU layers are auto-calibrated

MAIN_GPU=0
DEFAULT_MODEL=unsloth/Qwen3-VL-4B-Instruct-GGUF,unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF

# Image generation (set to enable, leave empty to disable)
IMG_MODEL=

# Speech-to-text model
WHISPER_MODEL=base

# Queue settings
MAX_CONCURRENT_REQUESTS=2

API Compatibility

All API endpoints remain unchanged. This is a drop-in upgrade.

New CLI

A lightweight pip-installable CLI for managing ezlocalai Docker containers:

pip install ezlocalai
ezlocalai start

The CLI:

  • Auto-detects GPU and uses CUDA or CPU mode appropriately
  • Auto-installs prerequisites (Docker, NVIDIA Container Toolkit) on Linux
  • Builds CUDA image locally from source (too large for DockerHub)
  • Persists data in ~/.ezlocalai/data/ (models, outputs, voices)
  • No prompts - designed to be as "ez" as possible

Commands:

  • ezlocalai start [--model X] - Start with optional model override
  • ezlocalai stop - Stop the container
  • ezlocalai restart - Restart with same config
  • ezlocalai status - Show running status and configuration
  • ezlocalai logs [-f] - View container logs
  • ezlocalai update - Pull latest CPU image / rebuild CUDA image

v1.0.0

26 Nov 05:33
229de8a

Choose a tag to compare

ezLocalai v1.0.0

Summary

This release represents a complete modernization of ezlocalai's AI stack, making it truly "ez" - simpler configuration, smarter defaults, and better performance. The result is a cleaner codebase with fewer dependencies, automatic optimization, and minimal required configuration.

Philosophy: VRAM is auto-detected, context is dynamic, GPU layers are auto-calibrated. Just set your models and go.

Key Changes

1. Dynamic Context Sizing

No more guessing context sizes! The system automatically sizes context to fit your prompt:

  • Estimates prompt tokens (chars / 4)
  • Rounds up to nearest 32k boundary (32k, 64k, 96k, 128k)
  • Recalibrates GPU layers if context increases
  • Result: Much faster inference on short prompts (128k→32k = 63% speedup)

2. Auto VRAM Detection

VRAM budget is now detected automatically at startup:

  • Queries torch.cuda.get_device_properties()
  • Rounds down to nearest GB for safety margin
  • No more manual VRAM_BUDGET configuration

3. Simplified Environment Variables

Removed env vars (now automatic):

Variable Reason
GPU_LAYERS Auto-calibrated based on VRAM
VRAM_BUDGET Auto-detected from GPU
LLM_MAX_TOKENS Dynamic based on prompt size
IMG_ENABLED Detected from IMG_MODEL being set
EMBEDDING_ENABLED Always available
IMG_DEVICE Auto-detects CUDA availability

Renamed:

Old New Reason
SD_MODEL IMG_MODEL Clearer naming

Simplified DEFAULT_MODEL format:

# Old: model@tokens,model@tokens
DEFAULT_MODEL=model1@64000,model2@8192

# New: just list models (context is dynamic)
DEFAULT_MODEL=model1,model2

4. Vision Model Fallback

When a non-vision model receives an image request:

  1. System detects images in request
  2. Finds a vision-capable model from available models
  3. Uses vision model to describe images
  4. Prepends description to prompt
  5. Processes with requested model

This allows coding models (non-vision) to receive image context!

5. LLM Engine: llama-cpp-python → xllamacpp

Replaced llama-cpp-python with xllamacpp:

  • Unified LLM + Vision via multimodal projector (mmproj)
  • Native estimate_gpu_layers() function for fast calibration
  • Cleaner API through xllamacpp.Server

6. Multi-Model Hot-Swap

Support for multiple LLMs with automatic hot-swap:

DEFAULT_MODEL=unsloth/Qwen3-VL-4B-Instruct-GGUF,unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
  • First model loads at startup with baseline 32k context
  • Models swap on demand based on request
  • Context-aware: recalibrates GPU layers when context size changes

7. Pre-Calibration at Startup

All models in DEFAULT_MODEL are pre-calibrated at startup:

  • Uses xllamacpp's native estimate_gpu_layers()
  • Calibrates at 32k baseline context
  • Caches results for instant swaps
  • Recalibrates on-demand for larger contexts

8. Text-to-Speech: XTTS → Chatterbox TTS

Replaced XTTS with Chatterbox TTS:

  • Modern Llama 3.2 backbone (0.5B params)
  • Better voice cloning quality
  • Preloaded at startup to warm cache (~38s → ~5s on reload)
  • Lazy-loaded on demand, unloaded after to free VRAM

9. Image Generation: SDXL-Turbo → SDXL-Lightning

Migrated to ByteDance/SDXL-Lightning:

  • Better image quality at low step counts
  • GPU: 2-step generation (blazing fast)
  • CPU: 4-step generation (stable)
  • Auto-detects CUDA for device selection

10. Embeddings: ONNX → BGE-M3

Replaced ONNX embedding with native BAAI/bge-m3:

  • Removed ONNX/Optimum dependencies
  • 1024-dimensional embeddings
  • Native PyTorch GPU acceleration

11. Pip-Installable CLI

New lightweight CLI for easy installation and management:

pip install ezlocalai
ezlocalai start

Features:

  • Auto-detects GPU (NVIDIA) or falls back to CPU mode
  • Auto-installs prerequisites on Linux (Docker, NVIDIA Container Toolkit)
  • Simple commands: start, stop, restart, status, logs
  • Configurable: --model, --uri, --api-key, --ngrok, --whisper, --img-model
  • Minimal dependencies: Only click and requests (no heavy ML libraries)
# Examples
ezlocalai start --model unsloth/gemma-3-4b-it-GGUF
ezlocalai start --api-key my-secret-key --ngrok <token>
ezlocalai logs -f

Performance Benchmarks

Test System

  • CPU: 12th Gen Intel Core i9-12900KS (24 cores)
  • GPU: NVIDIA GeForce RTX 4090 (24GB VRAM)
  • Models Tested:
    • unsloth/Qwen3-VL-4B-Instruct-GGUF - 4B vision model (Q4_K_XL quantization)
    • unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF - 30B MoE coding model (Q4_K_XL quantization)

LLM Inference Performance (RTX 4090)

Model Size Tokens Time Speed
Qwen3-VL-4B 4B 100 0.47s 213 tok/s
Qwen3-VL-4B 4B 134 0.64s 209 tok/s
Qwen3-Coder-30B 30B (MoE) 100 1.53s 65 tok/s
Qwen3-Coder-30B 30B (MoE) 300 4.41s 68 tok/s

Key Observations

  • 4B Vision Model: ~210 tok/s average - excellent for interactive use
  • 30B Coding Model: ~65-68 tok/s - great for code generation despite size
  • Hot-swap time: ~1s between models (pre-calibrated)
  • Dynamic context: 32k baseline, scales up as needed

GPU Layer Calibration

Auto-calibrated at startup for 32k context:

Model GPU Layers VRAM Usage
Qwen3-VL-4B 37 layers ~12GB
Qwen3-Coder-30B 45 layers ~24GB

Minimal Configuration

.env file (that's it!):

# ezlocalai Configuration - keeping it "ez"
# VRAM is auto-detected, context is dynamic, GPU layers are auto-calibrated

MAIN_GPU=0
DEFAULT_MODEL=unsloth/Qwen3-VL-4B-Instruct-GGUF,unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF

# Image generation (set to enable, leave empty to disable)
IMG_MODEL=

# Speech-to-text model
WHISPER_MODEL=base

# Queue settings
MAX_CONCURRENT_REQUESTS=2

API Compatibility

All API endpoints remain unchanged. This is a drop-in upgrade.

New CLI

A lightweight pip-installable CLI for managing ezlocalai Docker containers:

pip install ezlocalai
ezlocalai start

The CLI:

  • Auto-detects GPU and uses CUDA or CPU mode appropriately
  • Auto-installs prerequisites (Docker, NVIDIA Container Toolkit) on Linux
  • Builds CUDA image locally from source (too large for DockerHub)
  • Persists data in ~/.ezlocalai/data/ (models, outputs, voices)
  • No prompts - designed to be as "ez" as possible

Commands:

  • ezlocalai start [--model X] - Start with optional model override
  • ezlocalai stop - Stop the container
  • ezlocalai restart - Restart with same config
  • ezlocalai status - Show running status and configuration
  • ezlocalai logs [-f] - View container logs
  • ezlocalai update - Pull latest CPU image / rebuild CUDA image

v0.1.14

10 Sep 19:14
cf53709

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.1.13...v0.1.14