Releases: DevXT-LLC/ezlocalai
v1.0.8
ezLocalai v1.0.8
Video Generation (NEW)
- LTX-2.3 GGUF Video Generation — Full text-to-video, image-to-video, and video-to-video generation powered by LTX-2.3 with GGUF quantization (Q4_K_M, Q8_0)
- New
POST /v1/videos/generationsendpoint - Sequential CPU offload for low-VRAM GPUs (8GB+)
- Multimodal outputs including synchronized audio generation (24kHz)
- Multi-frame conditioning for video-to-video workflows
- Text encoder strategies automatically selected based on available memory: GPU BNB 4-bit, CPU Quanto INT8, or CPU BF16
- New
- Configurable via
VIDEO_MODELenvironment variable (default:"none", set to"unsloth/LTX-2.3-GGUF"to enable)
Video Understanding (NEW)
- Vision-language models can now process video inputs in chat completions
- Scene-change detection automatically identifies key frames for efficient analysis
- Audio track extraction and transcription via Whisper for combined visual + audio understanding
- Supports video URLs and base64-encoded video inputs
Image Editing (NEW)
- FLUX.2-klein-4B with GGUF quantization — Replaces previous image model with a unified architecture supporting both image generation and editing
- New
POST /v1/images/editsendpoint - 15 GGUF quantization options (Q2_K through F16)
- Sequential CPU offload support matching video pipeline
- New
- Configurable via
IMG_MODELenvironment variable (default:"unsloth/FLUX.2-klein-4B-GGUF")
Speaker Diarization (NEW)
- Automatic speaker identification in audio transcriptions using MFCC-based clustering
- Persistent voice print storage with session-scoped UUIDs and 24-hour TTL
- Auto-detects number of speakers using inconsistency coefficient
- Cross-chunk speaker consistency — same physical speaker maintains the same ID across segments
- SRT and WebVTT subtitle output with speaker labels
Distributed Inference & Server Roles (NEW)
- Voice Server (
VOICE_SERVER) — Offload TTS/STT to a dedicated ezlocalai instance - Image Server (
IMAGE_SERVER) — Offload image and video generation to a dedicated instance - Text Server (
TEXT_SERVER) — Offload LLM text completion to a dedicated instance - Each server role supports independent API keys
- Lazy loading of voice models by default (
LAZY_LOAD_VOICE=true)
Fallback Server Support (NEW)
- Route requests to a remote server when local resources are exhausted
- Supports both ezlocalai instances and OpenAI-compatible APIs as fallback targets
- Configurable memory threshold (
FALLBACK_MEMORY_THRESHOLD) for combined VRAM + RAM monitoring - Automatic queue-based fallback with configurable wait timeout (
QUEUE_WAIT_TIMEOUT)
Request Queue System (NEW)
- Async request queue with configurable concurrency limits (
MAX_CONCURRENT_REQUESTS,MAX_QUEUE_SIZE) - Request lifecycle tracking: queued → processing → completed/failed
- Cancellation support for fallback routing
- Request history and metrics (total processed, failed, queued, active)
- Configurable request timeout (
REQUEST_TIMEOUT)
LLM Improvements
- Default model updated to
Qwen3.5-4B-GGUFwith new chat templates for Qwen3.5 4B, 2B, and 0.8B variants - Auto-reduce GPU layers on OOM instead of falling back entirely
- Multi-GPU improvements — Dynamic tensor split calculation based on free VRAM across all GPUs
- Parallel inference slots (
N_PARALLEL) — Auto-scales targeting ~32K context per slot (max 16 slots) - Configurable KV cache type (
QUANT_TYPEdefault:Q4_K_XL) - Configurable
LLM_MAX_TOKENS(default: 40000) andREASONING_BUDGETfor thinking tokens - Improved concurrency handling and logging
Platform Support
- NVIDIA Jetson — New
jetson.Dockerfileanddocker-compose-jetson.ymlwith platform-specific import handling - Raspberry Pi 5 with AI HAT+ 2 — New
rpi.Dockerfile,docker-compose-rpi.yml, andrpi-requirements.txt - Chatterbox TTS — Watermarker compatibility fix for perth module
Testing & CI
- Expanded GitHub Actions test workflow with TTS and STT testing enabled
- Tests use Qwen3.5-0.8B for faster CI runs
- Improved test reliability with better concurrency handling
Docker & Dependencies
- Updated to official xllamacpp v0.2.12 releases
- Updated setuptools to 78.1.1
- Updated cuda and rocm requirement files
- All Docker Compose files updated with new environment variables for distributed inference
v1.0.7
v1.0.6
v1.0.5
v1.0.4
ezLocalai v1.0.4
Summary
This release represents a complete modernization of ezlocalai's AI stack, making it truly "ez" - simpler configuration, smarter defaults, and better performance. The result is a cleaner codebase with fewer dependencies, automatic optimization, and minimal required configuration.
Philosophy: VRAM is auto-detected, context is dynamic, GPU layers are auto-calibrated. Just set your models and go.
Key Changes
1. Dynamic Context Sizing
No more guessing context sizes! The system automatically sizes context to fit your prompt:
- Estimates prompt tokens (chars / 4)
- Rounds up to nearest 32k boundary (32k, 64k, 96k, 128k)
- Recalibrates GPU layers if context increases
- Result: Much faster inference on short prompts (128k→32k = 63% speedup)
2. Auto VRAM Detection
VRAM budget is now detected automatically at startup:
- Queries
torch.cuda.get_device_properties() - Rounds down to nearest GB for safety margin
- No more manual
VRAM_BUDGETconfiguration
3. Simplified Environment Variables
Removed env vars (now automatic):
| Variable | Reason |
|---|---|
GPU_LAYERS |
Auto-calibrated based on VRAM |
VRAM_BUDGET |
Auto-detected from GPU |
LLM_MAX_TOKENS |
Dynamic based on prompt size |
IMG_ENABLED |
Detected from IMG_MODEL being set |
EMBEDDING_ENABLED |
Always available |
IMG_DEVICE |
Auto-detects CUDA availability |
Renamed:
| Old | New | Reason |
|---|---|---|
SD_MODEL |
IMG_MODEL |
Clearer naming |
Simplified DEFAULT_MODEL format:
# Old: model@tokens,model@tokens
DEFAULT_MODEL=model1@64000,model2@8192
# New: just list models (context is dynamic)
DEFAULT_MODEL=model1,model24. Vision Model Fallback
When a non-vision model receives an image request:
- System detects images in request
- Finds a vision-capable model from available models
- Uses vision model to describe images
- Prepends description to prompt
- Processes with requested model
This allows coding models (non-vision) to receive image context!
5. LLM Engine: llama-cpp-python → xllamacpp
Replaced llama-cpp-python with xllamacpp:
- Unified LLM + Vision via multimodal projector (mmproj)
- Native
estimate_gpu_layers()function for fast calibration - Cleaner API through
xllamacpp.Server
6. Multi-Model Hot-Swap
Support for multiple LLMs with automatic hot-swap:
DEFAULT_MODEL=unsloth/Qwen3-VL-4B-Instruct-GGUF,unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF- First model loads at startup with baseline 32k context
- Models swap on demand based on request
- Context-aware: recalibrates GPU layers when context size changes
7. Pre-Calibration at Startup
All models in DEFAULT_MODEL are pre-calibrated at startup:
- Uses xllamacpp's native
estimate_gpu_layers() - Calibrates at 32k baseline context
- Caches results for instant swaps
- Recalibrates on-demand for larger contexts
8. Text-to-Speech: XTTS → Chatterbox TTS
Replaced XTTS with Chatterbox TTS:
- Modern Llama 3.2 backbone (0.5B params)
- Better voice cloning quality
- Preloaded at startup to warm cache (~38s → ~5s on reload)
- Lazy-loaded on demand, unloaded after to free VRAM
9. Image Generation: SDXL-Turbo → SDXL-Lightning
Migrated to ByteDance/SDXL-Lightning:
- Better image quality at low step counts
- GPU: 2-step generation (blazing fast)
- CPU: 4-step generation (stable)
- Auto-detects CUDA for device selection
10. Embeddings: ONNX → BGE-M3
Replaced ONNX embedding with native BAAI/bge-m3:
- Removed ONNX/Optimum dependencies
- 1024-dimensional embeddings
- Native PyTorch GPU acceleration
11. Pip-Installable CLI
New lightweight CLI for easy installation and management:
pip install ezlocalai
ezlocalai startFeatures:
- Auto-detects GPU (NVIDIA) or falls back to CPU mode
- Auto-installs prerequisites on Linux (Docker, NVIDIA Container Toolkit)
- Simple commands:
start,stop,restart,status,logs - Configurable:
--model,--uri,--api-key,--ngrok,--whisper,--img-model - Minimal dependencies: Only
clickandrequests(no heavy ML libraries)
# Examples
ezlocalai start --model unsloth/gemma-3-4b-it-GGUF
ezlocalai start --api-key my-secret-key --ngrok <token>
ezlocalai logs -fPerformance Benchmarks
Test System
- CPU: 12th Gen Intel Core i9-12900KS (24 cores)
- GPU: NVIDIA GeForce RTX 4090 (24GB VRAM)
- Models Tested:
unsloth/Qwen3-VL-4B-Instruct-GGUF- 4B vision model (Q4_K_XL quantization)unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF- 30B MoE coding model (Q4_K_XL quantization)
LLM Inference Performance (RTX 4090)
| Model | Size | Tokens | Time | Speed |
|---|---|---|---|---|
| Qwen3-VL-4B | 4B | 100 | 0.47s | 213 tok/s |
| Qwen3-VL-4B | 4B | 134 | 0.64s | 209 tok/s |
| Qwen3-Coder-30B | 30B (MoE) | 100 | 1.53s | 65 tok/s |
| Qwen3-Coder-30B | 30B (MoE) | 300 | 4.41s | 68 tok/s |
Key Observations
- 4B Vision Model: ~210 tok/s average - excellent for interactive use
- 30B Coding Model: ~65-68 tok/s - great for code generation despite size
- Hot-swap time: ~1s between models (pre-calibrated)
- Dynamic context: 32k baseline, scales up as needed
GPU Layer Calibration
Auto-calibrated at startup for 32k context:
| Model | GPU Layers | VRAM Usage |
|---|---|---|
| Qwen3-VL-4B | 37 layers | ~12GB |
| Qwen3-Coder-30B | 45 layers | ~24GB |
Minimal Configuration
.env file (that's it!):
# ezlocalai Configuration - keeping it "ez"
# VRAM is auto-detected, context is dynamic, GPU layers are auto-calibrated
MAIN_GPU=0
DEFAULT_MODEL=unsloth/Qwen3-VL-4B-Instruct-GGUF,unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
# Image generation (set to enable, leave empty to disable)
IMG_MODEL=
# Speech-to-text model
WHISPER_MODEL=base
# Queue settings
MAX_CONCURRENT_REQUESTS=2API Compatibility
All API endpoints remain unchanged. This is a drop-in upgrade.
New CLI
A lightweight pip-installable CLI for managing ezlocalai Docker containers:
pip install ezlocalai
ezlocalai startThe CLI:
- Auto-detects GPU and uses CUDA or CPU mode appropriately
- Auto-installs prerequisites (Docker, NVIDIA Container Toolkit) on Linux
- Builds CUDA image locally from source (too large for DockerHub)
- Persists data in
~/.ezlocalai/data/(models, outputs, voices) - No prompts - designed to be as "ez" as possible
Commands:
ezlocalai start [--model X]- Start with optional model overrideezlocalai stop- Stop the containerezlocalai restart- Restart with same configezlocalai status- Show running status and configurationezlocalai logs [-f]- View container logsezlocalai update- Pull latest CPU image / rebuild CUDA image
v1.0.3
ezLocalai v1.0.3
Summary
This release represents a complete modernization of ezlocalai's AI stack, making it truly "ez" - simpler configuration, smarter defaults, and better performance. The result is a cleaner codebase with fewer dependencies, automatic optimization, and minimal required configuration.
Philosophy: VRAM is auto-detected, context is dynamic, GPU layers are auto-calibrated. Just set your models and go.
Key Changes
1. Dynamic Context Sizing
No more guessing context sizes! The system automatically sizes context to fit your prompt:
- Estimates prompt tokens (chars / 4)
- Rounds up to nearest 32k boundary (32k, 64k, 96k, 128k)
- Recalibrates GPU layers if context increases
- Result: Much faster inference on short prompts (128k→32k = 63% speedup)
2. Auto VRAM Detection
VRAM budget is now detected automatically at startup:
- Queries
torch.cuda.get_device_properties() - Rounds down to nearest GB for safety margin
- No more manual
VRAM_BUDGETconfiguration
3. Simplified Environment Variables
Removed env vars (now automatic):
| Variable | Reason |
|---|---|
GPU_LAYERS |
Auto-calibrated based on VRAM |
VRAM_BUDGET |
Auto-detected from GPU |
LLM_MAX_TOKENS |
Dynamic based on prompt size |
IMG_ENABLED |
Detected from IMG_MODEL being set |
EMBEDDING_ENABLED |
Always available |
IMG_DEVICE |
Auto-detects CUDA availability |
Renamed:
| Old | New | Reason |
|---|---|---|
SD_MODEL |
IMG_MODEL |
Clearer naming |
Simplified DEFAULT_MODEL format:
# Old: model@tokens,model@tokens
DEFAULT_MODEL=model1@64000,model2@8192
# New: just list models (context is dynamic)
DEFAULT_MODEL=model1,model24. Vision Model Fallback
When a non-vision model receives an image request:
- System detects images in request
- Finds a vision-capable model from available models
- Uses vision model to describe images
- Prepends description to prompt
- Processes with requested model
This allows coding models (non-vision) to receive image context!
5. LLM Engine: llama-cpp-python → xllamacpp
Replaced llama-cpp-python with xllamacpp:
- Unified LLM + Vision via multimodal projector (mmproj)
- Native
estimate_gpu_layers()function for fast calibration - Cleaner API through
xllamacpp.Server
6. Multi-Model Hot-Swap
Support for multiple LLMs with automatic hot-swap:
DEFAULT_MODEL=unsloth/Qwen3-VL-4B-Instruct-GGUF,unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF- First model loads at startup with baseline 32k context
- Models swap on demand based on request
- Context-aware: recalibrates GPU layers when context size changes
7. Pre-Calibration at Startup
All models in DEFAULT_MODEL are pre-calibrated at startup:
- Uses xllamacpp's native
estimate_gpu_layers() - Calibrates at 32k baseline context
- Caches results for instant swaps
- Recalibrates on-demand for larger contexts
8. Text-to-Speech: XTTS → Chatterbox TTS
Replaced XTTS with Chatterbox TTS:
- Modern Llama 3.2 backbone (0.5B params)
- Better voice cloning quality
- Preloaded at startup to warm cache (~38s → ~5s on reload)
- Lazy-loaded on demand, unloaded after to free VRAM
9. Image Generation: SDXL-Turbo → SDXL-Lightning
Migrated to ByteDance/SDXL-Lightning:
- Better image quality at low step counts
- GPU: 2-step generation (blazing fast)
- CPU: 4-step generation (stable)
- Auto-detects CUDA for device selection
10. Embeddings: ONNX → BGE-M3
Replaced ONNX embedding with native BAAI/bge-m3:
- Removed ONNX/Optimum dependencies
- 1024-dimensional embeddings
- Native PyTorch GPU acceleration
11. Pip-Installable CLI
New lightweight CLI for easy installation and management:
pip install ezlocalai
ezlocalai startFeatures:
- Auto-detects GPU (NVIDIA) or falls back to CPU mode
- Auto-installs prerequisites on Linux (Docker, NVIDIA Container Toolkit)
- Simple commands:
start,stop,restart,status,logs - Configurable:
--model,--uri,--api-key,--ngrok,--whisper,--img-model - Minimal dependencies: Only
clickandrequests(no heavy ML libraries)
# Examples
ezlocalai start --model unsloth/gemma-3-4b-it-GGUF
ezlocalai start --api-key my-secret-key --ngrok <token>
ezlocalai logs -fPerformance Benchmarks
Test System
- CPU: 12th Gen Intel Core i9-12900KS (24 cores)
- GPU: NVIDIA GeForce RTX 4090 (24GB VRAM)
- Models Tested:
unsloth/Qwen3-VL-4B-Instruct-GGUF- 4B vision model (Q4_K_XL quantization)unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF- 30B MoE coding model (Q4_K_XL quantization)
LLM Inference Performance (RTX 4090)
| Model | Size | Tokens | Time | Speed |
|---|---|---|---|---|
| Qwen3-VL-4B | 4B | 100 | 0.47s | 213 tok/s |
| Qwen3-VL-4B | 4B | 134 | 0.64s | 209 tok/s |
| Qwen3-Coder-30B | 30B (MoE) | 100 | 1.53s | 65 tok/s |
| Qwen3-Coder-30B | 30B (MoE) | 300 | 4.41s | 68 tok/s |
Key Observations
- 4B Vision Model: ~210 tok/s average - excellent for interactive use
- 30B Coding Model: ~65-68 tok/s - great for code generation despite size
- Hot-swap time: ~1s between models (pre-calibrated)
- Dynamic context: 32k baseline, scales up as needed
GPU Layer Calibration
Auto-calibrated at startup for 32k context:
| Model | GPU Layers | VRAM Usage |
|---|---|---|
| Qwen3-VL-4B | 37 layers | ~12GB |
| Qwen3-Coder-30B | 45 layers | ~24GB |
Minimal Configuration
.env file (that's it!):
# ezlocalai Configuration - keeping it "ez"
# VRAM is auto-detected, context is dynamic, GPU layers are auto-calibrated
MAIN_GPU=0
DEFAULT_MODEL=unsloth/Qwen3-VL-4B-Instruct-GGUF,unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
# Image generation (set to enable, leave empty to disable)
IMG_MODEL=
# Speech-to-text model
WHISPER_MODEL=base
# Queue settings
MAX_CONCURRENT_REQUESTS=2API Compatibility
All API endpoints remain unchanged. This is a drop-in upgrade.
New CLI
A lightweight pip-installable CLI for managing ezlocalai Docker containers:
pip install ezlocalai
ezlocalai startThe CLI:
- Auto-detects GPU and uses CUDA or CPU mode appropriately
- Auto-installs prerequisites (Docker, NVIDIA Container Toolkit) on Linux
- Builds CUDA image locally from source (too large for DockerHub)
- Persists data in
~/.ezlocalai/data/(models, outputs, voices) - No prompts - designed to be as "ez" as possible
Commands:
ezlocalai start [--model X]- Start with optional model overrideezlocalai stop- Stop the containerezlocalai restart- Restart with same configezlocalai status- Show running status and configurationezlocalai logs [-f]- View container logsezlocalai update- Pull latest CPU image / rebuild CUDA image
v1.0.2
ezLocalai v1.0.2
Summary
This release represents a complete modernization of ezlocalai's AI stack, making it truly "ez" - simpler configuration, smarter defaults, and better performance. The result is a cleaner codebase with fewer dependencies, automatic optimization, and minimal required configuration.
Philosophy: VRAM is auto-detected, context is dynamic, GPU layers are auto-calibrated. Just set your models and go.
Key Changes
1. Dynamic Context Sizing
No more guessing context sizes! The system automatically sizes context to fit your prompt:
- Estimates prompt tokens (chars / 4)
- Rounds up to nearest 32k boundary (32k, 64k, 96k, 128k)
- Recalibrates GPU layers if context increases
- Result: Much faster inference on short prompts (128k→32k = 63% speedup)
2. Auto VRAM Detection
VRAM budget is now detected automatically at startup:
- Queries
torch.cuda.get_device_properties() - Rounds down to nearest GB for safety margin
- No more manual
VRAM_BUDGETconfiguration
3. Simplified Environment Variables
Removed env vars (now automatic):
| Variable | Reason |
|---|---|
GPU_LAYERS |
Auto-calibrated based on VRAM |
VRAM_BUDGET |
Auto-detected from GPU |
LLM_MAX_TOKENS |
Dynamic based on prompt size |
IMG_ENABLED |
Detected from IMG_MODEL being set |
EMBEDDING_ENABLED |
Always available |
IMG_DEVICE |
Auto-detects CUDA availability |
Renamed:
| Old | New | Reason |
|---|---|---|
SD_MODEL |
IMG_MODEL |
Clearer naming |
Simplified DEFAULT_MODEL format:
# Old: model@tokens,model@tokens
DEFAULT_MODEL=model1@64000,model2@8192
# New: just list models (context is dynamic)
DEFAULT_MODEL=model1,model24. Vision Model Fallback
When a non-vision model receives an image request:
- System detects images in request
- Finds a vision-capable model from available models
- Uses vision model to describe images
- Prepends description to prompt
- Processes with requested model
This allows coding models (non-vision) to receive image context!
5. LLM Engine: llama-cpp-python → xllamacpp
Replaced llama-cpp-python with xllamacpp:
- Unified LLM + Vision via multimodal projector (mmproj)
- Native
estimate_gpu_layers()function for fast calibration - Cleaner API through
xllamacpp.Server
6. Multi-Model Hot-Swap
Support for multiple LLMs with automatic hot-swap:
DEFAULT_MODEL=unsloth/Qwen3-VL-4B-Instruct-GGUF,unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF- First model loads at startup with baseline 32k context
- Models swap on demand based on request
- Context-aware: recalibrates GPU layers when context size changes
7. Pre-Calibration at Startup
All models in DEFAULT_MODEL are pre-calibrated at startup:
- Uses xllamacpp's native
estimate_gpu_layers() - Calibrates at 32k baseline context
- Caches results for instant swaps
- Recalibrates on-demand for larger contexts
8. Text-to-Speech: XTTS → Chatterbox TTS
Replaced XTTS with Chatterbox TTS:
- Modern Llama 3.2 backbone (0.5B params)
- Better voice cloning quality
- Preloaded at startup to warm cache (~38s → ~5s on reload)
- Lazy-loaded on demand, unloaded after to free VRAM
9. Image Generation: SDXL-Turbo → SDXL-Lightning
Migrated to ByteDance/SDXL-Lightning:
- Better image quality at low step counts
- GPU: 2-step generation (blazing fast)
- CPU: 4-step generation (stable)
- Auto-detects CUDA for device selection
10. Embeddings: ONNX → BGE-M3
Replaced ONNX embedding with native BAAI/bge-m3:
- Removed ONNX/Optimum dependencies
- 1024-dimensional embeddings
- Native PyTorch GPU acceleration
11. Pip-Installable CLI
New lightweight CLI for easy installation and management:
pip install ezlocalai
ezlocalai startFeatures:
- Auto-detects GPU (NVIDIA) or falls back to CPU mode
- Auto-installs prerequisites on Linux (Docker, NVIDIA Container Toolkit)
- Simple commands:
start,stop,restart,status,logs - Configurable:
--model,--uri,--api-key,--ngrok,--whisper,--img-model - Minimal dependencies: Only
clickandrequests(no heavy ML libraries)
# Examples
ezlocalai start --model unsloth/gemma-3-4b-it-GGUF
ezlocalai start --api-key my-secret-key --ngrok <token>
ezlocalai logs -fPerformance Benchmarks
Test System
- CPU: 12th Gen Intel Core i9-12900KS (24 cores)
- GPU: NVIDIA GeForce RTX 4090 (24GB VRAM)
- Models Tested:
unsloth/Qwen3-VL-4B-Instruct-GGUF- 4B vision model (Q4_K_XL quantization)unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF- 30B MoE coding model (Q4_K_XL quantization)
LLM Inference Performance (RTX 4090)
| Model | Size | Tokens | Time | Speed |
|---|---|---|---|---|
| Qwen3-VL-4B | 4B | 100 | 0.47s | 213 tok/s |
| Qwen3-VL-4B | 4B | 134 | 0.64s | 209 tok/s |
| Qwen3-Coder-30B | 30B (MoE) | 100 | 1.53s | 65 tok/s |
| Qwen3-Coder-30B | 30B (MoE) | 300 | 4.41s | 68 tok/s |
Key Observations
- 4B Vision Model: ~210 tok/s average - excellent for interactive use
- 30B Coding Model: ~65-68 tok/s - great for code generation despite size
- Hot-swap time: ~1s between models (pre-calibrated)
- Dynamic context: 32k baseline, scales up as needed
GPU Layer Calibration
Auto-calibrated at startup for 32k context:
| Model | GPU Layers | VRAM Usage |
|---|---|---|
| Qwen3-VL-4B | 37 layers | ~12GB |
| Qwen3-Coder-30B | 45 layers | ~24GB |
Minimal Configuration
.env file (that's it!):
# ezlocalai Configuration - keeping it "ez"
# VRAM is auto-detected, context is dynamic, GPU layers are auto-calibrated
MAIN_GPU=0
DEFAULT_MODEL=unsloth/Qwen3-VL-4B-Instruct-GGUF,unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
# Image generation (set to enable, leave empty to disable)
IMG_MODEL=
# Speech-to-text model
WHISPER_MODEL=base
# Queue settings
MAX_CONCURRENT_REQUESTS=2API Compatibility
All API endpoints remain unchanged. This is a drop-in upgrade.
New CLI
A lightweight pip-installable CLI for managing ezlocalai Docker containers:
pip install ezlocalai
ezlocalai startThe CLI:
- Auto-detects GPU and uses CUDA or CPU mode appropriately
- Auto-installs prerequisites (Docker, NVIDIA Container Toolkit) on Linux
- Builds CUDA image locally from source (too large for DockerHub)
- Persists data in
~/.ezlocalai/data/(models, outputs, voices) - No prompts - designed to be as "ez" as possible
Commands:
ezlocalai start [--model X]- Start with optional model overrideezlocalai stop- Stop the containerezlocalai restart- Restart with same configezlocalai status- Show running status and configurationezlocalai logs [-f]- View container logsezlocalai update- Pull latest CPU image / rebuild CUDA image
v1.0.1
ezLocalai v1.0.1
Summary
This release represents a complete modernization of ezlocalai's AI stack, making it truly "ez" - simpler configuration, smarter defaults, and better performance. The result is a cleaner codebase with fewer dependencies, automatic optimization, and minimal required configuration.
Philosophy: VRAM is auto-detected, context is dynamic, GPU layers are auto-calibrated. Just set your models and go.
Key Changes
1. Dynamic Context Sizing
No more guessing context sizes! The system automatically sizes context to fit your prompt:
- Estimates prompt tokens (chars / 4)
- Rounds up to nearest 32k boundary (32k, 64k, 96k, 128k)
- Recalibrates GPU layers if context increases
- Result: Much faster inference on short prompts (128k→32k = 63% speedup)
2. Auto VRAM Detection
VRAM budget is now detected automatically at startup:
- Queries
torch.cuda.get_device_properties() - Rounds down to nearest GB for safety margin
- No more manual
VRAM_BUDGETconfiguration
3. Simplified Environment Variables
Removed env vars (now automatic):
| Variable | Reason |
|---|---|
GPU_LAYERS |
Auto-calibrated based on VRAM |
VRAM_BUDGET |
Auto-detected from GPU |
LLM_MAX_TOKENS |
Dynamic based on prompt size |
IMG_ENABLED |
Detected from IMG_MODEL being set |
EMBEDDING_ENABLED |
Always available |
IMG_DEVICE |
Auto-detects CUDA availability |
Renamed:
| Old | New | Reason |
|---|---|---|
SD_MODEL |
IMG_MODEL |
Clearer naming |
Simplified DEFAULT_MODEL format:
# Old: model@tokens,model@tokens
DEFAULT_MODEL=model1@64000,model2@8192
# New: just list models (context is dynamic)
DEFAULT_MODEL=model1,model24. Vision Model Fallback
When a non-vision model receives an image request:
- System detects images in request
- Finds a vision-capable model from available models
- Uses vision model to describe images
- Prepends description to prompt
- Processes with requested model
This allows coding models (non-vision) to receive image context!
5. LLM Engine: llama-cpp-python → xllamacpp
Replaced llama-cpp-python with xllamacpp:
- Unified LLM + Vision via multimodal projector (mmproj)
- Native
estimate_gpu_layers()function for fast calibration - Cleaner API through
xllamacpp.Server
6. Multi-Model Hot-Swap
Support for multiple LLMs with automatic hot-swap:
DEFAULT_MODEL=unsloth/Qwen3-VL-4B-Instruct-GGUF,unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF- First model loads at startup with baseline 32k context
- Models swap on demand based on request
- Context-aware: recalibrates GPU layers when context size changes
7. Pre-Calibration at Startup
All models in DEFAULT_MODEL are pre-calibrated at startup:
- Uses xllamacpp's native
estimate_gpu_layers() - Calibrates at 32k baseline context
- Caches results for instant swaps
- Recalibrates on-demand for larger contexts
8. Text-to-Speech: XTTS → Chatterbox TTS
Replaced XTTS with Chatterbox TTS:
- Modern Llama 3.2 backbone (0.5B params)
- Better voice cloning quality
- Preloaded at startup to warm cache (~38s → ~5s on reload)
- Lazy-loaded on demand, unloaded after to free VRAM
9. Image Generation: SDXL-Turbo → SDXL-Lightning
Migrated to ByteDance/SDXL-Lightning:
- Better image quality at low step counts
- GPU: 2-step generation (blazing fast)
- CPU: 4-step generation (stable)
- Auto-detects CUDA for device selection
10. Embeddings: ONNX → BGE-M3
Replaced ONNX embedding with native BAAI/bge-m3:
- Removed ONNX/Optimum dependencies
- 1024-dimensional embeddings
- Native PyTorch GPU acceleration
11. Pip-Installable CLI
New lightweight CLI for easy installation and management:
pip install ezlocalai
ezlocalai startFeatures:
- Auto-detects GPU (NVIDIA) or falls back to CPU mode
- Auto-installs prerequisites on Linux (Docker, NVIDIA Container Toolkit)
- Simple commands:
start,stop,restart,status,logs - Configurable:
--model,--uri,--api-key,--ngrok,--whisper,--img-model - Minimal dependencies: Only
clickandrequests(no heavy ML libraries)
# Examples
ezlocalai start --model unsloth/gemma-3-4b-it-GGUF
ezlocalai start --api-key my-secret-key --ngrok <token>
ezlocalai logs -fPerformance Benchmarks
Test System
- CPU: 12th Gen Intel Core i9-12900KS (24 cores)
- GPU: NVIDIA GeForce RTX 4090 (24GB VRAM)
- Models Tested:
unsloth/Qwen3-VL-4B-Instruct-GGUF- 4B vision model (Q4_K_XL quantization)unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF- 30B MoE coding model (Q4_K_XL quantization)
LLM Inference Performance (RTX 4090)
| Model | Size | Tokens | Time | Speed |
|---|---|---|---|---|
| Qwen3-VL-4B | 4B | 100 | 0.47s | 213 tok/s |
| Qwen3-VL-4B | 4B | 134 | 0.64s | 209 tok/s |
| Qwen3-Coder-30B | 30B (MoE) | 100 | 1.53s | 65 tok/s |
| Qwen3-Coder-30B | 30B (MoE) | 300 | 4.41s | 68 tok/s |
Key Observations
- 4B Vision Model: ~210 tok/s average - excellent for interactive use
- 30B Coding Model: ~65-68 tok/s - great for code generation despite size
- Hot-swap time: ~1s between models (pre-calibrated)
- Dynamic context: 32k baseline, scales up as needed
GPU Layer Calibration
Auto-calibrated at startup for 32k context:
| Model | GPU Layers | VRAM Usage |
|---|---|---|
| Qwen3-VL-4B | 37 layers | ~12GB |
| Qwen3-Coder-30B | 45 layers | ~24GB |
Minimal Configuration
.env file (that's it!):
# ezlocalai Configuration - keeping it "ez"
# VRAM is auto-detected, context is dynamic, GPU layers are auto-calibrated
MAIN_GPU=0
DEFAULT_MODEL=unsloth/Qwen3-VL-4B-Instruct-GGUF,unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
# Image generation (set to enable, leave empty to disable)
IMG_MODEL=
# Speech-to-text model
WHISPER_MODEL=base
# Queue settings
MAX_CONCURRENT_REQUESTS=2API Compatibility
All API endpoints remain unchanged. This is a drop-in upgrade.
New CLI
A lightweight pip-installable CLI for managing ezlocalai Docker containers:
pip install ezlocalai
ezlocalai startThe CLI:
- Auto-detects GPU and uses CUDA or CPU mode appropriately
- Auto-installs prerequisites (Docker, NVIDIA Container Toolkit) on Linux
- Builds CUDA image locally from source (too large for DockerHub)
- Persists data in
~/.ezlocalai/data/(models, outputs, voices) - No prompts - designed to be as "ez" as possible
Commands:
ezlocalai start [--model X]- Start with optional model overrideezlocalai stop- Stop the containerezlocalai restart- Restart with same configezlocalai status- Show running status and configurationezlocalai logs [-f]- View container logsezlocalai update- Pull latest CPU image / rebuild CUDA image
v1.0.0
ezLocalai v1.0.0
Summary
This release represents a complete modernization of ezlocalai's AI stack, making it truly "ez" - simpler configuration, smarter defaults, and better performance. The result is a cleaner codebase with fewer dependencies, automatic optimization, and minimal required configuration.
Philosophy: VRAM is auto-detected, context is dynamic, GPU layers are auto-calibrated. Just set your models and go.
Key Changes
1. Dynamic Context Sizing
No more guessing context sizes! The system automatically sizes context to fit your prompt:
- Estimates prompt tokens (chars / 4)
- Rounds up to nearest 32k boundary (32k, 64k, 96k, 128k)
- Recalibrates GPU layers if context increases
- Result: Much faster inference on short prompts (128k→32k = 63% speedup)
2. Auto VRAM Detection
VRAM budget is now detected automatically at startup:
- Queries
torch.cuda.get_device_properties() - Rounds down to nearest GB for safety margin
- No more manual
VRAM_BUDGETconfiguration
3. Simplified Environment Variables
Removed env vars (now automatic):
| Variable | Reason |
|---|---|
GPU_LAYERS |
Auto-calibrated based on VRAM |
VRAM_BUDGET |
Auto-detected from GPU |
LLM_MAX_TOKENS |
Dynamic based on prompt size |
IMG_ENABLED |
Detected from IMG_MODEL being set |
EMBEDDING_ENABLED |
Always available |
IMG_DEVICE |
Auto-detects CUDA availability |
Renamed:
| Old | New | Reason |
|---|---|---|
SD_MODEL |
IMG_MODEL |
Clearer naming |
Simplified DEFAULT_MODEL format:
# Old: model@tokens,model@tokens
DEFAULT_MODEL=model1@64000,model2@8192
# New: just list models (context is dynamic)
DEFAULT_MODEL=model1,model24. Vision Model Fallback
When a non-vision model receives an image request:
- System detects images in request
- Finds a vision-capable model from available models
- Uses vision model to describe images
- Prepends description to prompt
- Processes with requested model
This allows coding models (non-vision) to receive image context!
5. LLM Engine: llama-cpp-python → xllamacpp
Replaced llama-cpp-python with xllamacpp:
- Unified LLM + Vision via multimodal projector (mmproj)
- Native
estimate_gpu_layers()function for fast calibration - Cleaner API through
xllamacpp.Server
6. Multi-Model Hot-Swap
Support for multiple LLMs with automatic hot-swap:
DEFAULT_MODEL=unsloth/Qwen3-VL-4B-Instruct-GGUF,unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF- First model loads at startup with baseline 32k context
- Models swap on demand based on request
- Context-aware: recalibrates GPU layers when context size changes
7. Pre-Calibration at Startup
All models in DEFAULT_MODEL are pre-calibrated at startup:
- Uses xllamacpp's native
estimate_gpu_layers() - Calibrates at 32k baseline context
- Caches results for instant swaps
- Recalibrates on-demand for larger contexts
8. Text-to-Speech: XTTS → Chatterbox TTS
Replaced XTTS with Chatterbox TTS:
- Modern Llama 3.2 backbone (0.5B params)
- Better voice cloning quality
- Preloaded at startup to warm cache (~38s → ~5s on reload)
- Lazy-loaded on demand, unloaded after to free VRAM
9. Image Generation: SDXL-Turbo → SDXL-Lightning
Migrated to ByteDance/SDXL-Lightning:
- Better image quality at low step counts
- GPU: 2-step generation (blazing fast)
- CPU: 4-step generation (stable)
- Auto-detects CUDA for device selection
10. Embeddings: ONNX → BGE-M3
Replaced ONNX embedding with native BAAI/bge-m3:
- Removed ONNX/Optimum dependencies
- 1024-dimensional embeddings
- Native PyTorch GPU acceleration
11. Pip-Installable CLI
New lightweight CLI for easy installation and management:
pip install ezlocalai
ezlocalai startFeatures:
- Auto-detects GPU (NVIDIA) or falls back to CPU mode
- Auto-installs prerequisites on Linux (Docker, NVIDIA Container Toolkit)
- Simple commands:
start,stop,restart,status,logs - Configurable:
--model,--uri,--api-key,--ngrok,--whisper,--img-model - Minimal dependencies: Only
clickandrequests(no heavy ML libraries)
# Examples
ezlocalai start --model unsloth/gemma-3-4b-it-GGUF
ezlocalai start --api-key my-secret-key --ngrok <token>
ezlocalai logs -fPerformance Benchmarks
Test System
- CPU: 12th Gen Intel Core i9-12900KS (24 cores)
- GPU: NVIDIA GeForce RTX 4090 (24GB VRAM)
- Models Tested:
unsloth/Qwen3-VL-4B-Instruct-GGUF- 4B vision model (Q4_K_XL quantization)unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF- 30B MoE coding model (Q4_K_XL quantization)
LLM Inference Performance (RTX 4090)
| Model | Size | Tokens | Time | Speed |
|---|---|---|---|---|
| Qwen3-VL-4B | 4B | 100 | 0.47s | 213 tok/s |
| Qwen3-VL-4B | 4B | 134 | 0.64s | 209 tok/s |
| Qwen3-Coder-30B | 30B (MoE) | 100 | 1.53s | 65 tok/s |
| Qwen3-Coder-30B | 30B (MoE) | 300 | 4.41s | 68 tok/s |
Key Observations
- 4B Vision Model: ~210 tok/s average - excellent for interactive use
- 30B Coding Model: ~65-68 tok/s - great for code generation despite size
- Hot-swap time: ~1s between models (pre-calibrated)
- Dynamic context: 32k baseline, scales up as needed
GPU Layer Calibration
Auto-calibrated at startup for 32k context:
| Model | GPU Layers | VRAM Usage |
|---|---|---|
| Qwen3-VL-4B | 37 layers | ~12GB |
| Qwen3-Coder-30B | 45 layers | ~24GB |
Minimal Configuration
.env file (that's it!):
# ezlocalai Configuration - keeping it "ez"
# VRAM is auto-detected, context is dynamic, GPU layers are auto-calibrated
MAIN_GPU=0
DEFAULT_MODEL=unsloth/Qwen3-VL-4B-Instruct-GGUF,unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF
# Image generation (set to enable, leave empty to disable)
IMG_MODEL=
# Speech-to-text model
WHISPER_MODEL=base
# Queue settings
MAX_CONCURRENT_REQUESTS=2API Compatibility
All API endpoints remain unchanged. This is a drop-in upgrade.
New CLI
A lightweight pip-installable CLI for managing ezlocalai Docker containers:
pip install ezlocalai
ezlocalai startThe CLI:
- Auto-detects GPU and uses CUDA or CPU mode appropriately
- Auto-installs prerequisites (Docker, NVIDIA Container Toolkit) on Linux
- Builds CUDA image locally from source (too large for DockerHub)
- Persists data in
~/.ezlocalai/data/(models, outputs, voices) - No prompts - designed to be as "ez" as possible
Commands:
ezlocalai start [--model X]- Start with optional model overrideezlocalai stop- Stop the containerezlocalai restart- Restart with same configezlocalai status- Show running status and configurationezlocalai logs [-f]- View container logsezlocalai update- Pull latest CPU image / rebuild CUDA image
v0.1.14
What's Changed
- Add TENSOR_SPLIT env var; by @JamesonRGrieve in #46
- Add a ternary to fix single GPUs; by @JamesonRGrieve in #47
- Use global environment by @Josh-XT in #48
New Contributors
- @JamesonRGrieve made their first contribution in #46
Full Changelog: v0.1.13...v0.1.14