26 Mar 13:56

Josh-XT

b7f39be

v1.0.8 Latest

Latest

ezLocalai v1.0.8

Video Generation (NEW)

LTX-2.3 GGUF Video Generation — Full text-to-video, image-to-video, and video-to-video generation powered by LTX-2.3 with GGUF quantization (Q4_K_M, Q8_0)
- New POST /v1/videos/generations endpoint
- Sequential CPU offload for low-VRAM GPUs (8GB+)
- Multimodal outputs including synchronized audio generation (24kHz)
- Multi-frame conditioning for video-to-video workflows
- Text encoder strategies automatically selected based on available memory: GPU BNB 4-bit, CPU Quanto INT8, or CPU BF16
Configurable via VIDEO_MODEL environment variable (default: "none", set to "unsloth/LTX-2.3-GGUF" to enable)

Video Understanding (NEW)

Vision-language models can now process video inputs in chat completions
Scene-change detection automatically identifies key frames for efficient analysis
Audio track extraction and transcription via Whisper for combined visual + audio understanding
Supports video URLs and base64-encoded video inputs

Image Editing (NEW)

FLUX.2-klein-4B with GGUF quantization — Replaces previous image model with a unified architecture supporting both image generation and editing
- New POST /v1/images/edits endpoint
- 15 GGUF quantization options (Q2_K through F16)
- Sequential CPU offload support matching video pipeline
Configurable via IMG_MODEL environment variable (default: "unsloth/FLUX.2-klein-4B-GGUF")

Speaker Diarization (NEW)

Automatic speaker identification in audio transcriptions using MFCC-based clustering
Persistent voice print storage with session-scoped UUIDs and 24-hour TTL
Auto-detects number of speakers using inconsistency coefficient
Cross-chunk speaker consistency — same physical speaker maintains the same ID across segments
SRT and WebVTT subtitle output with speaker labels

Distributed Inference & Server Roles (NEW)

Voice Server (VOICE_SERVER) — Offload TTS/STT to a dedicated ezlocalai instance
Image Server (IMAGE_SERVER) — Offload image and video generation to a dedicated instance
Text Server (TEXT_SERVER) — Offload LLM text completion to a dedicated instance
Each server role supports independent API keys
Lazy loading of voice models by default (LAZY_LOAD_VOICE=true)

Fallback Server Support (NEW)

Route requests to a remote server when local resources are exhausted
Supports both ezlocalai instances and OpenAI-compatible APIs as fallback targets
Configurable memory threshold (FALLBACK_MEMORY_THRESHOLD) for combined VRAM + RAM monitoring
Automatic queue-based fallback with configurable wait timeout (QUEUE_WAIT_TIMEOUT)

Request Queue System (NEW)

Async request queue with configurable concurrency limits (MAX_CONCURRENT_REQUESTS, MAX_QUEUE_SIZE)
Request lifecycle tracking: queued → processing → completed/failed
Cancellation support for fallback routing
Request history and metrics (total processed, failed, queued, active)
Configurable request timeout (REQUEST_TIMEOUT)

LLM Improvements

Default model updated to Qwen3.5-4B-GGUF with new chat templates for Qwen3.5 4B, 2B, and 0.8B variants
Auto-reduce GPU layers on OOM instead of falling back entirely
Multi-GPU improvements — Dynamic tensor split calculation based on free VRAM across all GPUs
Parallel inference slots (N_PARALLEL) — Auto-scales targeting ~32K context per slot (max 16 slots)
Configurable KV cache type (QUANT_TYPE default: Q4_K_XL)
Configurable LLM_MAX_TOKENS (default: 40000) and REASONING_BUDGET for thinking tokens
Improved concurrency handling and logging

Platform Support

NVIDIA Jetson — New jetson.Dockerfile and docker-compose-jetson.yml with platform-specific import handling
Raspberry Pi 5 with AI HAT+ 2 — New rpi.Dockerfile, docker-compose-rpi.yml, and rpi-requirements.txt
Chatterbox TTS — Watermarker compatibility fix for perth module

Testing & CI

Expanded GitHub Actions test workflow with TTS and STT testing enabled
Tests use Qwen3.5-0.8B for faster CI runs
Improved test reliability with better concurrency handling

Docker & Dependencies

Updated to official xllamacpp v0.2.12 releases
Updated setuptools to 78.1.1
Updated cuda and rocm requirement files
All Docker Compose files updated with new environment variables for distributed inference

Assets 2

24 Mar 16:22

Josh-XT

v1.0.7

7a83c95

v1.0.7

What's Changed

Add speaker diarization support to audio transcriptions by @Copilot in #72

New Contributors

@Copilot made their first contribution in #72

Full Changelog: v1.0.6...v1.0.7

Assets 2

21 Feb 14:05

Josh-XT

v1.0.6

2fc4f7c

v1.0.6

What's Changed

Add wake word training by @Josh-XT in #71

Full Changelog: v1.0.5...v1.0.6

Contributors

Josh-XT

Assets 2

02 Dec 16:10

Josh-XT

v1.0.5

15e4167

v1.0.5

What's Changed

Fix CodeQL warnings by @Josh-XT in #68
Update transformers by @Josh-XT in #69
Add AMD RCom support by @Josh-XT in #70

Full Changelog: v1.0.4...v1.0.5

Contributors

Josh-XT

Assets 2

26 Nov 18:21

Josh-XT

v1.0.4

b11dacf

v1.0.4

ezLocalai v1.0.4

Summary

This release represents a complete modernization of ezlocalai's AI stack, making it truly "ez" - simpler configuration, smarter defaults, and better performance. The result is a cleaner codebase with fewer dependencies, automatic optimization, and minimal required configuration.

Philosophy: VRAM is auto-detected, context is dynamic, GPU layers are auto-calibrated. Just set your models and go.

Key Changes

1. Dynamic Context Sizing

No more guessing context sizes! The system automatically sizes context to fit your prompt:

Estimates prompt tokens (chars / 4)
Rounds up to nearest 32k boundary (32k, 64k, 96k, 128k)
Recalibrates GPU layers if context increases
Result: Much faster inference on short prompts (128k→32k = 63% speedup)

2. Auto VRAM Detection

VRAM budget is now detected automatically at startup:

Queries torch.cuda.get_device_properties()
Rounds down to nearest GB for safety margin
No more manual VRAM_BUDGET configuration

3. Simplified Environment Variables

Removed env vars (now automatic):

Variable	Reason
`GPU_LAYERS`	Auto-calibrated based on VRAM
`VRAM_BUDGET`	Auto-detected from GPU
`LLM_MAX_TOKENS`	Dynamic based on prompt size
`IMG_ENABLED`	Detected from `IMG_MODEL` being set
`EMBEDDING_ENABLED`	Always available
`IMG_DEVICE`	Auto-detects CUDA availability

Renamed:

Old	New	Reason
`SD_MODEL`	`IMG_MODEL`	Clearer naming

Simplified DEFAULT_MODEL format:

# Old: model@tokens,model@tokens
DEFAULT_MODEL=model1@64000,model2@8192

# New: just list models (context is dynamic)
DEFAULT_MODEL=model1,model2

4. Vision Model Fallback

When a non-vision model receives an image request:

System detects images in request
Finds a vision-capable model from available models
Uses vision model to describe images
Prepends description to prompt
Processes with requested model

This allows coding models (non-vision) to receive image context!

5. LLM Engine: llama-cpp-python → xllamacpp

Replaced llama-cpp-python with xllamacpp:

Unified LLM + Vision via multimodal projector (mmproj)
Native estimate_gpu_layers() function for fast calibration
Cleaner API through xllamacpp.Server

6. Multi-Model Hot-Swap

Support for multiple LLMs with automatic hot-swap:

DEFAULT_MODEL=unsloth/Qwen3-VL-4B-Instruct-GGUF,unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF

First model loads at startup with baseline 32k context
Models swap on demand based on request
Context-aware: recalibrates GPU layers when context size changes

7. Pre-Calibration at Startup

All models in DEFAULT_MODEL are pre-calibrated at startup:

Uses xllamacpp's native estimate_gpu_layers()
Calibrates at 32k baseline context
Caches results for instant swaps
Recalibrates on-demand for larger contexts

8. Text-to-Speech: XTTS → Chatterbox TTS

Replaced XTTS with Chatterbox TTS:

Modern Llama 3.2 backbone (0.5B params)
Better voice cloning quality
Preloaded at startup to warm cache (~38s → ~5s on reload)
Lazy-loaded on demand, unloaded after to free VRAM

9. Image Generation: SDXL-Turbo → SDXL-Lightning

Migrated to ByteDance/SDXL-Lightning:

Better image quality at low step counts
GPU: 2-step generation (blazing fast)
CPU: 4-step generation (stable)
Auto-detects CUDA for device selection

10. Embeddings: ONNX → BGE-M3

Replaced ONNX embedding with native BAAI/bge-m3:

Removed ONNX/Optimum dependencies
1024-dimensional embeddings
Native PyTorch GPU acceleration

11. Pip-Installable CLI

New lightweight CLI for easy installation and management:

pip install ezlocalai
ezlocalai start

Features:

Auto-detects GPU (NVIDIA) or falls back to CPU mode
Auto-installs prerequisites on Linux (Docker, NVIDIA Container Toolkit)
Simple commands: start, stop, restart, status, logs
Configurable: --model, --uri, --api-key, --ngrok, --whisper, --img-model
Minimal dependencies: Only click and requests (no heavy ML libraries)

# Examples
ezlocalai start --model unsloth/gemma-3-4b-it-GGUF
ezlocalai start --api-key my-secret-key --ngrok <token>
ezlocalai logs -f

Performance Benchmarks

Test System

CPU: 12th Gen Intel Core i9-12900KS (24 cores)
GPU: NVIDIA GeForce RTX 4090 (24GB VRAM)
Models Tested:
- unsloth/Qwen3-VL-4B-Instruct-GGUF - 4B vision model (Q4_K_XL quantization)
- unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF - 30B MoE coding model (Q4_K_XL quantization)

LLM Inference Performance (RTX 4090)

Model	Size	Tokens	Time	Speed
Qwen3-VL-4B	4B	100	0.47s	213 tok/s
Qwen3-VL-4B	4B	134	0.64s	209 tok/s
Qwen3-Coder-30B	30B (MoE)	100	1.53s	65 tok/s
Qwen3-Coder-30B	30B (MoE)	300	4.41s	68 tok/s

Key Observations

4B Vision Model: ~210 tok/s average - excellent for interactive use
30B Coding Model: ~65-68 tok/s - great for code generation despite size
Hot-swap time: ~1s between models (pre-calibrated)
Dynamic context: 32k baseline, scales up as needed

GPU Layer Calibration

Auto-calibrated at startup for 32k context:

Model	GPU Layers	VRAM Usage
Qwen3-VL-4B	37 layers	~12GB
Qwen3-Coder-30B	45 layers	~24GB

Minimal Configuration

.env file (that's it!):

# ezlocalai Configuration - keeping it "ez"
# VRAM is auto-detected, context is dynamic, GPU layers are auto-calibrated

MAIN_GPU=0
DEFAULT_MODEL=unsloth/Qwen3-VL-4B-Instruct-GGUF,unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF

# Image generation (set to enable, leave empty to disable)
IMG_MODEL=

# Speech-to-text model
WHISPER_MODEL=base

# Queue settings
MAX_CONCURRENT_REQUESTS=2

API Compatibility

All API endpoints remain unchanged. This is a drop-in upgrade.

New CLI

A lightweight pip-installable CLI for managing ezlocalai Docker containers:

pip install ezlocalai
ezlocalai start

The CLI:

Auto-detects GPU and uses CUDA or CPU mode appropriately
Auto-installs prerequisites (Docker, NVIDIA Container Toolkit) on Linux
Builds CUDA image locally from source (too large for DockerHub)
Persists data in ~/.ezlocalai/data/ (models, outputs, voices)
No prompts - designed to be as "ez" as possible

Commands:

ezlocalai start [--model X] - Start with optional model override
ezlocalai stop - Stop the container
ezlocalai restart - Restart with same config
ezlocalai status - Show running status and configuration
ezlocalai logs [-f] - View container logs
ezlocalai update - Pull latest CPU image / rebuild CUDA image

Assets 2

26 Nov 15:36

Josh-XT

v1.0.3

b0827ec

v1.0.3

ezLocalai v1.0.3

Summary

Philosophy: VRAM is auto-detected, context is dynamic, GPU layers are auto-calibrated. Just set your models and go.

Key Changes

1. Dynamic Context Sizing

No more guessing context sizes! The system automatically sizes context to fit your prompt:

Estimates prompt tokens (chars / 4)
Rounds up to nearest 32k boundary (32k, 64k, 96k, 128k)
Recalibrates GPU layers if context increases
Result: Much faster inference on short prompts (128k→32k = 63% speedup)

2. Auto VRAM Detection

VRAM budget is now detected automatically at startup:

Queries torch.cuda.get_device_properties()
Rounds down to nearest GB for safety margin
No more manual VRAM_BUDGET configuration

3. Simplified Environment Variables

Removed env vars (now automatic):

Variable	Reason
`GPU_LAYERS`	Auto-calibrated based on VRAM
`VRAM_BUDGET`	Auto-detected from GPU
`LLM_MAX_TOKENS`	Dynamic based on prompt size
`IMG_ENABLED`	Detected from `IMG_MODEL` being set
`EMBEDDING_ENABLED`	Always available
`IMG_DEVICE`	Auto-detects CUDA availability

Renamed:

Old	New	Reason
`SD_MODEL`	`IMG_MODEL`	Clearer naming

Simplified DEFAULT_MODEL format:

# Old: model@tokens,model@tokens
DEFAULT_MODEL=model1@64000,model2@8192

# New: just list models (context is dynamic)
DEFAULT_MODEL=model1,model2

4. Vision Model Fallback

When a non-vision model receives an image request:

System detects images in request
Finds a vision-capable model from available models
Uses vision model to describe images
Prepends description to prompt
Processes with requested model

This allows coding models (non-vision) to receive image context!

5. LLM Engine: llama-cpp-python → xllamacpp

Replaced llama-cpp-python with xllamacpp:

Unified LLM + Vision via multimodal projector (mmproj)
Native estimate_gpu_layers() function for fast calibration
Cleaner API through xllamacpp.Server

6. Multi-Model Hot-Swap

Support for multiple LLMs with automatic hot-swap:

DEFAULT_MODEL=unsloth/Qwen3-VL-4B-Instruct-GGUF,unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF

First model loads at startup with baseline 32k context
Models swap on demand based on request
Context-aware: recalibrates GPU layers when context size changes

7. Pre-Calibration at Startup

All models in DEFAULT_MODEL are pre-calibrated at startup:

Uses xllamacpp's native estimate_gpu_layers()
Calibrates at 32k baseline context
Caches results for instant swaps
Recalibrates on-demand for larger contexts

8. Text-to-Speech: XTTS → Chatterbox TTS

Replaced XTTS with Chatterbox TTS:

Modern Llama 3.2 backbone (0.5B params)
Better voice cloning quality
Preloaded at startup to warm cache (~38s → ~5s on reload)
Lazy-loaded on demand, unloaded after to free VRAM

9. Image Generation: SDXL-Turbo → SDXL-Lightning

Migrated to ByteDance/SDXL-Lightning:

Better image quality at low step counts
GPU: 2-step generation (blazing fast)
CPU: 4-step generation (stable)
Auto-detects CUDA for device selection

10. Embeddings: ONNX → BGE-M3

Replaced ONNX embedding with native BAAI/bge-m3:

Removed ONNX/Optimum dependencies
1024-dimensional embeddings
Native PyTorch GPU acceleration

11. Pip-Installable CLI

New lightweight CLI for easy installation and management:

pip install ezlocalai
ezlocalai start

Features:

Auto-detects GPU (NVIDIA) or falls back to CPU mode
Auto-installs prerequisites on Linux (Docker, NVIDIA Container Toolkit)
Simple commands: start, stop, restart, status, logs
Configurable: --model, --uri, --api-key, --ngrok, --whisper, --img-model
Minimal dependencies: Only click and requests (no heavy ML libraries)

# Examples
ezlocalai start --model unsloth/gemma-3-4b-it-GGUF
ezlocalai start --api-key my-secret-key --ngrok <token>
ezlocalai logs -f

Performance Benchmarks

Test System

CPU: 12th Gen Intel Core i9-12900KS (24 cores)
GPU: NVIDIA GeForce RTX 4090 (24GB VRAM)
Models Tested:
- unsloth/Qwen3-VL-4B-Instruct-GGUF - 4B vision model (Q4_K_XL quantization)
- unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF - 30B MoE coding model (Q4_K_XL quantization)

LLM Inference Performance (RTX 4090)

Model	Size	Tokens	Time	Speed
Qwen3-VL-4B	4B	100	0.47s	213 tok/s
Qwen3-VL-4B	4B	134	0.64s	209 tok/s
Qwen3-Coder-30B	30B (MoE)	100	1.53s	65 tok/s
Qwen3-Coder-30B	30B (MoE)	300	4.41s	68 tok/s

Key Observations

4B Vision Model: ~210 tok/s average - excellent for interactive use
30B Coding Model: ~65-68 tok/s - great for code generation despite size
Hot-swap time: ~1s between models (pre-calibrated)
Dynamic context: 32k baseline, scales up as needed

GPU Layer Calibration

Auto-calibrated at startup for 32k context:

Model	GPU Layers	VRAM Usage
Qwen3-VL-4B	37 layers	~12GB
Qwen3-Coder-30B	45 layers	~24GB

Minimal Configuration

.env file (that's it!):

# ezlocalai Configuration - keeping it "ez"
# VRAM is auto-detected, context is dynamic, GPU layers are auto-calibrated

MAIN_GPU=0
DEFAULT_MODEL=unsloth/Qwen3-VL-4B-Instruct-GGUF,unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF

# Image generation (set to enable, leave empty to disable)
IMG_MODEL=

# Speech-to-text model
WHISPER_MODEL=base

# Queue settings
MAX_CONCURRENT_REQUESTS=2

API Compatibility

All API endpoints remain unchanged. This is a drop-in upgrade.

New CLI

A lightweight pip-installable CLI for managing ezlocalai Docker containers:

pip install ezlocalai
ezlocalai start

The CLI:

Auto-detects GPU and uses CUDA or CPU mode appropriately
Auto-installs prerequisites (Docker, NVIDIA Container Toolkit) on Linux
Builds CUDA image locally from source (too large for DockerHub)
Persists data in ~/.ezlocalai/data/ (models, outputs, voices)
No prompts - designed to be as "ez" as possible

Commands:

ezlocalai start [--model X] - Start with optional model override
ezlocalai stop - Stop the container
ezlocalai restart - Restart with same config
ezlocalai status - Show running status and configuration
ezlocalai logs [-f] - View container logs
ezlocalai update - Pull latest CPU image / rebuild CUDA image

Assets 2

26 Nov 14:44

Josh-XT

v1.0.2

e51f597

v1.0.2

ezLocalai v1.0.2

Summary

Philosophy: VRAM is auto-detected, context is dynamic, GPU layers are auto-calibrated. Just set your models and go.

Key Changes

1. Dynamic Context Sizing

No more guessing context sizes! The system automatically sizes context to fit your prompt:

Estimates prompt tokens (chars / 4)
Rounds up to nearest 32k boundary (32k, 64k, 96k, 128k)
Recalibrates GPU layers if context increases
Result: Much faster inference on short prompts (128k→32k = 63% speedup)

2. Auto VRAM Detection

VRAM budget is now detected automatically at startup:

Queries torch.cuda.get_device_properties()
Rounds down to nearest GB for safety margin
No more manual VRAM_BUDGET configuration

3. Simplified Environment Variables

Removed env vars (now automatic):

Variable	Reason
`GPU_LAYERS`	Auto-calibrated based on VRAM
`VRAM_BUDGET`	Auto-detected from GPU
`LLM_MAX_TOKENS`	Dynamic based on prompt size
`IMG_ENABLED`	Detected from `IMG_MODEL` being set
`EMBEDDING_ENABLED`	Always available
`IMG_DEVICE`	Auto-detects CUDA availability

Renamed:

Old	New	Reason
`SD_MODEL`	`IMG_MODEL`	Clearer naming

Simplified DEFAULT_MODEL format:

# Old: model@tokens,model@tokens
DEFAULT_MODEL=model1@64000,model2@8192

# New: just list models (context is dynamic)
DEFAULT_MODEL=model1,model2

4. Vision Model Fallback

When a non-vision model receives an image request:

System detects images in request
Finds a vision-capable model from available models
Uses vision model to describe images
Prepends description to prompt
Processes with requested model

This allows coding models (non-vision) to receive image context!

5. LLM Engine: llama-cpp-python → xllamacpp

Replaced llama-cpp-python with xllamacpp:

Unified LLM + Vision via multimodal projector (mmproj)
Native estimate_gpu_layers() function for fast calibration
Cleaner API through xllamacpp.Server

6. Multi-Model Hot-Swap

Support for multiple LLMs with automatic hot-swap:

DEFAULT_MODEL=unsloth/Qwen3-VL-4B-Instruct-GGUF,unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF

First model loads at startup with baseline 32k context
Models swap on demand based on request
Context-aware: recalibrates GPU layers when context size changes

7. Pre-Calibration at Startup

All models in DEFAULT_MODEL are pre-calibrated at startup:

Uses xllamacpp's native estimate_gpu_layers()
Calibrates at 32k baseline context
Caches results for instant swaps
Recalibrates on-demand for larger contexts

8. Text-to-Speech: XTTS → Chatterbox TTS

Replaced XTTS with Chatterbox TTS:

Modern Llama 3.2 backbone (0.5B params)
Better voice cloning quality
Preloaded at startup to warm cache (~38s → ~5s on reload)
Lazy-loaded on demand, unloaded after to free VRAM

9. Image Generation: SDXL-Turbo → SDXL-Lightning

Migrated to ByteDance/SDXL-Lightning:

Better image quality at low step counts
GPU: 2-step generation (blazing fast)
CPU: 4-step generation (stable)
Auto-detects CUDA for device selection

10. Embeddings: ONNX → BGE-M3

Replaced ONNX embedding with native BAAI/bge-m3:

Removed ONNX/Optimum dependencies
1024-dimensional embeddings
Native PyTorch GPU acceleration

11. Pip-Installable CLI

New lightweight CLI for easy installation and management:

pip install ezlocalai
ezlocalai start

Features:

Auto-detects GPU (NVIDIA) or falls back to CPU mode
Auto-installs prerequisites on Linux (Docker, NVIDIA Container Toolkit)
Simple commands: start, stop, restart, status, logs
Configurable: --model, --uri, --api-key, --ngrok, --whisper, --img-model
Minimal dependencies: Only click and requests (no heavy ML libraries)

# Examples
ezlocalai start --model unsloth/gemma-3-4b-it-GGUF
ezlocalai start --api-key my-secret-key --ngrok <token>
ezlocalai logs -f

Performance Benchmarks

Test System

CPU: 12th Gen Intel Core i9-12900KS (24 cores)
GPU: NVIDIA GeForce RTX 4090 (24GB VRAM)
Models Tested:
- unsloth/Qwen3-VL-4B-Instruct-GGUF - 4B vision model (Q4_K_XL quantization)
- unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF - 30B MoE coding model (Q4_K_XL quantization)

LLM Inference Performance (RTX 4090)

Model	Size	Tokens	Time	Speed
Qwen3-VL-4B	4B	100	0.47s	213 tok/s
Qwen3-VL-4B	4B	134	0.64s	209 tok/s
Qwen3-Coder-30B	30B (MoE)	100	1.53s	65 tok/s
Qwen3-Coder-30B	30B (MoE)	300	4.41s	68 tok/s

Key Observations

4B Vision Model: ~210 tok/s average - excellent for interactive use
30B Coding Model: ~65-68 tok/s - great for code generation despite size
Hot-swap time: ~1s between models (pre-calibrated)
Dynamic context: 32k baseline, scales up as needed

GPU Layer Calibration

Auto-calibrated at startup for 32k context:

Model	GPU Layers	VRAM Usage
Qwen3-VL-4B	37 layers	~12GB
Qwen3-Coder-30B	45 layers	~24GB

Minimal Configuration

.env file (that's it!):

# ezlocalai Configuration - keeping it "ez"
# VRAM is auto-detected, context is dynamic, GPU layers are auto-calibrated

MAIN_GPU=0
DEFAULT_MODEL=unsloth/Qwen3-VL-4B-Instruct-GGUF,unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF

# Image generation (set to enable, leave empty to disable)
IMG_MODEL=

# Speech-to-text model
WHISPER_MODEL=base

# Queue settings
MAX_CONCURRENT_REQUESTS=2

API Compatibility

All API endpoints remain unchanged. This is a drop-in upgrade.

New CLI

A lightweight pip-installable CLI for managing ezlocalai Docker containers:

pip install ezlocalai
ezlocalai start

The CLI:

Auto-detects GPU and uses CUDA or CPU mode appropriately
Auto-installs prerequisites (Docker, NVIDIA Container Toolkit) on Linux
Builds CUDA image locally from source (too large for DockerHub)
Persists data in ~/.ezlocalai/data/ (models, outputs, voices)
No prompts - designed to be as "ez" as possible

Commands:

ezlocalai start [--model X] - Start with optional model override
ezlocalai stop - Stop the container
ezlocalai restart - Restart with same config
ezlocalai status - Show running status and configuration
ezlocalai logs [-f] - View container logs
ezlocalai update - Pull latest CPU image / rebuild CUDA image

Assets 2

26 Nov 14:39

Josh-XT

v1.0.1

0d702b4

v1.0.1

ezLocalai v1.0.1

Summary

Philosophy: VRAM is auto-detected, context is dynamic, GPU layers are auto-calibrated. Just set your models and go.

Key Changes

1. Dynamic Context Sizing

No more guessing context sizes! The system automatically sizes context to fit your prompt:

Estimates prompt tokens (chars / 4)
Rounds up to nearest 32k boundary (32k, 64k, 96k, 128k)
Recalibrates GPU layers if context increases
Result: Much faster inference on short prompts (128k→32k = 63% speedup)

2. Auto VRAM Detection

VRAM budget is now detected automatically at startup:

Queries torch.cuda.get_device_properties()
Rounds down to nearest GB for safety margin
No more manual VRAM_BUDGET configuration

3. Simplified Environment Variables

Removed env vars (now automatic):

Variable	Reason
`GPU_LAYERS`	Auto-calibrated based on VRAM
`VRAM_BUDGET`	Auto-detected from GPU
`LLM_MAX_TOKENS`	Dynamic based on prompt size
`IMG_ENABLED`	Detected from `IMG_MODEL` being set
`EMBEDDING_ENABLED`	Always available
`IMG_DEVICE`	Auto-detects CUDA availability

Renamed:

Old	New	Reason
`SD_MODEL`	`IMG_MODEL`	Clearer naming

Simplified DEFAULT_MODEL format:

# Old: model@tokens,model@tokens
DEFAULT_MODEL=model1@64000,model2@8192

# New: just list models (context is dynamic)
DEFAULT_MODEL=model1,model2

4. Vision Model Fallback

When a non-vision model receives an image request:

System detects images in request
Finds a vision-capable model from available models
Uses vision model to describe images
Prepends description to prompt
Processes with requested model

This allows coding models (non-vision) to receive image context!

5. LLM Engine: llama-cpp-python → xllamacpp

Replaced llama-cpp-python with xllamacpp:

Unified LLM + Vision via multimodal projector (mmproj)
Native estimate_gpu_layers() function for fast calibration
Cleaner API through xllamacpp.Server

6. Multi-Model Hot-Swap

Support for multiple LLMs with automatic hot-swap:

DEFAULT_MODEL=unsloth/Qwen3-VL-4B-Instruct-GGUF,unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF

First model loads at startup with baseline 32k context
Models swap on demand based on request
Context-aware: recalibrates GPU layers when context size changes

7. Pre-Calibration at Startup

All models in DEFAULT_MODEL are pre-calibrated at startup:

Uses xllamacpp's native estimate_gpu_layers()
Calibrates at 32k baseline context
Caches results for instant swaps
Recalibrates on-demand for larger contexts

8. Text-to-Speech: XTTS → Chatterbox TTS

Replaced XTTS with Chatterbox TTS:

Modern Llama 3.2 backbone (0.5B params)
Better voice cloning quality
Preloaded at startup to warm cache (~38s → ~5s on reload)
Lazy-loaded on demand, unloaded after to free VRAM

9. Image Generation: SDXL-Turbo → SDXL-Lightning

Migrated to ByteDance/SDXL-Lightning:

Better image quality at low step counts
GPU: 2-step generation (blazing fast)
CPU: 4-step generation (stable)
Auto-detects CUDA for device selection

10. Embeddings: ONNX → BGE-M3

Replaced ONNX embedding with native BAAI/bge-m3:

Removed ONNX/Optimum dependencies
1024-dimensional embeddings
Native PyTorch GPU acceleration

11. Pip-Installable CLI

New lightweight CLI for easy installation and management:

pip install ezlocalai
ezlocalai start

Features:

Auto-detects GPU (NVIDIA) or falls back to CPU mode
Auto-installs prerequisites on Linux (Docker, NVIDIA Container Toolkit)
Simple commands: start, stop, restart, status, logs
Configurable: --model, --uri, --api-key, --ngrok, --whisper, --img-model
Minimal dependencies: Only click and requests (no heavy ML libraries)

# Examples
ezlocalai start --model unsloth/gemma-3-4b-it-GGUF
ezlocalai start --api-key my-secret-key --ngrok <token>
ezlocalai logs -f

Performance Benchmarks

Test System

CPU: 12th Gen Intel Core i9-12900KS (24 cores)
GPU: NVIDIA GeForce RTX 4090 (24GB VRAM)
Models Tested:
- unsloth/Qwen3-VL-4B-Instruct-GGUF - 4B vision model (Q4_K_XL quantization)
- unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF - 30B MoE coding model (Q4_K_XL quantization)

LLM Inference Performance (RTX 4090)

Model	Size	Tokens	Time	Speed
Qwen3-VL-4B	4B	100	0.47s	213 tok/s
Qwen3-VL-4B	4B	134	0.64s	209 tok/s
Qwen3-Coder-30B	30B (MoE)	100	1.53s	65 tok/s
Qwen3-Coder-30B	30B (MoE)	300	4.41s	68 tok/s

Key Observations

4B Vision Model: ~210 tok/s average - excellent for interactive use
30B Coding Model: ~65-68 tok/s - great for code generation despite size
Hot-swap time: ~1s between models (pre-calibrated)
Dynamic context: 32k baseline, scales up as needed

GPU Layer Calibration

Auto-calibrated at startup for 32k context:

Model	GPU Layers	VRAM Usage
Qwen3-VL-4B	37 layers	~12GB
Qwen3-Coder-30B	45 layers	~24GB

Minimal Configuration

.env file (that's it!):

# ezlocalai Configuration - keeping it "ez"
# VRAM is auto-detected, context is dynamic, GPU layers are auto-calibrated

MAIN_GPU=0
DEFAULT_MODEL=unsloth/Qwen3-VL-4B-Instruct-GGUF,unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF

# Image generation (set to enable, leave empty to disable)
IMG_MODEL=

# Speech-to-text model
WHISPER_MODEL=base

# Queue settings
MAX_CONCURRENT_REQUESTS=2

API Compatibility

All API endpoints remain unchanged. This is a drop-in upgrade.

New CLI

A lightweight pip-installable CLI for managing ezlocalai Docker containers:

pip install ezlocalai
ezlocalai start

The CLI:

Auto-detects GPU and uses CUDA or CPU mode appropriately
Auto-installs prerequisites (Docker, NVIDIA Container Toolkit) on Linux
Builds CUDA image locally from source (too large for DockerHub)
Persists data in ~/.ezlocalai/data/ (models, outputs, voices)
No prompts - designed to be as "ez" as possible

Commands:

ezlocalai start [--model X] - Start with optional model override
ezlocalai stop - Stop the container
ezlocalai restart - Restart with same config
ezlocalai status - Show running status and configuration
ezlocalai logs [-f] - View container logs
ezlocalai update - Pull latest CPU image / rebuild CUDA image

Assets 2

26 Nov 05:33

Josh-XT

v1.0.0

229de8a

v1.0.0

ezLocalai v1.0.0

Summary

Philosophy: VRAM is auto-detected, context is dynamic, GPU layers are auto-calibrated. Just set your models and go.

Key Changes

1. Dynamic Context Sizing

No more guessing context sizes! The system automatically sizes context to fit your prompt:

Estimates prompt tokens (chars / 4)
Rounds up to nearest 32k boundary (32k, 64k, 96k, 128k)
Recalibrates GPU layers if context increases
Result: Much faster inference on short prompts (128k→32k = 63% speedup)

2. Auto VRAM Detection

VRAM budget is now detected automatically at startup:

Queries torch.cuda.get_device_properties()
Rounds down to nearest GB for safety margin
No more manual VRAM_BUDGET configuration

3. Simplified Environment Variables

Removed env vars (now automatic):

Variable	Reason
`GPU_LAYERS`	Auto-calibrated based on VRAM
`VRAM_BUDGET`	Auto-detected from GPU
`LLM_MAX_TOKENS`	Dynamic based on prompt size
`IMG_ENABLED`	Detected from `IMG_MODEL` being set
`EMBEDDING_ENABLED`	Always available
`IMG_DEVICE`	Auto-detects CUDA availability

Renamed:

Old	New	Reason
`SD_MODEL`	`IMG_MODEL`	Clearer naming

Simplified DEFAULT_MODEL format:

# Old: model@tokens,model@tokens
DEFAULT_MODEL=model1@64000,model2@8192

# New: just list models (context is dynamic)
DEFAULT_MODEL=model1,model2

4. Vision Model Fallback

When a non-vision model receives an image request:

System detects images in request
Finds a vision-capable model from available models
Uses vision model to describe images
Prepends description to prompt
Processes with requested model

This allows coding models (non-vision) to receive image context!

5. LLM Engine: llama-cpp-python → xllamacpp

Replaced llama-cpp-python with xllamacpp:

Unified LLM + Vision via multimodal projector (mmproj)
Native estimate_gpu_layers() function for fast calibration
Cleaner API through xllamacpp.Server

6. Multi-Model Hot-Swap

Support for multiple LLMs with automatic hot-swap:

DEFAULT_MODEL=unsloth/Qwen3-VL-4B-Instruct-GGUF,unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF

First model loads at startup with baseline 32k context
Models swap on demand based on request
Context-aware: recalibrates GPU layers when context size changes

7. Pre-Calibration at Startup

All models in DEFAULT_MODEL are pre-calibrated at startup:

Uses xllamacpp's native estimate_gpu_layers()
Calibrates at 32k baseline context
Caches results for instant swaps
Recalibrates on-demand for larger contexts

8. Text-to-Speech: XTTS → Chatterbox TTS

Replaced XTTS with Chatterbox TTS:

Modern Llama 3.2 backbone (0.5B params)
Better voice cloning quality
Preloaded at startup to warm cache (~38s → ~5s on reload)
Lazy-loaded on demand, unloaded after to free VRAM

9. Image Generation: SDXL-Turbo → SDXL-Lightning

Migrated to ByteDance/SDXL-Lightning:

Better image quality at low step counts
GPU: 2-step generation (blazing fast)
CPU: 4-step generation (stable)
Auto-detects CUDA for device selection

10. Embeddings: ONNX → BGE-M3

Replaced ONNX embedding with native BAAI/bge-m3:

Removed ONNX/Optimum dependencies
1024-dimensional embeddings
Native PyTorch GPU acceleration

11. Pip-Installable CLI

New lightweight CLI for easy installation and management:

pip install ezlocalai
ezlocalai start

Features:

Auto-detects GPU (NVIDIA) or falls back to CPU mode
Auto-installs prerequisites on Linux (Docker, NVIDIA Container Toolkit)
Simple commands: start, stop, restart, status, logs
Configurable: --model, --uri, --api-key, --ngrok, --whisper, --img-model
Minimal dependencies: Only click and requests (no heavy ML libraries)

# Examples
ezlocalai start --model unsloth/gemma-3-4b-it-GGUF
ezlocalai start --api-key my-secret-key --ngrok <token>
ezlocalai logs -f

Performance Benchmarks

Test System

CPU: 12th Gen Intel Core i9-12900KS (24 cores)
GPU: NVIDIA GeForce RTX 4090 (24GB VRAM)
Models Tested:
- unsloth/Qwen3-VL-4B-Instruct-GGUF - 4B vision model (Q4_K_XL quantization)
- unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF - 30B MoE coding model (Q4_K_XL quantization)

LLM Inference Performance (RTX 4090)

Model	Size	Tokens	Time	Speed
Qwen3-VL-4B	4B	100	0.47s	213 tok/s
Qwen3-VL-4B	4B	134	0.64s	209 tok/s
Qwen3-Coder-30B	30B (MoE)	100	1.53s	65 tok/s
Qwen3-Coder-30B	30B (MoE)	300	4.41s	68 tok/s

Key Observations

4B Vision Model: ~210 tok/s average - excellent for interactive use
30B Coding Model: ~65-68 tok/s - great for code generation despite size
Hot-swap time: ~1s between models (pre-calibrated)
Dynamic context: 32k baseline, scales up as needed

GPU Layer Calibration

Auto-calibrated at startup for 32k context:

Model	GPU Layers	VRAM Usage
Qwen3-VL-4B	37 layers	~12GB
Qwen3-Coder-30B	45 layers	~24GB

Minimal Configuration

.env file (that's it!):

# ezlocalai Configuration - keeping it "ez"
# VRAM is auto-detected, context is dynamic, GPU layers are auto-calibrated

MAIN_GPU=0
DEFAULT_MODEL=unsloth/Qwen3-VL-4B-Instruct-GGUF,unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF

# Image generation (set to enable, leave empty to disable)
IMG_MODEL=

# Speech-to-text model
WHISPER_MODEL=base

# Queue settings
MAX_CONCURRENT_REQUESTS=2

API Compatibility

All API endpoints remain unchanged. This is a drop-in upgrade.

New CLI

A lightweight pip-installable CLI for managing ezlocalai Docker containers:

pip install ezlocalai
ezlocalai start

The CLI:

Auto-detects GPU and uses CUDA or CPU mode appropriately
Auto-installs prerequisites (Docker, NVIDIA Container Toolkit) on Linux
Builds CUDA image locally from source (too large for DockerHub)
Persists data in ~/.ezlocalai/data/ (models, outputs, voices)
No prompts - designed to be as "ez" as possible

Commands:

ezlocalai start [--model X] - Start with optional model override
ezlocalai stop - Stop the container
ezlocalai restart - Restart with same config
ezlocalai status - Show running status and configuration
ezlocalai logs [-f] - View container logs
ezlocalai update - Pull latest CPU image / rebuild CUDA image

Assets 2

10 Sep 19:14

Josh-XT

v0.1.14

cf53709

v0.1.14

What's Changed

Add TENSOR_SPLIT env var; by @JamesonRGrieve in #46
Add a ternary to fix single GPUs; by @JamesonRGrieve in #47
Use global environment by @Josh-XT in #48

New Contributors

@JamesonRGrieve made their first contribution in #46

Full Changelog: v0.1.13...v0.1.14

Contributors

JamesonRGrieve and Josh-XT

Assets 2

Uh oh!

Releases: DevXT-LLC/ezlocalai

v1.0.8

ezLocalai v1.0.8

Video Generation (NEW)

Video Understanding (NEW)

Image Editing (NEW)

Speaker Diarization (NEW)

Distributed Inference & Server Roles (NEW)

Fallback Server Support (NEW)

Request Queue System (NEW)

LLM Improvements

Platform Support

Testing & CI

Docker & Dependencies

Uh oh!

v1.0.7

What's Changed

New Contributors

Uh oh!

v1.0.6

What's Changed

Contributors

Uh oh!

v1.0.5

What's Changed

Contributors

Uh oh!

v1.0.4

ezLocalai v1.0.4

Summary

Key Changes

1. Dynamic Context Sizing

2. Auto VRAM Detection

3. Simplified Environment Variables

4. Vision Model Fallback

5. LLM Engine: llama-cpp-python → xllamacpp

6. Multi-Model Hot-Swap

7. Pre-Calibration at Startup

8. Text-to-Speech: XTTS → Chatterbox TTS

9. Image Generation: SDXL-Turbo → SDXL-Lightning

10. Embeddings: ONNX → BGE-M3

11. Pip-Installable CLI

Performance Benchmarks

Test System

LLM Inference Performance (RTX 4090)

Key Observations

GPU Layer Calibration

Minimal Configuration

API Compatibility

New CLI

Uh oh!

v1.0.3

ezLocalai v1.0.3

Summary

Key Changes

1. Dynamic Context Sizing

2. Auto VRAM Detection

3. Simplified Environment Variables

4. Vision Model Fallback

5. LLM Engine: llama-cpp-python → xllamacpp

6. Multi-Model Hot-Swap

7. Pre-Calibration at Startup

8. Text-to-Speech: XTTS → Chatterbox TTS

9. Image Generation: SDXL-Turbo → SDXL-Lightning

10. Embeddings: ONNX → BGE-M3

11. Pip-Installable CLI

Performance Benchmarks

Test System

LLM Inference Performance (RTX 4090)

Key Observations

GPU Layer Calibration

Minimal Configuration

API Compatibility

New CLI

Uh oh!

v1.0.2

ezLocalai v1.0.2

Summary

Key Changes