feat: INT4/INT8 quantization + expert offloading for consumer hardware by oyi77 · Pull Request #74 · kyegomez/OpenMythos

oyi77 · 2026-05-20T03:31:57Z

Summary

Enables running OpenMythos MoE models on consumer hardware (RTX 3060 12GB) through INT4/INT8 weight quantization and GPU↔CPU↔NVMe expert offloading.

Changes

open_mythos/quantization.py (388 lines)

QuantizedLinear: Memory-efficient quantized linear layer with INT4/INT8 support
INT4 packing: two 4-bit values per byte (4x compression)
Group-wise scaling (configurable group_size)
quantize_model(): Model-level quantization (MoE experts only by default)

open_mythos/expert_offloader.py (330 lines)

ExpertOffloader: LRU-based expert caching across memory hierarchy (GPU→CPU→NVMe)
Automatic expert loading on-demand during inference
Statistics tracking (hit rates, evictions)

Supporting files

examples/quantized_inference.py: Demo script
tests/test_quantization.py: Unit tests

Usage

from open_mythos import OpenMythos, mythos_1b
from open_mythos.quantization import quantize_model
from open_mythos.expert_offloader import ExpertOffloader

model = OpenMythos(mythos_1b())
model = quantize_model(model, bits=4, group_size=128)

offloader = ExpertOffloader(model, gpu_experts=4, cache_experts=16)
offloader.prepare()

All existing functionality preserved. Quantization is opt-in.

…ardware - open_mythos/quantization.py: INT4/INT8 weight quantization with group-wise scaling - QuantizedLinear: Memory-efficient quantized linear layer (4x compression) - quantize_model(): Model-level quantization (MoE experts only by default) - Supports INT4 packing (two 4-bit values per byte) - open_mythos/expert_offloader.py: GPU/CPU/NVMe expert management - ExpertOffloader: LRU-based expert caching across memory hierarchy - Automatic expert loading on-demand during inference - Statistics tracking (hit rates, evictions) - examples/quantized_inference.py: Demo script for consumer hardware - tests/test_quantization.py: Unit tests for both modules Enables: - mythos_1b on 8GB VRAM (RTX 3060) - mythos_3b on 12GB VRAM with expert offloading - mythos_500b/1t with aggressive offloading (GPU + CPU + NVMe) Co-authored-by: BerkahKarya <coder@berkahkarya.com>

quantization.py: - Replace assert with proper ValueError/TypeError exceptions - Add logging for quantization progress tracking - Add __repr__ to QuantizedLinear for debugging - Extract _dequantize_weight() method (cleaner forward pass) - Remove unused math import - Fix duplicate docstring in quantize_moe_experts - Add input validation to quantize_model() expert_offloader.py: - Fix bug: expert.state_dict → expert.state_dict() (missing parentheses) - Add bounds checking for expert_id access - Add proper KeyError/IndexError/AttributeError for invalid access - Add __repr__ to ExpertOffloader for debugging - Add input validation for layer_name existence All changes maintain backward compatibility.

oyi77 and others added 2 commits May 20, 2026 10:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: INT4/INT8 quantization + expert offloading for consumer hardware#74

feat: INT4/INT8 quantization + expert offloading for consumer hardware#74
oyi77 wants to merge 2 commits into
kyegomez:mainfrom
oyi77:feature/int4-quantization

oyi77 commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

oyi77 commented May 20, 2026

Summary

Changes

open_mythos/quantization.py (388 lines)

open_mythos/expert_offloader.py (330 lines)

Supporting files

Usage

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant