Skip to content

feat: INT4/INT8 quantization + expert offloading for consumer hardware#74

Open
oyi77 wants to merge 2 commits into
kyegomez:mainfrom
oyi77:feature/int4-quantization
Open

feat: INT4/INT8 quantization + expert offloading for consumer hardware#74
oyi77 wants to merge 2 commits into
kyegomez:mainfrom
oyi77:feature/int4-quantization

Conversation

@oyi77

@oyi77 oyi77 commented May 20, 2026

Copy link
Copy Markdown

Summary

Enables running OpenMythos MoE models on consumer hardware (RTX 3060 12GB) through INT4/INT8 weight quantization and GPU↔CPU↔NVMe expert offloading.

Changes

open_mythos/quantization.py (388 lines)

  • QuantizedLinear: Memory-efficient quantized linear layer with INT4/INT8 support
  • INT4 packing: two 4-bit values per byte (4x compression)
  • Group-wise scaling (configurable group_size)
  • quantize_model(): Model-level quantization (MoE experts only by default)

open_mythos/expert_offloader.py (330 lines)

  • ExpertOffloader: LRU-based expert caching across memory hierarchy (GPU→CPU→NVMe)
  • Automatic expert loading on-demand during inference
  • Statistics tracking (hit rates, evictions)

Supporting files

  • examples/quantized_inference.py: Demo script
  • tests/test_quantization.py: Unit tests

Usage

from open_mythos import OpenMythos, mythos_1b
from open_mythos.quantization import quantize_model
from open_mythos.expert_offloader import ExpertOffloader

model = OpenMythos(mythos_1b())
model = quantize_model(model, bits=4, group_size=128)

offloader = ExpertOffloader(model, gpu_experts=4, cache_experts=16)
offloader.prepare()

All existing functionality preserved. Quantization is opt-in.

oyi77 and others added 2 commits May 20, 2026 10:23
…ardware

- open_mythos/quantization.py: INT4/INT8 weight quantization with group-wise scaling
  - QuantizedLinear: Memory-efficient quantized linear layer (4x compression)
  - quantize_model(): Model-level quantization (MoE experts only by default)
  - Supports INT4 packing (two 4-bit values per byte)

- open_mythos/expert_offloader.py: GPU/CPU/NVMe expert management
  - ExpertOffloader: LRU-based expert caching across memory hierarchy
  - Automatic expert loading on-demand during inference
  - Statistics tracking (hit rates, evictions)

- examples/quantized_inference.py: Demo script for consumer hardware
- tests/test_quantization.py: Unit tests for both modules

Enables:
- mythos_1b on 8GB VRAM (RTX 3060)
- mythos_3b on 12GB VRAM with expert offloading
- mythos_500b/1t with aggressive offloading (GPU + CPU + NVMe)

Co-authored-by: BerkahKarya <coder@berkahkarya.com>
quantization.py:
- Replace assert with proper ValueError/TypeError exceptions
- Add logging for quantization progress tracking
- Add __repr__ to QuantizedLinear for debugging
- Extract _dequantize_weight() method (cleaner forward pass)
- Remove unused math import
- Fix duplicate docstring in quantize_moe_experts
- Add input validation to quantize_model()

expert_offloader.py:
- Fix bug: expert.state_dict → expert.state_dict() (missing parentheses)
- Add bounds checking for expert_id access
- Add proper KeyError/IndexError/AttributeError for invalid access
- Add __repr__ to ExpertOffloader for debugging
- Add input validation for layer_name existence

All changes maintain backward compatibility.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant