-
Notifications
You must be signed in to change notification settings - Fork 141
Open
Labels
help wantedExtra attention is neededExtra attention is needed
Description
Here is the development roadmap for H2 2025. We will pin this roadmap in Issues, and most of our subsequent work will be updated in this roadmap within Issues. In MLLM's documentation, we will archive each version of the roadmap and provide some outlooks. Contributions and feedback are welcome.
We plan to release a major MLLM version every year. The version for H2 2025 will be 2.0.0, and the main updates to be implemented in this version can be found in the Focus section.
Focus
- Refactoring from mllm-v1: Implement a more streamlined project structure; introduce a simple and user-friendly eager mode; provide MLLM static graph IR
- Support for more backends: P0-CANN, and P1-CUDA/AMD NPU
- Experimental attempt: Compilation from MLLM static graph IR to NPU backend
- Provide user-friendly components such as pymllm, mllm-cli, and MllmCSdk to expand the adoption of the MLLM project
- Enhance the benchmarking system with a focus on optimizing Arm Kernels
Engine
- ✔️ Async IO between Disk and Host memory for MoE model @chenghuaWang
- ✔️
inplaceandredirectAPI for resue memory. Checkmllm/models/qwen3/modeling_qwen3_fa2.hpp. @chenghuaWang - ✔️ Op Plugin system. @chenghuaWang @oreomaker
Model coverage
- ✔️ DeepSeek-OCR @chenghuaWang
- ✔️ SmolLM3-3B @nuozhihan
- InternLM2.5-1.8B @Sp0tless
- ✔️ MiniCPM2.6-Omni @oreomaker
- Gemma3 @jialilve
- InternVL @jialilve
- Qwen3-VL
- Qwen3-Next
Kernels
- Arm: KleidiAI SME Kernels are supported, as the latest SoC includes SME capability. @chenghuaWang
- Arm: Improve Mllm-Blas performance.
- X86: FP32 Kernels built on top of highway. @HayzelHan
- X86: Quantized kernel, using GGUF format. @HayzelHan
- ✔️ Commons: Paged Attention kernels based on mllm's zen-file system. @chenghuaWang
Backends
- Qualcomm NPU: Migrated and refactored the QNN backend from mllm-v1 to mllm-v2, leveraging the framework’s static-graph compilation to execute on QNN. @oreomaker @jialilve @Sp0tless
- CUDA: Future-proofed for VLA models with native FP8 and MXFP4 support. (CUDA + TileLang) @jialilve @chenghuaWang
- CANN: Ready for next-gen VLA models, optimized for Ascend AI accelerators. @lywbarca @yuerqiqi @chenghuaWang
Performance
- Benchmark MLLM, llama.cpp and mnn using q4_k like quantization settings. @jialilve
- ✔️ Fast version of qwen3 using: 1. Manually memory planning 2. Fused kernels 3. Inplace Operators etc. @chenghuaWang
- ✔️ Using tracy and perfetto for performance measurement @chenghuaWang
- (Optional) ARM PMU Tools setup
Quantization
- kai: Quantize on any machine, packing on ARM (make
mllm-convertor --pipeline xxx_kai_pipelineavailable on any devices). - GGUF: GGUF Q4_K and Q6_K Quantization method on
.mllmfile. @HayzelHan
Compile
KV Cache Management
- Quantized KVCache: int8 per token.
- ✔️ Prefix Cache and Paged Attn: Support multi-turn chat. @chenghuaWang
Pymllm
- ✔️ Waiting: Awaiting PyPI's approval of our organization's application. pymllm now is available on MacOS(
pip install pymllm).
Production Stack
biaji and chenghuaWang
Metadata
Metadata
Assignees
Labels
help wantedExtra attention is neededExtra attention is needed