Development Roadmap (2025 H2)

Here is the development roadmap for H2 2025. We will pin this roadmap in Issues, and most of our subsequent work will be updated in this roadmap within Issues. In MLLM's documentation, we will archive each version of the roadmap and provide some outlooks. Contributions and feedback are welcome.

We plan to release a major MLLM version every year. The version for H2 2025 will be 2.0.0, and the main updates to be implemented in this version can be found in the Focus section.

# Focus

- Refactoring from mllm-v1: Implement a more streamlined project structure; introduce a simple and user-friendly eager mode; provide MLLM static graph IR
- Support for more backends: P0-CANN, and P1-CUDA/AMD NPU
- Experimental attempt: Compilation from MLLM static graph IR to NPU backend
- Provide user-friendly components such as pymllm, mllm-cli, and MllmCSdk to expand the adoption of the MLLM project
- Enhance the benchmarking system with a focus on optimizing Arm Kernels

# Engine

- ✔️ Async IO between Disk and Host memory for MoE model @chenghuaWang 
    - https://github.com/UbiquitousLearning/mllm/pull/497 
- ✔️ `inplace` and `redirect` API for resue memory. Check `mllm/models/qwen3/modeling_qwen3_fa2.hpp`. @chenghuaWang 
    - https://github.com/UbiquitousLearning/mllm/pull/483 
- ✔️ Op Plugin system. @chenghuaWang @oreomaker 
    - https://github.com/UbiquitousLearning/mllm/pull/479 

# Model coverage

- ✔️ [DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR) @chenghuaWang 
    - https://github.com/UbiquitousLearning/mllm/pull/486 
- ✔️ [SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) @nuozhihan
- [InternLM2.5-1.8B](https://huggingface.co/internlm/internlm2_5-1_8b) @Sp0tless
- ✔️  [MiniCPM2.6-Omni](https://huggingface.co/openbmb/MiniCPM-o-2_6) @oreomaker
    - https://github.com/UbiquitousLearning/mllm/pull/495
    - https://github.com/UbiquitousLearning/mllm/pull/496
    - https://github.com/UbiquitousLearning/mllm/pull/499
- [Gemma3](https://huggingface.co/google/gemma-3-1b-it) @jialilve
- [InternVL](https://huggingface.co/OpenGVLab/InternVL) @jialilve
- [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL)
- [Qwen3-Next](https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct)

# Kernels

- **Arm:** KleidiAI SME Kernels are supported, as the latest SoC includes SME capability. @chenghuaWang
- **Arm:** Improve Mllm-Blas performance.
- **X86:** FP32 Kernels built on top of highway. @HayzelHan
- **X86:** Quantized kernel, using GGUF format. @HayzelHan
- ✔️ **Commons:** Paged Attention kernels based on mllm's zen-file system. @chenghuaWang
    - https://github.com/UbiquitousLearning/mllm/pull/459 

# Backends

- **Qualcomm NPU:** Migrated and refactored the QNN backend from mllm-v1 to mllm-v2, leveraging the framework’s static-graph compilation to execute on QNN. @oreomaker @jialilve @Sp0tless
    - https://github.com/UbiquitousLearning/mllm/pull/488
    - https://github.com/UbiquitousLearning/mllm/pull/485
- **CUDA:** Future-proofed for VLA models with native FP8 and MXFP4 support. (CUDA + TileLang) @jialilve  @chenghuaWang 
    - https://github.com/UbiquitousLearning/mllm/pull/478
- **CANN:** Ready for next-gen VLA models, optimized for Ascend AI accelerators. @lywbarca @yuerqiqi @chenghuaWang 

# Performance

- Benchmark MLLM, llama.cpp and mnn using q4_k like quantization settings. @jialilve
    - https://github.com/UbiquitousLearning/mllm/pull/487
- ✔️ Fast version of qwen3 using: 1. Manually memory planning 2. Fused kernels 3. Inplace Operators etc. @chenghuaWang 
    - https://github.com/UbiquitousLearning/mllm/pull/483 
- ✔️ Using tracy and perfetto for performance measurement @chenghuaWang  
    - https://github.com/UbiquitousLearning/mllm/pull/498 
    - https://github.com/UbiquitousLearning/mllm/pull/382
- (Optional) ARM PMU Tools setup

# Quantization

- **kai:** Quantize on any machine, packing on ARM (make `mllm-convertor --pipeline xxx_kai_pipeline ` available on any devices).
- **GGUF**: GGUF Q4_K and Q6_K Quantization method on `.mllm` file. @HayzelHan

# Compile

# KV Cache Management

- **Quantized KVCache:** int8 per token.
- ✔️ **Prefix Cache and Paged Attn**: Support multi-turn chat. @chenghuaWang
    - https://github.com/UbiquitousLearning/mllm/pull/459 
    - https://github.com/UbiquitousLearning/mllm/pull/462 

# Pymllm

- ✔️ **Waiting:** Awaiting PyPI's approval of our organization's application. **pymllm** now is available on MacOS(`pip install pymllm`).
    - https://pypi.org/project/pymllm/

# Production Stack

- ✔️ **mllm-cli:** API Server and CLI Chat Interface. @yuerqiqi
    - https://github.com/UbiquitousLearning/mllm/pull/493
- **MllmCSdk:** Mllm C Wrapper for mllm-cli and other language usage. @yuerqiqi


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Development Roadmap (2025 H2) #460

Focus

Engine

Model coverage

Kernels

Backends

Performance

Quantization

Compile

KV Cache Management

Pymllm

Production Stack

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Development Roadmap (2025 H2) #460

Description

Focus

Engine

Model coverage

Kernels

Backends

Performance

Quantization

Compile

KV Cache Management

Pymllm

Production Stack

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions