GitHub - ModelEngine-Group/unified-cache-management: Persist and reuse KV Cache to speedup your LLM.

UCM

Overview

The core principle of Unified Cache Manager (UCM) is to persist the LLM KVCache and replace redundant computations through multiple retrieval mechanisms. UCM not only supports prefix caching but also offers a variety of training-free sparse attention retrieval methods, delivering higher performance when handling extremely long sequence inference tasks. Additionally, UCM provides a PD disaggregation solution based on a storage-compute separation architecture, which enables more straightforward and flexible management of heterogeneous computing resources. When integrated with vLLM, UCM achieves a 3-10x reduction in inference latency across various scenarios, including multi-turn dialogue and long-context reasoning tasks.

Motivation

With the increase of model size, the KV cache became larger and sparser, especially for long sequence requests. To reduce the GPU memory used, offload full KV to external storage and only keep partial or compressed KV in GPU memory became the popular direction. This can also reduce the GPU calculation, increase the sequence length and batch size of decoding.

Sparse KV cache have many different choices. Recently paper point out that there is no common way can fit all scenarios and all models. So better to build a common framework then different sparse algorithms can be plugin to it like KV connector for PC.

All gray boxes in the diagram represent existing classes in vLLM version 0.9.2, while the green boxes indicate newly added components by UCM. The light green boxes demonstrate potential future subclass extensions based on this framework.

UcmSparseBase is the base class of different sparse algorithms. Just like KV connector design, it will hook few places of scheduler and layer.py to do additional load, dump and calculate sparse KV blocks.

SparseKVManager allows users to define custom KV block allocations for different algorithms. To keep all implementations unified under the SparseKVBase framework, the system calls the SparseKVBase base class, while the actual implementation occurs in subclasses of sparse algorithms.

KVStoreBase helps decouple sparse algorithms from external storage. It defines methods for communicating with external storage, enabling any sparse algorithm to work seamlessly with any external storage system. The core concept here involves identifying blocks through IDs and offsets. This approach is not only suitable for sparse scenarios but also naturally accommodates prefix caching. The KVStoreConnector links it with the current KVConnectorBase_V1 to provide PC (Prefix Caching) functionality. For example, NFSStore serves as a reference implementation that provides the capability to store KVCache in either a local filesystem for single-machine scenarios or through NFS mount points in multi-server environments.

Support Features

Prefix Cache
Cache Blend
Model Window Extrapolation
Prefill Offload
Sparse Attention
Sparse Attention Offload
Heterogeneous PD Disaggregation

Quick Start

please refer to Quick Start.

Branch

Branch	Status	vLLM version
main	Maintained	v0.9.2
develop	Maintained	v0.9.2

Contact Us

For technical questions and feature requests, please use GitHub Issues.
WeChat technical discussion group: Scan the QR code below.

License

UCM is licensed under the MIT with additional conditions. Please read the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 295 Commits
.github		.github
benchmarks		benchmarks
docker		docker
docs		docs
examples		examples
test		test
ucm		ucm
.clang-format		.clang-format
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
README_zh.md		README_zh.md
format.sh		format.sh
pyproject.toml		pyproject.toml
requirements-lint.txt		requirements-lint.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Overview

Motivation

Support Features

Quick Start

Branch

Contact Us

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 37

Languages

License

ModelEngine-Group/unified-cache-management

Folders and files

Latest commit

History

Repository files navigation

Overview

Motivation

Support Features

Quick Start

Branch

Contact Us

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 37

Languages

Packages