Adding support for BlockedKV attention in CasualLM models #618

vaibverm · 2025-11-14T00:30:37Z

Objective:

This PR introduces the KV blocking technique for CausalLM models where the K/V cache is read and processed block by block in the attention computation. Number of desired KV blocks are defined at model initialization in the "from_pretrained" call to export the ONNX with required number of KV blocks. As a result, the following changes are introduced:

Changes:

SoftMax needs to be changed from regular SoftMax to online SoftMax where the running maximum and cumulative denominators are tracked and updated once each block is processed to retain mathematical accuracy compared to regular SoftMax.
Changes to CTXGather and CTXGatherCB custom ops to read only 1 block worth of data in each cache gather/read.
Changes to read_only function in QEffDynamicCache to allow reading of a cache block by block rather than full K/V cache.
Generation of attention mask per block.
Changes to eager_attention_forward implementation in the llama model to allow BlockedKV attention and online SoftMax implementation.
Wrapping the num_kv_blocks variable inside qaic_config to keep consistent calling style.
A new PyTorch transform to pass the num_kv_blocks variable to QEffLlamaAttention block.
A new constant added for num_kv_blocks.
Added tests to switch the BlockedKV feature on and off.

Please review and feel free to suggest changes and tests.

vbaddi · 2025-11-14T03:36:13Z

Thanks @vaibverm
Could you please address the conflicts and run the lint/format?

Signed-off-by: Vaibhav Verma <[email protected]>

… number indices Signed-off-by: Vaibhav Verma <[email protected]>

vaibverm · 2025-11-14T23:23:02Z

Hi @vbaddi,
I have addressed the conflicts but some workflows need approval. Would you be able to approve those?

Signed-off-by: Vaibhav Verma <[email protected]>

vaibverm requested review from ochougul, quic-amitraj, quic-hemagnih and quic-rishinr as code owners November 14, 2025 00:30

vaibverm force-pushed the main branch from 9e46a31 to c1d11d1 Compare November 14, 2025 00:33

vaibverm force-pushed the main branch 2 times, most recently from bc0ef8b to 5997515 Compare November 14, 2025 08:01

vaibverm added 3 commits November 14, 2025 02:05

Adding support for BlockedKV attention in CasualLM models

070acec

Signed-off-by: Vaibhav Verma <[email protected]>

Updated num_kv_blocks checking within qaic_config to use .get()

6437e3c

Signed-off-by: Vaibhav Verma <[email protected]>

Updated modeling_auto.py to handle num_kv_blocks=None case gracefully

4e817c2

Signed-off-by: Vaibhav Verma <[email protected]>

vaibverm force-pushed the main branch from 5997515 to 4e817c2 Compare November 14, 2025 08:05

Fix to satisfy where op needing a tensor condition and arange needing…

d1152a5

… number indices Signed-off-by: Vaibhav Verma <[email protected]>

Minor fix for _create_causal_mask arg order

b3a0430

Signed-off-by: Vaibhav Verma <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding support for BlockedKV attention in CasualLM models #618

Adding support for BlockedKV attention in CasualLM models #618

Uh oh!

vaibverm commented Nov 14, 2025

Uh oh!

vbaddi commented Nov 14, 2025

Uh oh!

vaibverm commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Adding support for BlockedKV attention in CasualLM models #618

Are you sure you want to change the base?

Adding support for BlockedKV attention in CasualLM models #618

Uh oh!

Conversation

vaibverm commented Nov 14, 2025

Objective:

Changes:

Uh oh!

vbaddi commented Nov 14, 2025

Uh oh!

vaibverm commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants