Skip to content

Conversation

@vaibverm
Copy link

Objective:

This PR introduces the KV blocking technique for CausalLM models where the K/V cache is read and processed block by block in the attention computation. Number of desired KV blocks are defined at model initialization in the "from_pretrained" call to export the ONNX with required number of KV blocks. As a result, the following changes are introduced:

Changes:

  1. SoftMax needs to be changed from regular SoftMax to online SoftMax where the running maximum and cumulative denominators are tracked and updated once each block is processed to retain mathematical accuracy compared to regular SoftMax.
  2. Changes to CTXGather and CTXGatherCB custom ops to read only 1 block worth of data in each cache gather/read.
  3. Changes to read_only function in QEffDynamicCache to allow reading of a cache block by block rather than full K/V cache.
  4. Generation of attention mask per block.
  5. Changes to eager_attention_forward implementation in the llama model to allow BlockedKV attention and online SoftMax implementation.
  6. Wrapping the num_kv_blocks variable inside qaic_config to keep consistent calling style.
  7. A new PyTorch transform to pass the num_kv_blocks variable to QEffLlamaAttention block.
  8. A new constant added for num_kv_blocks.
  9. Added tests to switch the BlockedKV feature on and off.

Please review and feel free to suggest changes and tests.

@vbaddi
Copy link
Contributor

vbaddi commented Nov 14, 2025

Thanks @vaibverm
Could you please address the conflicts and run the lint/format?

@vaibverm vaibverm force-pushed the main branch 2 times, most recently from bc0ef8b to 5997515 Compare November 14, 2025 08:01
@vaibverm
Copy link
Author

Hi @vbaddi,
I have addressed the conflicts but some workflows need approval. Would you be able to approve those?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants