Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is streamingLLM incompatible with GQA while DuoAttention is compatible? #15

Open
BoxuanYang opened this issue Jan 19, 2025 · 0 comments

Comments

@BoxuanYang
Copy link

Hi there,

I am reading your Duoattention paper and one paragraph confuses me.

Despite numerous efforts to overcome the challenges of attention mechanisms in long-context
inference, significant computational and memory issues persist. Architectural modifications, such
as Grouped-Query Attention (GQA)(Ainslie et al., 2023), require model pre-training and fail to
reduce computational costs. Linear Attention methods (Gu & Dao, 2023; Poli et al., 2023), while
less demanding in terms of computation and memory, often underperform in long-context scenarios
compared to Transformer models. Approximative attention methods, such as H2O (Zhang et al.,
2023b), StreamingLLM (Xiao et al., 2023b), TOVA (Oren et al., 2024), and FastGen (Ge et al., 2024),
often compromise accuracy in long-context applications and are incompatible with essential KV cache
optimization techniques like GQA

Why is streamingLLM incompatible with GQA while DuoAttention is compatible? It seems to me that each layer contains both retrieval head and streaminghead, hence it well be the case that all attention heads who shares the same KV matrices might contain both retrieval head and streaming head. Therefore for one attention group, full KV matrices need to be stored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant