You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am reading your Duoattention paper and one paragraph confuses me.
Despite numerous efforts to overcome the challenges of attention mechanisms in long-context
inference, significant computational and memory issues persist. Architectural modifications, such
as Grouped-Query Attention (GQA)(Ainslie et al., 2023), require model pre-training and fail to
reduce computational costs. Linear Attention methods (Gu & Dao, 2023; Poli et al., 2023), while
less demanding in terms of computation and memory, often underperform in long-context scenarios
compared to Transformer models. Approximative attention methods, such as H2O (Zhang et al.,
2023b), StreamingLLM (Xiao et al., 2023b), TOVA (Oren et al., 2024), and FastGen (Ge et al., 2024),
often compromise accuracy in long-context applications and are incompatible with essential KV cache
optimization techniques like GQA
Why is streamingLLM incompatible with GQA while DuoAttention is compatible? It seems to me that each layer contains both retrieval head and streaminghead, hence it well be the case that all attention heads who shares the same KV matrices might contain both retrieval head and streaming head. Therefore for one attention group, full KV matrices need to be stored.
The text was updated successfully, but these errors were encountered:
Hi there,
I am reading your Duoattention paper and one paragraph confuses me.
Why is streamingLLM incompatible with GQA while DuoAttention is compatible? It seems to me that each layer contains both retrieval head and streaminghead, hence it well be the case that all attention heads who shares the same KV matrices might contain both retrieval head and streaming head. Therefore for one attention group, full KV matrices need to be stored.
The text was updated successfully, but these errors were encountered: