Skip to content

Commit 35a7a07

Browse files
committed
feat: final good πŸ‘πŸ‘πŸ‘πŸ‘
1 parent 68a067a commit 35a7a07

1 file changed

Lines changed: 3 additions & 3 deletions

File tree

β€Ž_posts/2025-06-04-Final.mdβ€Ž

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -257,11 +257,11 @@ During the Decoding phase, each request reuses previously computed Key and Value
257257
- MHA Block: Emphasis on **GEMV** operations increases
258258
- QKV Generation (GEMM):
259259
$$XW_Q,\ XW_K,\ XW_V: [1, d_{\text{emb}}] \times [d_{\text{emb}}, d_{\text{emb}}]$$
260-
These are vector-matrix multiplications.
260+
These are vector-matrix multiplications. Although a single request may appear as a GEMV, multiple decoding requests share the same weights. Thus, in practice, this is processed as a GEMM with shape:
261+
$$[N_{batches}, d_{emb}] \times [d_{emb}, d_{emb}]$$
261262
For Key and Value, the previous KV Cache is loaded from memory and concatenated. The KV matrices have shape:
262263
$$K,V: [N_{prev}+1, d_{emb}]$$
263-
Although a single request may appear as a GEMV, multiple decoding requests share the same weights. Thus, in practice, this is processed as a GEMM with shape:
264-
$$[N_{batches}, d_{emb}] \times [d_{emb}, d_{emb}]$$
264+
265265
- **Attention** :
266266
$$Q \times K^T \times V: [1, \frac{d_{emb}}{H}]\times[\frac{d_{emb}}{H}, N_{prev}+1] \times [N_{prev}+1, \frac{d_{emb}}{H}]$$
267267
**Even with batching, each request in the decoding process maintains its own KV Cache and must independently load and process its own data. As a result, these requests cannot be efficiently handled in a batched manner. This leads to low compute utilization(GEMV) and disproportionately high memory utilization during attention in the decoding phase.**

0 commit comments

Comments
Β (0)