feat: final good 👍👍👍👍

20190511 · 20190511 · commit 35a7a072e72a · 2025-05-30T14:42:08.000+09:00
diff --git a/_posts/2025-06-04-Final.md b/_posts/2025-06-04-Final.md
@@ -257,11 +257,11 @@ During the Decoding phase, each request reuses previously computed Key and Value
   - MHA Block: Emphasis on **GEMV** operations increases
     - QKV Generation (GEMM):  
       $$XW_Q,\ XW_K,\ XW_V: [1, d_{\text{emb}}] \times [d_{\text{emb}}, d_{\text{emb}}]$$  
-      These are vector-matrix multiplications.
+      These are vector-matrix multiplications. Although a single request may appear as a GEMV, multiple decoding requests share the same weights. Thus, in practice, this is processed as a GEMM with shape:  
+      $$[N_{batches}, d_{emb}] \times [d_{emb}, d_{emb}]$$  
       For Key and Value, the previous KV Cache is loaded from memory and concatenated. The KV matrices have shape:  
       $$K,V: [N_{prev}+1, d_{emb}]$$   
-      Although a single request may appear as a GEMV, multiple decoding requests share the same weights. Thus, in practice, this is processed as a GEMM with shape:  
-      $$[N_{batches}, d_{emb}] \times [d_{emb}, d_{emb}]$$  
+  
     - **Attention** :  
       $$Q \times K^T \times V:  [1, \frac{d_{emb}}{H}]\times[\frac{d_{emb}}{H}, N_{prev}+1] \times [N_{prev}+1, \frac{d_{emb}}{H}]$$  
       **Even with batching, each request in the decoding process maintains its own KV Cache and must independently load and process its own data. As a result, these requests cannot be efficiently handled in a batched manner. This leads to low compute utilization(GEMV) and disproportionately high memory utilization during attention in the decoding phase.**