feat: final good 👍👍👍

rkdtjddnr · rkdtjddnr · commit 8979cfac1df7 · 2025-05-30T14:34:47.000+09:00
diff --git a/_posts/2025-06-04-Final.md b/_posts/2025-06-04-Final.md
@@ -77,12 +77,12 @@ _styles: >
 ## Abstract
 Our blog post focuses on optimizing the serving of deep learning models on large-scale servers in distributed systems, with an emphasis on improving memory efficiency and reducing latency.
 
-> System architecture aspect   
+> System Architecture Aspect   
 
 Large Language Models (LLMs) have mostly been developed in the form of GPT, which is based on the Decoder phase of the Transformer. As the context length of an LLM increases, inference performance becomes highly dependent on the optimization of the Attention operation. Accordingly, LLMs such as GPT operate in two major stages: Prefill and Decoding, each with distinct computational characteristics. For example, during the Decoder phase, the initial tokens are summarized in the Prefill stage, and subsequently, batched requests each maintain their own KV Cache. The process of loading these caches introduces overhead, resulting in a memory bottleneck.<d-cite key="PIMIsAllYouNeed, neupims"></d-cite>   
 In particular, the KV Cache grows with the sequence length and becomes a key source of memory bandwidth bottlenecks. **As the size of the prompt increases, the memory load during the Attention operation introduces** significant overhead in LLM serving. This blog, based on NeuPIMs and the paper PIM is All You Need, presents a new perspective that ***computational characteristics such as GEMM and GEMV operations, as well as the layer structure, should be carefully segmented and treated with distinct batching strategies depending on the type of accelerator used***.<d-cite key="GEMM_GEMV"></d-cite>
 
-> Network aspect   
+> Network Aspect   
 
 As deep learning models—including large language models (LLMs)—continue to grow in scale, it has become increasingly difficult to train them on a single GPU. This has led to a growing interest in **Distributed Deep Learning** (DDL), which enables models to be trained in parallel across multiple hardware devices. While DDL offers clear advantages in scalability, it also introduces a critical challenge: communication overhead between devices. In particular, **inter-node communication** (GPU-to-GPU) can be handled efficiently using high-performance communication libraries such as NVIDIA’s NCCL (NVIDIA Collective Communication Library). However, **intra-node communication** (between GPU systems) often relies on Ethernet-based connections, which are inherently limited by physical bandwidth and latency constraints.
 To address these limitations, intelligent network interface cards (SmartNICs) have emerged as a promising solution.<d-cite key="OmNICCL, DirectReduce, FPGANIC, OptimusNIC, SqueezeNIC"></d-cite> In this blog post, we explore ***how recent research suggests that SmartNICs can be leveraged to optimize communication overhead between system nodes in distributed deep learning environments***.
@@ -128,7 +128,7 @@ Therefore, GPUs and NPUs, which are optimized for high compute intensity, perfor
 GPUs and NPUs operate more efficiently when the arithmetic intensity of GEMV is high. Thus, while they are well-suited for high-operation-intensity tasks like GEMM, they tend to show lower utilization for matrix-vector multiplications such as GEMV.
 
 ### Explanation of Detailed Operations in LLM Model
-> transformer & GPT
+> Transformer & GPT
 
   {% include figure.html path="assets/img/2025-06-04-Final/transformer.png"  class="col-10" %}
 
@@ -293,6 +293,8 @@ Based on the distinct computational characteristics of prefill and decoding phas
 
 {% include figure.html path="assets/img/2025-06-04-Final/NeuPIMs.png"  class="col-10" %}
 
+> Proposed Architecture
+
 NeuPIMs <d-cite key="neupims"></d-cite> addresses a key limitation of traditional PIM architectures, where memory mode and PIM mode (for GEMV operations) could not be executed simultaneously. To overcome this, NeuPIMs introduces an architecture that integrates a lightweight NPU and advanced PIM within the same chip, enabling efficient processing of decoding attention operations.
 
 In particular, traditional PIM units are located near memory and share the same buffer for both memory load operations and GEMV computations, making concurrent execution infeasible. To resolve this, NeuPIMs implements a dual-buffer system, allowing memory loading and GEMV execution to occur in parallel, thereby improving decoding efficiency and overall throughput.
@@ -304,7 +306,7 @@ By employing a dual-buffer system, NeuPIMs enables the batching of N requests in
 
 This overlapping of memory-bound and compute-bound workloads allows NeuPIMs to effectively mitigate both memory and compute bottlenecks, leading to improved parallelism and higher overall throughput in LLM serving.
 
->  NeuPIMs Results  
+>  Evaluation Results  
 
 {% include figure.html path="assets/img/2025-06-04-Final/NeuPIMs_result.png"  class="col-10" %}
 
@@ -317,6 +319,8 @@ This overlapping of memory-bound and compute-bound workloads allows NeuPIMs to e
 
 ### PIM is All you need  
 
+>Proposed Architecture
+
 The paper <d-cite key="PIMIsAllYouNeed"></d-cite> presents an architecture designed to address the increasing context length in LLMs by leveraging the high energy efficiency of PIM compared to GPUs and TPUs. In this architecture, PIM units are responsible for GEMV operations, while a custom-designed low-power PNM (Processing-Near-Memory) device, placed near the DRAM controller, handles GEMM computations.
 
 The proposed PNM is not limited to GEMM; it also includes lightweight components such as reduce trees for softmax, exponent processors, and RISC-V cores to support essential functions like activation operations (e.g., GeLU, ReLU). This co-design enables efficient and low-power LLM serving by distributing tasks to specialized near-memory processing elements.
@@ -327,7 +331,7 @@ The proposed PNM is not limited to GEMM; it also includes lightweight components
 In NeuPIMs, all operations except for the GEMV in decoding are handled by the NPU. In contrast, PIM is All You Need takes a different approach: it offloads all operations except for attention to the PNM device, which is placed near the DRAM controller. These operations are then executed by broadcasting and gathering data across multiple devices, enabling efficient distributed execution across a network of lightweight, near-memory processing units.
 
 
-> PIM is ALL you need Results  
+> Evaluation Results  
 
 {% include figure.html path="assets/img/2025-06-04-Final/Results.png"  class="col-10" %}  
   - Models : Llama2-70B   
@@ -355,7 +359,7 @@ As model sizes and datasets continue to grow, and as server systems scale accord
 Traditional INA techniques utilize network switches as aggregators to offload and accelerate collective communication operations such as AllReduce. However, despite their potential, network switches are not well-suited for high-performance computing (HPC) environments.<d-cite key="DirectReduce"></d-cite>   
 Therefore, the papers introduced below explore **the use of SmartNICs, modern programmable network devices, as aggregators instead of traditional network switches**.
 ### SmartNIC for Ring-AllReduce   
->Proposed technique
+>Proposed Architecture
 
 We begin by introducing ***DirectReduce*** <d-cite key="DirectReduce"></d-cite>, a technique that offloads Ring-AllReduce operations onto SmartNICs. The paper highlights several inefficiencies in the traditional Ring-AllReduce communication pattern, as outlined below.   
 
@@ -418,7 +422,7 @@ The findings can be summarized in the following table.
 <br>
 
 ### Zero-Sparse AllReduce and SmartNIC offloading
->Proposed technique
+>Proposed Architecture
 
 The next paper, ***OmNICCL*** <d-cite key="OmNICCL"></d-cite>, introduces not only an offloading mechanism to SmartNICs but also proposes a Zero-Sparse AllReduce algorithm, which aims to reduce the overall amount of data transferred during communication. However, since this blog focuses primarily on SmartNIC-based solutions, we will briefly introduce the Zero-Sparse algorithm and then shift our attention back to the SmartNIC-related aspects.
 * **Zero-Sparse AllReduce**