Skip to content

Commit 8979cfa

Browse files
committed
feat: final good πŸ‘πŸ‘πŸ‘
1 parent bcaedce commit 8979cfa

1 file changed

Lines changed: 11 additions & 7 deletions

File tree

β€Ž_posts/2025-06-04-Final.mdβ€Ž

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -77,12 +77,12 @@ _styles: >
7777
## Abstract
7878
Our blog post focuses on optimizing the serving of deep learning models on large-scale servers in distributed systems, with an emphasis on improving memory efficiency and reducing latency.
7979

80-
> System architecture aspect
80+
> System Architecture Aspect
8181
8282
Large Language Models (LLMs) have mostly been developed in the form of GPT, which is based on the Decoder phase of the Transformer. As the context length of an LLM increases, inference performance becomes highly dependent on the optimization of the Attention operation. Accordingly, LLMs such as GPT operate in two major stages: Prefill and Decoding, each with distinct computational characteristics. For example, during the Decoder phase, the initial tokens are summarized in the Prefill stage, and subsequently, batched requests each maintain their own KV Cache. The process of loading these caches introduces overhead, resulting in a memory bottleneck.<d-cite key="PIMIsAllYouNeed, neupims"></d-cite>
8383
In particular, the KV Cache grows with the sequence length and becomes a key source of memory bandwidth bottlenecks. **As the size of the prompt increases, the memory load during the Attention operation introduces** significant overhead in LLM serving. This blog, based on NeuPIMs and the paper PIM is All You Need, presents a new perspective that ***computational characteristics such as GEMM and GEMV operations, as well as the layer structure, should be carefully segmented and treated with distinct batching strategies depending on the type of accelerator used***.<d-cite key="GEMM_GEMV"></d-cite>
8484

85-
> Network aspect
85+
> Network Aspect
8686
8787
As deep learning modelsβ€”including large language models (LLMs)β€”continue to grow in scale, it has become increasingly difficult to train them on a single GPU. This has led to a growing interest in **Distributed Deep Learning** (DDL), which enables models to be trained in parallel across multiple hardware devices. While DDL offers clear advantages in scalability, it also introduces a critical challenge: communication overhead between devices. In particular, **inter-node communication** (GPU-to-GPU) can be handled efficiently using high-performance communication libraries such as NVIDIA’s NCCL (NVIDIA Collective Communication Library). However, **intra-node communication** (between GPU systems) often relies on Ethernet-based connections, which are inherently limited by physical bandwidth and latency constraints.
8888
To address these limitations, intelligent network interface cards (SmartNICs) have emerged as a promising solution.<d-cite key="OmNICCL, DirectReduce, FPGANIC, OptimusNIC, SqueezeNIC"></d-cite> In this blog post, we explore ***how recent research suggests that SmartNICs can be leveraged to optimize communication overhead between system nodes in distributed deep learning environments***.
@@ -128,7 +128,7 @@ Therefore, GPUs and NPUs, which are optimized for high compute intensity, perfor
128128
GPUs and NPUs operate more efficiently when the arithmetic intensity of GEMV is high. Thus, while they are well-suited for high-operation-intensity tasks like GEMM, they tend to show lower utilization for matrix-vector multiplications such as GEMV.
129129

130130
### Explanation of Detailed Operations in LLM Model
131-
> transformer & GPT
131+
> Transformer & GPT
132132
133133
{% include figure.html path="assets/img/2025-06-04-Final/transformer.png" class="col-10" %}
134134

@@ -293,6 +293,8 @@ Based on the distinct computational characteristics of prefill and decoding phas
293293

294294
{% include figure.html path="assets/img/2025-06-04-Final/NeuPIMs.png" class="col-10" %}
295295

296+
> Proposed Architecture
297+
296298
NeuPIMs <d-cite key="neupims"></d-cite> addresses a key limitation of traditional PIM architectures, where memory mode and PIM mode (for GEMV operations) could not be executed simultaneously. To overcome this, NeuPIMs introduces an architecture that integrates a lightweight NPU and advanced PIM within the same chip, enabling efficient processing of decoding attention operations.
297299

298300
In particular, traditional PIM units are located near memory and share the same buffer for both memory load operations and GEMV computations, making concurrent execution infeasible. To resolve this, NeuPIMs implements a dual-buffer system, allowing memory loading and GEMV execution to occur in parallel, thereby improving decoding efficiency and overall throughput.
@@ -304,7 +306,7 @@ By employing a dual-buffer system, NeuPIMs enables the batching of N requests in
304306

305307
This overlapping of memory-bound and compute-bound workloads allows NeuPIMs to effectively mitigate both memory and compute bottlenecks, leading to improved parallelism and higher overall throughput in LLM serving.
306308

307-
> NeuPIMs Results
309+
> Evaluation Results
308310
309311
{% include figure.html path="assets/img/2025-06-04-Final/NeuPIMs_result.png" class="col-10" %}
310312

@@ -317,6 +319,8 @@ This overlapping of memory-bound and compute-bound workloads allows NeuPIMs to e
317319

318320
### PIM is All you need
319321

322+
>Proposed Architecture
323+
320324
The paper <d-cite key="PIMIsAllYouNeed"></d-cite> presents an architecture designed to address the increasing context length in LLMs by leveraging the high energy efficiency of PIM compared to GPUs and TPUs. In this architecture, PIM units are responsible for GEMV operations, while a custom-designed low-power PNM (Processing-Near-Memory) device, placed near the DRAM controller, handles GEMM computations.
321325

322326
The proposed PNM is not limited to GEMM; it also includes lightweight components such as reduce trees for softmax, exponent processors, and RISC-V cores to support essential functions like activation operations (e.g., GeLU, ReLU). This co-design enables efficient and low-power LLM serving by distributing tasks to specialized near-memory processing elements.
@@ -327,7 +331,7 @@ The proposed PNM is not limited to GEMM; it also includes lightweight components
327331
In NeuPIMs, all operations except for the GEMV in decoding are handled by the NPU. In contrast, PIM is All You Need takes a different approach: it offloads all operations except for attention to the PNM device, which is placed near the DRAM controller. These operations are then executed by broadcasting and gathering data across multiple devices, enabling efficient distributed execution across a network of lightweight, near-memory processing units.
328332

329333

330-
> PIM is ALL you need Results
334+
> Evaluation Results
331335
332336
{% include figure.html path="assets/img/2025-06-04-Final/Results.png" class="col-10" %}
333337
- Models : Llama2-70B
@@ -355,7 +359,7 @@ As model sizes and datasets continue to grow, and as server systems scale accord
355359
Traditional INA techniques utilize network switches as aggregators to offload and accelerate collective communication operations such as AllReduce. However, despite their potential, network switches are not well-suited for high-performance computing (HPC) environments.<d-cite key="DirectReduce"></d-cite>
356360
Therefore, the papers introduced below explore **the use of SmartNICs, modern programmable network devices, as aggregators instead of traditional network switches**.
357361
### SmartNIC for Ring-AllReduce
358-
>Proposed technique
362+
>Proposed Architecture
359363
360364
We begin by introducing ***DirectReduce*** <d-cite key="DirectReduce"></d-cite>, a technique that offloads Ring-AllReduce operations onto SmartNICs. The paper highlights several inefficiencies in the traditional Ring-AllReduce communication pattern, as outlined below.
361365

@@ -418,7 +422,7 @@ The findings can be summarized in the following table.
418422
<br>
419423

420424
### Zero-Sparse AllReduce and SmartNIC offloading
421-
>Proposed technique
425+
>Proposed Architecture
422426
423427
The next paper, ***OmNICCL*** <d-cite key="OmNICCL"></d-cite>, introduces not only an offloading mechanism to SmartNICs but also proposes a Zero-Sparse AllReduce algorithm, which aims to reduce the overall amount of data transferred during communication. However, since this blog focuses primarily on SmartNIC-based solutions, we will briefly introduce the Zero-Sparse algorithm and then shift our attention back to the SmartNIC-related aspects.
424428
* **Zero-Sparse AllReduce**

0 commit comments

Comments
Β (0)