You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2025-06-04-Final.md
+11-7Lines changed: 11 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -77,12 +77,12 @@ _styles: >
77
77
## Abstract
78
78
Our blog post focuses on optimizing the serving of deep learning models on large-scale servers in distributed systems, with an emphasis on improving memory efficiency and reducing latency.
79
79
80
-
> System architecture aspect
80
+
> System Architecture Aspect
81
81
82
82
Large Language Models (LLMs) have mostly been developed in the form of GPT, which is based on the Decoder phase of the Transformer. As the context length of an LLM increases, inference performance becomes highly dependent on the optimization of the Attention operation. Accordingly, LLMs such as GPT operate in two major stages: Prefill and Decoding, each with distinct computational characteristics. For example, during the Decoder phase, the initial tokens are summarized in the Prefill stage, and subsequently, batched requests each maintain their own KV Cache. The process of loading these caches introduces overhead, resulting in a memory bottleneck.<d-citekey="PIMIsAllYouNeed, neupims"></d-cite>
83
83
In particular, the KV Cache grows with the sequence length and becomes a key source of memory bandwidth bottlenecks. **As the size of the prompt increases, the memory load during the Attention operation introduces** significant overhead in LLM serving. This blog, based on NeuPIMs and the paper PIM is All You Need, presents a new perspective that ***computational characteristics such as GEMM and GEMV operations, as well as the layer structure, should be carefully segmented and treated with distinct batching strategies depending on the type of accelerator used***.<d-citekey="GEMM_GEMV"></d-cite>
84
84
85
-
> Network aspect
85
+
> Network Aspect
86
86
87
87
As deep learning modelsβincluding large language models (LLMs)βcontinue to grow in scale, it has become increasingly difficult to train them on a single GPU. This has led to a growing interest in **Distributed Deep Learning** (DDL), which enables models to be trained in parallel across multiple hardware devices. While DDL offers clear advantages in scalability, it also introduces a critical challenge: communication overhead between devices. In particular, **inter-node communication** (GPU-to-GPU) can be handled efficiently using high-performance communication libraries such as NVIDIAβs NCCL (NVIDIA Collective Communication Library). However, **intra-node communication** (between GPU systems) often relies on Ethernet-based connections, which are inherently limited by physical bandwidth and latency constraints.
88
88
To address these limitations, intelligent network interface cards (SmartNICs) have emerged as a promising solution.<d-citekey="OmNICCL, DirectReduce, FPGANIC, OptimusNIC, SqueezeNIC"></d-cite> In this blog post, we explore ***how recent research suggests that SmartNICs can be leveraged to optimize communication overhead between system nodes in distributed deep learning environments***.
@@ -128,7 +128,7 @@ Therefore, GPUs and NPUs, which are optimized for high compute intensity, perfor
128
128
GPUs and NPUs operate more efficiently when the arithmetic intensity of GEMV is high. Thus, while they are well-suited for high-operation-intensity tasks like GEMM, they tend to show lower utilization for matrix-vector multiplications such as GEMV.
129
129
130
130
### Explanation of Detailed Operations in LLM Model
131
-
> transformer & GPT
131
+
> Transformer & GPT
132
132
133
133
{% include figure.html path="assets/img/2025-06-04-Final/transformer.png" class="col-10" %}
134
134
@@ -293,6 +293,8 @@ Based on the distinct computational characteristics of prefill and decoding phas
293
293
294
294
{% include figure.html path="assets/img/2025-06-04-Final/NeuPIMs.png" class="col-10" %}
295
295
296
+
> Proposed Architecture
297
+
296
298
NeuPIMs <d-citekey="neupims"></d-cite> addresses a key limitation of traditional PIM architectures, where memory mode and PIM mode (for GEMV operations) could not be executed simultaneously. To overcome this, NeuPIMs introduces an architecture that integrates a lightweight NPU and advanced PIM within the same chip, enabling efficient processing of decoding attention operations.
297
299
298
300
In particular, traditional PIM units are located near memory and share the same buffer for both memory load operations and GEMV computations, making concurrent execution infeasible. To resolve this, NeuPIMs implements a dual-buffer system, allowing memory loading and GEMV execution to occur in parallel, thereby improving decoding efficiency and overall throughput.
@@ -304,7 +306,7 @@ By employing a dual-buffer system, NeuPIMs enables the batching of N requests in
304
306
305
307
This overlapping of memory-bound and compute-bound workloads allows NeuPIMs to effectively mitigate both memory and compute bottlenecks, leading to improved parallelism and higher overall throughput in LLM serving.
306
308
307
-
> NeuPIMs Results
309
+
> Evaluation Results
308
310
309
311
{% include figure.html path="assets/img/2025-06-04-Final/NeuPIMs_result.png" class="col-10" %}
310
312
@@ -317,6 +319,8 @@ This overlapping of memory-bound and compute-bound workloads allows NeuPIMs to e
317
319
318
320
### PIM is All you need
319
321
322
+
>Proposed Architecture
323
+
320
324
The paper <d-citekey="PIMIsAllYouNeed"></d-cite> presents an architecture designed to address the increasing context length in LLMs by leveraging the high energy efficiency of PIM compared to GPUs and TPUs. In this architecture, PIM units are responsible for GEMV operations, while a custom-designed low-power PNM (Processing-Near-Memory) device, placed near the DRAM controller, handles GEMM computations.
321
325
322
326
The proposed PNM is not limited to GEMM; it also includes lightweight components such as reduce trees for softmax, exponent processors, and RISC-V cores to support essential functions like activation operations (e.g., GeLU, ReLU). This co-design enables efficient and low-power LLM serving by distributing tasks to specialized near-memory processing elements.
@@ -327,7 +331,7 @@ The proposed PNM is not limited to GEMM; it also includes lightweight components
327
331
In NeuPIMs, all operations except for the GEMV in decoding are handled by the NPU. In contrast, PIM is All You Need takes a different approach: it offloads all operations except for attention to the PNM device, which is placed near the DRAM controller. These operations are then executed by broadcasting and gathering data across multiple devices, enabling efficient distributed execution across a network of lightweight, near-memory processing units.
328
332
329
333
330
-
> PIM is ALL you need Results
334
+
> Evaluation Results
331
335
332
336
{% include figure.html path="assets/img/2025-06-04-Final/Results.png" class="col-10" %}
333
337
- Models : Llama2-70B
@@ -355,7 +359,7 @@ As model sizes and datasets continue to grow, and as server systems scale accord
355
359
Traditional INA techniques utilize network switches as aggregators to offload and accelerate collective communication operations such as AllReduce. However, despite their potential, network switches are not well-suited for high-performance computing (HPC) environments.<d-citekey="DirectReduce"></d-cite>
356
360
Therefore, the papers introduced below explore **the use of SmartNICs, modern programmable network devices, as aggregators instead of traditional network switches**.
357
361
### SmartNIC for Ring-AllReduce
358
-
>Proposed technique
362
+
>Proposed Architecture
359
363
360
364
We begin by introducing ***DirectReduce*** <d-citekey="DirectReduce"></d-cite>, a technique that offloads Ring-AllReduce operations onto SmartNICs. The paper highlights several inefficiencies in the traditional Ring-AllReduce communication pattern, as outlined below.
361
365
@@ -418,7 +422,7 @@ The findings can be summarized in the following table.
418
422
<br>
419
423
420
424
### Zero-Sparse AllReduce and SmartNIC offloading
421
-
>Proposed technique
425
+
>Proposed Architecture
422
426
423
427
The next paper, ***OmNICCL*** <d-citekey="OmNICCL"></d-cite>, introduces not only an offloading mechanism to SmartNICs but also proposes a Zero-Sparse AllReduce algorithm, which aims to reduce the overall amount of data transferred during communication. However, since this blog focuses primarily on SmartNIC-based solutions, we will briefly introduce the Zero-Sparse algorithm and then shift our attention back to the SmartNIC-related aspects.
0 commit comments