Skip to content

Commit f9c7786

Browse files
authored
[None][feat] Add layer wise benchmarks (#8777)
Signed-off-by: Tailing Yuan <[email protected]>
1 parent f666ad2 commit f9c7786

File tree

14 files changed

+928
-0
lines changed

14 files changed

+928
-0
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,4 +82,5 @@ compile_commands.json
8282
.devcontainer/docker-compose.override.yml
8383

8484
# Enroot sqsh files
85+
enroot/sw-tensorrt-docker+*.sqsh
8586
enroot/tensorrt_llm.devel.sqsh
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
# Layer-wise Benchmarks
2+
3+
## Generate profiles
4+
5+
### Run with MPI
6+
7+
**Step 1:** Start a container using Docker, Enroot or others. Please refer to `../../jenkins/current_image_tags.properties` for the Docker image URI.
8+
9+
**Step 2:** In the container, install `tensorrt_llm`:
10+
11+
```bash
12+
pip install -e ../..
13+
```
14+
15+
**Step 3:** In the container, run benchmarks and generate profiles:
16+
17+
```bash
18+
# Run DeepSeek-R1
19+
NP=4 ./mpi_launch.sh ./run_single.sh config_ctx.yaml
20+
NP=4 ./mpi_launch.sh ./run_single.sh config_gen.yaml
21+
22+
# Run DeepSeek-V3.2-Exp
23+
NP=4 ./mpi_launch.sh ./run_single.sh config_ctx.yaml --model deepseek-ai/DeepSeek-V3.2-Exp --tokens-per-block 64 --moe-backend DEEPGEMM
24+
NP=4 ./mpi_launch.sh ./run_single.sh config_gen.yaml --model deepseek-ai/DeepSeek-V3.2-Exp --tokens-per-block 64 --moe-backend DEEPGEMM
25+
26+
# Run DeepSeek-V3.2-Exp with 32k context length
27+
NP=4 ./mpi_launch.sh ./run_single.sh config_ctx.yaml --model deepseek-ai/DeepSeek-V3.2-Exp --tokens-per-block 64 --max-seq-len $((32768 + 1024 + 4)) --max-num-tokens $((32768 + 1024 + 4)) --moe-backend DEEPGEMM --batch-size 1 --seq-len-q 32769
28+
NP=4 ./mpi_launch.sh ./run_single.sh config_gen.yaml --model deepseek-ai/DeepSeek-V3.2-Exp --tokens-per-block 64 --max-seq-len $((32768 + 1024 + 4)) --moe-backend DEEPGEMM --seq-len-kv-cache 32769
29+
30+
# Run with attention TP
31+
NP=4 ./mpi_launch.sh ./run_single.sh config_gen.yaml --no-enable-attention-dp
32+
NP=4 ./mpi_launch.sh ./run_single.sh config_ctx.yaml --no-enable-attention-dp
33+
34+
# Run with attention TP and TRTLLMGen
35+
NP=4 TRTLLM_ENABLE_PDL=1 ./mpi_launch.sh ./run_single.sh config_ctx.yaml --no-enable-attention-dp --moe-backend TRTLLM
36+
NP=4 TRTLLM_ENABLE_PDL=1 ./mpi_launch.sh ./run_single.sh config_gen.yaml --no-enable-attention-dp --moe-backend TRTLLM
37+
38+
# Run with MTP3
39+
NP=4 ./mpi_launch.sh ./run_single.sh config_gen.yaml --batch-size 32 --seq-len-q 4
40+
41+
# Run 4 layers
42+
NP=4 ./mpi_launch.sh ./run_single.sh config_ctx.yaml --layer-indices 5,6,7,8
43+
NP=4 ./mpi_launch.sh ./run_single.sh config_gen.yaml --layer-indices 5,6,7,8
44+
45+
# Scale DEP=16 MNNVL to 4 GPUs: reduce the number of experts, uses MNNVL A2A if applicable
46+
NP=4 ./mpi_launch.sh ./run_single.sh config_gen.yaml --scaled-from 16 --moe-backend WIDEEP
47+
48+
# Scale TEP=16 to 4 GPUs: reduce the number of attention heads and experts
49+
NP=4 ./mpi_launch.sh ./run_single.sh config_gen.yaml --scaled-from 16 --no-enable-attention-dp
50+
51+
# Run with DeepEP A2A
52+
NP=4 TRTLLM_FORCE_ALLTOALL_METHOD=DeepEP ./mpi_launch.sh ./run_single.sh config_ctx.yaml --moe-backend WIDEEP
53+
NP=4 TRTLLM_FORCE_ALLTOALL_METHOD=DeepEP ./mpi_launch.sh ./run_single.sh config_gen.yaml --moe-backend WIDEEP
54+
```
55+
56+
### Run with Slurm
57+
58+
> Tips: If you have a running job with environment installed, please skip step 1 and 2 and go straight to step 3. In this case, your job must be run with `--container-name aaa`, and if the container name is not "layer_wise_benchmarks" please `export CONTAINER_NAME=aaa`.
59+
60+
**Step 1:** On the controller node, allocate one or multiple nodes, and record the `SLURM_JOB_ID`:
61+
62+
```bash
63+
SLURM_JOB_ID=$(NODES=4 TIME=02:00:00 ./slurm_alloc.sh)
64+
```
65+
66+
Please fill the variables in `./slurm_alloc.sh`.
67+
68+
**Step 2:** Start a container and install `tensorrt_llm`. Run the following command on the controller node:
69+
70+
```bash
71+
SLURM_JOB_ID=$SLURM_JOB_ID ./slurm_init_containers.sh
72+
```
73+
74+
It uses the image recorded in `../../jenkins/current_image_tags.properties`. The image will be downloaded to `../../enroot/` for once.
75+
76+
**Step 3:** Run benchmarks to generate profiles. Run the following command on the controller node, where `NODES` &le; the number of allocated nodes:
77+
78+
```bash
79+
# Run DeepSeek-R1 with wide ep: uses MNNVL A2A if applicable
80+
SLURM_JOB_ID=$SLURM_JOB_ID NODES=4 NP=16 ./slurm_launch.sh ./run_single.sh config_gen.yaml --moe-backend WIDEEP
81+
82+
# Run with attention TP and TRTLLMGen
83+
SLURM_JOB_ID=$SLURM_JOB_ID NODES=4 NP=16 TRTLLM_ENABLE_PDL=1 ./slurm_launch.sh ./run_single.sh config_gen.yaml --no-enable-attention-dp --moe-backend TRTLLM
84+
85+
# Run with DeepEPLowLatency
86+
SLURM_JOB_ID=$SLURM_JOB_ID NODES=4 NP=16 TRTLLM_FORCE_ALLTOALL_METHOD=DeepEPLowLatency ./slurm_launch.sh ./run_single.sh config_gen.yaml --moe-backend WIDEEP
87+
88+
# You can run 4-GPU and 8-GPU tasks without reallocate the slurm job
89+
SLURM_JOB_ID=$SLURM_JOB_ID NODES=1 NP=4 ./slurm_launch.sh ./run_single.sh config_ctx.yaml
90+
SLURM_JOB_ID=$SLURM_JOB_ID NODES=2 NP=8 ./slurm_launch.sh ./run_single.sh config_ctx.yaml
91+
```
92+
93+
## Parse profiles
94+
95+
Coming soon.
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
model: nvidia/DeepSeek-R1-0528-FP4-v2
2+
layer_indices: [5]
3+
run_type: CTX
4+
scaled_from: null
5+
6+
# KV cache related args
7+
tokens_per_block: 32
8+
max_seq_len: 9220 # 8192 + 1024 + 4
9+
enable_attention_dp: true
10+
11+
# Model init args
12+
max_num_tokens: 20480
13+
moe_backend: CUTLASS
14+
use_cuda_graph: false
15+
16+
# Per iteration args
17+
batch_size: 1
18+
seq_len_q: 8193
19+
seq_len_kv_cache: 0
20+
balance_method: Balanced
21+
balance_ratio: null
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
model: nvidia/DeepSeek-R1-0528-FP4-v2
2+
layer_indices: [5]
3+
run_type: GEN
4+
scaled_from: null
5+
6+
# KV cache related args
7+
tokens_per_block: 32
8+
max_seq_len: 9220 # 8192 + 1024 + 4
9+
enable_attention_dp: true
10+
11+
# Model init args
12+
max_num_tokens: 4096 # MTP3 as max
13+
moe_backend: CUTLASS
14+
use_cuda_graph: true
15+
16+
# Per iteration args
17+
batch_size: 128
18+
seq_len_q: 1 # Set to (1 + MTP)
19+
seq_len_kv_cache: 8193
20+
balance_method: Balanced
21+
balance_ratio: null
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
#!/bin/bash
2+
3+
set -euo pipefail
4+
5+
# Clear slurm envs
6+
unset $(env | grep -i slurm | awk -F'=' '{print $1}')
7+
unset $(env | grep MPI | awk -F'=' '{print $1}')
8+
9+
set -x
10+
mpirun --allow-run-as-root --np ${NP} "$@"
Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
import argparse
2+
3+
import numpy as np
4+
import nvtx
5+
import torch
6+
import yaml
7+
8+
from tensorrt_llm._torch.autotuner import AutoTuner, autotune
9+
from tensorrt_llm._torch.modules.multi_stream_utils import with_multi_stream
10+
from tensorrt_llm._utils import local_mpi_rank, mpi_rank, mpi_world_size
11+
from tensorrt_llm.tools.layer_wise_benchmarks.deepseekv3_runner import (
12+
BalanceMethod, DeepSeekV3Runner)
13+
14+
15+
def comma_separated_ints(s):
16+
return [int(x) for x in s.split(",")]
17+
18+
19+
# Parse cmdline
20+
parser = argparse.ArgumentParser()
21+
parser.add_argument("config_path", type=str)
22+
parser.add_argument("--model", type=str, help="Pretrained model name or path")
23+
parser.add_argument(
24+
"--layer-indices",
25+
type=comma_separated_ints,
26+
help="Comma separated indices of layers, should be a contiguous range")
27+
parser.add_argument("--run-type", type=str, choices=["CTX", "GEN"])
28+
parser.add_argument("--scaled-from", type=int)
29+
# KV cache related args
30+
parser.add_argument("--tokens-per-block", type=int)
31+
parser.add_argument("--max-seq-len", type=int)
32+
group = parser.add_mutually_exclusive_group(required=False)
33+
group.add_argument("--enable-attention-dp",
34+
action="store_true",
35+
dest="enable_attention_dp")
36+
group.add_argument("--no-enable-attention-dp",
37+
action="store_false",
38+
dest="enable_attention_dp")
39+
parser.set_defaults(enable_attention_dp=None)
40+
# Model init args
41+
parser.add_argument("--max-num-tokens", type=int)
42+
parser.add_argument("--moe-backend", type=str)
43+
group = parser.add_mutually_exclusive_group(required=False)
44+
group.add_argument("--use-cuda-graph",
45+
action="store_true",
46+
dest="use_cuda_graph")
47+
group.add_argument("--no-use-cuda-graph",
48+
action="store_false",
49+
dest="use_cuda_graph")
50+
parser.set_defaults(use_cuda_graph=None)
51+
# Per iteration args
52+
parser.add_argument("--batch-size", type=int)
53+
parser.add_argument("--seq-len-q", type=int)
54+
parser.add_argument("--seq-len-kv-cache", type=int)
55+
parser.add_argument("--balance-method", type=str)
56+
parser.add_argument("--balance-ratio", type=float)
57+
args = parser.parse_args()
58+
with open(args.config_path) as f:
59+
config = yaml.safe_load(f)
60+
del args.config_path
61+
for k, v in vars(args).items():
62+
if v is None:
63+
setattr(args, k, config[k])
64+
print(args)
65+
66+
# MPI args
67+
rank = mpi_rank()
68+
world_size = mpi_world_size()
69+
local_rank = local_mpi_rank()
70+
torch.cuda.set_device(local_rank)
71+
72+
# Create KV cache manager
73+
mapping = DeepSeekV3Runner.create_mapping(
74+
enable_attention_dp=args.enable_attention_dp)
75+
max_batch_size = 2048
76+
kv_cache_manager = DeepSeekV3Runner.create_kv_cache_manager(
77+
args.model,
78+
mapping,
79+
tokens_per_block=args.tokens_per_block,
80+
max_batch_size=max_batch_size,
81+
max_seq_len=args.max_seq_len,
82+
layer_indices=args.layer_indices)
83+
attn_workspace = torch.empty((0, ), device="cuda", dtype=torch.int8)
84+
85+
# Create other global objects
86+
AutoTuner.get().clear_cache()
87+
capture_stream = torch.cuda.Stream()
88+
89+
# Create Runner
90+
runner = DeepSeekV3Runner(args.model,
91+
mapping,
92+
moe_backend=args.moe_backend,
93+
layer_indices=args.layer_indices,
94+
scaled_from=args.scaled_from,
95+
max_seq_len=args.max_seq_len,
96+
max_num_tokens=args.max_num_tokens,
97+
use_cuda_graph=args.use_cuda_graph)
98+
99+
# Warm up
100+
assert args.batch_size <= max_batch_size
101+
assert args.seq_len_q + args.seq_len_kv_cache <= args.max_seq_len
102+
run_pack = runner.create_run_pack(args.run_type,
103+
batch_size=args.batch_size,
104+
seq_len_q=args.seq_len_q,
105+
seq_len_kv_cache=args.seq_len_kv_cache,
106+
kv_cache_manager=kv_cache_manager,
107+
attn_workspace=attn_workspace)
108+
runner.replace_routing_method(balance_method=BalanceMethod[args.balance_method],
109+
balance_ratio=args.balance_ratio)
110+
capture_stream.wait_stream(torch.cuda.current_stream())
111+
with torch.cuda.stream(capture_stream):
112+
run_pack()
113+
with autotune():
114+
run_pack()
115+
torch.cuda.current_stream().wait_stream(capture_stream)
116+
torch.cuda.synchronize()
117+
118+
# Profile: capture graph and replay it
119+
torch.cuda.cudart().cudaProfilerStart()
120+
if args.use_cuda_graph:
121+
with with_multi_stream(True):
122+
g = torch.cuda.CUDAGraph()
123+
with torch.cuda.graph(g,
124+
stream=capture_stream,
125+
capture_error_mode="global"):
126+
run_pack()
127+
128+
warmup_times = 20
129+
run_times = 100
130+
events = [
131+
torch.cuda.Event(enable_timing=True)
132+
for _ in range(warmup_times + run_times + 1)
133+
]
134+
for i in range(warmup_times + run_times):
135+
events[i].record()
136+
with nvtx.annotate(
137+
f"b={args.batch_size} s={args.seq_len_q} EP{world_size}"):
138+
if args.use_cuda_graph:
139+
g.replay()
140+
else:
141+
run_pack()
142+
events[-1].record()
143+
torch.cuda.synchronize()
144+
145+
# Print statistics
146+
# Print before `cudaProfilerStop` to ensure messages are included in the profile
147+
time_list = [
148+
start.elapsed_time(stop) for start, stop in zip(events, events[1:])
149+
]
150+
time_list = time_list[warmup_times:]
151+
print(f"[RANK {rank}]"
152+
f" min {np.min(time_list) * 1000:.1f}"
153+
f" max {np.max(time_list) * 1000:.1f}"
154+
f" mean {np.mean(time_list) * 1000:.1f}"
155+
f" median {np.median(time_list) * 1000:.1f}"
156+
f" P90 {np.percentile(time_list, 90) * 1000:.1f}"
157+
f" (us)")
158+
159+
torch.cuda.cudart().cudaProfilerStop()
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
#!/bin/bash
2+
3+
set -euo pipefail
4+
5+
if [ -v OMPI_COMM_WORLD_SIZE ]; then
6+
export WORLD_SIZE=$OMPI_COMM_WORLD_SIZE
7+
export RANK=$OMPI_COMM_WORLD_RANK
8+
export LOCAL_RANK=$OMPI_COMM_WORLD_LOCAL_RANK
9+
export NODE_RANK=$OMPI_COMM_WORLD_NODE_RANK
10+
fi
11+
12+
if [ "$RANK" -eq 0 ]; then
13+
export TLLM_LOG_LEVEL=INFO
14+
fi
15+
16+
PROFILE=${PROFILE:-1}
17+
GPU_METRICS=${GPU_METRICS:-0}
18+
if [ "$PROFILE" -eq 1 ]; then
19+
PROFILE_FOLDER=profiles/run_single
20+
mkdir -p ${PROFILE_FOLDER}
21+
PROFILE_CMD="nsys profile
22+
-t cuda,nvtx -s none
23+
--cpuctxsw none --cuda-event-trace false
24+
--cuda-graph-trace node
25+
-c cudaProfilerApi --capture-range-end stop
26+
-o ${PROFILE_FOLDER}/run_single_ep${WORLD_SIZE}_rank${RANK}.nsys-rep
27+
--force-overwrite true"
28+
if [ "$GPU_METRICS" -eq 1 ]; then
29+
PROFILE_CMD+=" --gpu-metrics-devices $LOCAL_RANK
30+
--gpu-metrics-frequency 10000"
31+
fi
32+
else
33+
PROFILE_CMD=
34+
fi
35+
36+
set -x
37+
$PROFILE_CMD python3 -u run_single.py "$@"
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
#!/bin/bash
2+
3+
set -euo pipefail
4+
5+
# ACCOUNT=
6+
# PARTITION=
7+
# EXTRA_ARGS="--gres gpu:4"
8+
TIME=${TIME:-01:00:00}
9+
10+
set -x
11+
salloc -A "$ACCOUNT" \
12+
-p "$PARTITION" \
13+
-N "$NODES" \
14+
--segment "$NODES" \
15+
$EXTRA_ARGS \
16+
-t "$TIME" \
17+
--no-shell \
18+
2>&1 \
19+
| tee >(cat >&2) \
20+
| awk '/Granted job allocation/ {print $NF}'

0 commit comments

Comments
 (0)