forked from vllm-project/vllm
-
Notifications
You must be signed in to change notification settings - Fork 8
Closed
Description
🚀 The feature, motivation and pitch
Based on the baseline of 1150 tokens per large EP single die, the throughput per die is improved by 10% under the scenario of AFD decoupled deployment.
Test configuration
bash infer.sh --MODEL_DIR=/xxx/deepseekv3-lite-base-latest_bugtest \
--EXE_MODE="dynamo" \
--FFN_MODE="dynamo" \
--BATCH_SIZE=864 \
--LAYER_NUM=10 \
--WORLD_SIZE=16 \
--ATTN_DIES=12 \
--NEXT_N=1 \
--N_ROUTED_EXPERTS_PER_RANK=3 \
--REMAINDER_ROUTER_EXPERT=0 \
--EXPERTS_SHARE_NUM_COPY=1 \
--DENSE_TP_SIZE=4 \
--ON_CLOUD=0 \
--ENABLE_CACHE_COMPILE=0 \
--ENABLE_SUPERKERNEL=0 \
--ENABLE_PREFETCH=1 \
--LAYER_OUT="FA" \
--ACTUAL_SEQ_LEN=4096 \With this ratio allocation, a 10% throughput improvement can be achieved, reaching approximately 1260+ tokens per die
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
No labels