[Feature] Consolidate performance benchmark datasets #14036

JenZhao · 2025-02-28T10:57:42Z

Addressing #13351

Benchmark Serving Results

after the change

Dataset	Backend	Successful requests	Benchmark duration (s)	Total input tokens
sonnet	openai-chat	1000	30.94	546875
hf-vision-arena	openai-chat	500	60.53	33418
hf	openai-chat	1000	85.57	11428
sonnet	vllm	1000	30.46	546875
sharegpt	vllm	1000	35.16	217393
random	vllm	1000	42.69	1024000
burstgpt	vllm	1000	102.80	768960

before the change

Dataset	Backend	Successful requests	Benchmark duration (s)	Total input tokens
sonnet	openai-chat	1000	30.24	546875
hf-vision-arena	openai-chat	500	59.43	33418
hf	openai-chat	1000	85.70	11428
sonnet	vllm	1000	29.61	546875
sharegpt	vllm	1000	41.31	217393
random	vllm	1000	51.89	1024000
burstgpt	vllm	1000	99.73	768960

MODEL_NAME="Qwen/Qwen2-VL-7B-Instruct"
  "python3 benchmarks/benchmark_serving.py --backend openai-chat --model ${MODEL_NAME} --endpoint /v1/chat/completions --dataset-name sonnet --dataset-path benchmarks/sonnet.txt --num-prompts ${NUM_PROMPTS}"
  "python3 benchmarks/benchmark_serving.py --model ${MODEL_NAME} --backend openai-chat --endpoint /v1/chat/completions --dataset-name hf --dataset-path lmarena-ai/vision-arena-bench-v0.1 --hf-split train --num-prompts ${NUM_PROMPTS} --request-rate 1000 --percentile-metrics ttft,tpot,e2el"
  "python3 benchmarks/benchmark_serving.py --model ${MODEL_NAME} --backend openai-chat --endpoint /v1/chat/completions --dataset-name hf --dataset-path lmms-lab/LLaVA-OneVision-Data --hf-split train --hf-subset \"chart2text(cauldron)\" --num-prompts ${NUM_PROMPTS} --request-rate 1000 --percentile-metrics ttft,tpot,e2el"
  "python3 benchmarks/benchmark_serving.py --backend vllm --model ${MODEL_NAME} --dataset-name sonnet --dataset-path benchmarks/sonnet.txt --num-prompts ${NUM_PROMPTS}"
  "python3 benchmarks/benchmark_serving.py --backend vllm --model ${MODEL_NAME} --dataset-name sharegpt --dataset-path /home/jovyan/data/vllm_benchmark_datasets/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts ${NUM_PROMPTS}"
  "python3 benchmarks/benchmark_serving.py --backend vllm --model ${MODEL_NAME} --dataset-name random --num-prompts ${NUM_PROMPTS}"
  "python3 benchmarks/benchmark_serving.py --backend vllm --model ${MODEL_NAME} --dataset-name burstgpt --dataset-path /home/jovyan/data/vllm_benchmark_datasets/BurstGPT_without_fails_2.csv --num-prompts ${NUM_PROMPTS}"

Benchmark Throughput Results

after the change

Dataset	Processed Prompts	Throughput (requests/s)	Total tokens/s	Output tokens/s
random	10	50.44	1513.07	1008.71
ShareGPT	10	1.66	605.33	378.11
sonnet	10	7.62	4960.96	1142.38
burstgpt	10	2.17	2999.05	406.72

before the change
sonnet and burstgpt is not supported

Dataset	Processed Prompts	Throughput (requests/s)	Total tokens/s	Output tokens/s
random	10	51.13	1534.02	1022.68
ShareGPT	10	1.66	604.19	377.39
sonnet	10
burstgpt	10

MODEL="NousResearch/Hermes-3-Llama-3.1-8B"
"VLLM_USE_V1=1 python3 benchmarks/benchmark_throughput.py --model $MODEL --input_len 10 --output_len 20 --dataset-name random --num-prompts $NUM_PROMPTS"
"VLLM_USE_V1=1 python3 benchmarks/benchmark_throughput.py --model $MODEL --dataset /home/jovyan/vllm/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts $NUM_PROMPTS"
"VLLM_USE_V1=1 python3 benchmarks/benchmark_throughput.py --model $MODEL --dataset-name sonnet --dataset benchmarks/sonnet.txt --num-prompts $NUM_PROMPTS"
"VLLM_USE_V1=1 python3 benchmarks/benchmark_throughput.py --model $MODEL --dataset /home/jovyan/data/vllm_benchmark_datasets/BurstGPT_without_fails_2.csv --dataset-name burstgpt --num-prompts $NUM_PROMPTS"

Benchmark Throughput Results - Image Support

command copied from here #9851
Since the coco dataset is too large, I have hard-coded it to use only one image.

# hard coded to use this one image
multi_modal_data["image"] = Image.open("000000000009.jpg").convert(
                    "RGB")

after change
1000 request
Throughput: 8.61 requests/s, 1839.93 total tokens/s, 1697.75 output tokens/s

before change
1000 request
Throughput: 8.59 requests/s, 1835.35 total tokens/s, 1693.53 output tokens/s

python benchmarks/benchmark_throughput.py \
    --model mistral-community/pixtral-12b \
    --max-model-len=8192 \
    --dataset sharegpt4v_instruct_gpt4-vision_cap100k.json

LoRA request test

commands are copied from this PR #11267
after the change

Dataset	Num Prompts	Max Loras	Max Lora Rank	Enable Lora	Async Engine	Throughput (requests/s)	Total tokens/s	Output tokens/s
ShareGPT	1000	1	8	Yes	No	11.66	5610.75	2742.90
ShareGPT	1000	4	8	Yes	No	11.59	5575.73	2725.78
ShareGPT	1000	N/A	N/A	No	Yes	17.42	8383.51	4098.41
ShareGPT	1000	1	8	Yes	Yes	11.50	5535.98	2706.35
ShareGPT	1000	4	8	Yes	Yes	11.25	5412.76	2646.11

before the change

Dataset	Num Prompts	Max Loras	Max Lora Rank	Enable Lora	Async Engine	Throughput (requests/s)	Total tokens/s	Output tokens/s
ShareGPT	1000	1	8	Yes	No	10.84	5216.17	2550.01
ShareGPT	1000	4	8	Yes	No	10.80	5197.68	2540.97
ShareGPT	1000	N/A	N/A	No	Yes	16.75	8061.23	3940.86
ShareGPT	1000	1	8	Yes	Yes	11.08	5332.47	2606.86
ShareGPT	1000	4	8	Yes	Yes	10.84	5215.25	2549.56
ShareGPT	1000	4	8	Yes	Yes	10.84	5215.25	2549.56

  "python3 benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-hf --backend vllm --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts $NUM_PROMPTS --max-loras 1 --max-lora-rank 8 --enable-lora --lora-path \"yard1/llama-2-7b-sql-lora-test\""
  "python3 benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-hf --backend vllm --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts $NUM_PROMPTS --max-loras 4 --max-lora-rank 8 --enable-lora --lora-path \"yard1/llama-2-7b-sql-lora-test\""
  "python3 benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-hf --backend vllm --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts $NUM_PROMPTS --async-engine"
  "python3 benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-hf --backend vllm --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts $NUM_PROMPTS --async-engine --max-loras 1 --max-lora-rank 8 --enable-lora --lora-path \"yard1/llama-2-7b-sql-lora-test\""
  "python3 benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-hf --backend vllm --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts $NUM_PROMPTS --async-engine --max-loras 4 --max-lora-rank 8 --enable-lora --lora-path \"yard1/llama-2-7b-sql-lora-test\""

scripts for generating the table above is here

Signed-off-by: Jennifer Zhao <[email protected]>

github-actions · 2025-02-28T10:57:51Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Jennifer Zhao <[email protected]>

JenZhao · 2025-03-01T10:39:08Z

benchmarks/benchmark_serving.py

-        ]
-        prompt_formatted = tokenizer.apply_chat_template(
-            message, add_generation_prompt=True, tokenize=False)
-        prompt_len = len(tokenizer(prompt_formatted).input_ids)


I wonder if the prompt_len should always be based on prompt_formatted. When returning the prompt, should the prompt_len be computed as len(tokenizer(prompt).input_ids) instead?

eg here in the original code, should the prompt_len be based on len(tokenizer(prompt).input_ids)

add benchmark_dataset

378e9f3

Signed-off-by: Jennifer Zhao <[email protected]>

JenZhao force-pushed the dataset branch from c92191f to 5207c9a Compare February 28, 2025 12:27

add more benchmark_dataset

d349166

Signed-off-by: Jennifer Zhao <[email protected]>

JenZhao force-pushed the dataset branch from 5207c9a to d349166 Compare March 1, 2025 05:06

JenZhao and others added 2 commits February 28, 2025 21:07

Merge branch 'vllm-project:main' into dataset

ceac084

update random dataset

d44461e

Signed-off-by: Jennifer Zhao <[email protected]>

JenZhao marked this pull request as ready for review March 1, 2025 07:03

Merge branch 'vllm-project:main' into dataset

e2f126f

ywang96 self-assigned this Mar 1, 2025

update sonnet for the prompt_format

955ff0c

Signed-off-by: Jennifer Zhao <[email protected]>

JenZhao commented Mar 1, 2025

View reviewed changes

JenZhao added 2 commits March 1, 2025 15:52

Merge branch 'vllm-project:main' into dataset

e4b652b

Merge branch 'vllm-project:main' into dataset

4955f3c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Consolidate performance benchmark datasets #14036

[Feature] Consolidate performance benchmark datasets #14036

JenZhao commented Feb 28, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Feb 28, 2025

JenZhao Mar 1, 2025

[Feature] Consolidate performance benchmark datasets #14036

Are you sure you want to change the base?

[Feature] Consolidate performance benchmark datasets #14036

Conversation

JenZhao commented Feb 28, 2025 • edited by github-actions bot Loading

Benchmark Serving Results

Benchmark Throughput Results

Benchmark Throughput Results - Image Support

LoRA request test

github-actions bot commented Feb 28, 2025

JenZhao Mar 1, 2025

Choose a reason for hiding this comment

JenZhao commented Feb 28, 2025 •

edited by github-actions bot

Loading