Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Consolidate performance benchmark datasets #14036

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

JenZhao
Copy link
Contributor

@JenZhao JenZhao commented Feb 28, 2025

Addressing #13351

Benchmark Serving Results

after the change

Dataset Backend Successful requests Benchmark duration (s) Total input tokens
sonnet openai-chat 1000 30.94 546875
hf-vision-arena openai-chat 500 60.53 33418
hf openai-chat 1000 85.57 11428
sonnet vllm 1000 30.46 546875
sharegpt vllm 1000 35.16 217393
random vllm 1000 42.69 1024000
burstgpt vllm 1000 102.80 768960

before the change

Dataset Backend Successful requests Benchmark duration (s) Total input tokens
sonnet openai-chat 1000 30.24 546875
hf-vision-arena openai-chat 500 59.43 33418
hf openai-chat 1000 85.70 11428
sonnet vllm 1000 29.61 546875
sharegpt vllm 1000 41.31 217393
random vllm 1000 51.89 1024000
burstgpt vllm 1000 99.73 768960
MODEL_NAME="Qwen/Qwen2-VL-7B-Instruct"
  "python3 benchmarks/benchmark_serving.py --backend openai-chat --model ${MODEL_NAME} --endpoint /v1/chat/completions --dataset-name sonnet --dataset-path benchmarks/sonnet.txt --num-prompts ${NUM_PROMPTS}"
  "python3 benchmarks/benchmark_serving.py --model ${MODEL_NAME} --backend openai-chat --endpoint /v1/chat/completions --dataset-name hf --dataset-path lmarena-ai/vision-arena-bench-v0.1 --hf-split train --num-prompts ${NUM_PROMPTS} --request-rate 1000 --percentile-metrics ttft,tpot,e2el"
  "python3 benchmarks/benchmark_serving.py --model ${MODEL_NAME} --backend openai-chat --endpoint /v1/chat/completions --dataset-name hf --dataset-path lmms-lab/LLaVA-OneVision-Data --hf-split train --hf-subset \"chart2text(cauldron)\" --num-prompts ${NUM_PROMPTS} --request-rate 1000 --percentile-metrics ttft,tpot,e2el"
  "python3 benchmarks/benchmark_serving.py --backend vllm --model ${MODEL_NAME} --dataset-name sonnet --dataset-path benchmarks/sonnet.txt --num-prompts ${NUM_PROMPTS}"
  "python3 benchmarks/benchmark_serving.py --backend vllm --model ${MODEL_NAME} --dataset-name sharegpt --dataset-path /home/jovyan/data/vllm_benchmark_datasets/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts ${NUM_PROMPTS}"
  "python3 benchmarks/benchmark_serving.py --backend vllm --model ${MODEL_NAME} --dataset-name random --num-prompts ${NUM_PROMPTS}"
  "python3 benchmarks/benchmark_serving.py --backend vllm --model ${MODEL_NAME} --dataset-name burstgpt --dataset-path /home/jovyan/data/vllm_benchmark_datasets/BurstGPT_without_fails_2.csv --num-prompts ${NUM_PROMPTS}"

Benchmark Throughput Results

after the change

Dataset Processed Prompts Throughput (requests/s) Total tokens/s Output tokens/s
random 10 50.44 1513.07 1008.71
ShareGPT 10 1.66 605.33 378.11
sonnet 10 7.62 4960.96 1142.38
burstgpt 10 2.17 2999.05 406.72

before the change
sonnet and burstgpt is not supported

Dataset Processed Prompts Throughput (requests/s) Total tokens/s Output tokens/s
random 10 51.13 1534.02 1022.68
ShareGPT 10 1.66 604.19 377.39
sonnet 10
burstgpt 10
MODEL="NousResearch/Hermes-3-Llama-3.1-8B"
"VLLM_USE_V1=1 python3 benchmarks/benchmark_throughput.py --model $MODEL --input_len 10 --output_len 20 --dataset-name random --num-prompts $NUM_PROMPTS"
"VLLM_USE_V1=1 python3 benchmarks/benchmark_throughput.py --model $MODEL --dataset /home/jovyan/vllm/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts $NUM_PROMPTS"
"VLLM_USE_V1=1 python3 benchmarks/benchmark_throughput.py --model $MODEL --dataset-name sonnet --dataset benchmarks/sonnet.txt --num-prompts $NUM_PROMPTS"
"VLLM_USE_V1=1 python3 benchmarks/benchmark_throughput.py --model $MODEL --dataset /home/jovyan/data/vllm_benchmark_datasets/BurstGPT_without_fails_2.csv --dataset-name burstgpt --num-prompts $NUM_PROMPTS"

Benchmark Throughput Results - Image Support

command copied from here #9851
Since the coco dataset is too large, I have hard-coded it to use only one image.

# hard coded to use this one image
multi_modal_data["image"] = Image.open("000000000009.jpg").convert(
                    "RGB")

after change
1000 request
Throughput: 8.61 requests/s, 1839.93 total tokens/s, 1697.75 output tokens/s

before change
1000 request
Throughput: 8.59 requests/s, 1835.35 total tokens/s, 1693.53 output tokens/s

python benchmarks/benchmark_throughput.py \
    --model mistral-community/pixtral-12b \
    --max-model-len=8192 \
    --dataset sharegpt4v_instruct_gpt4-vision_cap100k.json

LoRA request test

commands are copied from this PR #11267
after the change

Dataset Num Prompts Max Loras Max Lora Rank Enable Lora Async Engine Throughput (requests/s) Total tokens/s Output tokens/s
ShareGPT 1000 1 8 Yes No 11.66 5610.75 2742.90
ShareGPT 1000 4 8 Yes No 11.59 5575.73 2725.78
ShareGPT 1000 N/A N/A No Yes 17.42 8383.51 4098.41
ShareGPT 1000 1 8 Yes Yes 11.50 5535.98 2706.35
ShareGPT 1000 4 8 Yes Yes 11.25 5412.76 2646.11

before the change

Dataset Num Prompts Max Loras Max Lora Rank Enable Lora Async Engine Throughput (requests/s) Total tokens/s Output tokens/s
ShareGPT 1000 1 8 Yes No 10.84 5216.17 2550.01
ShareGPT 1000 4 8 Yes No 10.80 5197.68 2540.97
ShareGPT 1000 N/A N/A No Yes 16.75 8061.23 3940.86
ShareGPT 1000 1 8 Yes Yes 11.08 5332.47 2606.86
ShareGPT 1000 4 8 Yes Yes 10.84 5215.25 2549.56
ShareGPT 1000 4 8 Yes Yes 10.84 5215.25 2549.56
  "python3 benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-hf --backend vllm --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts $NUM_PROMPTS --max-loras 1 --max-lora-rank 8 --enable-lora --lora-path \"yard1/llama-2-7b-sql-lora-test\""
  "python3 benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-hf --backend vllm --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts $NUM_PROMPTS --max-loras 4 --max-lora-rank 8 --enable-lora --lora-path \"yard1/llama-2-7b-sql-lora-test\""
  "python3 benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-hf --backend vllm --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts $NUM_PROMPTS --async-engine"
  "python3 benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-hf --backend vllm --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts $NUM_PROMPTS --async-engine --max-loras 1 --max-lora-rank 8 --enable-lora --lora-path \"yard1/llama-2-7b-sql-lora-test\""
  "python3 benchmarks/benchmark_throughput.py --model meta-llama/Llama-2-7b-hf --backend vllm --dataset ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts $NUM_PROMPTS --async-engine --max-loras 4 --max-lora-rank 8 --enable-lora --lora-path \"yard1/llama-2-7b-sql-lora-test\""

scripts for generating the table above is here

Signed-off-by: Jennifer Zhao <[email protected]>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Jennifer Zhao <[email protected]>
@JenZhao JenZhao marked this pull request as ready for review March 1, 2025 07:03
@ywang96 ywang96 self-assigned this Mar 1, 2025
]
prompt_formatted = tokenizer.apply_chat_template(
message, add_generation_prompt=True, tokenize=False)
prompt_len = len(tokenizer(prompt_formatted).input_ids)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the prompt_len should always be based on prompt_formatted. When returning the prompt, should the prompt_len be computed as len(tokenizer(prompt).input_ids) instead?

eg here in the original code, should the prompt_len be based on len(tokenizer(prompt).input_ids)
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants