-
Notifications
You must be signed in to change notification settings - Fork 104
Labels
Milestone
Description
Describe the bug
Setting GUIDELLM__MAX_WORKER_PROCESSES=1 results in a maximum of exactly 100 concurrency measured from server-side. Client-side reports full concurrency meaning that the bottleneck is likely after we launch the request thread.
Expected behavior
Actual max concurrency per-worker should be much higher.
Environment
Include all relevant environment information:
- OS [e.g. Ubuntu 20.04]: Fedora Linux 42 (Container Image) on OCP 4.19
- Python version [e.g. 3.12.2]: 3.13.7
- GuideLLM version: v0.4.0
To Reproduce
Exact steps to reproduce the behavior:
export GUIDELLM__MAX_WORKER_PROCESSES=1
guidellm benchmark \
--target http://localhost:8000
--rate-type concurrent \
--rate "128" \
--max-seconds 120 \
--data "prompt_tokens=256,output_tokens=128"Observe from server-side that there are never more than 100 waiting + running requests.
Additional context
Behavior is constant with more workers. E.g. GUIDELLM__MAX_WORKER_PROCESSES=2 results in a maximum concurrency of 200. Our default is 10 workers; 1,000+ concurrency is pretty rare for single-node tests.