Skip to content

Per-worker concurrency is capped at 100 #487

@sjmonson

Description

@sjmonson

Describe the bug
Setting GUIDELLM__MAX_WORKER_PROCESSES=1 results in a maximum of exactly 100 concurrency measured from server-side. Client-side reports full concurrency meaning that the bottleneck is likely after we launch the request thread.

Expected behavior
Actual max concurrency per-worker should be much higher.

Environment
Include all relevant environment information:

  1. OS [e.g. Ubuntu 20.04]: Fedora Linux 42 (Container Image) on OCP 4.19
  2. Python version [e.g. 3.12.2]: 3.13.7
  3. GuideLLM version: v0.4.0

To Reproduce
Exact steps to reproduce the behavior:

export GUIDELLM__MAX_WORKER_PROCESSES=1
guidellm benchmark \
                --target http://localhost:8000
                --rate-type concurrent \
                --rate "128" \
                --max-seconds 120 \
                --data "prompt_tokens=256,output_tokens=128"

Observe from server-side that there are never more than 100 waiting + running requests.

Additional context
Behavior is constant with more workers. E.g. GUIDELLM__MAX_WORKER_PROCESSES=2 results in a maximum concurrency of 200. Our default is 10 workers; 1,000+ concurrency is pretty rare for single-node tests.

Metadata

Metadata

Assignees

Labels

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions