Expose --queue-size, --queue-timeout-secs, and --rate-limit-tokens-per-second CLI flags by pedramr · Pull Request #185 · vllm-project/router

pedramr · 2026-06-17T04:33:17Z

Purpose

The standalone Rust vllm-router binary hardcodes three queue / rate-limiting
knobs in CliArgs::to_router_config() and exposes no CLI flags for them:

queue_size = 100
queue_timeout_secs = 60
rate_limit_tokens_per_second = None

…even though RouterConfig already supports all three fields and the Python
launcher (py_src/vllm_router/router_args.py) already exposes them as
--queue-size / --queue-timeout-secs / --rate-limit-tokens-per-second.
This is a flag drift between the two entry points: configuration that works via
the Python launcher silently cannot be set when running the binary directly.

In particular, binary users cannot set --queue-size 0, which disables the
concurrency queue so the limiter sheds immediately with HTTP 429 (fail-fast)
instead of queuing up to 100 requests for up to 60s.

This PR adds the three flags to CliArgs and threads them through
to_router_config(). Defaults are identical to the previously hardcoded
values (100 / 60 / None), so behavior is unchanged unless a flag is
explicitly passed. The new --help text mirrors the Python launcher's wording.

Test Plan

Built and exercised the release binary with stable Rust (rustc 1.96.0):

cargo fmt --check
cargo check
cargo clippy --all-targets
cargo build --release
./target/release/vllm-router --help lists the three new flags with defaults
Boot the binary with --queue-size 0 and confirm RouterConfig::validate()
accepts it

Test Result

cargo fmt --check, cargo check, cargo clippy --all-targets, and
cargo build --release: all clean.
--help output:
- --queue-size <QUEUE_SIZE> — [default: 100]
- --queue-timeout-secs <QUEUE_TIMEOUT_SECS> — [default: 60]
- --rate-limit-tokens-per-second <RATE_LIMIT_TOKENS_PER_SECOND> — falls back
  to --max-concurrent-requests when unset
Booting with --queue-size 0 logs Configuration validated successfully and
proceeds past config validation (the limiter then runs queue-less, shedding
with HTTP 429 at the concurrency limit).

Essential Elements of an Effective PR Description Checklist

The purpose of the PR — fix the binary/Python-launcher flag drift so the
queue and rate-limit knobs are configurable from the binary.
The test plan — cargo fmt/check/clippy/build --release plus a --help
and --queue-size 0 boot check.
The test results — all checks clean; --help and --queue-size 0 boot
verified (see above).

…r-second CLI flags The Rust vllm-router binary hardcoded queue_size=100, queue_timeout_secs=60, and rate_limit_tokens_per_second=None in CliArgs::to_router_config(), even though RouterConfig supports all three and the Python launcher (router_args.py) already exposes them. This drift means binary users cannot disable the concurrency queue (--queue-size 0) for fail-fast 429 shedding, tune the queue timeout, or set an explicit token-bucket refill rate. Add the three flags to CliArgs and thread them through to_router_config(). Defaults match the previously hardcoded values (100 / 60 / None), so behavior is unchanged unless a flag is explicitly passed. Signed-off-by: Pedram Razavi <pedram.razavi@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose --queue-size, --queue-timeout-secs, and --rate-limit-tokens-per-second CLI flags#185

Expose --queue-size, --queue-timeout-secs, and --rate-limit-tokens-per-second CLI flags#185
pedramr wants to merge 1 commit into
vllm-project:mainfrom
pedramr:expose-rate-limit-cli-flags-upstream

pedramr commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pedramr commented Jun 17, 2026

Purpose

Test Plan

Test Result

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant