Skip to content

Expose --queue-size, --queue-timeout-secs, and --rate-limit-tokens-per-second CLI flags#185

Open
pedramr wants to merge 1 commit into
vllm-project:mainfrom
pedramr:expose-rate-limit-cli-flags-upstream
Open

Expose --queue-size, --queue-timeout-secs, and --rate-limit-tokens-per-second CLI flags#185
pedramr wants to merge 1 commit into
vllm-project:mainfrom
pedramr:expose-rate-limit-cli-flags-upstream

Conversation

@pedramr

@pedramr pedramr commented Jun 17, 2026

Copy link
Copy Markdown

Purpose

The standalone Rust vllm-router binary hardcodes three queue / rate-limiting
knobs in CliArgs::to_router_config() and exposes no CLI flags for them:

  • queue_size = 100
  • queue_timeout_secs = 60
  • rate_limit_tokens_per_second = None

…even though RouterConfig already supports all three fields and the Python
launcher (py_src/vllm_router/router_args.py) already exposes them as
--queue-size / --queue-timeout-secs / --rate-limit-tokens-per-second.
This is a flag drift between the two entry points: configuration that works via
the Python launcher silently cannot be set when running the binary directly.

In particular, binary users cannot set --queue-size 0, which disables the
concurrency queue so the limiter sheds immediately with HTTP 429 (fail-fast)
instead of queuing up to 100 requests for up to 60s.

This PR adds the three flags to CliArgs and threads them through
to_router_config(). Defaults are identical to the previously hardcoded
values (100 / 60 / None), so behavior is unchanged unless a flag is
explicitly passed.
The new --help text mirrors the Python launcher's wording.

Test Plan

Built and exercised the release binary with stable Rust (rustc 1.96.0):

  • cargo fmt --check
  • cargo check
  • cargo clippy --all-targets
  • cargo build --release
  • ./target/release/vllm-router --help lists the three new flags with defaults
  • Boot the binary with --queue-size 0 and confirm RouterConfig::validate()
    accepts it

Test Result

  • cargo fmt --check, cargo check, cargo clippy --all-targets, and
    cargo build --release: all clean.
  • --help output:
    • --queue-size <QUEUE_SIZE>[default: 100]
    • --queue-timeout-secs <QUEUE_TIMEOUT_SECS>[default: 60]
    • --rate-limit-tokens-per-second <RATE_LIMIT_TOKENS_PER_SECOND> — falls back
      to --max-concurrent-requests when unset
  • Booting with --queue-size 0 logs Configuration validated successfully and
    proceeds past config validation (the limiter then runs queue-less, shedding
    with HTTP 429 at the concurrency limit).

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR — fix the binary/Python-launcher flag drift so the
    queue and rate-limit knobs are configurable from the binary.
  • The test plan — cargo fmt/check/clippy/build --release plus a --help
    and --queue-size 0 boot check.
  • The test results — all checks clean; --help and --queue-size 0 boot
    verified (see above).

…r-second CLI flags

The Rust vllm-router binary hardcoded queue_size=100, queue_timeout_secs=60, and
rate_limit_tokens_per_second=None in CliArgs::to_router_config(), even though
RouterConfig supports all three and the Python launcher (router_args.py) already
exposes them. This drift means binary users cannot disable the concurrency queue
(--queue-size 0) for fail-fast 429 shedding, tune the queue timeout, or set an
explicit token-bucket refill rate.

Add the three flags to CliArgs and thread them through to_router_config().
Defaults match the previously hardcoded values (100 / 60 / None), so behavior is
unchanged unless a flag is explicitly passed.

Signed-off-by: Pedram Razavi <pedram.razavi@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant