Expose --queue-size, --queue-timeout-secs, and --rate-limit-tokens-per-second CLI flags#185
Open
pedramr wants to merge 1 commit into
Open
Conversation
…r-second CLI flags The Rust vllm-router binary hardcoded queue_size=100, queue_timeout_secs=60, and rate_limit_tokens_per_second=None in CliArgs::to_router_config(), even though RouterConfig supports all three and the Python launcher (router_args.py) already exposes them. This drift means binary users cannot disable the concurrency queue (--queue-size 0) for fail-fast 429 shedding, tune the queue timeout, or set an explicit token-bucket refill rate. Add the three flags to CliArgs and thread them through to_router_config(). Defaults match the previously hardcoded values (100 / 60 / None), so behavior is unchanged unless a flag is explicitly passed. Signed-off-by: Pedram Razavi <pedram.razavi@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
The standalone Rust
vllm-routerbinary hardcodes three queue / rate-limitingknobs in
CliArgs::to_router_config()and exposes no CLI flags for them:queue_size = 100queue_timeout_secs = 60rate_limit_tokens_per_second = None…even though
RouterConfigalready supports all three fields and the Pythonlauncher (
py_src/vllm_router/router_args.py) already exposes them as--queue-size/--queue-timeout-secs/--rate-limit-tokens-per-second.This is a flag drift between the two entry points: configuration that works via
the Python launcher silently cannot be set when running the binary directly.
In particular, binary users cannot set
--queue-size 0, which disables theconcurrency queue so the limiter sheds immediately with HTTP 429 (fail-fast)
instead of queuing up to 100 requests for up to 60s.
This PR adds the three flags to
CliArgsand threads them throughto_router_config(). Defaults are identical to the previously hardcodedvalues (
100/60/None), so behavior is unchanged unless a flag isexplicitly passed. The new
--helptext mirrors the Python launcher's wording.Test Plan
Built and exercised the release binary with stable Rust (rustc 1.96.0):
cargo fmt --checkcargo checkcargo clippy --all-targetscargo build --release./target/release/vllm-router --helplists the three new flags with defaults--queue-size 0and confirmRouterConfig::validate()accepts it
Test Result
cargo fmt --check,cargo check,cargo clippy --all-targets, andcargo build --release: all clean.--helpoutput:--queue-size <QUEUE_SIZE>—[default: 100]--queue-timeout-secs <QUEUE_TIMEOUT_SECS>—[default: 60]--rate-limit-tokens-per-second <RATE_LIMIT_TOKENS_PER_SECOND>— falls backto
--max-concurrent-requestswhen unset--queue-size 0logsConfiguration validated successfullyandproceeds past config validation (the limiter then runs queue-less, shedding
with HTTP 429 at the concurrency limit).
Essential Elements of an Effective PR Description Checklist
queue and rate-limit knobs are configurable from the binary.
cargo fmt/check/clippy/build --releaseplus a--helpand
--queue-size 0boot check.--helpand--queue-size 0bootverified (see above).