Skip to content

Conversation

@PeaBrane
Copy link
Contributor

@PeaBrane PeaBrane commented Oct 22, 2025

Overview:

A continuation of #3847

Major fixes/chores:

  • Refactored so that the KvManager would publish the kv events directly over NATs instead of relying on intermediate relays
  • Removed a very expensive op where we were sending ForwardPassMetrics after every token generated (instead of after every forward pass)
  • Use a running mean data structure for hit rate tracking (this was the second most expensive op)

Other minor fixes/chores:

  • Limited the mocker random token range to 100 - 200 so less likely to encounter detokenization failures
  • Update mocker timing estimates with new planner sweeps on H200

Scoped for future:

  • Some benchmarking with it
  • Make event publishing over zmq so that our kv event publisher can be tested in CI as well
  • Simulate nixl transfer latency (right now assumed to be 0)

Signed-off-by: PeaBrane <[email protected]>
Signed-off-by: PeaBrane <[email protected]>
@pull-request-size pull-request-size bot added size/L and removed size/M labels Oct 22, 2025
@PeaBrane PeaBrane marked this pull request as ready for review October 22, 2025 22:06
@PeaBrane PeaBrane requested review from a team as code owners October 22, 2025 22:06
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 22, 2025

Walkthrough

This PR introduces worker type awareness to the Dynamo Mocker engine by adding prefill/decode mode flags, centralizing CLI argument parsing, and propagating an is_prefill flag through Python and Rust layers to control engine behavior including KV event publishing, max token handling, and endpoint model type selection.

Changes

Cohort / File(s) Change Summary
Mocker shell script invocation
benchmarks/router/run_engines.sh
Appends mode-aware flags to MOCKER_ARGS per worker loop: --is-prefill-worker when MODE is "prefill", --is-decode-worker when MODE is "decode"
Python CLI argument parsing
components/src/dynamo/mocker/args.py
New module providing parse_args() for comprehensive CLI interface and create_temp_engine_args_file(args) to build engine config from CLI arguments, write to temp JSON file, and return path. Supports worker-type flags, KV events toggling, and legacy extra engine args file
Python worker refactoring
components/src/dynamo/mocker/main.py
Removed inline cmd_line_args() function; replaced with imports from .args. Worker now uses parse_args() and either consumes provided extra_engine_args or generates temp file via create_temp_engine_args_file(). EntrypointArgs construction now includes is_prefill=args.is_prefill_worker
Rust engine configuration
launch/dynamo-run/src/lib.rs, lib/llm/src/entrypoint.rs
Added is_prefill: bool field to EngineConfig::StaticCore variant, updating the enum signature and construction path for Mocker engine output
Rust bindings and entrypoint
lib/bindings/python/rust/llm/entrypoint.rs
Added is_prefill: bool field to EntrypointArgs struct with PyO3 binding (default false). Updated constructor signature and propagated field through engine selection and Mocker engine configuration
Endpoint model type selection
lib/llm/src/entrypoint/input/endpoint.rs
Updated StaticCore pattern match to destructure is_prefill. Sets model_type to Prefill when is_prefill is true, otherwise Chat | Completions
Mocker engine logic
lib/llm/src/mocker/engine.rs
For prefill workers: override max_tokens to 1, add dummy disaggregated_params to output payload. KV events publishing now requires both enable_prefix_caching and publish_kv_events to be true (previously only enable_prefix_caching)
Mocker configuration
lib/llm/src/mocker/protocols.rs
Added publish_kv_events: bool (default true) and is_prefill: bool (default false) fields to MockEngineArgs struct; wired through JSON builder path for extra_args overrides

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

The changes follow a consistent pattern of threading an is_prefill flag through multiple layers (Python CLI → Rust bindings → engine config → worker logic). While the file count is moderate-to-high (~9 files), the modifications are largely homogeneous plumbing changes with localized logic implementations in the engine and endpoint layers. Understanding the flow requires tracing across layers, but individual edits remain straightforward.

Poem

🐰 A prefill flag hops through the code,

From shell to Python, Rust to load,

With KV events and tokens refined,

Workers now know their dispatch kind! ✨

Pre-merge checks

❌ Failed checks (1 warning, 2 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 63.64% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title Check ❓ Inconclusive The title “feat: mocker disagg” is too terse and uses an unclear abbreviation that does not clearly convey the main change of adding prefill mode support and disaggregated parameters in the mocker engine. It is related to the mocker component but is vague about the actual feature being introduced. Please revise the title to explicitly describe the primary feature, for example “feat: add prefill worker support and disaggregated parameters to mocker engine,” to make the change clear at a glance.
Description Check ❓ Inconclusive The pull request description provides substantive information about the changes in the Overview section, including specific details about major fixes (KvManager refactoring, ForwardPassMetrics removal, hit rate tracking), minor fixes (mocker token range, timing estimates), and future work. However, the description deviates significantly from the required template structure by lacking three specified section headings: a dedicated "#### Details:" section, a "#### Where should the reviewer start?" section calling out specific files for review, and a "#### Related Issues:" section with action keywords (though #3847 is mentioned inline in the Overview). While the content quality is good and directly relevant to the PR objectives, the structural mismatch with the explicit template creates ambiguity about whether it fully satisfies the documentation requirements. To fully meet the template requirements, consider reorganizing the description to include all four sections with their specified headings. Specifically, add a dedicated "#### Details:" section that summarizes the changes, a "#### Where should the reviewer start?" section that calls out key files like lib/llm/src/mocker/engine.rs and lib/llm/src/mocker/protocols.rs for focused review, and a "#### Related Issues:" section that properly references #3847 using an action keyword such as "Relates to #3847". This will ensure the description aligns with the repository's documentation standards.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
lib/bindings/python/rust/llm/entrypoint.rs (1)

255-269: Propagate args.is_prefill to MockEngineArgs

MockEngineArgs is loaded or defaulted without considering the endpoint’s args.is_prefill, so the mock engine won’t enforce prefill limits. After constructing mocker_args, assign the flag:

-            let mocker_args = if let Some(extra_args_path) = args.extra_engine_args {
+            let mut mocker_args = if let Some(extra_args_path) = args.extra_engine_args {
                 MockEngineArgs::from_json_file(&extra_args_path)? 
             } else {
                 MockEngineArgs::default()
             };
+            mocker_args.is_prefill = args.is_prefill;
launch/dynamo-run/src/lib.rs (1)

144-149: Add a prefill_worker flag and wire it to is_prefill

The Flags struct (launch/dynamo-run/src/flags.rs) currently has no prefill indicator, so is_prefill is always set to false. Introduce a --prefill-worker boolean in Flags and pass its value to is_prefill when constructing EngineConfig::StaticCore in launch/dynamo-run/src/lib.rs.

components/src/dynamo/mocker/main.py (1)

26-34: Decode worker flag silently ignored when --extra-engine-args is used

Right now, if someone launches the mocker with both --is-decode-worker and a custom --extra-engine-args JSON, the branch on Lines 26-34 just forwards that file untouched. As a result, publish_kv_events stays whatever the JSON dictated (often True), so decode workers keep emitting KV events even though the CLI flag promises “does not publish KV events.” This is a regression compared to the non-JSON path where create_temp_engine_args_file forces publish_kv_events=False. Please make sure the decode/no-kv toggles are applied regardless of how the extra args are supplied (e.g., merge the override into a temp copy of the supplied JSON or error out if it conflicts).

🧹 Nitpick comments (5)
lib/llm/src/mocker/protocols.rs (1)

101-108: JSON wiring for publish_kv_events and is_prefill looks correct; add a small mapping test.

The builder defaults and parsing logic are sound. Add a unit test to lock behavior and guard against regressions.

Example:

#[test]
fn loads_publish_kv_and_is_prefill() {
    let tmp = tempfile::NamedTempFile::new().unwrap();
    std::fs::write(tmp.path(), r#"{ "publish_kv_events": false, "is_prefill": true }"#).unwrap();
    let args = MockEngineArgs::from_json_file(tmp.path()).unwrap();
    assert!(!args.publish_kv_events);
    assert!(args.is_prefill);
}

Also applies to: 132-145, 226-236

lib/llm/src/entrypoint/input/endpoint.rs (1)

70-92: Correctly attaches Prefill vs Chat|Completions based on is_prefill.

Good conditional routing of model type with no behavior change for non-prefill.

Consider a trace log on attach indicating the chosen model_type for easier debugging.

lib/llm/src/mocker/engine.rs (1)

359-366: Also bound scheduler’s requested tokens in prefill mode.

You cap streamed tokens to 1, but DirectRequest.max_output_tokens remains the original value. This can overproduce scheduler work and signals. Clamp it to 1 when is_prefill.

Example adjustment (within generate):

let is_prefill = self.engine_args.is_prefill;
let requested_max = request
    .stop_conditions
    .max_tokens
    .expect("max_output_tokens must be specified for mocker") as usize;

let effective_max = if is_prefill { 1 } else { requested_max };

let direct_request = DirectRequest {
    tokens: request.token_ids.clone(),
    max_output_tokens: effective_max,
    uuid: Some(request_uuid),
    dp_rank,
};

Optional: if a completion signal arrives before effective_max, send a graceful length finish instead of an error to avoid noisy failures in prefill.

components/src/dynamo/mocker/args.py (2)

180-192: Make worker-type flags mutually exclusive.

Prevent accidental --is-prefill-worker + --is-decode-worker combos.

Patch:

-    # Worker type configuration
-    parser.add_argument(
+    # Worker type configuration (mutually exclusive)
+    group = parser.add_mutually_exclusive_group()
+    group.add_argument(
         "--is-prefill-worker",
         action="store_true",
         default=False,
         help="Register as Prefill model type instead of Chat+Completions (default: False)",
     )
-    parser.add_argument(
+    group.add_argument(
         "--is-decode-worker",
         action="store_true",
         default=False,
         help="Mark this as a decode worker which does not publish KV events (default: False)",
     )

53-61: Consider cleaning up the temp JSON automatically.

If the main doesn’t delete it, register an atexit hook here to remove the file.

Example:

import atexit
# ...
temp_path = Path(f.name)
atexit.register(lambda p=temp_path: p.exists() and p.unlink(missing_ok=True))
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fb294b9 and 127883b.

📒 Files selected for processing (9)
  • benchmarks/router/run_engines.sh (1 hunks)
  • components/src/dynamo/mocker/args.py (1 hunks)
  • components/src/dynamo/mocker/main.py (2 hunks)
  • launch/dynamo-run/src/lib.rs (1 hunks)
  • lib/bindings/python/rust/llm/entrypoint.rs (4 hunks)
  • lib/llm/src/entrypoint.rs (1 hunks)
  • lib/llm/src/entrypoint/input/endpoint.rs (2 hunks)
  • lib/llm/src/mocker/engine.rs (3 hunks)
  • lib/llm/src/mocker/protocols.rs (3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#3184
File: docs/architecture/kv_cache_routing.md:70-73
Timestamp: 2025-09-23T20:08:37.105Z
Learning: PeaBrane prefers to keep documentation diagrams simplified to avoid visual overload, even when this means sacrificing some technical precision for the sake of clarity and comprehension. They prioritize pedagogical effectiveness over exhaustive technical detail in architectural diagrams.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#2756
File: lib/llm/src/kv_router/subscriber.rs:36-44
Timestamp: 2025-08-29T10:03:48.330Z
Learning: PeaBrane prefers to keep PRs contained in scope and is willing to defer technical improvements to future PRs when the current implementation works for the immediate use case. They acknowledge technical debt but prioritize deliverability over completeness in individual PRs.
🧬 Code graph analysis (4)
lib/llm/src/entrypoint/input/endpoint.rs (2)
lib/llm/src/model_card.rs (2)
  • model_type (550-550)
  • model_type (711-713)
lib/bindings/python/src/dynamo/_core.pyi (1)
  • ModelType (889-896)
components/src/dynamo/mocker/args.py (1)
lib/llm/src/mocker/protocols.rs (1)
  • default (111-115)
components/src/dynamo/mocker/main.py (1)
components/src/dynamo/mocker/args.py (2)
  • create_temp_engine_args_file (19-61)
  • parse_args (64-202)
lib/bindings/python/rust/llm/entrypoint.rs (2)
lib/llm/src/discovery/model_manager.rs (1)
  • new (70-81)
lib/llm/src/local_model.rs (21)
  • model_path (90-93)
  • model_name (95-98)
  • endpoint_id (100-103)
  • endpoint_id (399-401)
  • context_length (105-108)
  • router_config (136-139)
  • router_config (372-374)
  • kv_cache_block_size (111-114)
  • http_host (116-119)
  • http_host (356-358)
  • http_port (121-124)
  • http_port (360-362)
  • tls_cert_path (126-129)
  • tls_cert_path (364-366)
  • extra_engine_args (166-169)
  • namespace (141-144)
  • namespace (380-382)
  • custom_backend_metrics_endpoint (181-184)
  • custom_backend_metrics_endpoint (384-386)
  • custom_backend_metrics_polling_interval (186-189)
  • custom_backend_metrics_polling_interval (388-390)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: Mirror Repository to GitLab
  • GitHub Check: tests (lib/bindings/python)
  • GitHub Check: tests (.)
  • GitHub Check: clippy (.)
  • GitHub Check: tests (launch/dynamo-run)
  • GitHub Check: clippy (launch/dynamo-run)
  • GitHub Check: tests (lib/runtime/examples)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (2)
components/src/dynamo/mocker/args.py (1)

45-52: Good: decode disables KV events, prefill sets is_prefill.

This matches the engine semantics introduced in Rust.

benchmarks/router/run_engines.sh (1)

201-205: Prefill/decode worker flags are correctly supported by mocker CLI

The flags --is-prefill-worker and --is-decode-worker are already defined in components/src/dynamo/mocker/args.py and handled in components/src/dynamo/mocker/main.py, so the script will not trigger unknown-argument errors.

Signed-off-by: PeaBrane <[email protected]>
Signed-off-by: PeaBrane <[email protected]>
Signed-off-by: PeaBrane <[email protected]>
@grahamking
Copy link
Contributor

grahamking commented Oct 23, 2025

@PeaBrane This PR is three or more different things. Could you split it into more focused PRs? One for the worker_id removal. One for the --is-prefill param. And so on.

That makes it much easier to locate a change when you git blame, easier to revert a specific change, easier to review, easier for someone reading the logs to understand what is changing in the project.

@PeaBrane
Copy link
Contributor Author

PeaBrane commented Oct 23, 2025

@grahamking Thanks, I will try to do better. But I think this one of those cases, where the changes are too correlated to be easily broken down into separate PRs. The additional context here is we would to get disagg mockers in a functional state fast for router/planner benchmarking, and ideally we need it soon.

The core changes should be contained to the mockers, so should not affect the core dynamo components. That being said, I will try to see if I can break it down to a series of PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants