- 
                Notifications
    You must be signed in to change notification settings 
- Fork 655
feat: mocker disagg #3833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: mocker disagg #3833
Conversation
Signed-off-by: PeaBrane <[email protected]>
Signed-off-by: PeaBrane <[email protected]>
| WalkthroughThis PR introduces worker type awareness to the Dynamo Mocker engine by adding prefill/decode mode flags, centralizing CLI argument parsing, and propagating an is_prefill flag through Python and Rust layers to control engine behavior including KV event publishing, max token handling, and endpoint model type selection. Changes
 Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes The changes follow a consistent pattern of threading an  Poem
 Pre-merge checks❌ Failed checks (1 warning, 2 inconclusive)
 Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️  Outside diff range comments (3)
lib/bindings/python/rust/llm/entrypoint.rs (1)
255-269: Propagateargs.is_prefilltoMockEngineArgs
MockEngineArgsis loaded or defaulted without considering the endpoint’sargs.is_prefill, so the mock engine won’t enforce prefill limits. After constructingmocker_args, assign the flag:- let mocker_args = if let Some(extra_args_path) = args.extra_engine_args { + let mut mocker_args = if let Some(extra_args_path) = args.extra_engine_args { MockEngineArgs::from_json_file(&extra_args_path)? } else { MockEngineArgs::default() }; + mocker_args.is_prefill = args.is_prefill;launch/dynamo-run/src/lib.rs (1)
144-149: Add aprefill_workerflag and wire it tois_prefillThe
Flagsstruct (launch/dynamo-run/src/flags.rs) currently has no prefill indicator, sois_prefillis always set tofalse. Introduce a--prefill-workerboolean inFlagsand pass its value tois_prefillwhen constructingEngineConfig::StaticCorein launch/dynamo-run/src/lib.rs.components/src/dynamo/mocker/main.py (1)
26-34: Decode worker flag silently ignored when--extra-engine-argsis usedRight now, if someone launches the mocker with both
--is-decode-workerand a custom--extra-engine-argsJSON, the branch on Lines 26-34 just forwards that file untouched. As a result,publish_kv_eventsstays whatever the JSON dictated (oftenTrue), so decode workers keep emitting KV events even though the CLI flag promises “does not publish KV events.” This is a regression compared to the non-JSON path wherecreate_temp_engine_args_fileforcespublish_kv_events=False. Please make sure the decode/no-kv toggles are applied regardless of how the extra args are supplied (e.g., merge the override into a temp copy of the supplied JSON or error out if it conflicts).
🧹 Nitpick comments (5)
lib/llm/src/mocker/protocols.rs (1)
101-108: JSON wiring forpublish_kv_eventsandis_prefilllooks correct; add a small mapping test.The builder defaults and parsing logic are sound. Add a unit test to lock behavior and guard against regressions.
Example:
#[test] fn loads_publish_kv_and_is_prefill() { let tmp = tempfile::NamedTempFile::new().unwrap(); std::fs::write(tmp.path(), r#"{ "publish_kv_events": false, "is_prefill": true }"#).unwrap(); let args = MockEngineArgs::from_json_file(tmp.path()).unwrap(); assert!(!args.publish_kv_events); assert!(args.is_prefill); }Also applies to: 132-145, 226-236
lib/llm/src/entrypoint/input/endpoint.rs (1)
70-92: Correctly attaches Prefill vs Chat|Completions based onis_prefill.Good conditional routing of model type with no behavior change for non-prefill.
Consider a trace log on attach indicating the chosen
model_typefor easier debugging.lib/llm/src/mocker/engine.rs (1)
359-366: Also bound scheduler’s requested tokens in prefill mode.You cap streamed tokens to 1, but
DirectRequest.max_output_tokensremains the original value. This can overproduce scheduler work and signals. Clamp it to 1 whenis_prefill.Example adjustment (within
generate):let is_prefill = self.engine_args.is_prefill; let requested_max = request .stop_conditions .max_tokens .expect("max_output_tokens must be specified for mocker") as usize; let effective_max = if is_prefill { 1 } else { requested_max }; let direct_request = DirectRequest { tokens: request.token_ids.clone(), max_output_tokens: effective_max, uuid: Some(request_uuid), dp_rank, };Optional: if a completion signal arrives before
effective_max, send a graceful length finish instead of an error to avoid noisy failures in prefill.components/src/dynamo/mocker/args.py (2)
180-192: Make worker-type flags mutually exclusive.Prevent accidental
--is-prefill-worker+--is-decode-workercombos.Patch:
- # Worker type configuration - parser.add_argument( + # Worker type configuration (mutually exclusive) + group = parser.add_mutually_exclusive_group() + group.add_argument( "--is-prefill-worker", action="store_true", default=False, help="Register as Prefill model type instead of Chat+Completions (default: False)", ) - parser.add_argument( + group.add_argument( "--is-decode-worker", action="store_true", default=False, help="Mark this as a decode worker which does not publish KV events (default: False)", )
53-61: Consider cleaning up the temp JSON automatically.If the main doesn’t delete it, register an atexit hook here to remove the file.
Example:
import atexit # ... temp_path = Path(f.name) atexit.register(lambda p=temp_path: p.exists() and p.unlink(missing_ok=True))
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
- benchmarks/router/run_engines.sh(1 hunks)
- components/src/dynamo/mocker/args.py(1 hunks)
- components/src/dynamo/mocker/main.py(2 hunks)
- launch/dynamo-run/src/lib.rs(1 hunks)
- lib/bindings/python/rust/llm/entrypoint.rs(4 hunks)
- lib/llm/src/entrypoint.rs(1 hunks)
- lib/llm/src/entrypoint/input/endpoint.rs(2 hunks)
- lib/llm/src/mocker/engine.rs(3 hunks)
- lib/llm/src/mocker/protocols.rs(3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#3184
File: docs/architecture/kv_cache_routing.md:70-73
Timestamp: 2025-09-23T20:08:37.105Z
Learning: PeaBrane prefers to keep documentation diagrams simplified to avoid visual overload, even when this means sacrificing some technical precision for the sake of clarity and comprehension. They prioritize pedagogical effectiveness over exhaustive technical detail in architectural diagrams.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#2756
File: lib/llm/src/kv_router/subscriber.rs:36-44
Timestamp: 2025-08-29T10:03:48.330Z
Learning: PeaBrane prefers to keep PRs contained in scope and is willing to defer technical improvements to future PRs when the current implementation works for the immediate use case. They acknowledge technical debt but prioritize deliverability over completeness in individual PRs.
🧬 Code graph analysis (4)
lib/llm/src/entrypoint/input/endpoint.rs (2)
lib/llm/src/model_card.rs (2)
model_type(550-550)
model_type(711-713)lib/bindings/python/src/dynamo/_core.pyi (1)
ModelType(889-896)
components/src/dynamo/mocker/args.py (1)
lib/llm/src/mocker/protocols.rs (1)
default(111-115)
components/src/dynamo/mocker/main.py (1)
components/src/dynamo/mocker/args.py (2)
create_temp_engine_args_file(19-61)
parse_args(64-202)
lib/bindings/python/rust/llm/entrypoint.rs (2)
lib/llm/src/discovery/model_manager.rs (1)
new(70-81)lib/llm/src/local_model.rs (21)
model_path(90-93)
model_name(95-98)
endpoint_id(100-103)
endpoint_id(399-401)
context_length(105-108)
router_config(136-139)
router_config(372-374)
kv_cache_block_size(111-114)
http_host(116-119)
http_host(356-358)
http_port(121-124)
http_port(360-362)
tls_cert_path(126-129)
tls_cert_path(364-366)
extra_engine_args(166-169)
namespace(141-144)
namespace(380-382)
custom_backend_metrics_endpoint(181-184)
custom_backend_metrics_endpoint(384-386)
custom_backend_metrics_polling_interval(186-189)
custom_backend_metrics_polling_interval(388-390)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: Mirror Repository to GitLab
- GitHub Check: tests (lib/bindings/python)
- GitHub Check: tests (.)
- GitHub Check: clippy (.)
- GitHub Check: tests (launch/dynamo-run)
- GitHub Check: clippy (launch/dynamo-run)
- GitHub Check: tests (lib/runtime/examples)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (2)
components/src/dynamo/mocker/args.py (1)
45-52: Good: decode disables KV events, prefill setsis_prefill.This matches the engine semantics introduced in Rust.
benchmarks/router/run_engines.sh (1)
201-205: Prefill/decode worker flags are correctly supported by mocker CLIThe flags
--is-prefill-workerand--is-decode-workerare already defined incomponents/src/dynamo/mocker/args.pyand handled incomponents/src/dynamo/mocker/main.py, so the script will not trigger unknown-argument errors.
Signed-off-by: PeaBrane <[email protected]>
Signed-off-by: PeaBrane <[email protected]>
Signed-off-by: PeaBrane <[email protected]>
Signed-off-by: PeaBrane <[email protected]>
| @PeaBrane This PR is three or more different things. Could you split it into more focused PRs? One for the worker_id removal. One for the  That makes it much easier to locate a change when you  | 
| @grahamking Thanks, I will try to do better. But I think this one of those cases, where the changes are too correlated to be easily broken down into separate PRs. The additional context here is we would to get disagg mockers in a functional state fast for router/planner benchmarking, and ideally we need it soon. The core changes should be contained to the mockers, so should not affect the core dynamo components. That being said, I will try to see if I can break it down to a series of PRs | 
Signed-off-by: PeaBrane <[email protected]>
Signed-off-by: PeaBrane <[email protected]>
Signed-off-by: PeaBrane <[email protected]>
Signed-off-by: PeaBrane <[email protected]>
Signed-off-by: PeaBrane <[email protected]>
Overview:
A continuation of #3847
Major fixes/chores:
KvManagerwould publish the kv events directly over NATs instead of relying on intermediate relaysForwardPassMetricsafter every token generated (instead of after every forward pass)Other minor fixes/chores:
100 - 200so less likely to encounter detokenization failuresScoped for future: