[Hackathon] meta-backend: realistic transport plugin (latency, jitter, queueing, loss)#10
Open
mariagorskikh wants to merge 3 commits into
Open
[Hackathon] meta-backend: realistic transport plugin (latency, jitter, queueing, loss)#10mariagorskikh wants to merge 3 commits into
mariagorskikh wants to merge 3 commits into
Conversation
The Tier 1 simulator hardwired zero-latency delivery (the event-queue push used time = now), which made mean_latency, throughput, and duration metrics report 0.0 for any scenario and left timing-sensitive protocols untestable. This introduces NetworkModel: a small Protocol with one method, schedule(sender, target, payload_size, t_now, rng) -> float | None, that the simulator queries for every send. Returning a time advances the deliver event; returning None signals a transport-level drop. The default ZeroLatencyNetworkModel reproduces the previous behavior byte-for-byte, so existing traces and validators are unchanged. The hook also distinguishes between scenario-level failure-injection drops (reason: failure_injection / partition) and transport-level drops (reason: network) in the JSONL trace. Tests cover backwards-compat, latency propagation, drop semantics, and determinism under a custom NetworkModel.
…, loss) The plugin gives NEST a per-hop network model that exercises every existing latency-aware metric the simulator already supports but which the bundled zero-latency in_memory transport leaves at 0. Knobs (all configurable via layers.transport_config in scenario YAML): - base_latency_ms: mean propagation per hop - jitter_sigma: lognormal jitter shape (heavy tail, like real networks) - bandwidth_bps: per-agent egress link rate, which forces serialization delay (payload_size * 8 / bandwidth) and back-to-back queueing - max_queue_bytes: drop-tail load shedding on the egress queue - loss_rate: per-hop Bernoulli packet loss - links: per-pair overrides for modeling cross-region or flaky links The scenario runner picks transport: realistic out of the YAML and forwards transport_config to RealisticNetwork.from_config; everything else (scenarios, validators, agents) is unchanged. Determinism is preserved: the simulator passes a seeded RNG so byte-identical traces still hold across runs with the same seed. Includes 28 unit tests (validation, latency, queueing, loss, determinism, from_config) and 5 runner-level end-to-end tests, plus scenarios/marketplace_realistic.yaml as a worked example.
There was a problem hiding this comment.
Sorry @mariagorskikh, you have reached your weekly rate limit of 500000 diff characters.
Please try again later or upgrade to continue using Sourcery
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Layer picked: Transport (#1)
Why
The README is candid about it:
So an entire family of protocol properties — tail latency, retry/backoff
behavior, deadline budgets, congestion response, queue-shed strategies —
is currently invisible to NEST users. The metrics module already
computes
mean_latency,throughput, andduration; they just alwaysreport 0.0 because the only shipped transport is zero-latency.
This PR plugs that hole.
Core idea
Two layers, kept deliberately small:
NetworkModelhook in the simulator (nest_core.sim.network).A Protocol with one method:
schedule(sender, target, payload_size, t_now, rng) -> float | None.The simulator queries it for every send; the returned time becomes
the deliver event's timestamp.
Nonemeans transport-level drop.Default is
ZeroLatencyNetworkModel, so existing traces arebyte-identical without code changes.
RealisticNetworkreference plugin (nest_plugins_reference.transport.realistic).Implements
NetworkModelwith the small set of knobs a backendengineer actually reaches for:
base_latency_ms+jitter_sigma— lognormal jitter so thetail behaves like a real network (heavy, asymmetric), not a Gaussian toy.
bandwidth_bps— payload-size-aware serialization delay(
bytes * 8 / bw). A 1 KB message on a 1 Mbps link costs 8 ms morethan a 64 B message.
Back-to-back sends serialize: the second message can't depart until
the first finishes transmitting. This is where
mean_latencystopsbeing constant and starts to show the load curve.
max_queue_bytes— drop-tail backpressure when the egress queueoverflows. The crude-but-honest baseline; a real engineer can swap
in CoDel later.
loss_rate— per-hop Bernoulli packet loss at the link layer,orthogonal to (and separately attributable from) the scenario's
failures.message_drop.(sender, target)pairs can carrytheir own latency / jitter / bandwidth / loss for modeling
cross-region hops or hot pairs.
Drops in the trace now carry a
reasonfield:"network"(this pluginor any custom
NetworkModel),"failure_injection"(scenario-levelBernoulli drop), or
"partition"(cross-group send). Attribution thatpreviously didn't exist.
How to test
Build-from-source (uv) or just run pytest after editable installs:
End-to-end via the bundled scenario:
nest run scenarios/marketplace_realistic.yaml # trace now has non-zero ts everywhere; report.html shows real latency curvesQuick interactive sanity check (what I used to validate the wiring):
Before this PR:
mean_latency == 0.0,duration == 0.0,throughput == 0.0.Key assumptions
must produce a byte-identical trace under the same seed. Default
network_model=Noneshort-circuits to the zero-latency model usedbefore; the simulator's RNG plumbing splits failure-injection and
network-model RNGs so byzantine/partition draws don't shift.
into
NetworkModel.schedule, so traces remain byte-identical acrossruns at the same seed, including jitter and loss.
is for stressing the protocol that runs on top of TCP, not for
reimplementing TCP. The README's "no TCP/gRPC/HTTP" limitation still
stands and is reworded to reflect the new option.
{from, to, ...}),forwarded verbatim into
RealisticNetwork.from_config. Malformedentries are silently dropped rather than failing the run — same
failure mode the scenario loader uses for partition groups.
Persona
Meta backend engineer who has spent too many quarters tuning Thrift /
MCRouter under load and thinks "tail latency" first, "happy path"
second.
Future work (deliberately out of scope here)
protocols have something to react to.
a→bslower thanb→a).star, datacenter clos, hub-and-spoke) instead of enumerating pairs.
second reference plugin, keeping
realisticas the "physical layer".HtmlReportpanel with latency CDFs / P50-P99 per pair, surfacingthe data that's now in the trace.
https://claude.ai/code/session_01C5j2D4MgCkPgsjSCqBVpWW
Generated by Claude Code