[Hackathon] meta-backend: realistic transport plugin (latency, jitter, queueing, loss) by mariagorskikh · Pull Request #10 · projnanda/nandatown

mariagorskikh · 2026-05-26T19:01:07Z

Layer picked: Transport (#1)

Why

The README is candid about it:

"The default transport is zero-latency. ... mean_latency / duration
will both be 0.0 in your trace. Latency numbers become meaningful
only when ... you write a transport plugin that introduces per-hop delay."

So an entire family of protocol properties — tail latency, retry/backoff
behavior, deadline budgets, congestion response, queue-shed strategies —
is currently invisible to NEST users. The metrics module already
computes mean_latency, throughput, and duration; they just always
report 0.0 because the only shipped transport is zero-latency.

This PR plugs that hole.

Core idea

Two layers, kept deliberately small:

NetworkModel hook in the simulator (nest_core.sim.network).
A Protocol with one method:
schedule(sender, target, payload_size, t_now, rng) -> float | None.
The simulator queries it for every send; the returned time becomes
the deliver event's timestamp. None means transport-level drop.
Default is ZeroLatencyNetworkModel, so existing traces are
byte-identical without code changes.
RealisticNetwork reference plugin (nest_plugins_reference.transport.realistic).
Implements NetworkModel with the small set of knobs a backend
engineer actually reaches for:
- base_latency_ms + jitter_sigma — lognormal jitter so the
  tail behaves like a real network (heavy, asymmetric), not a Gaussian toy.
- bandwidth_bps — payload-size-aware serialization delay
  (bytes * 8 / bw). A 1 KB message on a 1 Mbps link costs 8 ms more
  than a 64 B message.
- Egress queueing — each sender has its own virtual egress link.
  Back-to-back sends serialize: the second message can't depart until
  the first finishes transmitting. This is where mean_latency stops
  being constant and starts to show the load curve.
- max_queue_bytes — drop-tail backpressure when the egress queue
  overflows. The crude-but-honest baseline; a real engineer can swap
  in CoDel later.
- loss_rate — per-hop Bernoulli packet loss at the link layer,
  orthogonal to (and separately attributable from) the scenario's
  failures.message_drop.
- Per-link overrides — single (sender, target) pairs can carry
  their own latency / jitter / bandwidth / loss for modeling
  cross-region hops or hot pairs.

Drops in the trace now carry a reason field: "network" (this plugin
or any custom NetworkModel), "failure_injection" (scenario-level
Bernoulli drop), or "partition" (cross-group send). Attribution that
previously didn't exist.

How to test

Build-from-source (uv) or just run pytest after editable installs:

# all green: 240 tests (38 reference plugin + 16 hypothesis + everything else)
pytest packages/nest-core/tests/ packages/nest-plugins-reference/tests/

# the new surface specifically
pytest packages/nest-plugins-reference/tests/test_realistic_transport.py -v   # 28 tests
pytest packages/nest-core/tests/test_network_model.py -v                       # 9 tests
pytest packages/nest-core/tests/test_runner_realistic.py -v                    # 5 tests

End-to-end via the bundled scenario:

nest run scenarios/marketplace_realistic.yaml
# trace now has non-zero ts everywhere; report.html shows real latency curves

Quick interactive sanity check (what I used to validate the wiring):

import asyncio
from nest_core.scenario import ScenarioConfig
from nest_core.runner import ScenarioRunner

cfg = ScenarioConfig.from_yaml("scenarios/marketplace_realistic.yaml")
cfg.duration = "ticks: 3000"

async def go():
    r = ScenarioRunner(cfg); await r.run(); print(r.metrics)
asyncio.run(go())
# {'mean_latency': 0.0055, 'throughput': 14735, 'duration': 0.131, ...}

Before this PR: mean_latency == 0.0, duration == 0.0, throughput == 0.0.

Key assumptions

Backwards compatibility is non-negotiable. Every existing scenario
must produce a byte-identical trace under the same seed. Default
network_model=None short-circuits to the zero-latency model used
before; the simulator's RNG plumbing splits failure-injection and
network-model RNGs so byzantine/partition draws don't shift.
Determinism is preserved. The simulator passes its own seeded RNG
into NetworkModel.schedule, so traces remain byte-identical across
runs at the same seed, including jitter and loss.
The model stays inside Tier 1. No threads, no real sockets. This
is for stressing the protocol that runs on top of TCP, not for
reimplementing TCP. The README's "no TCP/gRPC/HTTP" limitation still
stands and is reworded to reflect the new option.
Per-link config is a flat list in YAML ({from, to, ...}),
forwarded verbatim into RealisticNetwork.from_config. Malformed
entries are silently dropped rather than failing the run — same
failure mode the scenario loader uses for partition groups.

Persona

Meta backend engineer who has spent too many quarters tuning Thrift /
MCRouter under load and thinks "tail latency" first, "happy path"
second.

Future work (deliberately out of scope here)

AQM (CoDel / PIE) and ECN signaling on the egress queue so adaptive
protocols have something to react to.
Asymmetric per-direction link config (a→b slower than b→a).
A topology helper: build per-link config from a graph YAML (rings,
star, datacenter clos, hub-and-spoke) instead of enumerating pairs.
TCP-like behaviors layered on top (windowing, fast-retransmit) as a
second reference plugin, keeping realistic as the "physical layer".
An HtmlReport panel with latency CDFs / P50-P99 per pair, surfacing
the data that's now in the trace.

https://claude.ai/code/session_01C5j2D4MgCkPgsjSCqBVpWW

Generated by Claude Code

The Tier 1 simulator hardwired zero-latency delivery (the event-queue push used time = now), which made mean_latency, throughput, and duration metrics report 0.0 for any scenario and left timing-sensitive protocols untestable. This introduces NetworkModel: a small Protocol with one method, schedule(sender, target, payload_size, t_now, rng) -> float | None, that the simulator queries for every send. Returning a time advances the deliver event; returning None signals a transport-level drop. The default ZeroLatencyNetworkModel reproduces the previous behavior byte-for-byte, so existing traces and validators are unchanged. The hook also distinguishes between scenario-level failure-injection drops (reason: failure_injection / partition) and transport-level drops (reason: network) in the JSONL trace. Tests cover backwards-compat, latency propagation, drop semantics, and determinism under a custom NetworkModel.

…, loss) The plugin gives NEST a per-hop network model that exercises every existing latency-aware metric the simulator already supports but which the bundled zero-latency in_memory transport leaves at 0. Knobs (all configurable via layers.transport_config in scenario YAML): - base_latency_ms: mean propagation per hop - jitter_sigma: lognormal jitter shape (heavy tail, like real networks) - bandwidth_bps: per-agent egress link rate, which forces serialization delay (payload_size * 8 / bandwidth) and back-to-back queueing - max_queue_bytes: drop-tail load shedding on the egress queue - loss_rate: per-hop Bernoulli packet loss - links: per-pair overrides for modeling cross-region or flaky links The scenario runner picks transport: realistic out of the YAML and forwards transport_config to RealisticNetwork.from_config; everything else (scenarios, validators, agents) is unchanged. Determinism is preserved: the simulator passes a seeded RNG so byte-identical traces still hold across runs with the same seed. Includes 28 unit tests (validation, latency, queueing, loss, determinism, from_config) and 5 runner-level end-to-end tests, plus scenarios/marketplace_realistic.yaml as a worked example.

sourcery-ai

Sorry @mariagorskikh, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

claude added 2 commits May 26, 2026 18:59

sourcery-ai Bot reviewed May 26, 2026

View reviewed changes

fix(ci): satisfy pyright

cb6c1da

mariagorskikh mentioned this pull request May 26, 2026

[Platform] Open problems + charter + judging doc #13

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hackathon] meta-backend: realistic transport plugin (latency, jitter, queueing, loss)#10

[Hackathon] meta-backend: realistic transport plugin (latency, jitter, queueing, loss)#10
mariagorskikh wants to merge 3 commits into
mainfrom
hackathon/meta-backend-realistic-transport

mariagorskikh commented May 26, 2026

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mariagorskikh commented May 26, 2026

Layer picked: Transport (#1)

Why

Core idea

How to test

Key assumptions

Persona

Future work (deliberately out of scope here)

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants