Skip to content

[Hackathon] meta-backend: realistic transport plugin (latency, jitter, queueing, loss)#10

Open
mariagorskikh wants to merge 3 commits into
mainfrom
hackathon/meta-backend-realistic-transport
Open

[Hackathon] meta-backend: realistic transport plugin (latency, jitter, queueing, loss)#10
mariagorskikh wants to merge 3 commits into
mainfrom
hackathon/meta-backend-realistic-transport

Conversation

@mariagorskikh

Copy link
Copy Markdown
Collaborator

Layer picked: Transport (#1)

Why

The README is candid about it:

"The default transport is zero-latency. ... mean_latency / duration
will both be 0.0 in your trace. Latency numbers become meaningful
only when ... you write a transport plugin that introduces per-hop delay."

So an entire family of protocol properties — tail latency, retry/backoff
behavior, deadline budgets, congestion response, queue-shed strategies —
is currently invisible to NEST users. The metrics module already
computes mean_latency, throughput, and duration; they just always
report 0.0 because the only shipped transport is zero-latency.

This PR plugs that hole.

Core idea

Two layers, kept deliberately small:

  1. NetworkModel hook in the simulator (nest_core.sim.network).
    A Protocol with one method:
    schedule(sender, target, payload_size, t_now, rng) -> float | None.
    The simulator queries it for every send; the returned time becomes
    the deliver event's timestamp. None means transport-level drop.
    Default is ZeroLatencyNetworkModel, so existing traces are
    byte-identical without code changes.

  2. RealisticNetwork reference plugin (nest_plugins_reference.transport.realistic).
    Implements NetworkModel with the small set of knobs a backend
    engineer actually reaches for:

    • base_latency_ms + jitter_sigma — lognormal jitter so the
      tail behaves like a real network (heavy, asymmetric), not a Gaussian toy.
    • bandwidth_bps — payload-size-aware serialization delay
      (bytes * 8 / bw). A 1 KB message on a 1 Mbps link costs 8 ms more
      than a 64 B message.
    • Egress queueing — each sender has its own virtual egress link.
      Back-to-back sends serialize: the second message can't depart until
      the first finishes transmitting. This is where mean_latency stops
      being constant and starts to show the load curve.
    • max_queue_bytes — drop-tail backpressure when the egress queue
      overflows. The crude-but-honest baseline; a real engineer can swap
      in CoDel later.
    • loss_rate — per-hop Bernoulli packet loss at the link layer,
      orthogonal to (and separately attributable from) the scenario's
      failures.message_drop.
    • Per-link overrides — single (sender, target) pairs can carry
      their own latency / jitter / bandwidth / loss for modeling
      cross-region hops or hot pairs.

Drops in the trace now carry a reason field: "network" (this plugin
or any custom NetworkModel), "failure_injection" (scenario-level
Bernoulli drop), or "partition" (cross-group send). Attribution that
previously didn't exist.

How to test

Build-from-source (uv) or just run pytest after editable installs:

# all green: 240 tests (38 reference plugin + 16 hypothesis + everything else)
pytest packages/nest-core/tests/ packages/nest-plugins-reference/tests/

# the new surface specifically
pytest packages/nest-plugins-reference/tests/test_realistic_transport.py -v   # 28 tests
pytest packages/nest-core/tests/test_network_model.py -v                       # 9 tests
pytest packages/nest-core/tests/test_runner_realistic.py -v                    # 5 tests

End-to-end via the bundled scenario:

nest run scenarios/marketplace_realistic.yaml
# trace now has non-zero ts everywhere; report.html shows real latency curves

Quick interactive sanity check (what I used to validate the wiring):

import asyncio
from nest_core.scenario import ScenarioConfig
from nest_core.runner import ScenarioRunner

cfg = ScenarioConfig.from_yaml("scenarios/marketplace_realistic.yaml")
cfg.duration = "ticks: 3000"

async def go():
    r = ScenarioRunner(cfg); await r.run(); print(r.metrics)
asyncio.run(go())
# {'mean_latency': 0.0055, 'throughput': 14735, 'duration': 0.131, ...}

Before this PR: mean_latency == 0.0, duration == 0.0, throughput == 0.0.

Key assumptions

  • Backwards compatibility is non-negotiable. Every existing scenario
    must produce a byte-identical trace under the same seed. Default
    network_model=None short-circuits to the zero-latency model used
    before; the simulator's RNG plumbing splits failure-injection and
    network-model RNGs so byzantine/partition draws don't shift.
  • Determinism is preserved. The simulator passes its own seeded RNG
    into NetworkModel.schedule, so traces remain byte-identical across
    runs at the same seed, including jitter and loss.
  • The model stays inside Tier 1. No threads, no real sockets. This
    is for stressing the protocol that runs on top of TCP, not for
    reimplementing TCP. The README's "no TCP/gRPC/HTTP" limitation still
    stands and is reworded to reflect the new option.
  • Per-link config is a flat list in YAML ({from, to, ...}),
    forwarded verbatim into RealisticNetwork.from_config. Malformed
    entries are silently dropped rather than failing the run — same
    failure mode the scenario loader uses for partition groups.

Persona

Meta backend engineer who has spent too many quarters tuning Thrift /
MCRouter under load and thinks "tail latency" first, "happy path"
second.

Future work (deliberately out of scope here)

  • AQM (CoDel / PIE) and ECN signaling on the egress queue so adaptive
    protocols have something to react to.
  • Asymmetric per-direction link config (a→b slower than b→a).
  • A topology helper: build per-link config from a graph YAML (rings,
    star, datacenter clos, hub-and-spoke) instead of enumerating pairs.
  • TCP-like behaviors layered on top (windowing, fast-retransmit) as a
    second reference plugin, keeping realistic as the "physical layer".
  • An HtmlReport panel with latency CDFs / P50-P99 per pair, surfacing
    the data that's now in the trace.

https://claude.ai/code/session_01C5j2D4MgCkPgsjSCqBVpWW


Generated by Claude Code

claude added 2 commits May 26, 2026 18:59
The Tier 1 simulator hardwired zero-latency delivery (the event-queue
push used time = now), which made mean_latency, throughput, and
duration metrics report 0.0 for any scenario and left timing-sensitive
protocols untestable.

This introduces NetworkModel: a small Protocol with one method,
schedule(sender, target, payload_size, t_now, rng) -> float | None,
that the simulator queries for every send. Returning a time advances
the deliver event; returning None signals a transport-level drop.

The default ZeroLatencyNetworkModel reproduces the previous behavior
byte-for-byte, so existing traces and validators are unchanged. The
hook also distinguishes between scenario-level failure-injection
drops (reason: failure_injection / partition) and transport-level
drops (reason: network) in the JSONL trace.

Tests cover backwards-compat, latency propagation, drop semantics,
and determinism under a custom NetworkModel.
…, loss)

The plugin gives NEST a per-hop network model that exercises every
existing latency-aware metric the simulator already supports but
which the bundled zero-latency in_memory transport leaves at 0.

Knobs (all configurable via layers.transport_config in scenario YAML):
- base_latency_ms: mean propagation per hop
- jitter_sigma: lognormal jitter shape (heavy tail, like real networks)
- bandwidth_bps: per-agent egress link rate, which forces serialization
  delay (payload_size * 8 / bandwidth) and back-to-back queueing
- max_queue_bytes: drop-tail load shedding on the egress queue
- loss_rate: per-hop Bernoulli packet loss
- links: per-pair overrides for modeling cross-region or flaky links

The scenario runner picks transport: realistic out of the YAML and
forwards transport_config to RealisticNetwork.from_config; everything
else (scenarios, validators, agents) is unchanged. Determinism is
preserved: the simulator passes a seeded RNG so byte-identical traces
still hold across runs with the same seed.

Includes 28 unit tests (validation, latency, queueing, loss,
determinism, from_config) and 5 runner-level end-to-end tests, plus
scenarios/marketplace_realistic.yaml as a worked example.

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @mariagorskikh, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants