Skip to content

Cross-machine firewall traversal: verify real mesh formation works behind institutional / corporate firewalls #60

@jeremymanning

Description

@jeremymanning

Summary

The production NAT traversal stack (libp2p relay v2 + DCUtR + AutoNAT) is wired into the daemon and validated in-process by tests/nat_traversal.rs — a three-node test (relay + two NAT'd clients) where the broker dispatches a real WASM job to an executor through a relay circuit in ~5ms. But we have NOT verified this works across real machines on real networks, and there is strong evidence it does NOT work out of the box behind common institutional firewalls.

What we actually verified

  • tests/nat_traversal.rs runs and passes. Trace shows real Noise handshakes, real CBOR-serialized TaskDispatchRequest, real wasmtime execution, real TaskDispatchResponse::Succeeded. This validates the code paths exist and compose correctly.
  • 802 tests passing, CI green on Linux/macOS/Windows + Sandbox KVM + swtpm.
  • The daemon builds and runs; it dials bootstrap addresses; it listens on TCP and QUIC; it subscribes to gossip topics.

What is NOT verified and what went wrong when we tried

Testing from tensor02.dartmouth.edu (Rocky Linux 9, behind Dartmouth institutional firewall):

  1. Daemon starts, logs 8 bootstrap dials (3 worldcompute placeholder + 5 public libp2p).
  2. Zero ConnectionEstablished events appear in the log after 60+ seconds of observation.
  3. Raw TCP probes (nc -z 104.131.131.82 4001) SUCCEED, but libp2p's long-lived dials silently fail.
  4. No error messages surface — libp2p swallows dial failures into tracing::warn! which isn't printed by default.
  5. The test tests/nat_traversal.rs uses 127.0.0.1 so there is no actual NAT and no actual firewall exercised; it proves the protocol logic but not the NAT behavior.

Why this is critical, not an edge case

Anyone at a university, national lab, enterprise, hospital, government, or cloud provider is likely behind a comparable firewall. That covers the majority of high-value potential donors — the very systems World Compute most needs. If cross-firewall meshing doesn't work out of the box, the project cannot achieve its core mission. This is not deferrable.

Three hypotheses for why tensor02 can't form a mesh

  1. Daemon exits silently before connection completes. The startup log shows only initial output; no connection events arrive. Process might die; we need to confirm it stays alive for minutes while dialing.
  2. Dials fail but errors are invisible. libp2p_swarm::DialFailure events go to tracing at warn or debug level. With RUST_LOG=debug we'd see the actual reason.
  3. Connection establishes then dies at Noise handshake. Stateful firewalls that permit SYN but block protocol upgrades (Dartmouth has been observed to do this on other protocols) would show brief connects followed by immediate closes with auth errors.

Each hypothesis has a concrete diagnostic step.

Acceptance criteria

  • Daemon running on a machine behind a real institutional firewall (tensor02 or equivalent) successfully maintains a libp2p connection to at least one public bootstrap relay for 10+ minutes continuously.
  • That daemon successfully obtains a relay reservation visible in its logs as NewListenAddr(/p2p/<relay>/p2p-circuit/p2p/<self>).
  • A second daemon on a different network can dial the first daemon via the reserved circuit address and establish a working connection.
  • A real WASM job dispatched from the second daemon to the first completes successfully and returns the expected result bytes.
  • Evidence artifact (log file, screenshots, or trace) captured and committed under evidence/phase1/firewall-traversal/.

Investigation plan

Phase 1 — Diagnose why tensor02 can't connect out (foreground with RUST_LOG=debug):

  1. Run daemon in foreground SSH session (no daemonization, no backgrounding).
  2. RUST_LOG=info,libp2p_swarm=debug,libp2p_tcp=debug,libp2p_quic=debug,libp2p_noise=debug,libp2p_dns=debug,libp2p_relay=debug.
  3. Observe for 5 minutes. Record every dial attempt, every failure reason.

Phase 2 — Apply the appropriate fix based on diagnosis:

  • If DNS resolution of /dnsaddr/... fails: ensure we always have raw IP fallbacks; consider bundling a DoH/DoT resolver.
  • If TCP is blocked but QUIC isn't (or vice versa): wire transport-negotiation preference logic.
  • If Noise handshake fails due to MTU or SYN-cookies: consider enabling QUIC-only mode or adding TCP MSS clamping options.
  • If the firewall blocks ALL outbound traffic except HTTP/HTTPS: ship a WebSocket transport for libp2p over port 443 as last-resort fallback. This is the most likely scenario and the most impactful fix.

Phase 3 — Verify end-to-end:

  • Daemon on tensor02 + daemon on local + real WASM job flow, as described in acceptance criteria.

Risk register

  • If the firewall truly blocks outbound libp2p AND we can't ship a fallback, we would need to deploy a public WebSocket-over-443 relay node (one of only a few in existence today). This contradicts the "no paid infrastructure" design goal — except: any donor with a publicly-reachable machine can run one, so it's volunteer-sustainable. We'd need to document this clearly.
  • WebSocket transport adds complexity and an extra dependency. Worth it only if TCP/QUIC direct fails consistently.

Scope

This issue is the next spec's north star. It should be spec 005-firewall-traversal or similar. The expected output is: a production donor daemon can participate in the World Compute mesh from behind any common firewall without manual network configuration, proven on real hardware.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions