Summary
The production NAT traversal stack (libp2p relay v2 + DCUtR + AutoNAT) is wired into the daemon and validated in-process by tests/nat_traversal.rs — a three-node test (relay + two NAT'd clients) where the broker dispatches a real WASM job to an executor through a relay circuit in ~5ms. But we have NOT verified this works across real machines on real networks, and there is strong evidence it does NOT work out of the box behind common institutional firewalls.
What we actually verified
tests/nat_traversal.rs runs and passes. Trace shows real Noise handshakes, real CBOR-serialized TaskDispatchRequest, real wasmtime execution, real TaskDispatchResponse::Succeeded. This validates the code paths exist and compose correctly.
- 802 tests passing, CI green on Linux/macOS/Windows + Sandbox KVM + swtpm.
- The daemon builds and runs; it dials bootstrap addresses; it listens on TCP and QUIC; it subscribes to gossip topics.
What is NOT verified and what went wrong when we tried
Testing from tensor02.dartmouth.edu (Rocky Linux 9, behind Dartmouth institutional firewall):
- Daemon starts, logs 8 bootstrap dials (3 worldcompute placeholder + 5 public libp2p).
- Zero
ConnectionEstablished events appear in the log after 60+ seconds of observation.
- Raw TCP probes (
nc -z 104.131.131.82 4001) SUCCEED, but libp2p's long-lived dials silently fail.
- No error messages surface — libp2p swallows dial failures into tracing::warn! which isn't printed by default.
- The test
tests/nat_traversal.rs uses 127.0.0.1 so there is no actual NAT and no actual firewall exercised; it proves the protocol logic but not the NAT behavior.
Why this is critical, not an edge case
Anyone at a university, national lab, enterprise, hospital, government, or cloud provider is likely behind a comparable firewall. That covers the majority of high-value potential donors — the very systems World Compute most needs. If cross-firewall meshing doesn't work out of the box, the project cannot achieve its core mission. This is not deferrable.
Three hypotheses for why tensor02 can't form a mesh
- Daemon exits silently before connection completes. The startup log shows only initial output; no connection events arrive. Process might die; we need to confirm it stays alive for minutes while dialing.
- Dials fail but errors are invisible.
libp2p_swarm::DialFailure events go to tracing at warn or debug level. With RUST_LOG=debug we'd see the actual reason.
- Connection establishes then dies at Noise handshake. Stateful firewalls that permit SYN but block protocol upgrades (Dartmouth has been observed to do this on other protocols) would show brief connects followed by immediate closes with auth errors.
Each hypothesis has a concrete diagnostic step.
Acceptance criteria
Investigation plan
Phase 1 — Diagnose why tensor02 can't connect out (foreground with RUST_LOG=debug):
- Run daemon in foreground SSH session (no daemonization, no backgrounding).
RUST_LOG=info,libp2p_swarm=debug,libp2p_tcp=debug,libp2p_quic=debug,libp2p_noise=debug,libp2p_dns=debug,libp2p_relay=debug.
- Observe for 5 minutes. Record every dial attempt, every failure reason.
Phase 2 — Apply the appropriate fix based on diagnosis:
- If DNS resolution of
/dnsaddr/... fails: ensure we always have raw IP fallbacks; consider bundling a DoH/DoT resolver.
- If TCP is blocked but QUIC isn't (or vice versa): wire transport-negotiation preference logic.
- If Noise handshake fails due to MTU or SYN-cookies: consider enabling QUIC-only mode or adding TCP MSS clamping options.
- If the firewall blocks ALL outbound traffic except HTTP/HTTPS: ship a WebSocket transport for libp2p over port 443 as last-resort fallback. This is the most likely scenario and the most impactful fix.
Phase 3 — Verify end-to-end:
- Daemon on tensor02 + daemon on local + real WASM job flow, as described in acceptance criteria.
Risk register
- If the firewall truly blocks outbound libp2p AND we can't ship a fallback, we would need to deploy a public WebSocket-over-443 relay node (one of only a few in existence today). This contradicts the "no paid infrastructure" design goal — except: any donor with a publicly-reachable machine can run one, so it's volunteer-sustainable. We'd need to document this clearly.
- WebSocket transport adds complexity and an extra dependency. Worth it only if TCP/QUIC direct fails consistently.
Scope
This issue is the next spec's north star. It should be spec 005-firewall-traversal or similar. The expected output is: a production donor daemon can participate in the World Compute mesh from behind any common firewall without manual network configuration, proven on real hardware.
Related
Summary
The production NAT traversal stack (libp2p relay v2 + DCUtR + AutoNAT) is wired into the daemon and validated in-process by
tests/nat_traversal.rs— a three-node test (relay + two NAT'd clients) where the broker dispatches a real WASM job to an executor through a relay circuit in ~5ms. But we have NOT verified this works across real machines on real networks, and there is strong evidence it does NOT work out of the box behind common institutional firewalls.What we actually verified
tests/nat_traversal.rsruns and passes. Trace shows real Noise handshakes, real CBOR-serializedTaskDispatchRequest, real wasmtime execution, realTaskDispatchResponse::Succeeded. This validates the code paths exist and compose correctly.What is NOT verified and what went wrong when we tried
Testing from
tensor02.dartmouth.edu(Rocky Linux 9, behind Dartmouth institutional firewall):ConnectionEstablishedevents appear in the log after 60+ seconds of observation.nc -z 104.131.131.82 4001) SUCCEED, but libp2p's long-lived dials silently fail.tests/nat_traversal.rsuses127.0.0.1so there is no actual NAT and no actual firewall exercised; it proves the protocol logic but not the NAT behavior.Why this is critical, not an edge case
Anyone at a university, national lab, enterprise, hospital, government, or cloud provider is likely behind a comparable firewall. That covers the majority of high-value potential donors — the very systems World Compute most needs. If cross-firewall meshing doesn't work out of the box, the project cannot achieve its core mission. This is not deferrable.
Three hypotheses for why tensor02 can't form a mesh
libp2p_swarm::DialFailureevents go to tracing at warn or debug level. WithRUST_LOG=debugwe'd see the actual reason.Each hypothesis has a concrete diagnostic step.
Acceptance criteria
NewListenAddr(/p2p/<relay>/p2p-circuit/p2p/<self>).evidence/phase1/firewall-traversal/.Investigation plan
Phase 1 — Diagnose why tensor02 can't connect out (foreground with RUST_LOG=debug):
RUST_LOG=info,libp2p_swarm=debug,libp2p_tcp=debug,libp2p_quic=debug,libp2p_noise=debug,libp2p_dns=debug,libp2p_relay=debug.Phase 2 — Apply the appropriate fix based on diagnosis:
/dnsaddr/...fails: ensure we always have raw IP fallbacks; consider bundling a DoH/DoT resolver.Phase 3 — Verify end-to-end:
Risk register
Scope
This issue is the next spec's north star. It should be spec 005-firewall-traversal or similar. The expected output is: a production donor daemon can participate in the World Compute mesh from behind any common firewall without manual network configuration, proven on real hardware.
Related
tests/nat_traversal.rs: in-process validation of the code paths this issue tests over real networks.src/network/nat.rs: existing STUN-based NAT type detection.