Skip to content

coworld hosted-play WS sessions die at ~40 s due to default upstream ping settings #17

@JBoggsy

Description

@JBoggsy

Where the fix lives: the actual line that needs changing is in the private Metta-AI/metta repo at app_backend/src/metta/app_backend/routes/coworld_routes.py:737 (and the symmetric replay-proxy hop at :826). Filing here because Metta-AI/metta has issues disabled, and this surfaces through the public coworld hosted-game CLI.

Summary

Every WebSocket connection to a coworld hosted-game play session is forcibly closed at ~40 seconds, regardless of game state, slot count, or client behavior. The cutoff matches the Python websockets library's default ping watchdog (ping_interval=20 + ping_timeout=20) on the proxy's upstream connection to the in-cluster game pod. This makes hosted-play unusable for any Among Them session, since lobby + RoleReveal alone is ~10 s and any meaningful play sits well above 40 s.

This is hosted-play-only. Tournament/league episodes use direct in-cluster service WS (packages/coworld/src/coworld/runner/kubernetes_runner.py:421) and don't traverse the FastAPI play-session proxy, so they are unaffected.

Reproduction

Fresh coworld hosted-game create cow_a7418f9b-…-91bb9655bc76 --variant default, three configurations, all close at the same wall-clock window:

Setup Frames received Closed at Close code
1 slot, raw client, no ping, no input echo 650 +27.1 s 1012 (Service Restart)
1 slot, raw client, no ping, input echo every frame 971 +40.3 s 1006 (abnormal, no close frame)
8 slots claimed (anonymous), 8 raw clients, all echoing input ~974 each +40.3 s (within 0.1 s of each other) 1006

974 frames at 24 Hz ≈ 40.6 s — matches ping_interval=20 + ping_timeout=20. Session status stays ready/running and frames flow continuously at 24 fps for the whole window — not an idle timeout, not a natural game-end.

Full reproducer (~30 LOC, stdlib + websockets):

import asyncio, json, subprocess, time, urllib.parse, urllib.request
import websockets

CLI = "/path/to/coworld"
SERVER = "https://api.observatory.softmax-research.net"
COWORLD_ID = "cow_a7418f9b-4f4e-4f93-bfa4-91bb9655bc76"  # among_them

session = json.loads(subprocess.check_output([
    CLI, "hosted-game", "create", COWORLD_ID, "--variant", "default", "--json",
]))
session_id = session["session_id"]

# Anonymous join (no auth) bypasses the same-user-returns-same-slot shortcut
req = urllib.request.Request(
    f"{SERVER}/v2/coworlds/play/session/{session_id}/join", method="POST", data=b"")
req.add_header("content-type", "application/json")
with urllib.request.urlopen(req) as r:
    join = json.load(r)
ws_url = urllib.parse.parse_qs(urllib.parse.urlsplit(join["player_url"]).query)["address"][0]

async def watch():
    start = time.monotonic()
    frames = 0
    async with websockets.connect(ws_url, ping_interval=None, max_size=None) as ws:
        try:
            async for msg in ws:
                frames += 1
                if isinstance(msg, (bytes, bytearray)) and len(msg) == 8192:
                    await ws.send(bytes([0, 0]))  # NOOP input
        except websockets.exceptions.ConnectionClosed as exc:
            print(f"closed at +{time.monotonic() - start:.1f}s frames={frames} code={exc.code}")

asyncio.run(watch())

Consistent output across runs:

closed at +40.3s frames=971 code=1006

Root cause

The connection chain has two stitched WS sessions:

[client] <— Session A (wss/443) —> [FastAPI proxy] <— Session B (TCP→WS) —> [Among Them (mummy)]
                                                       via coworld_play_proxy.py
                                                       (raw TCP pipe; not a WS proxy)

Metta-AI/metta:app_backend/src/metta/app_backend/routes/coworld_routes.py:737:

async with websockets.connect(target_url, additional_headers=headers, ssl=ssl_context) as upstream:
    await websocket.accept()
    upstream_task = asyncio.create_task(_upstream_to_websocket(websocket, upstream))
    downstream_task = asyncio.create_task(_websocket_to_upstream(websocket, upstream))
    done, pending = await asyncio.wait({upstream_task, downstream_task}, return_when=asyncio.FIRST_COMPLETED)
    ...

No ping_interval / ping_timeout are passed → websockets.connect uses the library defaults (20 + 20). Session B sends its first ping at t=20 s and gives up if no pong by t=40 s. When Session B closes, asyncio.wait(..., FIRST_COMPLETED) fires, cancels the downstream task, and Session A closes — which is what clients observe.

Either mummy isn't pong-ing for some reason, or pongs aren't surviving the in-cluster TCP pipe (less likely — coworld_play_proxy.py is byte-transparent). Either way the proxy's own watchdog is what ends the session at exactly its default window. 1006 is the code websockets produces on a ping-timeout-driven close (no close frame, just hangs up). The 27 s / 1012 outlier in the no-echo run is a separate, graceful close from upstream (probably a no-client-input watchdog inside the game container).

Suggested fix

One-line change at coworld_routes.py:737 (and the symmetric replay-proxy site at :826). Pick one:

# Option A — disable upstream-side pinging entirely.
# Simple. Relies on TCP/mummy to surface dropped peers.
async with websockets.connect(
    target_url, additional_headers=headers, ssl=ssl_context,
    ping_interval=None,
) as upstream:

# Option B — pragmatic longer window (recommended).
# Raises the watchdog to 180 s; long enough for a real Among Them game.
async with websockets.connect(
    target_url, additional_headers=headers, ssl=ssl_context,
    ping_interval=120, ping_timeout=60,
) as upstream:

Option B is the safer middle ground.

Independently, worth checking whether mummy is in fact auto-pong-ing in this deployment — if it isn't, that's a real bug in bitworld/among_them/server.nim and the metta-side change is defense-in-depth.

Impact

coworld hosted-game is currently effectively broken for Among Them — default variant has roleRevealTicks=120 (5 s) and startWaitTicks=120 (5 s) before Playing phase even begins, so by the time a game reaches its first interesting tick the 40 s proxy watchdog has already fired. Anyone using hosted-play for bot-vs-bot or human-vs-bot Among Them sees their connection silently drop ~40 s in.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions