Skip to content

feat(scripts): generalize collect_trajectories.py to all task envs#25

Open
SourasishBasu wants to merge 1 commit into
mainfrom
feat/collect-trajectories-multi-env
Open

feat(scripts): generalize collect_trajectories.py to all task envs#25
SourasishBasu wants to merge 1 commit into
mainfrom
feat/collect-trajectories-multi-env

Conversation

@SourasishBasu

@SourasishBasu SourasishBasu commented Apr 26, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Adds TASK_PROFILES so scripts/collect_trajectories.py can target any of the four envs (postgres, notebook, dependent-type-checker, libexpat-to-x86asm) via --task. Postgres defaults are preserved verbatim — the existing PG flow is unaffected.
  • New CLI: --task, --runtime (auto-detects docker/podman), --image, --base-port, --container-prefix. All optional with backward-compatible defaults.
  • Adds a runtime monkey-patch of openenv.core.env_client.ws_connect to inject ping_interval=None. Without this, long single-turn agent bursts (>20s of WS silence) trip the websockets library's keepalive and kill the episode mid-flight with 1011 keepalive ping timeout. Guarded by an assert hasattr(...) so it fails loudly if upstream layout changes. TODO: remove once openenv-core exposes ping_interval/ping_timeout kwargs and we bump the pin in pyproject.toml.
  • .gitignore: adds artifacts/ so trajectory outputs stay local.

Test runs

All four reward shapes verified end-to-end on 2026-04-26. Trajectories archived locally at artifacts/trajectories-2026-04-26.zip (285 KB) — happy to share on request.

Task Reward shape Episodes Reward(s) Phase pi_session.jsonl
notebook reward_json 1 0.6476 DONE (server-computed) ✅ 269 KB
dependent-type-checker reward_json_score 2 0.4093, 0.3000 (mean 0.3547) EXECUTING (offline backfill) ✅ 175 KB / 234 KB
libexpat-to-x86asm reward_json_score 1 0.3843 EXECUTING (offline backfill) ✅ 453 KB
postgres ratio not re-run; profile mirrors prior hardcoded constants

Notebook verification (reward_json, server-computed)

3 turns, 254 tool calls, 33 min wall.

  • plan_score = 0.667, S1=0.840, S2=0.000 (agent burned both attempts), S3=0.815
  • Reward formula sanity: 0.25·0.667 + 0.60·(0.84+0+0.815)/3 + 0.10·1.0 + 0.05·1.0 = 0.6477 ≈ server's 0.6476 ✓
  • Episode survived three interim server-side Session timeout warnings without losing the WS — the keepalive monkey-patch is doing its job.

Type-checker verification (reward_json_score, offline backfill)

2 episodes, both reached phase=EXECUTING then offline-backfilled by _compute_reward_offline(). _reward_backfilled: true flag set on both. Demonstrates the script correctly distinguishes server-computed vs. backfilled rewards.

Libexpat verification (reward_json_score, offline backfill)

1 episode, single subtask impl-libexpat scored 0.21, plan_score 0.43, reward 0.3843 — formula reproduces.

Caveats

  • The monkey-patch is documented inline

🤖 Generated with Claude Code

Adds task profiles (postgres / notebook / type-checker / libexpat) so the
trajectory collector can be pointed at any env via --task. Postgres
defaults are preserved bit-for-bit (image, episode/message timeouts,
container prefix, base port) so the existing PG flow is unaffected.

New CLI flags: --task, --runtime (auto-detects docker/podman), --image,
--base-port, --container-prefix.

Also patches openenv-core's WebSocket client at runtime to disable the
keepalive ping. The agent can run a single turn for >10 minutes without
yielding back to the WS event loop, which starves pong replies and kills
the connection mid-episode (1011 keepalive ping timeout). Patch is
guarded by an assertion so it fails loudly if openenv-core's layout
changes — TODO to remove once ping_interval/ping_timeout kwargs are
exposed upstream.

Verified end-to-end across all four reward shapes; trajectories at
artifacts/trajectories-2026-04-26.zip locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant