feat(scripts): generalize collect_trajectories.py to all task envs by SourasishBasu · Pull Request #25 · 3xcaffeine/frontier-swe-openenv

SourasishBasu · 2026-04-26T12:46:31Z

Summary

Adds TASK_PROFILES so scripts/collect_trajectories.py can target any of the four envs (postgres, notebook, dependent-type-checker, libexpat-to-x86asm) via --task. Postgres defaults are preserved verbatim — the existing PG flow is unaffected.
New CLI: --task, --runtime (auto-detects docker/podman), --image, --base-port, --container-prefix. All optional with backward-compatible defaults.
Adds a runtime monkey-patch of openenv.core.env_client.ws_connect to inject ping_interval=None. Without this, long single-turn agent bursts (>20s of WS silence) trip the websockets library's keepalive and kill the episode mid-flight with 1011 keepalive ping timeout. Guarded by an assert hasattr(...) so it fails loudly if upstream layout changes. TODO: remove once openenv-core exposes ping_interval/ping_timeout kwargs and we bump the pin in pyproject.toml.
.gitignore: adds artifacts/ so trajectory outputs stay local.

Test runs

All four reward shapes verified end-to-end on 2026-04-26. Trajectories archived locally at artifacts/trajectories-2026-04-26.zip (285 KB) — happy to share on request.

Task	Reward shape	Episodes	Reward(s)	Phase	`pi_session.jsonl`
`notebook`	`reward_json`	1	0.6476	DONE (server-computed)	✅ 269 KB
`dependent-type-checker`	`reward_json_score`	2	0.4093, 0.3000 (mean 0.3547)	EXECUTING (offline backfill)	✅ 175 KB / 234 KB
`libexpat-to-x86asm`	`reward_json_score`	1	0.3843	EXECUTING (offline backfill)	✅ 453 KB
`postgres`	`ratio`	—	not re-run; profile mirrors prior hardcoded constants	—	—

Notebook verification (`reward_json`, server-computed)

3 turns, 254 tool calls, 33 min wall.

plan_score = 0.667, S1=0.840, S2=0.000 (agent burned both attempts), S3=0.815
Reward formula sanity: 0.25·0.667 + 0.60·(0.84+0+0.815)/3 + 0.10·1.0 + 0.05·1.0 = 0.6477 ≈ server's 0.6476 ✓
Episode survived three interim server-side Session timeout warnings without losing the WS — the keepalive monkey-patch is doing its job.

Type-checker verification (`reward_json_score`, offline backfill)

2 episodes, both reached phase=EXECUTING then offline-backfilled by _compute_reward_offline(). _reward_backfilled: true flag set on both. Demonstrates the script correctly distinguishes server-computed vs. backfilled rewards.

Libexpat verification (`reward_json_score`, offline backfill)

1 episode, single subtask impl-libexpat scored 0.21, plan_score 0.43, reward 0.3843 — formula reproduces.

Caveats

The monkey-patch is documented inline

🤖 Generated with Claude Code

Adds task profiles (postgres / notebook / type-checker / libexpat) so the trajectory collector can be pointed at any env via --task. Postgres defaults are preserved bit-for-bit (image, episode/message timeouts, container prefix, base port) so the existing PG flow is unaffected. New CLI flags: --task, --runtime (auto-detects docker/podman), --image, --base-port, --container-prefix. Also patches openenv-core's WebSocket client at runtime to disable the keepalive ping. The agent can run a single turn for >10 minutes without yielding back to the WS event loop, which starves pong replies and kills the connection mid-episode (1011 keepalive ping timeout). Patch is guarded by an assertion so it fails loudly if openenv-core's layout changes — TODO to remove once ping_interval/ping_timeout kwargs are exposed upstream. Verified end-to-end across all four reward shapes; trajectories at artifacts/trajectories-2026-04-26.zip locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scripts): generalize collect_trajectories.py to all task envs#25

feat(scripts): generalize collect_trajectories.py to all task envs#25
SourasishBasu wants to merge 1 commit into
mainfrom
feat/collect-trajectories-multi-env

SourasishBasu commented Apr 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SourasishBasu commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test runs

Notebook verification (reward_json, server-computed)

Type-checker verification (reward_json_score, offline backfill)

Libexpat verification (reward_json_score, offline backfill)

Caveats

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SourasishBasu commented Apr 26, 2026 •

edited

Loading

Notebook verification (`reward_json`, server-computed)

Type-checker verification (`reward_json_score`, offline backfill)

Libexpat verification (`reward_json_score`, offline backfill)