feat: stream ACP trajectory incrementally to disk#566
Draft
Yiminnn wants to merge 1 commit into
Draft
Conversation
Before: acp_trajectory.jsonl was written once, at end-of-run, from _build_rollout_result. Crashes mid-run left no trajectory on disk; followers couldn't see in-flight tool calls. After: an optional `on_change` callback on ACPSession fires after every mutating method (record_user_prompt, mark_prompt_end, handle_update). Rollout wires it to a TrajectoryWriter that atomically rewrites acp_trajectory.jsonl (tmp + os.replace) on each ACP update. Final write at end-of-run is unchanged and idempotent. Trajectory format and viewer are unchanged — same JSONL events, just present on disk earlier. Multi-scene safety: each scene wires its writer through make_trajectory_sink(writer, prior=self._trajectory[:]), so scene N's sink prepends the captured cumulative trajectory from prior scenes. Without this, each scene would overwrite the file with only its own session's events. Robustness: - non-destructive _snapshot_session_trajectory used by the live writer; pending_text stays merge-able for mark_prompt_end - TrajectoryWriter sweeps any stale .tmp left by a previous crashed run - handle_update no longer fires on_change for unrecognized sessionUpdate types (future ACP versions, agent extensions) - callback failures log at ERROR (was WARNING) so silent streaming degradation is visible End-to-end verified on Daytona with openhands + gemini/gemini-3.1-flash-lite on SkillsBench citation-check: reward=1.0, file lands at t=0 with the user_message and grows through tool_call status transitions.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
acp_trajectory.jsonlis now written to disk as the agent runs, not just at end-of-run. Mid-run crashes leave a usable partial trajectory; followers cancatthe file during the run to see live tool calls.Format and viewer are unchanged. Each line is still a merged event from
_capture_session_trajectory; the writer just snapshots the current cumulative state on every ACP update.Mechanism
ACPSessiongains an optionalon_change: Callable[[ACPSession], None]invoked afterrecord_user_prompt,mark_prompt_end, and any recognizedhandle_updatebranch.TrajectoryWriter(new) writes payloads atomically viatmp + os.replace; dedupes byte-identical re-snapshots._snapshot_session_trajectoryis non-destructive (does not flush_pending_text), so streamed chunks stay mergeable untilmark_prompt_endruns and produces the canonical merged events.Rollout._attach_trajectory_writerwires the writer inconnect()andconnect_as()(multi-role), capturing the cumulativeself._trajectoryfrom prior scenes viamake_trajectory_sink(writer, prior)— so scene-N's sink writesprior_scenes + current_session, not just the current session._build_rollout_resultnow goes throughTrajectoryWriter.write_final(idempotent atomic write — important for oracle / scraped-fallback paths where no live writer ran).Why this is safe
viewer.py,judge.py.tmpl,tests/test_smoke.py, etc.).TestMultiSceneCumulativeStreamingcover this.handle_updateis single-thread cooperative within the asyncio event loop, so the syncwrite_text + os.replaceper update is consistent with the existing model. See follow-up note below for one MEDIUM finding deferred to a separate PR..tmpfrom a SIGKILLed writer is swept by the nextTrajectoryWriter.__init__.handle_updateskips_notify_changefor unrecognizedsessionUpdatetypes (no wasted snapshots, no risk if a future ACP version adds new event kinds)._notify_changenow logs atERROR(wasWARNING) — silent streaming degradation in a 64-concurrency log would otherwise be easy to miss.End-to-end verified
openhands+gemini/gemini-3.1-flash-liteon SkillsBenchcitation-checkvia Daytona — reward 1.0, no errors, 2 tool calls,trajectory_source: acp,partial_trajectory: False. File lands at t=0 with theuser_messageand grows through each tool-call status transition:user_messagecatcompletededitcompletedagent_messageOn
mainthe file would have first appeared at the verifier-finished mark (~10:05:53 in this run).Test plan
tests/test_capture_trajectory.py:TestTrajectoryWriter— file appears on first event, tool_call pending → completed transitions visible, exceptions swallowed, atomic rewrites (no torn lines / leftover.tmp),write_finaloverwrites.TestSnapshotSessionTrajectory— non-destructive snapshot preserves_pending_textuntilmark_prompt_end.TestMultiSceneCumulativeStreaming— prior scenes preserved on disk during current scene; empty current session doesn't wipe history; caller-mutated cumulative list does not double-count.TestHandleUpdateUnknownType— unknownsessionUpdateskipson_change; known types still notify.TestTrajectoryWriterStaleTmpCleanup— stale.tmpswept on__init__.openhands + gemini/gemini-3.1-flash-lite + citation-check: reward 1.0.Known follow-ups (out of scope for this PR)
TrajectoryWriter.flushperforms syncwrite_text + os.replacefrom the asyncio event loop (~1–3 ms per ACP update; ~100–300 ms cumulative per chunky prompt with 100+ streamed chunks). Bounded but not free. Would require making theon_changecallback async-aware. Defer to a follow-up unless a real run shows measurable cost.openhands + daytonareportsusage_source: unavailableby design (the host-side proxy is unreachable from the remote sandbox). Tracked in OpenHands integration: ACP CLI path does not expose stable token/cost metrics; need SDK-runner or equivalent metrics bridge #183 / ACP trials do not persist provider token/cost telemetry #249 / ACP trials do not persist provider token/cost telemetry #250.