feat(eval): two-agent file-edit tracker eval (#64)#65
Open
codevibesmatter wants to merge 6 commits into
Open
Conversation
Adds approved spec for an eval that proves the session-scoped file-edit tracker (b5d2c95) correctly isolates the `committed` stop-condition per session when two real concurrent Claude agents work in the same project. Spec covers: - Harness: agents[] affordance on EvalScenario + Promise.allSettled, scenarioStartSha capture, session-ID-aware getSessionState - Assertions: assertTwoCommitsSinceStart + assertCommitsScopedToEachSession (no caller-supplied session ID; discovers sessions at checkpoint time) - Scenario: eval/scenarios/two-agent-tracker.ts with disjoint prompts against tanstack-start fixture - VP2 fault injection to prove the eval actually measures the tracker Review: 2 passes, final score 88/100, all gaps addressed inline. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Phase 1 of spec 64 — two-agent file-edit tracker eval: - Add AgentSpec type; extend EvalScenario with optional agents?: AgentSpec[] (mutually exclusive with prompt) - Capture scenarioStartSha after fixtureSetup; expose via EvalContext.startSha - Extend getSessionState(sessionId?) to read a specific session's state.json - Add Promise.allSettled multi-agent branch writing per-agent transcripts - Enforce runtime invariant: exactly one of prompt or agents must be set - New eval/harness.test.ts covering mutual-exclusivity and sessionId lookup Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Phase 2 of spec 64: - assertTwoCommitsSinceStart: verifies exactly 2 non-merge commits since ctx.startSha using git rev-list --count --no-merges - assertCommitsScopedToEachSession: matches each active session to its commit via edits.jsonl file-set intersection, then asserts commit files ⊆ edits ∪ ALLOWLIST (bun.lockb, bun.lock, package-lock.json, *.tsbuildinfo — glob-matched) - 13 new unit tests covering all 7 spec test_cases via real temp git repos Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Phase 3 of spec 64 — register the two-agent-tracker scenario: - New eval/scenarios/two-agent-tracker.ts: two disjoint-file prompts, fixtureSetup pre-installs deps, checkpoints assert 2 commits + per-session commit scoping - Register in eval/run.ts scenario registry Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
- harness: multi-agent branch now uses isolated canUseTool (allow-all) instead of the single-agent closure that shared abortController, sessionId, and pendingQuestion across both agents - assertions: empty edits.jsonl no longer short-circuits assertCommitsScopedToEachSession; falls through to standard zero-match candidate-list diagnostic per spec tc3 - harness: remove dead scenario.prompt ?? '' fallback on single-agent path (invariant already guarantees it is set) - assertions: document matchesAllowlist glob limitations (* only) - harness.test: note multi-agent SDK dispatch is covered by VP1 Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Research mode previously ended with "no issue update". Close now creates a follow-up issue capturing summary, findings, and follow-ups, linked back to the research doc. Makes research output discoverable and actionable outside the research doc itself. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements spec #64 — a two-agent eval that proves the per-session file-edit tracker (commit
b5d2c95) scopescommittedstop-condition correctly under real concurrency.EvalScenario.agents?: AgentSpec[]spawns concurrentquery()calls viaPromise.allSettled; per-agent transcripts avoid append collisions; runtime invariant enforces mutual exclusivity withprompt.getSessionState(projectDir, sessionId?)— explicit ID reads a specific session; omitted preserves latest-by-updatedAt.assertTwoCommitsSinceStart()andassertCommitsScopedToEachSession()— the latter matches each session to its commit via edits.jsonl intersection and assertsfiles(commit) ⊆ edits ∪ ALLOWLIST(with glob-matched*.tsbuildinfo, etc).eval/scenarios/two-agent-tracker.ts— two disjoint-file prompts,fixtureSetup: ['bun install'].Test plan
Closes #64
🤖 Generated with Claude Code