Skip to content

feat(engine): add verification, cost tracking, and smarter completion#331

Open
highimpact-dev wants to merge 7 commits intosubsy:mainfrom
highimpact-dev:feat/engine-improvements
Open

feat(engine): add verification, cost tracking, and smarter completion#331
highimpact-dev wants to merge 7 commits intosubsy:mainfrom
highimpact-dev:feat/engine-improvements

Conversation

@highimpact-dev
Copy link

@highimpact-dev highimpact-dev commented Feb 23, 2026

Summary

  • Verification gates: configurable commands run after agent signals completion, retry with error injection on failure
  • Auto-commit defaults: autoCommit: true default, improved commit messages with iteration context, --no-auto-commit opt-out
  • Model escalation: start with cheaper model, escalate on verification failure, fully wired through config pipeline
  • Cross-iteration context: structured git diff summaries fed to subsequent iterations instead of raw output
  • Completion detection hardening: pluggable strategies (promise-tag, relaxed-tag, heuristic) — configurable per project
  • Acceptance criteria validation: parse AC text for executable assertions (shell commands, file checks), run as verification commands
  • Cost tracking: per-session cost accumulation with model-aware pricing, TUI dashboard display, configurable alert threshold
  • Parallel first-class: auto-detect mode as default, --conflict-timeout CLI flag, improved help text

Test plan

  • bun run typecheck passes
  • bun run build passes
  • bun test — 3350 pass, 4 pre-existing failures (unrelated beads tracker tests)
  • 65+ new tests across 6 test files covering all new modules
  • Manual: run with --start-model sonnet --escalate-model opus to verify escalation
  • Manual: set verification.commands in config and confirm gate runs post-completion
  • Manual: verify cost display appears in TUI dashboard

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Post-completion verification gate with formatted failure feedback injected into retries
    • Model escalation (start cheaper, escalate after failures)
    • Per-session cost tracking with alerts and dashboard cost display
    • Cross-iteration diff summaries included in prompts
    • Pluggable completion-detection strategies
    • Acceptance-criteria validation converted into executable verification checks
    • Auto-commit enabled by default; richer commit messages
  • Configuration & CLI Enhancements

    • Flags to control start/escalate models, conflict-timeout, and explicit auto-commit toggles
  • Documentation

    • Advanced configuration docs and CLI examples updated

highimpact-dev and others added 5 commits February 22, 2026 18:30
- Verification gates between iterations
- Auto-commit defaults (enable by default)
- Model escalation strategy (sonnet → opus on failure)
- Cross-iteration context (structured diff summaries)
- Completion detection hardening (strategy pattern)
- Acceptance criteria validation (auto-extract executable checks)
- Cost tracking (model-aware pricing, TUI display, threshold alerts)
- First-class parallel execution (auto-detect independent tasks)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add configurable verification commands that run after agent signals
completion but before task is marked done. Includes retry with error
injection, output truncation, and proper cleanup on skip/abort.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ation

Step 2: autoCommit defaults to true, improved commit messages with
iteration number, --no-auto-commit CLI flag for opt-out.

Step 3: Model escalation starts with cheaper model, escalates on
verification failure. Config properly flows through buildConfig pipeline.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…on detection

Step 4: Generate structured diff summaries after each iteration, feed
as rolling context to subsequent prompts instead of raw output history.

Step 5: Pluggable completion detection strategies - promise-tag (default),
relaxed-tag, and heuristic. Configurable per project.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…llel first-class

Step 6: Parse acceptance criteria for executable assertions (shell
commands, file existence checks). Run as verification commands with
proper shell escaping.

Step 7: Track cumulative cost per session with model-aware pricing.
Display in TUI dashboard. Configurable alert threshold pauses engine.

Step 8: Default parallel mode to 'auto' (detect independent tasks).
Add --conflict-timeout CLI flag. Improved help text.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@vercel
Copy link

vercel bot commented Feb 23, 2026

@highimpact-dev is attempting to deploy a commit to the plgeek Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 23, 2026

Walkthrough

Adds an eight-step engine improvement suite: post-completion verification gate, model escalation, cost tracking, pluggable completion detection, acceptance-criteria validation, cross-iteration diff context, auto-commit defaults, and first-class parallel execution; extends config, templates, engine loop, TUI and adds many tests.

Changes

Cohort / File(s) Summary
Specs & Roadmap
(.claude/spec-loop.local.md), spec/20260222-engine-improvements/PLAN.md, spec/20260222-engine-improvements/*/*.{TASK,TODO,COMPLETED}.md
New eight-step plan, per-step TASK/TODO/COMPLETED docs and a local spec-loop record.
Config types & build
src/config/types.ts, src/config/index.ts
Adds VerificationConfig, ModelEscalationConfig, CostConfig, CompletionConfig and defaults; extends StoredConfig/RalphConfig/RuntimeOptions (autoCommit, startModel, escalateModel, conflictTimeout) and merges CLI overrides into runtime config.
Engine: verification gate
src/engine/verification.ts, tests/engine/verification.test.ts
New runVerification/formatVerificationErrors APIs; sequential shell-command verification gate with timeout/retry and tests; integrated into engine loop.
Engine: model escalation
src/engine/model-escalation.ts, tests/engine/model-escalation.test.ts
Per-task escalation state and helpers (createEscalationState, getModelForTask, record/clear attempts); CLI overrides wired; emits model:escalated events.
Engine: completion detection
src/engine/completion-strategies.ts, tests/engine/completion-detection.test.ts
Pluggable completion strategies (promise-tag, relaxed-tag, heuristic) and detectCompletion orchestrator; replaces inline regex check and is configurable.
Engine: acceptance-criteria → verification
src/engine/ac-validator.ts, tests/engine/ac-validator.test.ts
Parses task.acceptanceCriteria into ExecutableAC and converts to verification commands; AC-derived commands are prepended to verification runs.
Engine: diff summariser (cross-iteration context)
src/engine/diff-summarizer.ts, tests/engine/diff-summarizer.test.ts
generateDiffSummary + formatDiffContext produce DiffSummary objects; engine stores rolling window (max 5) and injects formatted diffContext into prompts; IterationResult optionally contains diffSummary.
Engine: cost tracking
src/engine/cost-tracker.ts, tests/engine/cost-tracker.test.ts
CostTracker class with model-aware pricing, addIteration/getSnapshot/formatCost, emits cost:updated and cost:threshold-exceeded events; session persistence updated for cumulative cost.
Engine core integration
src/engine/index.ts, src/engine/types.ts
Wires verification (with AC), model escalation, diff summaries, cost tracking and completion detection into runIteration/buildPrompt; extends EngineEvent union (verification/model/cost) and IterationResult (diffSummary); adds internal state (verificationRetryMap, escalationState, recentDiffSummaries, costTracker). High-impact file.
Templates & trackers
src/templates/types.ts, src/templates/engine.ts, src/templates/builtin.ts, src/plugins/trackers/builtin/json/template.hbs
Adds verificationErrors and diffContext to TemplateVariables/ExtendedTemplateContext; templates and tracker partials render conditional sections for previous verification failures and recent diff context.
Auto-commit
src/engine/auto-commit.ts, src/engine/auto-commit.test.ts
performAutoCommit signature adds optional iteration; commit message formatted to include iteration/agent context; default autoCommit behaviour changed via config/build.
CLI & run command
src/commands/run.tsx, src/commands/run.test.ts
Adds --start-model/--escalate-model/--conflict-timeout/--auto-commit/--no-auto-commit flags; exports resolveParallelMode; default parallel fallback becomes auto; parsing and help updated; tests added.
Session & persistence
src/session/types.ts, src/session/index.ts
Adds cumulativeCost?: number to SessionMetadata and updateSessionIteration to persist cumulativeCost.
TUI: cost display
src/tui/components/ProgressDashboard.tsx, src/tui/components/RunApp.tsx
Adds totalCost prop/state, listens for cost events, and displays cumulative cost in the dashboard.
Tests
tests/engine/*, tests/*
Extensive new test suites covering verification, completion detection, diff summariser, cost tracker, model escalation, AC parsing, auto-commit, and CLI behaviour.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Engine as ExecutionEngine
    participant Agent as AgentProcess
    participant CompDetect as CompletionDetection
    participant Verification as VerificationGate
    participant Escalation as ModelEscalation
    participant Cost as CostTracker
    participant Templates
    participant Session

    Client->>Engine: runIteration()
    Engine->>Agent: executeAgent()
    Agent-->>Engine: AgentExecutionResult
    Engine->>CompDetect: detectCompletion(result, strategies)
    CompDetect-->>Engine: { completed, matchedStrategy }

    alt Completed
        Engine->>Verification: runVerification(cwd, commands)
        Verification-->>Engine: VerificationResult
        alt Verification passed
            Engine->>Cost: addIteration(tokens, model)
            Cost-->>Engine: iterationCost, snapshot
            Engine->>Engine: generateDiffSummary()
            Engine->>Session: updateSessionIteration(iteration, diffSummary, snapshot.totalCost)
            Engine->>Templates: buildPrompt(... diffContext)
            Engine-->>Client: IterationResult { diffSummary, costSnapshot }
        else Verification failed and retries remain
            Engine->>Templates: buildPrompt(verificationErrors)
            Engine->>Escalation: recordTaskAttempt(taskId)
            Engine-->>Client: IterationResult (retry queued)
        else Verification failed, no retries
            Engine->>Engine: markTaskDone()
            Engine-->>Client: IterationResult (completed with verification failures)
        end
    else Not completed
        Engine->>Escalation: getModelForTask(taskId)
        Escalation-->>Engine: selectedModel
        Engine->>Engine: emit model:escalated? / set model
        Engine-->>Client: IterationResult (continue)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the primary changes: verification gates, cost tracking, and improved completion detection. It captures the core enhancements without noise.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
src/tui/components/ProgressDashboard.tsx (1)

254-276: ⚠️ Potential issue | 🟡 Minor

Row 2 can overflow the 45-character fixed-width column when both aggregateUsage and totalCost are populated.

With a short tracker name the row can reach ~57+ characters: "Tracker: json · Σ I/O/T: 1.2M/500K/1.7M · Cost: $1.2345". The fixed width: 45 on the right column will silently clip the cost portion.

Consider either widening the column, moving Cost: to its own row (Row 3 or 4), or only showing cost when aggregateUsage is absent.

♻️ Option — give cost its own row
-        {/* Row 2: Tracker + cost */}
+        {/* Row 2: Tracker + token totals */}
         <box style={{ flexDirection: 'row' }}>
           <text fg={colors.fg.secondary}>Tracker: </text>
           <text fg={colors.accent.tertiary}>{trackerName}</text>
           {aggregateUsage && (
             <>
               <text fg={colors.fg.muted}> · </text>
               <text fg={colors.fg.secondary}>Σ I/O/T: </text>
               <text fg={colors.accent.secondary}>{formatTokenCount(aggregateUsage.inputTokens)}</text>
               <text fg={colors.fg.muted}>/</text>
               <text fg={colors.accent.primary}>{formatTokenCount(aggregateUsage.outputTokens)}</text>
               <text fg={colors.fg.muted}>/</text>
               <text fg={colors.status.info}>{formatTokenCount(aggregateUsage.totalTokens)}</text>
             </>
           )}
-          {totalCost !== undefined && totalCost > 0 && (
-            <>
-              <text fg={colors.fg.muted}> · </text>
-              <text fg={colors.fg.secondary}>Cost: </text>
-              <text fg={colors.accent.primary}>${totalCost.toFixed(4)}</text>
-            </>
-          )}
         </box>

+        {/* Row 2b: Cost (own row to avoid overflow) */}
+        {totalCost !== undefined && totalCost > 0 && (
+          <box style={{ flexDirection: 'row' }}>
+            <text fg={colors.fg.secondary}>Cost: </text>
+            <text fg={colors.accent.primary}>${totalCost.toFixed(4)}</text>
+          </box>
+        )}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/tui/components/ProgressDashboard.tsx` around lines 254 - 276, The Row 2
layout in ProgressDashboard (the <box> rendering trackerName, aggregateUsage and
totalCost) can overflow the fixed-width column; update the component so Cost is
rendered on its own row instead of inline: remove the totalCost block from the
existing <box> that contains trackerName/aggregateUsage and add a new separate
<box> (e.g., "Row 3: Cost") that conditionally renders the Cost label and the
${totalCost.toFixed(4)} when totalCost is defined and > 0; keep existing helpers
like formatTokenCount and the same color styles so only the JSX structure
changes (move the totalCost conditional rendering out of the
tracker/aggregateUsage <box> into its own <box>).
src/tui/components/RunApp.tsx (1)

1840-1885: ⚠️ Potential issue | 🟡 Minor

Initialise totalCost from engine state on mount to avoid displaying $0.00 after session resume.

The useEffect at lines 1840–1885 initialises taskUsageMap, activeAgent, rateLimitState, and detectedModel from engine.getState(), but does not read state.costSnapshot to initialise totalCost. After a session resume or component remount mid-run, the cost widget will display $0.00 until the next cost:updated event fires, despite the engine already having the accumulated cost in state.

Suggested fix — seed totalCost from state.costSnapshot
+  if (state.costSnapshot?.totalCost) {
+    setTotalCost(state.costSnapshot.totalCost);
+  }
   if (state.currentModel) {
     setDetectedModel(state.currentModel);
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/tui/components/RunApp.tsx` around lines 1840 - 1885, The mount effect
that reads engine.getState() should also seed the total cost from the engine
state: when engine.getState() returns a state with a costSnapshot, call the
existing setter (setTotalCost) with the appropriate value (e.g.,
state.costSnapshot.total or the field used to represent accumulated cost) so the
UI shows the correct cost immediately after resume; update the useEffect that
references engine and agentName to check for state.costSnapshot and call
setTotalCost(state.costSnapshot.<totalField>) alongside the other
initializations (e.g., setTaskUsageMap, setActiveAgentState, setRateLimitState,
setDetectedModel).
src/engine/index.ts (1)

470-475: ⚠️ Potential issue | 🟡 Minor

generatePromptPreview does not include verificationErrors or diffContext in the extended context.

buildPrompt (Line 155) now accepts verificationErrors and diffContext, and these are wired into the template context. However, generatePromptPreview builds its own extendedContext (Lines 471–475) without these fields. This means prompt previews will never show verification errors or diff context blocks, which could confuse users trying to debug why an agent received certain context.

Consider adding the missing fields (even as empty strings) to keep parity with the real prompt path.

Proposed fix
     const extendedContext = {
       recentProgress,
       codebasePatterns,
       prd: prdContext ?? undefined,
+      verificationErrors: '',
+      diffContext: formatDiffContext(this.recentDiffSummaries),
     };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/engine/index.ts` around lines 470 - 475, generatePromptPreview builds an
extendedContext object that omits verificationErrors and diffContext, so
previews never show those blocks; update the extendedContext inside
generatePromptPreview to include verificationErrors and diffContext (e.g.,
verificationErrors: verificationErrors ?? "" and diffContext: diffContext ?? "")
so it matches buildPrompt's template context and defaults to empty strings when
absent; ensure you reference the same variable names used elsewhere
(verificationErrors, diffContext, prdContext) so the preview and real prompt
paths remain consistent.
🟡 Minor comments (26)
spec/20260222-engine-improvements/06-acceptance-criteria-validation/TODO.md-3-3 (1)

3-3: ⚠️ Potential issue | 🟡 Minor

Status appears stale.

The AI summary confirms src/engine/ac-validator.ts, its tests, and engine integration are all implemented. If Step 6 is indeed complete, update the status to ✅ Complete and add a corresponding COMPLETED.md (as done for other steps).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@spec/20260222-engine-improvements/06-acceptance-criteria-validation/TODO.md`
at line 3, Update the TODO step status from "⬜ Not Started" to "✅ Complete" in
spec/20260222-engine-improvements/06-acceptance-criteria-validation/TODO.md and
create a corresponding COMPLETED.md for this step mirroring the pattern used by
other steps; reference the implemented artifacts (src/engine/ac-validator.ts,
its tests, and the engine integration) in the COMPLETED.md so reviewers can
verify completion.
spec/20260222-engine-improvements/02-auto-commit-defaults/TASK.md-20-20 (1)

20-20: ⚠️ Potential issue | 🟡 Minor

Wrong file extension — run.ts should be run.tsx.

The CLI command file is src/commands/run.tsx, not run.ts.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@spec/20260222-engine-improvements/02-auto-commit-defaults/TASK.md` at line
20, The comment references the wrong file extension: update any mention of
`src/commands/run.ts` to the correct `src/commands/run.tsx` (e.g., in the
TASK.md line "CLI options parsed in `src/commands/run.ts`") so documentation and
diffs point to the actual CLI command file `run.tsx`; ensure any other
references to `run.ts` (README, comments, or task descriptions) are also
corrected to `run.tsx`.
.claude/spec-loop.local.md-7-10 (1)

7-10: ⚠️ Potential issue | 🟡 Minor

Circuit breaker state contradicts its own trigger conditions.

no_progress_count: 2 is below the 3-iteration open threshold, and error_count: 0 is far below the 5-iteration error threshold, yet circuit_breaker: open is set. If the spec-loop agent reads this field to decide whether to proceed, the premature open state will stall it unnecessarily. Either the counters or the breaker state need to be reconciled.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/spec-loop.local.md around lines 7 - 10, The circuit_breaker state is
inconsistent with threshold fields: with no_progress_count = 2 (<3) and
error_count = 0 (<5) the breaker should not be "open"; update the spec so the
circuit_breaker field matches the counters (e.g., set circuit_breaker: closed)
or else increment no_progress_count/error_count to meet the open thresholds;
ensure the change is applied to the keys no_progress_count, error_count, and
circuit_breaker so the spec-loop agent reads a consistent state.
spec/20260222-engine-improvements/08-parallel-first-class/TODO.md-3-3 (1)

3-3: ⚠️ Potential issue | 🟡 Minor

Stale status — step is already complete.

The COMPLETED.md for this same step ships in the same PR and confirms all items are done. Update the status to reflect completion, e.g. ✅ Complete.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@spec/20260222-engine-improvements/08-parallel-first-class/TODO.md` at line 3,
Update the stale status line in TODO.md: replace the existing header "## Status:
⬜ Not Started" with a completed state such as "## Status: ✅ Complete" (since the
corresponding COMPLETED.md shows the step is done) so the TODO reflects the real
status; keep the rest of the file intact.
src/engine/cost-tracker.ts-40-54 (1)

40-54: ⚠️ Potential issue | 🟡 Minor

No guard against negative token counts.

addIteration(-1000, 0) would silently decrement totalCost, inputCost, and totalInputTokens. A caller passing inverted or invalid usage data could corrupt the running session total.

🛡️ Proposed fix — validate inputs
  addIteration(inputTokens: number, outputTokens: number, model?: string): number {
+   if (inputTokens < 0 || outputTokens < 0) {
+     throw new Error(`Token counts must be non-negative (got input=${inputTokens}, output=${outputTokens})`);
+   }
    const pricing = this.getPricing(model);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/engine/cost-tracker.ts` around lines 40 - 54, The addIteration method
currently allows negative inputTokens/outputTokens which can corrupt snapshot
totals; add validation at the start of addIteration (before calling getPricing)
to ensure inputTokens and outputTokens are numbers, finite, and >= 0, and throw
a descriptive RangeError (or TypeError) on invalid values; keep the rest of the
logic (getPricing, computing inputCost/outputCost, updating
this.snapshot.totalCost, inputCost, outputCost, totalInputTokens,
totalOutputTokens, and iterationCosts) unchanged so only valid non-negative
token counts are recorded.
.claude/spec-loop.local.md-31-34 (1)

31-34: ⚠️ Potential issue | 🟡 Minor

Corrupted content in document body — machine-state leaked outside YAML front-matter.

Lines 31–34 contain raw state fragments (last_completed_step: 5, 0) that appear to have been accidentally appended outside the front-matter block. This could cause the spec-loop agent to fail when parsing the document.

🛠️ Proposed fix — remove orphaned lines
 When circuit breaker opens, analyze and fix before continuing.
-last_completed_step: 5
-0
-last_completed_step: 5
-0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/spec-loop.local.md around lines 31 - 34, Remove the stray
machine-state fragments `last_completed_step: 5` and `0` that were appended
outside the YAML front-matter; open the file, locate the orphaned occurrences of
the `last_completed_step` and `0` tokens outside the front-matter block and
delete them so the document contains only a properly closed YAML front-matter
(ensure `---` start/end markers remain intact) and no leaked machine state that
could break the spec-loop parser.
src/engine/ac-validator.ts-7-11 (1)

7-11: ⚠️ Potential issue | 🟡 Minor

file-contains is declared but unreachable, and its grep invocation is broken

The ExecutableAC union (line 9) and the acToVerificationCommands switch (line 76) both reference 'file-contains', and the JSDoc at line 19–21 promises detection of "X contains Y" / "X includes Y" patterns. However, parseExecutableCriteria has no code branch that produces a 'file-contains' entry — so the case is permanently unreachable.

Additionally, the grep command at line 77 has no file argument; grep -q 'pattern' with no file reads from stdin and would block until the verification timeout fires (even though || true prevents a non-zero exit afterwards).

Either implement the detection logic and fix the command to include a target file, or remove the 'file-contains' variant from the union and the switch until it is properly implemented.

🐛 Proposed fix: remove the dead variant for now
 export interface ExecutableAC {
   original: string;
-  type: 'command' | 'file-exists' | 'file-contains';
+  type: 'command' | 'file-exists';
   assertion: string;
 }
 export function acToVerificationCommands(acs: ExecutableAC[]): string[] {
   return acs.map(ac => {
     switch (ac.type) {
       case 'command':
         return ac.assertion;
       case 'file-exists':
         return `test -e '${shellEscape(ac.assertion)}'`;
-      case 'file-contains':
-        return `grep -q '${shellEscape(ac.assertion)}' || true`; // soft check
       default:
         return '';
     }
   }).filter(Boolean);
 }

Also applies to: 69-82

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/engine/ac-validator.ts` around lines 7 - 11, The AC variant
'file-contains' is unreachable and its grep invocation lacks a file argument;
remove this dead variant and its handling until implemented: update the
ExecutableAC union to drop 'file-contains', remove the 'file-contains' case in
acToVerificationCommands, and clean up any JSDoc/comments in
parseExecutableCriteria referencing "X contains Y"; ensure
parseExecutableCriteria only emits the remaining types (e.g., 'file-exists' and
'command') so acToVerificationCommands and tests remain consistent with
ExecutableAC.
src/config/types.ts-411-412 (1)

411-412: ⚠️ Potential issue | 🟡 Minor

autoCommit documented as default: true but absent from DEFAULT_CONFIG

The updated JSDoc on line 411 states (default: true), but DEFAULT_CONFIG (line 532) has no autoCommit field. Consumers that derive defaults from DEFAULT_CONFIG will receive undefined, not true. Either add autoCommit: true to DEFAULT_CONFIG, or remove the default claim from the comment and document it in the consuming engine code where the actual fallback lives.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/config/types.ts` around lines 411 - 412, The JSDoc for the autoCommit
field claims a default of true but DEFAULT_CONFIG does not include autoCommit,
so consumers using DEFAULT_CONFIG get undefined; fix by adding autoCommit: true
to DEFAULT_CONFIG (in the same module where DEFAULT_CONFIG is defined) or
remove/adjust the JSDoc default and instead document the fallback where the
consuming engine applies the true default; reference the autoCommit property and
DEFAULT_CONFIG constant to locate and update the code.
src/engine/ac-validator.ts-56-59 (1)

56-59: ⚠️ Potential issue | 🟡 Minor

looksLikeCommand: s.includes(' run ') is overly broad

The substring run is common in plain English acceptance-criteria text (e.g. "the tests should run without errors", "make sure CI can run the suite"). This risks misclassifying non-executable criteria as shell commands.

🐛 Proposed fix
 function looksLikeCommand(s: string): boolean {
   const cmdPrefixes = ['bun ', 'npm ', 'npx ', 'node ', 'git ', 'curl ', 'test '];
-  return cmdPrefixes.some(p => s.startsWith(p)) || s.includes(' run ');
+  return cmdPrefixes.some(p => s.startsWith(p));
 }

If you want to keep run, tighten it to patterns like bun run or npm run which are already covered by the prefix list anyway.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/engine/ac-validator.ts` around lines 56 - 59, The looksLikeCommand
function is misclassifying plain text because of the broad s.includes(' run ')
check; update it to remove that generic check and instead match only explicit
package-run patterns (e.g., "npm run", "bun run", "yarn run", "pnpm run") or
rely on the existing cmdPrefixes array; modify looksLikeCommand to drop
s.includes(' run ') and, if needed, replace it with a targeted pattern check for
those specific run commands so ordinary phrases like "tests should run" are not
treated as shell commands.
src/engine/diff-summarizer.ts-29-35 (1)

29-35: ⚠️ Potential issue | 🟡 Minor

Renamed and copied files produce malformed paths in the diff summary

In the short-format (and porcelain v1), the → PATH2 part is shown when a file corresponds to a different path, i.e. the file is renamed. Concretely, a staged rename emits R old.ts -> new.ts; line.substring(3) therefore yields "old.ts -> new.ts", which is neither a valid path nor either of the actual filenames. That string then lands in filesChanged and is injected verbatim into agent prompts.

In the -z format the -> separator is omitted and the field order is reversed, which makes it machine-friendly. Switching to --porcelain -z (NUL-delimited) and splitting on \0 is the recommended approach for machine parsing, or at minimum detect and strip the -> … suffix for R/C entries:

🐛 Proposed minimal fix for rename/copy entries
   for (const line of lines) {
     const status = line.substring(0, 2).trim();
-    const file = line.substring(3);
+    // Renames/copies: "R  old.ts -> new.ts" — keep only the destination path
+    const rawFile = line.substring(3);
+    const file = rawFile.includes(' -> ') ? rawFile.split(' -> ').pop()! : rawFile;
     if (status === 'A' || status === '??') filesAdded.push(file);
     else if (status === 'D') filesDeleted.push(file);
     else filesChanged.push(file);
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/engine/diff-summarizer.ts` around lines 29 - 35, The current loop in
diff-summarizer (iterating over lines, using status = line.substring(0,2) and
file = line.substring(3)) mishandles rename/copy entries (e.g. "R  old -> new")
producing malformed paths; update the parsing so that when status indicates
rename/copy (status startsWith 'R' or 'C' or the file string contains "->") you
extract the destination path (the text after "->") and trim it before pushing
into filesAdded/filesChanged/filesDeleted, and ideally switch the upstream
parsing to use git --porcelain -z and split on '\0' to avoid ambiguous
separators — adjust references to variables lines, status, file, filesAdded,
filesDeleted, filesChanged and the loop that builds them.
src/tui/components/RunApp.tsx-1815-1818 (1)

1815-1818: ⚠️ Potential issue | 🟡 Minor

cost:threshold-exceeded provides no distinct user notification.

The handler updates the cost display but relies entirely on the subsequent engine:paused event to communicate the pause to the user. There is no toast or banner indicating why execution stopped, which could confuse users who did not configure the threshold themselves (e.g., on a shared config).

💡 Suggested improvement — surface a threshold toast
 case 'cost:threshold-exceeded':
   // Engine will auto-pause; update cost display
   setTotalCost(event.snapshot.totalCost);
+  setInfoFeedback(`Cost threshold reached (${event.snapshot.totalCost.toFixed(4)}). Execution paused.`);
   break;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/tui/components/RunApp.tsx` around lines 1815 - 1818, The
'cost:threshold-exceeded' handler in RunApp.tsx only calls setTotalCost and
doesn't notify the user why execution stopped; update the case for
'cost:threshold-exceeded' to also surface a user-facing notification (e.g., call
the existing toast/enqueueSnackbar/notify function used elsewhere in RunApp)
with a clear message that execution was paused due to a cost threshold and
include the current total (event.snapshot.totalCost); keep the cost update but
add the toast/banner call so users immediately see why the run paused before the
separate 'engine:paused' event arrives.
spec/20260222-engine-improvements/01-verification-gates/TODO.md-1-21 (1)

1-21: ⚠️ Potential issue | 🟡 Minor

Stale planning document — same issue as the Step 2 TODO.

Status shows "⬜ Not Started" with all tasks unchecked, while the verification gate implementation (verification.ts, event types, engine wiring, and full test suite) is demonstrably complete. Update or remove this file to avoid confusion.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@spec/20260222-engine-improvements/01-verification-gates/TODO.md` around lines
1 - 21, The TODO planning doc is stale and conflicts with the implemented
verification work; update or remove
spec/20260222-engine-improvements/01-verification-gates/TODO.md to reflect the
current state by either deleting the file or marking tasks done and adding notes
linking to the real implementations (e.g., VerificationConfig and
DEFAULT_VERIFICATION_CONFIG in src/config/types.ts, verification?:
VerificationConfig on StoredConfig/RalphConfig, runVerification() and
formatVerificationErrors() in src/engine/verification.ts,
VerificationStartedEvent/VerificationPassedEvent/VerificationFailedEvent in
src/engine/types.ts, engine integration in src/engine/index.ts,
verificationErrors in TemplateVariables in src/templates/types.ts, and template
changes in src/plugins/trackers/builtin/json/template.hbs) so the checklist and
Status field accurately represent completed work.
tests/engine/ac-validator.test.ts-73-92 (1)

73-92: ⚠️ Potential issue | 🟡 Minor

No test coverage for the file-contains case in acToVerificationCommands.

acToVerificationCommands has a file-contains branch (grep -q '…' || true), but parseExecutableCriteria never produces that type — making the branch unreachable via the normal flow. The test suite has no test for this case, and || true means it would always succeed even if the content is absent, which silently defeats the verification intent.

Consider either removing the dead branch from acToVerificationCommands or, if file-contains is intended for future use, adding an explicit test and removing the || true safety valve before it becomes active.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/engine/ac-validator.test.ts` around lines 73 - 92,
acToVerificationCommands contains a dead "file-contains" branch (the grep -q '…'
|| true form) that parseExecutableCriteria never produces; either remove that
branch to avoid misleading behavior or, if you intend to support it, update
parseExecutableCriteria to emit type: 'file-contains' and add a unit test in
ac-validator.test.ts covering the expected output (e.g., assertion path and
content => "grep -q '…' <file>" without the "|| true" safety) and remove the "||
true" so failures surface; refer to acToVerificationCommands and
parseExecutableCriteria when making the change and add/modify the test similar
to the existing cases in ac-validator.test.ts.
spec/20260222-engine-improvements/02-auto-commit-defaults/TODO.md-1-21 (1)

1-21: ⚠️ Potential issue | 🟡 Minor

Stale planning document — status and task boxes are out of date.

This file records "⬜ Not Started" with all tasks unchecked, yet the corresponding COMPLETED.md documents the full implementation. Leaving the TODO in this state alongside a COMPLETED counterpart is misleading for future contributors. Consider either:

  • Updating the status and ticking all boxes to reflect completion, or
  • Deleting the file, since COMPLETED.md is the canonical record.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@spec/20260222-engine-improvements/02-auto-commit-defaults/TODO.md` around
lines 1 - 21, The TODO.md planning document is stale (still shows "⬜ Not
Started") while the COMPLETED.md records the work as done; either update this
TODO to reflect completion by changing the status to "✅ Completed" and checking
off all tasks (or adding a brief "moved to COMPLETED.md" note), or remove the
TODO.md entirely so only the canonical COMPLETED.md remains; make this change in
the spec/20260222-engine-improvements/02-auto-commit-defaults/TODO.md file.
spec/20260222-engine-improvements/02-auto-commit-defaults/COMPLETED.md-30-33 (1)

30-33: ⚠️ Potential issue | 🟡 Minor

Missing language specifier on fenced code block (markdownlint MD040).

Add a language tag to suppress the linter warning:

✏️ Proposed fix
-```
+```text
 bun run typecheck   PASS
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@spec/20260222-engine-improvements/02-auto-commit-defaults/COMPLETED.md`
around lines 30 - 33, The fenced code block in COMPLETED.md is missing a
language specifier (triggering markdownlint MD040); update the triple-backtick
that opens the block to include a language tag (e.g., change ``` to ```text or
```bash) so the block reads like "```text" (or your preferred language) to
satisfy the linter and preserve the existing content (the block containing "bun
run typecheck   PASS", "bun run build       PASS", "bun test            PASS").
spec/20260222-engine-improvements/06-acceptance-criteria-validation/COMPLETED.md-39-42 (1)

39-42: ⚠️ Potential issue | 🟡 Minor

Missing language specifier on fenced code block (markdownlint MD040).

✏️ Proposed fix
-```
+```text
 bun run typecheck  ✓ (clean)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@spec/20260222-engine-improvements/06-acceptance-criteria-validation/COMPLETED.md`
around lines 39 - 42, The fenced code block containing the lines "bun run
typecheck  ✓ (clean)" / "bun run build      ✓ (clean)" / "bun test
tests/engine/ac-validator.test.ts  ✓ (17 pass, 0 fail)" is missing a language
specifier; update the opening fence from ``` to ```text so the block becomes
```text and includes the three lines, which satisfies markdownlint MD040.
src/config/index.ts-704-704 (1)

704-704: ⚠️ Potential issue | 🟡 Minor

autoCommit: true default is a silent breaking change for existing users.

Before this PR, unset autoCommit meant no auto-commits. After this PR, every existing deployment without an explicit autoCommit = false in their config will start auto-committing after upgrading. This should be prominently highlighted in the changelog/release notes, and a --no-auto-commit deprecation notice or migration guide should accompany the change.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/config/index.ts` at line 704, The change made autoCommit default to true
(autoCommit: options.autoCommit ?? storedConfig.autoCommit ?? true) which
silently enables auto-commits for existing users; revert to the prior implicit
"off" behavior by changing the default to false (e.g., use options.autoCommit ??
storedConfig.autoCommit ?? false) and add a clear deprecation/migration path:
log a one-time warning when autoCommit is undefined recommending explicit
opt-in/opt-out, add a CLI/flag handler for --no-auto-commit/--auto-commit to
make opt-in explicit, and document the change in the release notes/changelog.
Ensure you update the code that reads autoCommit (the autoCommit assignment
using options and storedConfig) and add the warning where configuration is
initialized.
tests/engine/diff-summarizer.test.ts-90-142 (1)

90-142: ⚠️ Potential issue | 🟡 Minor

Missing test coverage for the filesDeleted path.

generateDiffSummaryLocal has an explicit else if (status === 'D') filesDeleted.push(file) branch (line 54), and DiffSummary exposes filesDeleted as a first-class field. There is currently no test that:

  • verifies a deleted tracked file populates result.filesDeleted
  • verifies the "Deleted:" prefix appears in result.summary
✅ Suggested test additions
+  test('deleted tracked files populate filesDeleted', async () => {
+    // Remove a file that was committed in initGitRepo
+    const { rm } = await import('node:fs/promises');
+    await rm(join(tempDir, 'README.md'));
+
+    const result = await generateDiffSummaryLocal(tempDir);
+    expect(result).not.toBeNull();
+    expect(result!.filesDeleted).toContain('README.md');
+    expect(result!.filesAdded).toHaveLength(0);
+    expect(result!.filesChanged).toHaveLength(0);
+  });
+
+  test('summary contains Deleted prefix for removed files', async () => {
+    const { rm } = await import('node:fs/promises');
+    await rm(join(tempDir, 'README.md'));
+
+    const result = await generateDiffSummaryLocal(tempDir);
+    expect(result).not.toBeNull();
+    expect(result!.summary).toContain('Deleted:');
+    expect(result!.summary).toContain('README.md');
+  });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/engine/diff-summarizer.test.ts` around lines 90 - 142, Add a test that
removes a tracked file and asserts the deleted-path behavior: call unlink/remove
on an already-committed file (e.g., 'README.md'), run
generateDiffSummaryLocal(tempDir), then assert result is not null,
result.filesDeleted contains 'README.md' and result.summary contains the
"Deleted:" prefix; reference generateDiffSummaryLocal and the
DiffSummary.filesDeleted field to locate where to add this test.
spec/20260222-engine-improvements/05-completion-detection/TASK.md-61-69 (1)

61-69: ⚠️ Potential issue | 🟡 Minor

Remove dead code and clarify misleading comment in spec and implementation.

The spec file contains the dead code as identified, but the actual implementation in src/engine/completion-strategies.ts (lines 27–34) has already been corrected and does not contain the unused stripped variable. However, both the spec and the actual implementation retain a misleading comment claiming "Match even inside markdown code blocks" when neither implementation actually strips fence markers before testing.

Spec (TASK.md lines 61–69):

  • Remove the no-op const stripped = ... line.
  • Replace the misleading comment with an accurate description: the strategy extends promiseTagStrategy with an alternative promise: complete pattern but does not handle code fences.

Actual implementation (src/engine/completion-strategies.ts line 30):

  • Update the comment to clarify that the strategy does not detect tags inside markdown code fences, only the explicit tag or the informal pattern.

If stripping code fences is genuinely required, implement it correctly by testing against the stripped string instead of the original.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@spec/20260222-engine-improvements/05-completion-detection/TASK.md` around
lines 61 - 69, Remove the dead no-op `const stripped = ...` line from the spec's
`relaxedTagStrategy` and change the misleading comment "Match even inside
markdown code blocks" to a truthful description that this strategy extends
`promiseTagStrategy` by also matching the informal `promise: complete` pattern
but does not strip markdown fences; also update the comment in the actual
implementation's `relaxedTagStrategy` (same symbol) to the same clarified
wording. If you instead want to support detection inside fenced code blocks,
implement proper fence-stripping and run the regex tests against the stripped
string (replace uses of `result.stdout` with the stripped variable) rather than
leaving the current no-op.
src/config/index.ts-705-712 (1)

705-712: ⚠️ Potential issue | 🟡 Minor

Consider applying DEFAULT_MODEL_ESCALATION as base in config construction for clarity.

When storedConfig.modelEscalation is undefined and the user provides only --start-model, the constructed config object becomes { enabled: true, startModel: 'x' } with escalateModel and escalateAfter absent. While the engine applies DEFAULT_MODEL_ESCALATION as a base (at line 1006 in src/engine/index.ts) before calling getModelForTask, applying defaults at config construction would be more explicit and defensive.

Additionally, the ?? undefined patterns on lines 712–713 are redundant when the value is already undefined | T (the nullish coalescing does nothing).

Suggested improvement
-    modelEscalation: (options.startModel || options.escalateModel)
-      ? {
-          ...storedConfig.modelEscalation,
-          enabled: true,
-          ...(options.startModel ? { startModel: options.startModel } : {}),
-          ...(options.escalateModel ? { escalateModel: options.escalateModel } : {}),
-        }
-      : storedConfig.modelEscalation ?? undefined,
+    modelEscalation: (options.startModel || options.escalateModel)
+      ? {
+          ...DEFAULT_MODEL_ESCALATION,
+          ...storedConfig.modelEscalation,
+          enabled: true,
+          ...(options.startModel ? { startModel: options.startModel } : {}),
+          ...(options.escalateModel ? { escalateModel: options.escalateModel } : {}),
+        }
+      : storedConfig.modelEscalation,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/config/index.ts` around lines 705 - 712, The modelEscalation block should
use DEFAULT_MODEL_ESCALATION as the base when constructing the config so partial
CLI flags (like options.startModel) get the full default shape; change the logic
that builds modelEscalation (symbol: modelEscalation) to merge
DEFAULT_MODEL_ESCALATION, then overlay storedConfig.modelEscalation and finally
overlay flags (options.startModel / options.escalateModel) so escalateModel and
escalateAfter are present when appropriate; also remove the redundant "??
undefined" nullish coalescing on the fallback to storedConfig.modelEscalation
since it's already undefined | T.
spec/20260222-engine-improvements/04-cross-iteration-context/TODO.md-1-3 (1)

1-3: ⚠️ Potential issue | 🟡 Minor

TODO status shows "Not Started" but Step 4 is marked complete in PLAN.md.

The PLAN.md completion log (line 151) records Step 4 as completed on 2026-02-22, and the AI summary confirms the implementation exists. This TODO file appears to have been left un-updated. Consider updating the status or removing this file if a separate COMPLETED.md already exists.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@spec/20260222-engine-improvements/04-cross-iteration-context/TODO.md` around
lines 1 - 3, Update the TODO file "Step 4: Cross-Iteration Context - Todo" so
its status matches PLAN.md (Step 4 marked complete on 2026-02-22): either change
the "## Status:" line from "⬜ Not Started" to a completed state with the
completion date, or remove this TODO.md if a separate COMPLETED.md already
records the finished work; ensure the header "Step 4: Cross-Iteration Context -
Todo" and status line are updated to keep documentation consistent with the
PLAN.md entry.
spec/20260222-engine-improvements/06-acceptance-criteria-validation/TASK.md-58-70 (1)

58-70: ⚠️ Potential issue | 🟡 Minor

Backtick/single-quote regex may cause false-positive command extraction

The regex /[`']([^`']+)[`']/ matches content between mismatched delimiters (e.g., `foo' or 'foo`). This is likely intentional for flexibility with agent output variations, but could extract non-command text from ACs that use single quotes for emphasis (e.g., The 'login' page should work), feeding it to looksLikeCommand. The looksLikeCommand guard mitigates this somewhat but phrases like 'node modules should be installed' would match the 'node ' prefix.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@spec/20260222-engine-improvements/06-acceptance-criteria-validation/TASK.md`
around lines 58 - 70, The current regex /[`']([^`']+)[`']/ can match mismatched
delimiters and produce false-positive command captures; update
parseExecutableCriteria so you only capture content enclosed by the same
delimiter (prefer backticks for shell commands) — e.g., replace the single
combined pattern with a strict backtick-only match (/`([^`]+)`/) and optionally
a separate single-quote match (/\'([^']+)\'/) that is only used if
looksLikeCommand returns true; adjust the cmdMatch handling in
parseExecutableCriteria (the cmdMatch variable and the branch that pushes
type:'command') accordingly so you no longer extract text from mismatched
delimiters.
spec/20260222-engine-improvements/07-cost-tracking/TASK.md-93-95 (1)

93-95: ⚠️ Potential issue | 🟡 Minor

Shallow spread does not deep-copy iterationCosts array

getSnapshot() returns { ...this.snapshot }, which copies primitive fields but shares the iterationCosts array reference. A caller doing snapshot.iterationCosts.push(…) would mutate the tracker's internal state. The spec should note that a proper copy (e.g., spreading the array) is needed.

   getSnapshot(): CostSnapshot {
-    return { ...this.snapshot };
+    return { ...this.snapshot, iterationCosts: [...this.snapshot.iterationCosts] };
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@spec/20260222-engine-improvements/07-cost-tracking/TASK.md` around lines 93 -
95, getSnapshot() currently returns a shallow copy ({ ...this.snapshot }) which
preserves the same iterationCosts array reference; change it to return a copy
that clones the iterationCosts array as well (e.g., return { ...this.snapshot,
iterationCosts: [...this.snapshot.iterationCosts] }) so callers cannot mutate
the tracker's internal state via snapshot.iterationCosts; if iterationCosts
contains mutable objects, also deep-copy those elements as needed to fully
isolate the snapshot from internal state.
src/commands/run.tsx-534-543 (1)

534-543: ⚠️ Potential issue | 🟡 Minor

i++ placement inconsistent with other numeric flag parsers — value arg may be re-parsed as a flag

When nextArg is present but fails the parseInt validation (e.g., --conflict-timeout abc), i is not incremented, so "abc" is re-processed as a flag name in the next loop iteration. Compare with --iterations (line 378) where i++ is outside the inner if, correctly consuming the value argument regardless of parse success. The same inconsistency exists for --listen-port.

Proposed fix
       case '--conflict-timeout': {
         if (nextArg && !nextArg.startsWith('-')) {
           const parsed = parseInt(nextArg, 10);
           if (!isNaN(parsed) && parsed > 0) {
             options.conflictTimeout = parsed;
-            i++;
           }
+          i++;
         }
         break;
       }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/commands/run.tsx` around lines 534 - 543, The --conflict-timeout case
currently only increments i when parseInt succeeds, causing the provided value
token (e.g., "abc") to be re-parsed as a flag; change the logic in the case
'--conflict-timeout' to always increment i whenever nextArg exists and does not
start with '-' (like the --iterations case) but only assign
options.conflictTimeout when parsed value is a valid number > 0; apply the same
fix pattern to the --listen-port case so both flags consume their value token
even if validation fails.
spec/20260222-engine-improvements/PLAN.md-44-57 (1)

44-57: ⚠️ Potential issue | 🟡 Minor

Dependency graph has Steps 4 and 6 swapped.

The graph labels Step 4 as "(AC validation)" and Step 6 as "(cross-iteration context)", but the table below (lines 61–70) correctly shows Step 4 = Cross-iteration context and Step 6 = Acceptance criteria validation. The arrow Step 1 → Step 4 should read Step 1 → Step 6, and the labels for 4 and 6 need swapping.

Proposed fix
-Step 1 (verification) ─► Step 4 (AC validation)
+Step 1 (verification) ─► Step 6 (AC validation)
          │
 Step 2 (auto-commit) ──── independent
          │
 Step 3 (model escalation) ─► Step 7 (cost tracking)
          │
-Step 5 (completion detection) ── independent
+Step 4 (cross-iteration context) ── independent-Step 6 (cross-iteration context) ── independent
+Step 5 (completion detection) ── independent
          │
 Step 8 (parallel first-class) ── independent
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@spec/20260222-engine-improvements/PLAN.md` around lines 44 - 57, The
dependency graph mislabels Step 4 and Step 6 and the edge from Step 1 points to
the wrong step; update the diagram so Step 4 is labeled "cross-iteration
context" and Step 6 is labeled "AC validation" (swapping the two labels), and
change the arrow "Step 1 (verification) ─► Step 4 (AC validation)" to "Step 1
(verification) ─► Step 6 (AC validation)"; ensure any other references in the
same diagram to Step 4/6 reflect this swap so the diagram matches the table
below.
spec/20260222-engine-improvements/07-cost-tracking/TASK.md-101-108 (1)

101-108: ⚠️ Potential issue | 🟡 Minor

Model matching with includes() may produce incorrect results for some model names

The getPricing method uses model.includes(k) as a fallback matcher. This is too permissive — for example, a model name such as "opus-wrapper" would match the short key 'opus' and receive incorrect pricing. Whilst the current MODEL_PRICING keys are unlikely to cause false matches in typical Claude model names, the logic lacks precision.

Consider instead: (1) enforce startsWith() only and remove includes(), relying on full versioned keys like 'claude-opus-4-6' for prefix matching, or (2) sort keys from longest to shortest before matching to prioritise more-specific entries.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@spec/20260222-engine-improvements/07-cost-tracking/TASK.md` around lines 101
- 108, The getPricing method currently uses model.includes(k) which is too
permissive and can produce false matches; update getPricing (referencing the
getPricing function, MODEL_PRICING and the key variable) to remove the
includes() fallback and instead either (A) only match exact equality or prefix
with startsWith(k) (relying on full versioned keys in MODEL_PRICING), or (B)
sort Object.keys(MODEL_PRICING) by descending key length first and then find the
first key where model === k || model.startsWith(k) so more specific keys win;
implement one of these options and return MODEL_PRICING['sonnet'] as the safe
default when no key is found.

Comment on lines +100 to +113
export function acToVerificationCommands(acs: ExecutableAC[], cwd: string): string[] {
return acs.map(ac => {
switch (ac.type) {
case 'command':
return ac.assertion;
case 'file-exists':
return `test -e "${ac.assertion}"`;
case 'file-contains':
return `grep -q "${ac.assertion}" || true`; // soft check
default:
return '';
}
}).filter(Boolean);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Command injection risk: user-derived strings interpolated into shell commands without escaping

ac.assertion values are extracted from user-authored acceptance criteria via regex and placed directly into shell command strings with double-quote interpolation (lines 106, 108). A crafted AC like Running `bun test"; rm -rf /` passes would produce a dangerous verification command. The AI summary mentions a shellEscape internal helper — ensure the implementation applies it to all interpolated values.

Additionally, the file-contains case (line 108) generates grep -q "${ac.assertion}" || true without a target file path, which would read from stdin and hang.

Finally, the cwd parameter is declared but never used in the function body.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@spec/20260222-engine-improvements/06-acceptance-criteria-validation/TASK.md`
around lines 100 - 113, The acToVerificationCommands function is vulnerable
because ac.assertion and cwd are interpolated into shell strings without
escaping and the 'file-contains' branch emits a grep that reads from stdin;
update acToVerificationCommands to call the internal shellEscape helper on every
interpolated value (use shellEscape(ac.assertion) for file paths/patterns and
shellEscape(cwd) when prepending a working-directory change), change the
'file-exists' case to use test -e "shellEscape(ac.assertion)", change the
'file-contains' case to run grep safely (e.g. grep -qr --
"shellEscape(ac.assertion)" . || true to search under the given cwd) and prefix
each returned command with cd "shellEscape(cwd)" && ... so cwd is actually used;
keep 'command' behavior but ensure you only interpolate cwd (not raw
ac.assertion) into the wrapper command so user commands are not concatenated
into unescaped shell contexts.

Comment on lines +64 to +71
private getPricing(model?: string): ModelPricing {
if (!model) return MODEL_PRICING['sonnet']; // safe default
// Try exact match, then prefix match
const key = Object.keys(MODEL_PRICING).find(
k => model === k || model.startsWith(k) || model.includes(k)
);
return key ? MODEL_PRICING[key] : MODEL_PRICING['sonnet'];
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Version-specific pricing keys are unreachable dead code, and the matching strategy is fragile.

Object.keys iterates in insertion order. Because 'opus' precedes 'claude-opus-4-6' (and likewise for sonnet/haiku), the condition model.includes('opus') is satisfied for the short key before the version-specific key is ever checked. This makes every 'claude-*' key permanently unreachable — if they were ever assigned different pricing, it would silently be ignored.

Additionally, model.includes(k) performs unanchored substring matching: any model whose name happens to contain a tier word (e.g. a hypothetical "new-opus-lite") would silently receive the wrong price tier.

A safer approach separates match phases:

♻️ Proposed fix — prioritised, non-ambiguous match
-  private getPricing(model?: string): ModelPricing {
-    if (!model) return MODEL_PRICING['sonnet']; // safe default
-    // Try exact match, then prefix match
-    const key = Object.keys(MODEL_PRICING).find(
-      k => model === k || model.startsWith(k) || model.includes(k)
-    );
-    return key ? MODEL_PRICING[key] : MODEL_PRICING['sonnet'];
-  }
+  private getPricing(model?: string): ModelPricing {
+    if (!model) return MODEL_PRICING['sonnet'];
+    // 1. Exact match
+    if (MODEL_PRICING[model]) return MODEL_PRICING[model];
+    // 2. Prefix match (model name starts with a known key)
+    const prefixKey = Object.keys(MODEL_PRICING).find(k => model.startsWith(k));
+    if (prefixKey) return MODEL_PRICING[prefixKey];
+    // 3. Substring match (model name contains a known key)
+    const subKey = Object.keys(MODEL_PRICING).find(k => model.includes(k));
+    return subKey ? MODEL_PRICING[subKey] : MODEL_PRICING['sonnet'];
+  }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
private getPricing(model?: string): ModelPricing {
if (!model) return MODEL_PRICING['sonnet']; // safe default
// Try exact match, then prefix match
const key = Object.keys(MODEL_PRICING).find(
k => model === k || model.startsWith(k) || model.includes(k)
);
return key ? MODEL_PRICING[key] : MODEL_PRICING['sonnet'];
}
private getPricing(model?: string): ModelPricing {
if (!model) return MODEL_PRICING['sonnet'];
// 1. Exact match
if (MODEL_PRICING[model]) return MODEL_PRICING[model];
// 2. Prefix match (model name starts with a known key)
const prefixKey = Object.keys(MODEL_PRICING).find(k => model.startsWith(k));
if (prefixKey) return MODEL_PRICING[prefixKey];
// 3. Substring match (model name contains a known key)
const subKey = Object.keys(MODEL_PRICING).find(k => model.includes(k));
return subKey ? MODEL_PRICING[subKey] : MODEL_PRICING['sonnet'];
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/engine/cost-tracker.ts` around lines 64 - 71, getPricing currently uses
unanchored includes and insertion-order key scanning so short keys like "opus"
shadow versioned keys like "claude-opus-4-6"; fix by changing the matching
strategy in getPricing: first check for exact match of model in MODEL_PRICING,
then perform a prefix match using keys sorted by descending length (so more
specific/versioned keys like "claude-opus-4-6" are matched before "opus"), and
do not use model.includes(k); fall back to MODEL_PRICING['sonnet'] if no match
is found. Ensure you reference getPricing and MODEL_PRICING when implementing
this change.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
src/engine/index.ts (1)

1363-1374: promiseComplete is misleading now that completion uses pluggable strategies.

With strategies like 'relaxed-tag' or 'heuristic', the variable no longer specifically reflects a <promise>COMPLETE</promise> signal. Renaming to completionSignalled (or agentSignalledComplete) would keep the semantics accurate as more strategies are added.

♻️ Suggested rename
-     const completionResult = detectCompletion(
-       agentResult,
-       this.config.completion?.strategies ?? ['promise-tag'],
-     );
-     const promiseComplete = completionResult.completed;
+     const completionResult = detectCompletion(
+       agentResult,
+       this.config.completion?.strategies ?? ['promise-tag'],
+     );
+     const completionSignalled = completionResult.completed;

      // ...
-     let taskCompleted = promiseComplete;
+     let taskCompleted = completionSignalled;

      // ...
-     } else if (!promiseComplete) {
+     } else if (!completionSignalled) {

      // ...
        promiseComplete,   // in IterationResult
+       // rename field to completionSignalled in IterationResult if types permit
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/engine/index.ts` around lines 1363 - 1374, The variable name
promiseComplete is misleading now that detectCompletion supports pluggable
strategies; rename promiseComplete to completionSignalled (or
agentSignalledComplete) and update the assignment and any uses (e.g., let
completionSignalled = completionResult.completed) and the following
taskCompleted initialization (let taskCompleted = completionSignalled) as well
as the surrounding comment to reflect "completion signalled by strategy" rather
than an explicit <promise>COMPLETE</promise>; update references to
detectCompletion, completionResult, and this.config.completion?.strategies
accordingly.
src/commands/run.tsx (1)

697-706: Default parallel mode changed to 'auto' — existing projects without an explicit parallel.mode will now receive dependency-graph-based parallel scheduling on ralph-tui run.

Previously the fallback was sequential. Any project relying on implicit sequential behaviour will silently begin running tasks in parallel (when the graph analysis recommends it). Consider adding a release note or migration warning, or defaulting only for new projects that have never stored a config.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/commands/run.tsx` around lines 697 - 706, The resolveParallelMode
function currently defaults to 'auto' even for projects with an existing
storedConfig missing a parallel.mode; change it so 'auto' is returned only when
there is no storedConfig at all (new projects), and when storedConfig exists but
has no parallel.mode fall back to the previous sequential default ('never').
Update the final return to check storedConfig explicitly (e.g., if storedConfig
== null return 'auto'; else return storedConfig.parallel?.mode ?? 'never'),
keeping the CLI flag checks (options.serial/options.parallel) intact;
references: resolveParallelMode, options.serial, options.parallel,
storedConfig.parallel.mode.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/commands/run.tsx`:
- Around line 548-557: The --conflict-timeout branch fails to consume the next
token when parsing fails, leaving the invalid value as the next arg; update the
handler in run.tsx so that when nextArg exists and !nextArg.startsWith('-') you
always increment i (consume the token) regardless of whether parseInt(nextArg,
10) yields a valid positive number; keep assigning options.conflictTimeout only
on successful parse (using parseInt and the parsed > 0 check) but move or add
i++ so the token is consumed in both success and failure paths, matching the
behavior of the other integer flags (variables: i, nextArg, parseInt,
options.conflictTimeout).

In `@src/engine/index.ts`:
- Around line 1019-1022: Remove the redundant console output that corrupts the
TUI: in the escalation path where this.emit(event) is called (the model
escalation logic referencing previousModel, escalatedModel and attempts), delete
the console.log(`[model-escalation] Escalating model: ${previousModel} →
${escalatedModel} after ${attempts} failed attempt(s)`); line so only the
"model:escalated" event is emitted and no direct stdout writes occur.
- Around line 1553-1563: The cost threshold check currently fires every
iteration because snapshot.totalCost is monotonic; add a private boolean flag
(e.g., costThresholdTripped) to the class and change the logic around the
alertThreshold check in the method that contains
this.config.cost?.alertThreshold, snapshot.totalCost, this.emit and this.pause
so that you only emit `cost:threshold-exceeded` and call this.pause() when
alertThreshold > 0 && !this.costThresholdTripped && snapshot.totalCost >=
alertThreshold, and then set this.costThresholdTripped = true after
emitting/pausing (optionally provide a method to clear/reset the flag if you
ever allow decreasing cost or manual override).

---

Nitpick comments:
In `@src/commands/run.tsx`:
- Around line 697-706: The resolveParallelMode function currently defaults to
'auto' even for projects with an existing storedConfig missing a parallel.mode;
change it so 'auto' is returned only when there is no storedConfig at all (new
projects), and when storedConfig exists but has no parallel.mode fall back to
the previous sequential default ('never'). Update the final return to check
storedConfig explicitly (e.g., if storedConfig == null return 'auto'; else
return storedConfig.parallel?.mode ?? 'never'), keeping the CLI flag checks
(options.serial/options.parallel) intact; references: resolveParallelMode,
options.serial, options.parallel, storedConfig.parallel.mode.

In `@src/engine/index.ts`:
- Around line 1363-1374: The variable name promiseComplete is misleading now
that detectCompletion supports pluggable strategies; rename promiseComplete to
completionSignalled (or agentSignalledComplete) and update the assignment and
any uses (e.g., let completionSignalled = completionResult.completed) and the
following taskCompleted initialization (let taskCompleted = completionSignalled)
as well as the surrounding comment to reflect "completion signalled by strategy"
rather than an explicit <promise>COMPLETE</promise>; update references to
detectCompletion, completionResult, and this.config.completion?.strategies
accordingly.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1fb98f7 and 069d253.

📒 Files selected for processing (2)
  • src/commands/run.tsx
  • src/engine/index.ts

Comment on lines +548 to +557
case '--conflict-timeout': {
if (nextArg && !nextArg.startsWith('-')) {
const parsed = parseInt(nextArg, 10);
if (!isNaN(parsed) && parsed > 0) {
options.conflictTimeout = parsed;
i++;
}
}
break;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

--conflict-timeout does not consume nextArg on parse failure, inconsistent with other parsers.

When an invalid value is provided (e.g., --conflict-timeout abc), i is not incremented so abc becomes the next arg in the loop. For all other integer flags (--iterations, --delay, --listen-port) the outer if block always ends with i++, consuming the token regardless of parse validity.

🐛 Proposed fix
      case '--conflict-timeout': {
        if (nextArg && !nextArg.startsWith('-')) {
          const parsed = parseInt(nextArg, 10);
          if (!isNaN(parsed) && parsed > 0) {
            options.conflictTimeout = parsed;
-           i++;
          }
+         i++;
        }
        break;
      }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
case '--conflict-timeout': {
if (nextArg && !nextArg.startsWith('-')) {
const parsed = parseInt(nextArg, 10);
if (!isNaN(parsed) && parsed > 0) {
options.conflictTimeout = parsed;
i++;
}
}
break;
}
case '--conflict-timeout': {
if (nextArg && !nextArg.startsWith('-')) {
const parsed = parseInt(nextArg, 10);
if (!isNaN(parsed) && parsed > 0) {
options.conflictTimeout = parsed;
}
i++;
}
break;
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/commands/run.tsx` around lines 548 - 557, The --conflict-timeout branch
fails to consume the next token when parsing fails, leaving the invalid value as
the next arg; update the handler in run.tsx so that when nextArg exists and
!nextArg.startsWith('-') you always increment i (consume the token) regardless
of whether parseInt(nextArg, 10) yields a valid positive number; keep assigning
options.conflictTimeout only on successful parse (using parseInt and the parsed
> 0 check) but move or add i++ so the token is consumed in both success and
failure paths, matching the behavior of the other integer flags (variables: i,
nextArg, parseInt, options.conflictTimeout).

Comment on lines +1019 to +1022
failedAttempts: attempts,
};
this.emit(event);
console.log(`[model-escalation] Escalating model: ${previousModel} → ${escalatedModel} after ${attempts} failed attempt(s)`);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

console.log in TUI mode will corrupt the display; the model:escalated event already surfaces this information.

The model:escalated event is emitted at Line 1021 — the console.log at Line 1022 is redundant and writes to the same stdout stream as the TUI renderer, causing visual corruption whenever escalation fires in interactive mode. The existing console.log calls for rate-limit (Line 924) carry the same risk but are pre-existing; avoid adding more.

🐛 Proposed fix — remove the redundant log
          if (attempts >= escalationConfig.escalateAfter && previousModel) {
            const event: ModelEscalatedEvent = {
              type: 'model:escalated',
              timestamp: new Date().toISOString(),
              taskId: task.id,
              previousModel,
              newModel: escalatedModel,
              failedAttempts: attempts,
            };
            this.emit(event);
-           console.log(`[model-escalation] Escalating model: ${previousModel} → ${escalatedModel} after ${attempts} failed attempt(s)`);
          }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
failedAttempts: attempts,
};
this.emit(event);
console.log(`[model-escalation] Escalating model: ${previousModel}${escalatedModel} after ${attempts} failed attempt(s)`);
failedAttempts: attempts,
};
this.emit(event);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/engine/index.ts` around lines 1019 - 1022, Remove the redundant console
output that corrupts the TUI: in the escalation path where this.emit(event) is
called (the model escalation logic referencing previousModel, escalatedModel and
attempts), delete the console.log(`[model-escalation] Escalating model:
${previousModel} → ${escalatedModel} after ${attempts} failed attempt(s)`); line
so only the "model:escalated" event is emitted and no direct stdout writes
occur.

Comment on lines +1553 to +1563
const alertThreshold = this.config.cost?.alertThreshold ?? 0;
if (alertThreshold > 0 && snapshot.totalCost >= alertThreshold) {
this.emit({
type: 'cost:threshold-exceeded',
timestamp: endedAt.toISOString(),
snapshot,
threshold: alertThreshold,
});
// Pause the engine so the user can decide whether to continue
this.pause();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Cost threshold fires on every iteration once crossed, creating an infinite pause loop.

snapshot.totalCost is monotonically increasing. Once totalCost >= alertThreshold, every completed iteration will emit cost:threshold-exceeded and call this.pause(). The user resumes, the next iteration completes, the cost is still above the threshold, and the engine pauses again immediately.

A threshold crossing guard is needed — only pause when the threshold is first exceeded:

🐛 Proposed fix — track whether the threshold has already been tripped

Add a private flag to the class:

private costThresholdTripped = false;

Then guard the pause:

          const alertThreshold = this.config.cost?.alertThreshold ?? 0;
-         if (alertThreshold > 0 && snapshot.totalCost >= alertThreshold) {
+         if (!this.costThresholdTripped && alertThreshold > 0 && snapshot.totalCost >= alertThreshold) {
+           this.costThresholdTripped = true;
            this.emit({
              type: 'cost:threshold-exceeded',
              timestamp: endedAt.toISOString(),
              snapshot,
              threshold: alertThreshold,
            });
            // Pause the engine so the user can decide whether to continue
            this.pause();
          }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/engine/index.ts` around lines 1553 - 1563, The cost threshold check
currently fires every iteration because snapshot.totalCost is monotonic; add a
private boolean flag (e.g., costThresholdTripped) to the class and change the
logic around the alertThreshold check in the method that contains
this.config.cost?.alertThreshold, snapshot.totalCost, this.emit and this.pause
so that you only emit `cost:threshold-exceeded` and call this.pause() when
alertThreshold > 0 && !this.costThresholdTripped && snapshot.totalCost >=
alertThreshold, and then set this.costThresholdTripped = true after
emitting/pausing (optionally provide a method to clear/reset the flag if you
ever allow decreasing cost or manual override).

@subsy
Copy link
Owner

subsy commented Feb 23, 2026

@highimpact-dev Thanks for the PR - the feature set here is solid and the modular approach with per-feature tests is well done. I have two concerns before this can merge.

First, the hardcoded model pricing table in cost-tracker.ts is already stale (e.g., Opus listed at $15/$75 vs current $5/$25) and will be a recurring maintenance burden. I'd prefer we either pull cost data from the API response, make it user-configurable, or remove the built-in price list entirely rather than shipping something that will silently give wrong numbers.

Second, a PR of this scope (eight new features, new config options, new CLI flags, and a breaking default change to autoCommit) needs user-facing documentation. Could you add docs for the new configuration options and a migration note for the autoCommit default change?

The CodeRabbit review also flagged a command injection issue in ac-validator.ts that should be addressed before merge.

- cost-tracker: remove hardcoded MODEL_PRICING table (was stale at $15/$75
  for Opus). CostTracker now accepts pricing via constructor from CostConfig.
  Token counts are always tracked; dollar costs require user-configured pricing.
- config/types: add CostConfig.pricing field with TOML example showing current
  prices — users own their pricing data so it stays accurate.
- engine/index: pass config.cost?.pricing to CostTracker constructor.
- ac-validator: reject command assertions containing shell metacharacters
  (; | & > < ` $ etc.) to prevent command injection from crafted AC text.
- tests: update cost-tracker tests to inject pricing via constructor (opus
  price corrected to $5/$25); fix ac-validator test quote mismatch (single
  quotes, matching implementation); add injection-rejection test.
- README: document verification, model escalation, completion strategies,
  cost tracking, and autoCommit config options. Add migration note for
  autoCommit default change (false → true) and new CLI flags.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (3)
src/engine/cost-tracker.ts (1)

65-72: Substring matching in getPricing is still order-dependent for user-supplied keys.

The previous review flagged that Object.keys(...).find(k => model.includes(k)) can match a shorter key before a more specific one (e.g. user configures both "opus" and "claude-opus-4-6""opus" wins if it appears first). The exact-match-first step is good, but the substring fallback should sort candidates by descending key length so longer, more specific keys are preferred.

♻️ Suggested improvement
   private getPricing(model?: string): ModelPricing | null {
     if (!model || Object.keys(this.pricing).length === 0) return null;
     // Exact match first
     if (this.pricing[model]) return this.pricing[model];
-    // Substring match: find a key whose name appears in the model string
-    const key = Object.keys(this.pricing).find(k => model.includes(k));
+    // Substring match: prefer the longest (most specific) key
+    const key = Object.keys(this.pricing)
+      .filter(k => model.includes(k))
+      .sort((a, b) => b.length - a.length)[0];
     return key ? this.pricing[key] : null;
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/engine/cost-tracker.ts` around lines 65 - 72, getPricing currently falls
back to a substring lookup using Object.keys(this.pricing).find(k =>
model.includes(k)), which can choose a shorter user-supplied key before a more
specific one; change the substring fallback so you sort candidate keys by
descending length and then pick the first key whose name is included in model.
Keep the existing exact-match check (this.pricing[model]) intact, and replace
the find call with something like
Object.keys(this.pricing).sort((a,b)=>b.length-a.length).find(... ) so
longer/more specific pricing keys win when resolving pricing in getPricing.
src/engine/index.ts (2)

1553-1564: Cost threshold fires repeatedly once crossed — still unaddressed.

This was flagged in a previous review. snapshot.totalCost is monotonically increasing, so once totalCost >= alertThreshold, every subsequent iteration will re-trigger the pause. The user resumes, the next iteration completes, and the engine pauses again immediately.

A one-shot guard flag is needed:

🐛 Proposed fix — track threshold trip

Add a private field:

private costThresholdTripped = false;

Then guard the check:

           const alertThreshold = this.config.cost?.alertThreshold ?? 0;
-          if (alertThreshold > 0 && snapshot.totalCost >= alertThreshold) {
+          if (!this.costThresholdTripped && alertThreshold > 0 && snapshot.totalCost >= alertThreshold) {
+            this.costThresholdTripped = true;
             this.emit({
               type: 'cost:threshold-exceeded',
               timestamp: endedAt.toISOString(),
               snapshot,
               threshold: alertThreshold,
             });
             // Pause the engine so the user can decide whether to continue
             this.pause();
           }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/engine/index.ts` around lines 1553 - 1564, Add a one-shot guard to avoid
re-triggering the cost threshold pause: introduce a private boolean field (e.g.,
costThresholdTripped = false) on the class and change the check around
snapshot.totalCost/alertThreshold so the emit + this.pause() only run when
alertThreshold > 0, snapshot.totalCost >= alertThreshold, and
costThresholdTripped is false; after emitting and pausing set
costThresholdTripped = true. Use the existing symbols snapshot.totalCost,
alertThreshold, this.emit(...) and this.pause() and set/reset the new
costThresholdTripped flag accordingly (no reset needed if cost is monotonic).

1022-1023: console.log in the escalation path will corrupt TUI output — still unaddressed.

This was flagged in a previous review. The model:escalated event at line 1022 already surfaces this information to listeners; the console.log at line 1023 writes directly to stdout and will corrupt the TUI display in interactive mode.

🐛 Proposed fix — remove the redundant log
           this.emit(event);
-          console.log(`[model-escalation] Escalating model: ${previousModel} → ${escalatedModel} after ${attempts} failed attempt(s)`);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/engine/index.ts` around lines 1022 - 1023, Remove the direct stdout write
that corrupts the TUI by deleting the console.log line that prints model
escalation (the line using previousModel, escalatedModel and attempts)
immediately after this.emit(event); rely on the already-emitted
"model:escalated" event for listeners to surface this info (or, if persistent
logging is required, replace the console.log with a non-stdout logger call such
as processLogger.debug instead of writing to stdout).
🧹 Nitpick comments (2)
src/engine/index.ts (2)

1376-1435: Verification gate logic is well-structured, with one concern about silent completion on retry exhaustion.

When verificationRetries >= verifyConfig.maxRetries (line 1417), the task is marked as complete despite failing verification. The inline comment says "skip the verification gate and mark done", which is a deliberate choice. However, this silently marks a task complete even though its verification never passed — consider emitting a distinct event (e.g. verification:skipped or a warning) so the user and TUI have visibility into this.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/engine/index.ts` around lines 1376 - 1435, The code currently clears
verification state and marks the task done when verificationRetries >=
verifyConfig.maxRetries, but emits no distinct event; modify the branch where
verificationRetries >= verifyConfig.maxRetries (referencing verificationRetries
and verifyConfig.maxRetries) to emit a new event like type:
'verification:skipped' (or 'verification:exhausted') with timestamp, task,
failures (use verifyResult.results.filter(r => !r.passed).map(r => r.command)),
totalRetries: verifyConfig.maxRetries, and a message explaining verification
exhausted before clearing lastVerificationErrors and deleting the retry map so
the TUI and users can see the skipped verification; keep the existing cleanup
behavior afterward.

1376-1385: AC-derived commands are prepended without deduplication.

If a user also manually lists the same command in verification.commands (e.g. "bun test"), it will run twice — once from the AC extraction and once from the config. This is unlikely to cause breakage but wastes time. A simple Set-based dedup would help.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/engine/index.ts` around lines 1376 - 1385, The code prepends AC-derived
commands (getAcVerificationCommands(task.metadata)) to configured
verification.commands (from DEFAULT_VERIFICATION_CONFIG merged into
baseVerifyConfig) without deduplication, causing duplicate commands to run;
modify the verifyConfig construction to merge the two arrays into a deduplicated
list while preserving order (AC commands first, then any remaining configured
commands) — e.g., build a new commands array by iterating acCommands then
baseVerifyConfig.commands and adding each command only if not already seen (use
a Set to track seen commands) and assign that deduped array to
verifyConfig.commands.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@README.md`:
- Around line 229-233: Update the stale Haiku pricing entry: change the
"claude-haiku-4-5" cost.pricing values from inputPer1M = 0.80 and outputPer1M =
4.0 to the current values inputPer1M = 1.0 and outputPer1M = 5.0 in the README
example; also make the identical update for the matching entry in
src/config/types.ts (look for the "claude-haiku-4-5" key and the
inputPer1M/outputPer1M fields) so examples and typed config stay consistent.

In `@src/config/types.ts`:
- Around line 509-520: Update the JSDoc `@example` for the pricing field to use
the current Haiku 4.5 rates: change the "claude-haiku-4-5" example values from
inputPer1M = 0.80, outputPer1M = 4.0 to inputPer1M = 1.0, outputPer1M = 5.0 so
the example in the pricing?: Record<string, { inputPer1M: number; outputPer1M:
number }> block reflects the correct $1/$5 per 1M tokens pricing.

In `@src/engine/ac-validator.ts`:
- Around line 83-84: The 'file-contains' switch branch in ac-validator.ts
returns a grep command that lacks a filename and will hang; update the
'file-contains' case in the switch (in src/engine/ac-validator.ts) to either
remove this unreachable branch entirely or change the returned command to
include an explicit file argument (e.g., accept or derive a filename/path from
the assertion object such as ac.path/ac.target and return something like "grep
-q -- '${shellEscape(ac.assertion)}' ${shellEscape(filePath)} || true"), and if
you can't yet supply a path add a clear TODO comment explaining the needed file
argument and leave a well-formed fallback (or remove the case). Ensure the
unique symbol 'file-contains' is handled consistently with other cases.
- Around line 67-89: The SHELL_METACHAR_RE used in acToVerificationCommands()
doesn't match newline or carriage-return so assertions containing embedded
newlines can bypass the blocklist; update the SHELL_METACHAR_RE to include \n
and \r (e.g. add \n\r into the character class) or explicitly reject assertions
that contain \n or \r before returning them in acToVerificationCommands/command
case so any assertion with embedded newlines/carriage-returns returns '' and
cannot be passed to the shell.

---

Duplicate comments:
In `@src/engine/cost-tracker.ts`:
- Around line 65-72: getPricing currently falls back to a substring lookup using
Object.keys(this.pricing).find(k => model.includes(k)), which can choose a
shorter user-supplied key before a more specific one; change the substring
fallback so you sort candidate keys by descending length and then pick the first
key whose name is included in model. Keep the existing exact-match check
(this.pricing[model]) intact, and replace the find call with something like
Object.keys(this.pricing).sort((a,b)=>b.length-a.length).find(... ) so
longer/more specific pricing keys win when resolving pricing in getPricing.

In `@src/engine/index.ts`:
- Around line 1553-1564: Add a one-shot guard to avoid re-triggering the cost
threshold pause: introduce a private boolean field (e.g., costThresholdTripped =
false) on the class and change the check around
snapshot.totalCost/alertThreshold so the emit + this.pause() only run when
alertThreshold > 0, snapshot.totalCost >= alertThreshold, and
costThresholdTripped is false; after emitting and pausing set
costThresholdTripped = true. Use the existing symbols snapshot.totalCost,
alertThreshold, this.emit(...) and this.pause() and set/reset the new
costThresholdTripped flag accordingly (no reset needed if cost is monotonic).
- Around line 1022-1023: Remove the direct stdout write that corrupts the TUI by
deleting the console.log line that prints model escalation (the line using
previousModel, escalatedModel and attempts) immediately after this.emit(event);
rely on the already-emitted "model:escalated" event for listeners to surface
this info (or, if persistent logging is required, replace the console.log with a
non-stdout logger call such as processLogger.debug instead of writing to
stdout).

---

Nitpick comments:
In `@src/engine/index.ts`:
- Around line 1376-1435: The code currently clears verification state and marks
the task done when verificationRetries >= verifyConfig.maxRetries, but emits no
distinct event; modify the branch where verificationRetries >=
verifyConfig.maxRetries (referencing verificationRetries and
verifyConfig.maxRetries) to emit a new event like type: 'verification:skipped'
(or 'verification:exhausted') with timestamp, task, failures (use
verifyResult.results.filter(r => !r.passed).map(r => r.command)), totalRetries:
verifyConfig.maxRetries, and a message explaining verification exhausted before
clearing lastVerificationErrors and deleting the retry map so the TUI and users
can see the skipped verification; keep the existing cleanup behavior afterward.
- Around line 1376-1385: The code prepends AC-derived commands
(getAcVerificationCommands(task.metadata)) to configured verification.commands
(from DEFAULT_VERIFICATION_CONFIG merged into baseVerifyConfig) without
deduplication, causing duplicate commands to run; modify the verifyConfig
construction to merge the two arrays into a deduplicated list while preserving
order (AC commands first, then any remaining configured commands) — e.g., build
a new commands array by iterating acCommands then baseVerifyConfig.commands and
adding each command only if not already seen (use a Set to track seen commands)
and assign that deduped array to verifyConfig.commands.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 069d253 and 0c3e684.

📒 Files selected for processing (7)
  • README.md
  • src/config/types.ts
  • src/engine/ac-validator.ts
  • src/engine/cost-tracker.ts
  • src/engine/index.ts
  • tests/engine/ac-validator.test.ts
  • tests/engine/cost-tracker.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/engine/ac-validator.test.ts
  • tests/engine/cost-tracker.test.ts

Comment on lines +229 to +233
[cost.pricing]
"claude-opus-4-6" = { inputPer1M = 5.0, outputPer1M = 25.0 }
"claude-sonnet-4-6" = { inputPer1M = 3.0, outputPer1M = 15.0 }
"claude-haiku-4-5" = { inputPer1M = 0.80, outputPer1M = 4.0 }
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Haiku pricing example is stale — same issue as in src/config/types.ts.

Line 232 lists "claude-haiku-4-5" = { inputPer1M = 0.80, outputPer1M = 4.0 } but current Haiku 4.5 pricing is $1 / $5. Since users are likely to copy these examples verbatim, the values should be accurate.

📝 Suggested fix
-"claude-haiku-4-5"   = { inputPer1M = 0.80, outputPer1M = 4.0  }
+"claude-haiku-4-5"   = { inputPer1M = 1.0, outputPer1M = 5.0  }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
[cost.pricing]
"claude-opus-4-6" = { inputPer1M = 5.0, outputPer1M = 25.0 }
"claude-sonnet-4-6" = { inputPer1M = 3.0, outputPer1M = 15.0 }
"claude-haiku-4-5" = { inputPer1M = 0.80, outputPer1M = 4.0 }
```
[cost.pricing]
"claude-opus-4-6" = { inputPer1M = 5.0, outputPer1M = 25.0 }
"claude-sonnet-4-6" = { inputPer1M = 3.0, outputPer1M = 15.0 }
"claude-haiku-4-5" = { inputPer1M = 1.0, outputPer1M = 5.0 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 229 - 233, Update the stale Haiku pricing entry:
change the "claude-haiku-4-5" cost.pricing values from inputPer1M = 0.80 and
outputPer1M = 4.0 to the current values inputPer1M = 1.0 and outputPer1M = 5.0
in the README example; also make the identical update for the matching entry in
src/config/types.ts (look for the "claude-haiku-4-5" key and the
inputPer1M/outputPer1M fields) so examples and typed config stay consistent.

Comment on lines +509 to +520
/**
* Model pricing in USD per 1M tokens. No built-in defaults are provided —
* configure this to enable dollar-cost estimation. Token counts are always tracked.
*
* @example
* [cost.pricing]
* "claude-opus-4-6" = { inputPer1M = 5.0, outputPer1M = 25.0 }
* "claude-sonnet-4-6" = { inputPer1M = 3.0, outputPer1M = 15.0 }
* "claude-haiku-4-5" = { inputPer1M = 0.80, outputPer1M = 4.0 }
*/
pricing?: Record<string, { inputPer1M: number; outputPer1M: number }>;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Pricing example in JSDoc shows outdated Haiku rate.

The @example block lists Haiku 4.5 at inputPer1M = 0.80, outputPer1M = 4.0, but current Haiku 4.5 pricing is $1 / $5 per 1M tokens. Since the whole rationale for removing built-in pricing was to avoid shipping stale numbers, the documentation example should be accurate to avoid copy-paste errors.

📝 Suggested fix
-   * "claude-haiku-4-5" = { inputPer1M = 0.80, outputPer1M = 4.0 }
+   * "claude-haiku-4-5" = { inputPer1M = 1.0, outputPer1M = 5.0 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/config/types.ts` around lines 509 - 520, Update the JSDoc `@example` for
the pricing field to use the current Haiku 4.5 rates: change the
"claude-haiku-4-5" example values from inputPer1M = 0.80, outputPer1M = 4.0 to
inputPer1M = 1.0, outputPer1M = 5.0 so the example in the pricing?:
Record<string, { inputPer1M: number; outputPer1M: number }> block reflects the
correct $1/$5 per 1M tokens pricing.

Comment on lines +67 to +89
const SHELL_METACHAR_RE = /[;&|><`$\\{}()!]/;

/**
* Convert executable AC into verification commands.
*/
export function acToVerificationCommands(acs: ExecutableAC[]): string[] {
return acs.map(ac => {
switch (ac.type) {
case 'command':
// Reject commands containing shell metacharacters to prevent injection.
// Acceptance criteria come from user-authored task files but may be
// processed by AI agents, so we enforce a conservative allowlist.
if (SHELL_METACHAR_RE.test(ac.assertion)) return '';
return ac.assertion;
case 'file-exists':
return `test -e '${shellEscape(ac.assertion)}'`;
case 'file-contains':
return `grep -q '${shellEscape(ac.assertion)}' || true`; // soft check
default:
return '';
}
}).filter(Boolean);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Command injection via embedded newlines bypasses the metacharacter blocklist.

SHELL_METACHAR_RE does not match \n or \r. An acceptance criterion like Running `bun test\nrm -rf /` passes would extract the command, pass the metacharacter check, and execute both bun test and rm -rf / when the string is passed to a shell. Since acceptance criteria may be authored or modified by AI agents (as the comment on line 77–78 acknowledges), this is exploitable.

Add \n and \r to the blocklist:

🐛 Proposed fix
-const SHELL_METACHAR_RE = /[;&|><`$\\{}()!]/;
+const SHELL_METACHAR_RE = /[;&|><`$\\{}()!\n\r]/;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/engine/ac-validator.ts` around lines 67 - 89, The SHELL_METACHAR_RE used
in acToVerificationCommands() doesn't match newline or carriage-return so
assertions containing embedded newlines can bypass the blocklist; update the
SHELL_METACHAR_RE to include \n and \r (e.g. add \n\r into the character class)
or explicitly reject assertions that contain \n or \r before returning them in
acToVerificationCommands/command case so any assertion with embedded
newlines/carriage-returns returns '' and cannot be passed to the shell.

Comment on lines +83 to +84
case 'file-contains':
return `grep -q '${shellEscape(ac.assertion)}' || true`; // soft check
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

file-contains branch produces a broken grep command (no filename argument).

grep -q '${pattern}' || true reads from stdin, which will hang or return immediately with no match. This branch is currently unreachable (no parse rule produces 'file-contains'), but if it's ever wired up it will silently malfunction. Either remove it or fix the command to include a file path.

♻️ Suggested fix — remove unreachable branch or add a TODO
       case 'file-contains':
-        return `grep -q '${shellEscape(ac.assertion)}' || true`; // soft check
+        // TODO: file-contains not yet produced by parseExecutableCriteria
+        return '';
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
case 'file-contains':
return `grep -q '${shellEscape(ac.assertion)}' || true`; // soft check
case 'file-contains':
// TODO: file-contains not yet produced by parseExecutableCriteria
return '';
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/engine/ac-validator.ts` around lines 83 - 84, The 'file-contains' switch
branch in ac-validator.ts returns a grep command that lacks a filename and will
hang; update the 'file-contains' case in the switch (in
src/engine/ac-validator.ts) to either remove this unreachable branch entirely or
change the returned command to include an explicit file argument (e.g., accept
or derive a filename/path from the assertion object such as ac.path/ac.target
and return something like "grep -q -- '${shellEscape(ac.assertion)}'
${shellEscape(filePath)} || true"), and if you can't yet supply a path add a
clear TODO comment explaining the needed file argument and leave a well-formed
fallback (or remove the case). Ensure the unique symbol 'file-contains' is
handled consistently with other cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants