Dev#2491
Conversation
fix: report standalone brain install in serve status
Adds agent_error observe health flag for agent/executor error lifecycle rows and regression coverage.
Co-authored-by: Genie Automagik <genie@namastex.ai>
Daily metrics update for 2026-05-24. Stats: 47 commits, 0 releases this week, +1319/-339 LoC (7d). VELOCITY.md and charts refreshed. https://claude.ai/code/session_01993o1C5JAog4AzSD2WtFu3
…rdcoding 8432 Migration 002 hardcoded CANONICAL_PORT=8432 and killed any postmaster on a different port. After the autopg-v3 socket-singleton cutover the canonical backbone binds 5432, so this was (a) inert on healthy 5432 hosts and, worse, (b) in a mixed window with a stray legacy postmaster on 8432 alongside canonical 5432, it treated 8432 as canonical and stopped the REAL 5432 backbone. Now discovers the canonical port from the autopg/pgserve binary (`<bin> status --json`). When no canonical binary is installed or it can't report a port, the migration is a safe no-op — we never stop a postmaster we can't positively distinguish from canonical. Extracted selectLegacyEmbedded() as a pure, unit-tested helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Reuse extractPgservePortFromStatus (moved to src/lib/pgserve-status.ts, re-exported from update.ts) so the migration tolerates nested instance.port / runtime.port status shapes — reading only top-level .port would null-resolve and make the migration a permanent no-op on those hosts (Codex P1). - Guard the parsed port with > 0 so Number(null|false|'')===0 is rejected (Gemini). - selectLegacyEmbedded now returns ALL non-canonical postmasters (filter, not find); apply stops each and validate waits for all to exit — otherwise a second stray survives and validate fails after 5s (Gemini). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…al-port-discovery fix(migrations): 002 discovers canonical pgserve port instead of hardcoding 8432
- releases_24h: 2 (v4.260525.1, v4.260525.2) - merged_prs_7d: 30 - avg_merge_time: 1.3h - ship_rate: 100% - daily_stats_count: 61 https://claude.ai/code/session_01A9HUwDC1aLZ45z8N5ViygD
Claude Code fails when asked to --resume a session whose JSONL file is missing (e.g. after a cleanup or on a fresh machine). By rewriting the flag to --session-id we keep the same identifier but force a fresh session, which always succeeds.
…cripts" This reverts commit 771e1c3.
buildOmniSpawnParams now always emits sessionId (never resume), so buildLaunchCommand produces --session-id <id>. Unlike --resume, this flag attaches to an existing JSONL transcript when present but gracefully starts a fresh session with the same id when the transcript is missing (e.g. after cleanup or on a fresh machine) — preventing hard failures on respawn. Fix is applied at the source (where the command is built), not in the transport layer (tmux-launch-script), which is now provider-agnostic. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When respawning a per-chat agent with a prior Claude session id, emit --resume (not --session-id) so Claude reattaches to the existing JSONL. Add a 3s liveness check after launch: if --resume silently fails (JSONL missing), fall back to a fresh --session-id so the inbound message is not lost. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix(omni): use --resume for respawn with JSONL-missing fallback
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: ⛔ Files ignored due to path filters (5)
📒 Files selected for processing (26)
📝 WalkthroughWalkthroughThis PR bundles release version metadata updates (4.260606.2), metrics data snapshots, and functional enhancements: shared utilities for pgserve status parsing and tmux script generation; hardened test database setup with fail-closed behavior; multi-target legacy postgres cleanup via runtime port discovery; agent terminal error health signaling; and standalone brain status with dependency injection. ChangesRelease artifacts, infrastructure libraries, and robustness hardening
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a script-based path for complex tmux spawns to prevent command corruption, improves test database isolation by failing closed on setup failures, refactors migration 002 to dynamically discover the canonical pgserve port, and adds an agent_error health flag to agent observability. Feedback on these changes suggests optimizing the 3-second delay during session resumption by polling the process status, using args instead of comm in isPaneProcessRunning to ensure correct process detection on macOS for interpreted runtimes, and adding a timeout to the health check fetch in printStandaloneBrainStatus to prevent potential hangs.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| await new Promise((r) => setTimeout(r, 3000)); | ||
| const processName = resolveOmniPaneProcessName(entry.provider); | ||
| const resumed = await isPaneProcessRunning(paneId, processName); |
There was a problem hiding this comment.
Using an unconditional 3000ms delay on every resume introduces a significant latency penalty on the happy path (which is the most common case).
Instead of waiting for the full 3 seconds unconditionally, we can poll isPaneProcessRunning at short intervals (e.g., every 500ms) and return early as soon as the process is detected as running. This keeps the resume flow extremely fast on success while still maintaining the 3-second fallback safety window on failure.
const processName = resolveOmniPaneProcessName(entry.provider);
let resumed = false;
for (let i = 0; i < 6; i++) {
resumed = await isPaneProcessRunning(paneId, processName);
if (resumed) break;
await new Promise((r) => setTimeout(r, 500));
}| // Check direct children and grandchildren for the target process name | ||
| const output = exec( | ||
| `pgrep -la -P ${panePid} 2>/dev/null; for cpid in $(pgrep -P ${panePid} 2>/dev/null); do pgrep -la -P "$cpid" 2>/dev/null; done; true`, | ||
| `pgrep -la -P ${panePid} 2>/dev/null; for cpid in $(pgrep -P ${panePid} 2>/dev/null); do pgrep -la -P "$cpid" 2>/dev/null; ps -p "$cpid" -o comm= 2>/dev/null; done; true`, |
There was a problem hiding this comment.
On macOS, pgrep -la is not supported (as -a is invalid), so the command relies on ps -p "$cpid" -o comm= to get the process name. However, if the process is running under an interpreter (such as node or bun), comm will only return the interpreter name (e.g., node or bun), causing the processName check (e.g., claude or genie) to fail.
Using args (or command) instead of comm will return the full command line including arguments, ensuring correct detection of interpreted processes on macOS.
| `pgrep -la -P ${panePid} 2>/dev/null; for cpid in $(pgrep -P ${panePid} 2>/dev/null); do pgrep -la -P "$cpid" 2>/dev/null; ps -p "$cpid" -o comm= 2>/dev/null; done; true`, | |
| `pgrep -la -P ${panePid} 2>/dev/null; for cpid in $(pgrep -P ${panePid} 2>/dev/null); do pgrep -la -P "$cpid" 2>/dev/null; ps -p "$cpid" -o args= 2>/dev/null; done; true`, |
| } | ||
| try { | ||
| const fetchHealth = deps.fetchImpl ?? fetch; | ||
| const resp = await fetchHealth(`http://127.0.0.1:${config.port}/healthz`); |
There was a problem hiding this comment.
If the brain server is unresponsive or deadlocked but the port remains open, a fetch request without a timeout can hang indefinitely. This would cause the genie serve status command to block forever.
Adding a short timeout (e.g., 3 seconds) using AbortSignal.timeout prevents potential hangs and ensures the status command remains responsive.
const resp = await fetchHealth(`http://127.0.0.1:${config.port}/healthz`, {
signal: AbortSignal.timeout(3000),
});References
- It is acceptable to use hardcoded numeric limits (magic numbers) in non-critical fallback logic, especially when they serve as intentional caps to prevent performance issues like excessive I/O.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 99b2bf40a0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| .join(' '); | ||
| const cmd = envPrefix ? `${envPrefix} ${launch.command}` : launch.command; | ||
| const scriptPath = writeTmuxLaunchScript(`omni-${chatId}`, cmd); | ||
| await executeTmux(`send-keys -t '${paneId}' "source ${scriptPath}" Enter`); |
There was a problem hiding this comment.
Quote the sourced launch script path
On hosts where the home directory contains spaces or shell metacharacters, this sends the pane a command like source /Users/Jane Doe/.genie/..., so the pane shell splits the path and fails to source the script, preventing omni spawn/resume from launching. The native tmux path quotes the generated script path before execution; this source path needs the same protection when building the command sent to tmux.
Useful? React with 👍 / 👎.
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Chores