fix: add heartbeat to prevent liveness timeout killing skill sessions#1937
Conversation
Print-mode CLI sessions (used by /recreate, /rebase, etc.) produce no stdout during tool use. The parent process's LivenessWatchdog fires after first_output_timeout (600s) and kills the skill subprocess, causing silent failures on complex PRs that need >10 min of tool work. Add a HeartbeatTimer that periodically prints a marker to the skill subprocess's stdout, keeping the outer watchdog alive. The heartbeat interval is set to half of first_output_timeout (default 300s).
PR Review — fix: add heartbeat to prevent liveness timeout killing skill sessionsSolid fix for a real production timeout. Merge-ready with minor suggestions. The heartbeat design correctly separates the inner CLI pipe (
🟡 Important1. Test name promises cancellation check but doesn't verify it (`koan/tests/test_claude_step.py`, L648-659)Named Options to strengthen:
As-is, the test only proves 🟢 Suggestions1. Unnecessary heartbeat when LivenessWatchdog is disabled (`koan/app/claude_step.py`, L607-621)When Consider yielding _heartbeat = max(60, _fot // 2) if _fot > 0 else None
Checklist
To rebase specific severity levels, mention me: Silent Failure Analysis🟡 **MEDIUM** — swallowed exception (`koan/app/claude_step.py:279-281`)Risk: The heartbeat thread silently swallows OSError/ValueError and stops emitting, which means the very watchdog timeout it was designed to prevent could now fire and kill a legitimate long-running process — with no log trace explaining why heartbeats stopped. Fix: Log a single debug-level message before breaking (e.g. via run_log.log_safe) so operators can distinguish 'heartbeat stopped due to pipe error' from 'heartbeat never started'. Automated review by Kōan (Claude · model claude-opus-4-6) |
What
Add a periodic heartbeat mechanism to
run_claude_stepthat keeps the parent process's LivenessWatchdog alive during print-mode CLI sessions.Why
/recreate(and other skills usingrun_claude_step) run Claude CLI in print mode, which produces NO stdout during tool use. The outerfirst_output_timeout(600s) kills the skill subprocess after 10 minutes of silence — even though Claude is actively working (reading files, writing code).This caused
/recreateto fail silently on complex PRs (e.g. PR #1088: dedicated chat process feature) that need >10 minutes of tool work before any text output.How
_HeartbeatTimerclass that periodically prints[still working...]to the skill subprocess's stdoutfirst_output_timeout / 2(default 300s) — enough margin to prevent false killsTesting
TestRunClaudeandTestRunClaudeSteptests passQuality Report
Changes: 2 files changed, 134 insertions(+)
Code scan: 1 issue(s) found
koan/app/claude_step.py:279— debug print statementTests: passed (403
tests)
Branch hygiene: clean
Generated by Kōan