Skip to content

fix: add heartbeat to prevent liveness timeout killing skill sessions#1937

Merged
atoomic merged 1 commit into
Anantys-oss:mainfrom
Koan-Bot:koan.atoomic/fix-recreate-liveness-timeout
Jun 15, 2026
Merged

fix: add heartbeat to prevent liveness timeout killing skill sessions#1937
atoomic merged 1 commit into
Anantys-oss:mainfrom
Koan-Bot:koan.atoomic/fix-recreate-liveness-timeout

Conversation

@Koan-Bot

@Koan-Bot Koan-Bot commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

What

Add a periodic heartbeat mechanism to run_claude_step that keeps the parent process's LivenessWatchdog alive during print-mode CLI sessions.

Why

/recreate (and other skills using run_claude_step) run Claude CLI in print mode, which produces NO stdout during tool use. The outer first_output_timeout (600s) kills the skill subprocess after 10 minutes of silence — even though Claude is actively working (reading files, writing code).

This caused /recreate to fail silently on complex PRs (e.g. PR #1088: dedicated chat process feature) that need >10 minutes of tool work before any text output.

How

  • Added _HeartbeatTimer class that periodically prints [still working...] to the skill subprocess's stdout
  • The heartbeat goes to the OUTER process (run.py) via the skill subprocess's stdout pipe, NOT to the inner CLI's captured output (separate pipe)
  • Interval set to first_output_timeout / 2 (default 300s) — enough margin to prevent false kills
  • Timer is a daemon thread that auto-stops when the inner CLI process exits

Testing

  • 2 new tests: heartbeat emission during silent CLI + timer cancellation on completion
  • All existing TestRunClaude and TestRunClaudeStep tests pass
  • Full test suite passes (0 failures)

Quality Report

Changes: 2 files changed, 134 insertions(+)

Code scan: 1 issue(s) found

  • koan/app/claude_step.py:279 — debug print statement

Tests: passed (403
tests)

Branch hygiene: clean

Generated by Kōan

Print-mode CLI sessions (used by /recreate, /rebase, etc.) produce no
stdout during tool use. The parent process's LivenessWatchdog fires
after first_output_timeout (600s) and kills the skill subprocess,
causing silent failures on complex PRs that need >10 min of tool work.

Add a HeartbeatTimer that periodically prints a marker to the skill
subprocess's stdout, keeping the outer watchdog alive. The heartbeat
interval is set to half of first_output_timeout (default 300s).
@Koan-Bot Koan-Bot requested a review from Koanex June 14, 2026 13:34
@atoomic atoomic self-assigned this Jun 14, 2026
@Koan-Bot

Copy link
Copy Markdown
Collaborator Author

PR Review — fix: add heartbeat to prevent liveness timeout killing skill sessions

Solid fix for a real production timeout. Merge-ready with minor suggestions.

The heartbeat design correctly separates the inner CLI pipe (proc.stdout, captured by stream_with_timeout) from the outer subprocess stdout (sys.stdout, read by the parent's LivenessWatchdog). Threading is clean: Event.wait(interval) for interruptible sleep, daemon thread for automatic cleanup, cancel() in a finally block, and proc.poll() guard to stop when the CLI exits. The API change is backward-compatible (heartbeat_interval=None default). The interval calculation (max(60, FOT // 2)) gives healthy margin before the watchdog fires.

  • test_heartbeat_cancelled_after_completion doesn't verify cancellation — name overpromises
  • Heartbeat still fires when FOT=0 (watchdog disabled) — harmless but wasteful
  • The code scan's "debug print at line 279" is a false positive — print("[still working...]") IS the feature

🟡 Important

1. Test name promises cancellation check but doesn't verify it (`koan/tests/test_claude_step.py`, L648-659)

Named test_heartbeat_cancelled_after_completion but only asserts result["success"] is True. Doesn't verify the timer was actually cancelled or that the thread stopped.

Options to strengthen:

  • Assert the heartbeat thread is no longer alive after run_claude returns (patch _start_heartbeat to capture the timer, then check timer._thread.is_alive()).
  • Or patch _HeartbeatTimer.cancel and assert it was called once.
  • Or use capsys to verify no [still working...] was emitted (since the CLI finishes instantly before the 1s interval).

As-is, the test only proves run_claude doesn't crash with heartbeat_interval set, which is valuable but doesn't match what the name claims.

def test_heartbeat_cancelled_after_completion(self, mock_popen):
    proc = _fake_proc(["ok\n"], returncode=0)
    mock_popen.return_value = (proc, lambda: None)
    result = run_claude(
        ["claude", "-p", "test"], "/project",
        heartbeat_interval=1,
    )
    assert result["success"] is True

🟢 Suggestions

1. Unnecessary heartbeat when LivenessWatchdog is disabled (`koan/app/claude_step.py`, L607-621)

When get_first_output_timeout() returns 0, the parent's LivenessWatchdog is not created (checked in run.py:2143: if first_output_timeout > 0). The heartbeat would print [still working...] to stdout for nothing.

Consider yielding None instead of 120 when FOT is disabled:

_heartbeat = max(60, _fot // 2) if _fot > 0 else None

run_claude already guards on heartbeat_interval and heartbeat_interval > 0, so None skips the heartbeat entirely.

_fot = get_first_output_timeout()
_heartbeat = max(60, _fot // 2) if _fot > 0 else 120

Checklist

  • No hardcoded secrets
  • No command injection
  • Thread safety
  • Resource cleanup in error paths
  • Backward compatibility
  • Test coverage for new behavior
  • Test assertions match test names — warning #1
  • No mutable default arguments

To rebase specific severity levels, mention me: @Koan-Bot rebase critical (fixes 🔴 only), @Koan-Bot rebase important (fixes 🔴 + 🟡), or just @Koan-Bot rebase for all.


Silent Failure Analysis

🟡 **MEDIUM** — swallowed exception (`koan/app/claude_step.py:279-281`)

Risk: The heartbeat thread silently swallows OSError/ValueError and stops emitting, which means the very watchdog timeout it was designed to prevent could now fire and kill a legitimate long-running process — with no log trace explaining why heartbeats stopped.

try:
    print("[still working...]", flush=True)
except (OSError, ValueError):
    break

Fix: Log a single debug-level message before breaking (e.g. via run_log.log_safe) so operators can distinguish 'heartbeat stopped due to pipe error' from 'heartbeat never started'.


Automated review by Kōan (Claude · model claude-opus-4-6) HEAD=637aa33 5 min 47s

@atoomic atoomic marked this pull request as ready for review June 15, 2026 02:30
@atoomic atoomic merged commit 1e9fc88 into Anantys-oss:main Jun 15, 2026
6 checks passed
@Koan-Bot Koan-Bot deleted the koan.atoomic/fix-recreate-liveness-timeout branch June 16, 2026 00:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants