Skip to content

feat: dedicated chat process to prevent API contention during missions#1088

Draft
sukria-koan0 wants to merge 8 commits into
mainfrom
koan/implement-1084
Draft

feat: dedicated chat process to prevent API contention during missions#1088
sukria-koan0 wants to merge 8 commits into
mainfrom
koan/implement-1084

Conversation

@sukria-koan0

Copy link
Copy Markdown
Collaborator

Summary

Decouples Telegram chat handling from mission execution by introducing a dedicated chat process. When a mission runs, the Claude API is no longer hammered by three concurrent callers (mission + chat + outbox formatting) — chat now has its own independent process and retry strategy.

Closes #1084

Changes

  • Phase 1: Outbox formatting skips Claude during active missions (uses fallback_format() to free one API caller)
  • Phase 2: New chat_process.py watches chat-inbox.jsonl for chat requests; chat_context.py extracted as shared module; awake.py routes to chat process with fallback to worker thread
  • Phase 3: Chat process integrated into pid_manager.py (start_chat(), start_all(), stop_processes(), format_status_all()); make chat / make logs updated
  • Phase 4: Exponential backoff retry (3 attempts, 2s/5s/10s) for empty responses (the main symptom of API contention); mission-awareness via .koan-status

Test plan

  • All 11,111 existing tests pass
  • New unit tests for outbox mission-aware formatting (5 tests)
  • New unit tests for chat inbox/outbox JSONL protocol (6 tests)
  • New unit tests for chat routing logic (3 tests)
  • New unit tests for retry constants and mission awareness (5 tests)
  • Manual: send chat messages while a mission runs — verify reliable responses

Generated by Kōan /implement

@atoomic

atoomic commented Apr 1, 2026

Copy link
Copy Markdown
Collaborator

@Koan-Bot review

@Koan-Bot

Koan-Bot commented Apr 1, 2026

Copy link
Copy Markdown
Collaborator

PR Review — feat: dedicated chat process to prevent API contention during missions

Well-structured architectural change that cleanly decouples chat from mission execution. The extraction of chat_context.py is solid, and the file-based inbox protocol is a reasonable IPC mechanism for this use case. However, there's one blocking issue: soul/summary/project_path are loaded once at startup and never refreshed, so the chat process serves stale personality context until restarted. The duplicated _is_mission_active() and overly aggressive queue-depth-1 busy message should also be addressed before merge. Test coverage for the new code is good, and the existing test patch updates are correct.


🔴 Blocking

1. Soul and summary loaded once at startup, never refreshed (koan/app/chat_process.py, L290-295)
Soul, summary, and project_path are loaded once in main() and passed into every process_chat_request() call forever. If the user edits soul.md or summary.md (or adds a project), the chat process serves stale context until manually restarted.

The run loop and awake.py read these on every iteration/call. The chat process should do the same, or at minimum reload periodically (e.g., check mtime and reload on change).

Suggested fix: move _load_soul(), _load_summary(), and _resolve_project_path() calls inside the for entry in entries loop, or add a lightweight mtime-based cache that reloads every N seconds.

    soul = _load_soul()
    summary = _load_summary()
    project_path = _resolve_project_path()

🟡 Important

1. Invalid JSON lines accumulate in inbox forever (koan/app/chat_process.py, L88-105)
read_and_clear_inbox() only truncates the file when entries is non-empty. If the inbox contains only malformed JSON lines (e.g., from a partial write crash), entries remains empty, the file is never truncated, and these bad lines persist across every poll cycle — logged silently as JSONDecodeError 2x/second forever.

Fix: truncate the file unconditionally after reading (whether or not valid entries were parsed), or at least truncate when lines were read but none parsed.

                if entries:
                    f.seek(0)
                    f.truncate()

2. Busy-message for a single pending request is too aggressive (koan/app/awake.py, L449-456)
_route_to_chat_process rejects new messages when any request is already pending in the inbox. This means the chat process effectively has a queue depth of 1 — a regression from the worker thread model which would at least queue while the previous chat was in progress.

Consider allowing a small queue (e.g., 3-5 messages) before sending the "Busy" message. Alternatively, just queue unconditionally and let the FIFO processing handle backpressure naturally — the user will see responses arrive in order.

    if has_pending_requests():
        send_telegram("⏳ Busy with a previous message. Try again in a moment.")
        return True

3. Duplicated _is_mission_active() logic (koan/app/chat_process.py, L148)
_is_mission_active() is copy-pasted identically in both chat_process.py (line 148) and outbox_manager.py (line 259). This violates DRY and will drift over time (e.g., if new status strings are added). Extract to a shared module like app.signals which already owns STATUS_FILE.

def _is_mission_active() -> bool:

4. Fragile KOAN_ROOT derivation via parent traversal (koan/app/outbox_manager.py, L266)
self._instance_dir.parent / STATUS_FILE assumes that instance_dir's parent is KOAN_ROOT. This is true by convention (instance/ lives at KOAN_ROOT/instance/), but it's an implicit coupling. If someone passes a non-standard instance_dir, _is_mission_active silently checks the wrong path.

Consider accepting koan_root as a constructor parameter (or deriving it from the env var) for explicitness.

status_file = self._instance_dir.parent / STATUS_FILE

5. _resolve_project_path always picks the first project (koan/app/chat_process.py, L78-83)
For multi-project setups, _resolve_project_path() unconditionally returns the first project's path. Chat messages aren't project-scoped, so the Claude CLI will always run in that project's directory. This means tool access (Read/Glob/Grep) is limited to one project's tree.

This isn't necessarily wrong (the old awake.py had the same behavior), but it's worth documenting as a known limitation, or considering passing the current project context from the user's message.

    projects = get_known_projects()
    if projects:
        return projects[0][1]

🟢 Suggestions

1. _get_last_message_id catches SystemExit (koan/app/chat_process.py, L246-253)
except (SystemExit, Exception) catches SystemExit which prevents the process from exiting cleanly if sys.exit() is called inside get_messaging_provider(). This is likely unintentional — SystemExit should propagate. Use except Exception alone.

    try:
        from app.messaging import get_messaging_provider
        provider = get_messaging_provider()
        ids = provider.get_last_message_ids()
        return ids[-1] if ids else 0
    except (SystemExit, Exception):
        return 0

2. Retry loop imports on every attempt (koan/app/chat_process.py, L193-200)
All the imports inside process_chat_request() (build_chat_prompt, clean_chat_response, run_cli, build_full_command, etc.) execute on every call. While Python caches modules, this is ~10 import lookups per chat message. Consider moving them to module-level or to the top of the function before the retry loop.

    import subprocess
    from app.chat_context import build_chat_prompt, clean_chat_response
    from app.cli_exec import run_cli
    from app.cli_provider import build_full_command

3. TOCTOU in _route_to_chat_process (koan/app/awake.py, L432-440)
_is_chat_process_running() checks the PID file, then has_pending_requests() checks the inbox, then write_to_inbox() writes. The chat process could die between the PID check and the write, leaving a message in the inbox that nobody processes until the chat process restarts. This is low-risk (message will be processed on next startup), but worth documenting in a comment.

    if not _is_chat_process_running():
        return False

    from app.chat_process import write_to_inbox, has_pending_requests

Checklist


Summary

Well-structured architectural change that cleanly decouples chat from mission execution. The extraction of chat_context.py is solid, and the file-based inbox protocol is a reasonable IPC mechanism for this use case. However, there's one blocking issue: soul/summary/project_path are loaded once at startup and never refreshed, so the chat process serves stale personality context until restarted. The duplicated _is_mission_active() and overly aggressive queue-depth-1 busy message should also be addressed before merge. Test coverage for the new code is good, and the existing test patch updates are correct.


Automated review by Kōan

@Koan-Bot

Koan-Bot commented Apr 1, 2026

Copy link
Copy Markdown
Collaborator

@atoomic: @Koan-Bot review

I've completed the review. The architecture is sound — decoupling chat into its own process with a JSONL inbox is a clean approach. The main blocking issue is that chat_process.py loads soul, summary, and project_path once at startup (lines 290-295) and never refreshes them. If the user updates soul.md or the memory summary evolves, the chat process serves stale context until manually restarted. The existing awake.py and run.py reload these per-call, so this is a behavioral regression.

Other notable findings: _is_mission_active() is duplicated in chat_process.py and outbox_manager.py (should be extracted to app.signals), the queue-depth-1 busy message in _route_to_chat_process is overly aggressive (rejecting messages when just one is pending), and _get_last_message_id() catches SystemExit which could prevent clean shutdown. See inline comments for details.

@atoomic

atoomic commented Apr 2, 2026

Copy link
Copy Markdown
Collaborator

@sukria-koan0 rebase

@atoomic

atoomic commented Apr 11, 2026

Copy link
Copy Markdown
Collaborator

@sukria-koan0 rr

@sukria-koan0

sukria-koan0 commented Apr 12, 2026

Copy link
Copy Markdown
Collaborator Author

PR Review — feat: dedicated chat process to prevent API contention during missions

Well-structured decomposition that cleanly separates chat from mission execution. The chat_context.py extraction is a good refactor, the JSONL inbox protocol is appropriate for this IPC pattern, and test coverage for the new code is solid. However, there's one blocking issue: soul/summary/project_path are loaded once at startup and never refreshed, causing stale personality context until process restart. The duplicated _is_mission_active() and the overly restrictive queue-depth-1 busy check should also be addressed. The malformed-JSON accumulation bug in the inbox is a latent issue that will cause log spam after a crash. Fix the blocking issue and address the warnings before merging.


🔴 Blocking

1. Soul, summary, and project_path loaded once, never refreshed (`koan/app/chat_process.py`, L290-295)

soul, summary, and project_path are loaded once in main() and reused for the entire lifetime of the process. If the user edits soul.md, the memory summary changes, or a project is added, the chat process serves stale context indefinitely.

awake.py and run.py reload these per-call. The chat process should do the same. The simplest fix: move _load_soul(), _load_summary(), and _resolve_project_path() inside the for-entry loop in main(), or at least inside process_chat_request() itself. These are cheap file reads — no reason to cache them across the process lifetime.

soul = _load_soul()
summary = _load_summary()
project_path = _resolve_project_path()

🟡 Important

1. Malformed JSON lines in inbox are never cleared (`koan/app/chat_process.py`, L88-105)

read_and_clear_inbox() only truncates the file when entries is non-empty. If the inbox contains only malformed JSON (e.g., from a partial write during a crash), the file is never truncated and these bad lines are logged as JSONDecodeError every 0.5s forever.

Fix: track whether any lines were read (not just valid entries) and truncate unconditionally when lines exist. For example:

lines_read = False
for line in f:
    lines_read = True
    ...
if lines_read:
    f.seek(0)
    f.truncate()
    f.flush()
if entries:
    f.seek(0)
    f.truncate()
    f.flush()
2. Queue depth of 1 is a regression from worker thread model (`koan/app/awake.py`, L449-456)

_route_to_chat_process() rejects new messages when any request is already pending in the inbox. This is more restrictive than the previous worker-thread model, which would queue messages naturally. A fast typist sending two messages in succession will get a 'Busy' response on the second one, even though the FIFO model would handle it fine.

Consider either removing the has_pending_requests() check entirely (the chat process handles FIFO naturally) or raising the threshold to allow 3-5 queued messages before sending the busy response.

if has_pending_requests():
    send_telegram("⏳ Busy with a previous message. Try again in a moment.")
    return True
3. Duplicated _is_mission_active() implementation (`koan/app/chat_process.py`, L140-155)

The same _is_mission_active() logic appears in both chat_process.py and outbox_manager.py (identical implementation: read .koan-status, check for 'executing mission' or 'skill dispatch'). This should be a shared utility — if the status file format changes or new status strings are added, both copies must be updated.

Suggested: extract to a shared module (e.g., app/signals.py alongside STATUS_FILE, or a new function in app/utils.py) and import it from both places.

def _is_mission_active() -> bool:
    status = status_file.read_text().strip().lower()
    return "executing mission" in status or "skill dispatch" in status

🟢 Suggestions

1. process_chat_request takes stale soul/summary as params (`koan/app/chat_process.py`, L159-162)

The function signature takes soul, summary, and project_path as parameters, which encodes the assumption that these are loaded once externally. If you fix the blocking issue by reloading per-call, these parameters become unnecessary — the function could load them itself, simplifying the API and making it impossible to pass stale values.

Alternatively, if you want to keep the parameter-based approach for testability, document that callers must provide fresh values.

def process_chat_request(text: str, soul: str, summary: str, project_path: str) -> None:
2. Prompt guard result not acted upon in chat process (`koan/app/chat_process.py`, L184-186)

The prompt guard scan runs and logs a warning if blocked, but the chat request proceeds regardless. In awake.py's handle_chat(), the same guard result also just logs and continues, so this is consistent — but it's worth confirming that warn-only is the intended behavior for chat. If chat tools are truly read-only, this is fine, but it should be documented in the guard config or a comment.

if guard_result.blocked:
    _log(f"WARNING chat guard: {guard_result.reason} | {text[:100]}")
3. write_to_inbox uses append mode without atomic write (`koan/app/chat_process.py`, L116-124)

write_to_inbox() opens the file in append mode with flock, which is fine for single-line JSON entries on local filesystems. However, the rest of the codebase uses atomic_write() (temp file + rename) for shared files. Here it's less critical since JSONL is append-only and the lock prevents concurrent writes, but if a write is interrupted mid-line (e.g., SIGKILL during f.write), the next line will be corrupted. This is mitigated by the malformed-JSON handling in read_and_clear_inbox() — but only if that handler also truncates (see the related warning).

with open(CHAT_INBOX, "a") as f:
    fcntl.flock(f, fcntl.LOCK_EX)
    try:
        f.write(entry + "\n")
        f.flush()

Checklist


Summary

Well-structured decomposition that cleanly separates chat from mission execution. The chat_context.py extraction is a good refactor, the JSONL inbox protocol is appropriate for this IPC pattern, and test coverage for the new code is solid. However, there's one blocking issue: soul/summary/project_path are loaded once at startup and never refreshed, causing stale personality context until process restart. The duplicated _is_mission_active() and the overly restrictive queue-depth-1 busy check should also be addressed. The malformed-JSON accumulation bug in the inbox is a latent issue that will cause log spam after a crash. Fix the blocking issue and address the warnings before merging.


Automated review by Kōanfa1f051
791b9b1
0913c7b
504c9de
0add031
2cb0ff8

@sukria-koan0

Copy link
Copy Markdown
Collaborator Author

@atoomic: @sukria-koan0 rr

@atoomic The PR has been rebased (per the earlier request) but the review findings from the initial bot review still stand. The blocking issue — soul/summary/project_path loaded once at startup in chat_process.py:main() (lines 290-295) and never refreshed — has not been addressed yet. The _is_mission_active() duplication and queue-depth-1 busy message are also still present.

sukria-koan0 and others added 7 commits April 12, 2026 11:44
When a mission is actively running, outbox messages now use
fallback_format() instead of invoking Claude CLI for formatting.
This eliminates one concurrent API caller during missions, reducing
contention that causes chat timeouts.

Refs #1084

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Create chat_context.py: shared prompt building extracted from awake.py
- Create chat_process.py: standalone process with JSONL inbox/outbox
  protocol, PID file, and graceful SIGTERM handling
- Update awake.py to route chat through dedicated process when running,
  with automatic fallback to worker thread when process is not available
- Update test patches to match new module structure

Refs #1084

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add "chat" to PROCESS_NAMES in pid_manager.py
- Add start_chat() function for launching chat_process.py
- Include chat in start_all(), stop_processes(), format_status_all()
- Add `make chat` target to Makefile
- Update `make logs` to tail chat.log

Refs #1084

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Chat process retries up to 3 times with 2s/5s/10s backoff on empty
  responses (the main symptom of API contention during missions)
- First attempt uses full context; retries use lite context + shorter timeout
- Detect active missions via .koan-status for logging context
- Extract _retry_chat_lite helper in awake.py fallback path
- Empty responses in awake.py fallback now trigger lite retry instead of
  immediately showing "I didn't get a response"

Refs #1084

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix prompt guard tests to patch functions at their new locations
  (app.conversation_history, app.config) after chat_context extraction
- Add diagnostic output to bare except blocks in chat_process.py
  to satisfy silent exception enforcement

Refs #1084

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Update CLAUDE.md: 3-process architecture, new modules, commands
- Add chat-inbox.jsonl to instance directory listing

Refs #1084

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…cation, remove queue-depth-1 rejection

d` are picked up without restarting the chat process. Per reviewer's blocking feedback.

- **Truncate inbox unconditionally after reading** (`chat_process.py:read_and_clear_inbox()`): Removed the `if entries:` guard around `f.seek(0); f.truncate()`. Previously, malformed JSON lines would persist forever since `entries` would be empty and the file was never truncated. Per reviewer's important feedback.

- **Remove queue-depth-1 busy rejection** (`awake.py:_route_to_chat_process()`): Removed the `has_pending_requests()` check that rejected new messages when any request was already pending. Messages are now always queued to the inbox and processed FIFO by the chat process. Per reviewer's feedback that depth-1 rejection was a regression from the worker thread model.

- **Updated test for queue behavior** (`test_chat_process.py`): Renamed `test_busy_when_pending_requests` to `test_queues_when_pending_requests` and updated assertions to verify both messages are queued (no busy message sent).
@sukria-koan0

Copy link
Copy Markdown
Collaborator Author

Rebase: feat: dedicated chat process to prevent API contention during missions

Branch koan/implement-1084 rebased onto main and force-pushed.

Diff: 11 files changed, 1000 insertions(+), 278 deletions(-)

Review feedback was analyzed and applied.

Changes

d` are picked up without restarting the chat process. Per reviewer's blocking feedback.

  • Truncate inbox unconditionally after reading (chat_process.py:read_and_clear_inbox()): Removed the if entries: guard around f.seek(0); f.truncate(). Previously, malformed JSON lines would persist forever since entries would be empty and the file was never truncated. Per reviewer's important feedback.

  • Remove queue-depth-1 busy rejection (awake.py:_route_to_chat_process()): Removed the has_pending_requests() check that rejected new messages when any request was already pending. Messages are now always queued to the inbox and processed FIFO by the chat process. Per reviewer's feedback that depth-1 rejection was a regression from the worker thread model.

  • Updated test for queue behavior (test_chat_process.py): Renamed test_busy_when_pending_requests to test_queues_when_pending_requests and updated assertions to verify both messages are queued (no busy message sent).

Actions

  • Already-solved check: negative (confidence=high, reasoning=None of the recent commits on main introduce a dedicated chat process, chat-inbox.jsonl protocol, or)
  • Resolved merge conflicts (2 round(s))
  • Rebased koan/implement-1084 onto origin/main
  • Applied review feedback
  • Pre-push CI check: previous run passed
  • Force-pushed koan/implement-1084 to origin
  • CI check enqueued in ## CI (async)

CI

CI will be checked asynchronously.


Automated by Kōan

@sukria sukria force-pushed the koan/implement-1084 branch from 2cb0ff8 to 37ba88e Compare April 12, 2026 09:46
@sukria-koan0

Copy link
Copy Markdown
Collaborator Author

Rebase: feat: dedicated chat process to prevent API contention during missions

Branch koan/implement-1084 rebased onto main and force-pushed.

Diff: 11 files changed, 1038 insertions(+), 302 deletions(-)

Review feedback was analyzed and applied.

Actions

  • Already-solved check: negative (confidence=high, reasoning=None of the recent commits on main introduce a dedicated chat process, chat-inbox.jsonl protocol, or)
  • Rebased koan/implement-1084 onto origin/main
  • Pre-push CI check: previous run #24303825169 failed
  • Applied pre-push CI fix
  • Pre-push CI fix applied
  • Force-pushed koan/implement-1084 to origin
  • CI check enqueued in ## CI (async)

CI

CI will be checked asynchronously.


Automated by Kōan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Telegram chat handler should live in its own process, able to invoke Claude even while a mission runs

4 participants