|
| 1 | +# Comprehensive Architecture Review |
| 2 | + |
| 3 | +Date: 2026-02-22 |
| 4 | +Scope: Repository-wide architecture review (boundaries, runtime flow, resilience, security, operability, and testability) |
| 5 | + |
| 6 | +## Findings (Ordered by Severity) |
| 7 | + |
| 8 | +### High |
| 9 | + |
| 10 | +1. Single long-lived token reused across API and WebSocket auth |
| 11 | +- Impact: Token leakage grants broad control until process restart. |
| 12 | +- Evidence: `src/bin/beamcode.ts:322`, `src/http/server.ts:13` |
| 13 | +- Detail: A single `consumerToken` is injected into HTML and reused for `/api/*` plus WebSocket `?token` auth. |
| 14 | +- Recommendation: Split token scopes (API vs WS), add rotation/revocation, and enforce short lifetime. |
| 15 | + |
| 16 | +2. Process-local session state limits resilience and horizontal scaling |
| 17 | +- Impact: Restart/failover can drop in-flight session metadata and pending control state. |
| 18 | +- Evidence: `src/core/session/session-repository.ts:120`, `src/core/session-coordinator.ts:108` |
| 19 | +- Detail: Session state is held in-process (`Map`) with local snapshoting, with no cluster-safe coordination. |
| 20 | +- Recommendation: Introduce shared session backing (e.g., Redis/DB) or explicitly document single-node constraints. |
| 21 | + |
| 22 | +### Medium |
| 23 | + |
| 24 | +1. HTTP rename path bypasses domain/policy command flow |
| 25 | +- Impact: Audit/RBAC/policy hooks can miss state mutations. |
| 26 | +- Evidence: `src/http/api-sessions.ts:158` |
| 27 | +- Detail: Rename mutates registry and broadcasts directly rather than going through coordinator/domain pipeline. |
| 28 | +- Recommendation: Route through coordinator/bridge commands so all mutations emit uniform domain events. |
| 29 | + |
| 30 | +2. Lifecycle transition invariants are not enforced |
| 31 | +- Impact: Runtime can enter invalid states, causing misleading lifecycle-driven behavior. |
| 32 | +- Evidence: `src/core/session/session-runtime.ts:336`, `src/core/session/session-lifecycle.ts:17` |
| 33 | +- Detail: Invalid transitions are logged but still applied. |
| 34 | +- Recommendation: Reject invalid transitions (no mutation) and optionally emit explicit error events. |
| 35 | + |
| 36 | +3. Queued message state is not durably persisted |
| 37 | +- Impact: Queued work can disappear after restart/crash. |
| 38 | +- Evidence: `src/core/session/message-queue-handler.ts:80`, `src/core/session/session-runtime.ts:150` |
| 39 | +- Detail: Queue slot updates stay in memory without corresponding persistence. |
| 40 | +- Recommendation: Persist queue-slot state on change and restore it during startup. |
| 41 | + |
| 42 | +4. Root redirect can point to stale/deleted session |
| 43 | +- Impact: `/` may redirect users to dead session IDs. |
| 44 | +- Evidence: `src/bin/beamcode.ts:393`, `src/http/server.ts:60` |
| 45 | +- Detail: `activeSessionId` is set during startup and not consistently updated during session churn. |
| 46 | +- Recommendation: Sync active session ID on create/delete/close events from coordinator. |
| 47 | + |
| 48 | +5. Entrypoint behavior has limited default test coverage |
| 49 | +- Impact: CLI flag/shutdown/bootstrap regressions can ship unnoticed. |
| 50 | +- Evidence: `vitest.config.ts:5`, `src/bin/beamcode.ts:57` |
| 51 | +- Detail: Default test configuration excludes `src/bin/**`. |
| 52 | +- Recommendation: Add targeted unit tests for arg parsing, lifecycle wiring, and shutdown ordering. |
| 53 | + |
| 54 | +6. Default observability is weak without optional flags |
| 55 | +- Impact: Degraded visibility into session health and failure trends. |
| 56 | +- Evidence: `src/bin/beamcode.ts:300`, `src/adapters/console-metrics-collector.ts:60`, `src/http/health.ts:4` |
| 57 | +- Detail: Console metrics are mostly debug-level and minimal by default. |
| 58 | +- Recommendation: Promote critical signals to default output and expose key counters in health/metrics. |
| 59 | + |
| 60 | +### Low |
| 61 | + |
| 62 | +1. Tunnel restart/failure signals are not strongly surfaced |
| 63 | +- Impact: Public connectivity degradation may go unnoticed. |
| 64 | +- Evidence: `src/relay/cloudflared-manager.ts:183` |
| 65 | +- Detail: Restart logic exists, but health/metrics surfacing is limited. |
| 66 | +- Recommendation: Emit explicit tunnel health metrics/events and include them in health checks. |
| 67 | + |
| 68 | +2. Message tracing summary is global, not session-scoped |
| 69 | +- Impact: Harder to diagnose per-session issues in multi-session runs. |
| 70 | +- Evidence: `src/core/messaging/message-tracer.ts:353` |
| 71 | +- Detail: Summary aggregates process-wide sets rather than per-session views. |
| 72 | +- Recommendation: Track summary by session ID and surface session-tagged diagnostics. |
| 73 | + |
| 74 | +3. Consumer-plane composition is tightly coupled to runtime internals |
| 75 | +- Impact: Higher refactor cost and weaker transport/domain separation. |
| 76 | +- Evidence: `src/core/session-bridge/compose-consumer-plane.ts:56` |
| 77 | +- Detail: Consumer composition reaches deeply into runtime state and helpers. |
| 78 | +- Recommendation: Narrow interfaces so transport interacts via explicit domain/service contracts. |
| 79 | + |
| 80 | +## Strengths |
| 81 | + |
| 82 | +- The bounded-context architecture is clear and mostly reflected in implementation (`docs/architecture.md:51`). |
| 83 | +- `SessionCoordinator` and `SessionBridge` centralize lifecycle flow and context composition (`src/core/session-coordinator.ts:108`, `src/core/session-bridge.ts:24`). |
| 84 | +- Runtime ownership is concentrated in session runtime/repository abstractions, which provides a solid foundation for future hardening (`src/core/session/session-runtime.ts:1`, `src/core/session/session-repository.ts:1`). |
| 85 | + |
| 86 | +## Open Questions |
| 87 | + |
| 88 | +1. Is single-node operation an intentional product constraint, or should active-active/multi-instance support be a target? |
| 89 | +2. Is shared API/WS token behavior acceptable only for local trust mode, or intended for tunneled/remote usage? |
| 90 | +3. Should queued message durability be a guaranteed contract across restarts? |
| 91 | + |
| 92 | +## Recommended Next Steps |
| 93 | + |
| 94 | +1. Implement token scope separation and rotation/revocation. |
| 95 | +2. Enforce lifecycle transition validity in runtime state machine. |
| 96 | +3. Persist queued-message state and restore on startup. |
| 97 | +4. Sync root redirect target with live session lifecycle events. |
| 98 | +5. Expand test coverage for CLI entrypoint and flag interactions. |
| 99 | +6. Improve default observability and health surfacing for critical failures. |
| 100 | + |
| 101 | +## Remediation Plan: Process-Local Session State |
| 102 | + |
| 103 | +### Phase 1 (implemented in this branch) |
| 104 | + |
| 105 | +1. Explicitly codify single-node constraints in runtime and docs. |
| 106 | +2. Expose deployment topology in `/health` so operators and automation can detect unsupported horizontal scaling. |
| 107 | +3. Emit startup warning to reduce accidental multi-instance assumptions. |
| 108 | + |
| 109 | +Delivered changes: |
| 110 | +- `/health` now includes: |
| 111 | + - `deployment.topology = "single-node"` |
| 112 | + - `deployment.session_state_scope = "process-local"` |
| 113 | + - `deployment.horizontal_scaling = "unsupported"` |
| 114 | +- Startup now logs an explicit process-local session-state warning. |
| 115 | +- CLI "already running" guidance now clarifies that separate `--data-dir` instances are isolated, not clustered. |
| 116 | +- Documentation updated in `README.md` and `docs/architecture.md`. |
| 117 | + |
| 118 | +### Phase 2 (in progress) |
| 119 | + |
| 120 | +1. Introduce a shared coordination contract for live session ownership/leases. |
| 121 | +2. Add pluggable distributed backend (Redis/DB) behind the contract. |
| 122 | +3. Gate session mutation on lease ownership to prevent split-brain writes. |
| 123 | + |
| 124 | +Delivered in this branch: |
| 125 | +- Added a `SessionLeaseCoordinator` contract with in-memory default implementation. |
| 126 | +- Added lease ownership checks at central runtime mutation ingress (`RuntimeApi`) and in mutating `SessionRuntime` methods. |
| 127 | +- Added lifecycle lease semantics: `getOrCreateSession` now acquires/validates lease ownership; `removeSession`/`closeSession` release leases. |
| 128 | +- Routed consumer/backend mutation paths through lease-aware runtime APIs where possible. |
| 129 | + |
| 130 | +### Phase 3 (next) |
| 131 | + |
| 132 | +1. Add multi-instance integration tests (failover, reconnect, queue durability). |
| 133 | +2. Add metrics for lease contention/failover and propagate into health/readiness checks. |
0 commit comments