Skip to content

Commit 2809376

Browse files
authored
Merge pull request #121 from teng-lin/fix/process-local-session-state-scaling
Harden session-state topology signaling and add lease coordination seam
2 parents 49917c5 + 452937b commit 2809376

22 files changed

+685
-19
lines changed

README.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -164,6 +164,19 @@ Then open `http://localhost:5174`.
164164
- **Rate limiting**: Token bucket per consumer (configurable)
165165
- **Circuit breaker**: Sliding window prevents CLI restart cascades
166166

167+
## Deployment Topology
168+
169+
BeamCode currently runs as a **single-node runtime** for session coordination:
170+
171+
- Live session state is **process-local** (in-memory runtime objects)
172+
- Persistent storage supports restart recovery, but **does not provide distributed coordination**
173+
- Running multiple BeamCode instances (especially with different `--data-dir`) creates isolated session islands, not a shared cluster
174+
175+
The `/health` endpoint exposes this explicitly under `deployment`:
176+
- `topology: "single-node"`
177+
- `session_state_scope: "process-local"`
178+
- `horizontal_scaling: "unsupported"`
179+
167180
See [SECURITY.md](./SECURITY.md) for the full threat model and cryptographic details.
168181

169182
## Documentation

docs/architecture.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -619,6 +619,8 @@ Policy services follow the **observe and advise** pattern: they subscribe to dom
619619

620620
The SessionRepository owns the in-memory session map (`Map<string, Session>`), creates live `Session` objects, provides session/query helpers, and delegates persistence operations to `SessionStorage`.
621621

622+
> **Topology constraint:** live session coordination is process-local. Persistence enables restart recovery, but does not make multi-instance BeamCode nodes share runtime state. Current topology is single-node. A lease-coordination seam (`SessionLeaseCoordinator`) now exists for future distributed ownership.
623+
622624
**Responsibilities:**
623625
- **Own live sessions:** `getOrCreate()`, `get()`, `has()`, `keys()`, and `remove()` over live `Session` objects
624626
- **Expose query snapshots:** `getSnapshot()` and `getAllStates()` for read models
Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
# Comprehensive Architecture Review
2+
3+
Date: 2026-02-22
4+
Scope: Repository-wide architecture review (boundaries, runtime flow, resilience, security, operability, and testability)
5+
6+
## Findings (Ordered by Severity)
7+
8+
### High
9+
10+
1. Single long-lived token reused across API and WebSocket auth
11+
- Impact: Token leakage grants broad control until process restart.
12+
- Evidence: `src/bin/beamcode.ts:322`, `src/http/server.ts:13`
13+
- Detail: A single `consumerToken` is injected into HTML and reused for `/api/*` plus WebSocket `?token` auth.
14+
- Recommendation: Split token scopes (API vs WS), add rotation/revocation, and enforce short lifetime.
15+
16+
2. Process-local session state limits resilience and horizontal scaling
17+
- Impact: Restart/failover can drop in-flight session metadata and pending control state.
18+
- Evidence: `src/core/session/session-repository.ts:120`, `src/core/session-coordinator.ts:108`
19+
- Detail: Session state is held in-process (`Map`) with local snapshoting, with no cluster-safe coordination.
20+
- Recommendation: Introduce shared session backing (e.g., Redis/DB) or explicitly document single-node constraints.
21+
22+
### Medium
23+
24+
1. HTTP rename path bypasses domain/policy command flow
25+
- Impact: Audit/RBAC/policy hooks can miss state mutations.
26+
- Evidence: `src/http/api-sessions.ts:158`
27+
- Detail: Rename mutates registry and broadcasts directly rather than going through coordinator/domain pipeline.
28+
- Recommendation: Route through coordinator/bridge commands so all mutations emit uniform domain events.
29+
30+
2. Lifecycle transition invariants are not enforced
31+
- Impact: Runtime can enter invalid states, causing misleading lifecycle-driven behavior.
32+
- Evidence: `src/core/session/session-runtime.ts:336`, `src/core/session/session-lifecycle.ts:17`
33+
- Detail: Invalid transitions are logged but still applied.
34+
- Recommendation: Reject invalid transitions (no mutation) and optionally emit explicit error events.
35+
36+
3. Queued message state is not durably persisted
37+
- Impact: Queued work can disappear after restart/crash.
38+
- Evidence: `src/core/session/message-queue-handler.ts:80`, `src/core/session/session-runtime.ts:150`
39+
- Detail: Queue slot updates stay in memory without corresponding persistence.
40+
- Recommendation: Persist queue-slot state on change and restore it during startup.
41+
42+
4. Root redirect can point to stale/deleted session
43+
- Impact: `/` may redirect users to dead session IDs.
44+
- Evidence: `src/bin/beamcode.ts:393`, `src/http/server.ts:60`
45+
- Detail: `activeSessionId` is set during startup and not consistently updated during session churn.
46+
- Recommendation: Sync active session ID on create/delete/close events from coordinator.
47+
48+
5. Entrypoint behavior has limited default test coverage
49+
- Impact: CLI flag/shutdown/bootstrap regressions can ship unnoticed.
50+
- Evidence: `vitest.config.ts:5`, `src/bin/beamcode.ts:57`
51+
- Detail: Default test configuration excludes `src/bin/**`.
52+
- Recommendation: Add targeted unit tests for arg parsing, lifecycle wiring, and shutdown ordering.
53+
54+
6. Default observability is weak without optional flags
55+
- Impact: Degraded visibility into session health and failure trends.
56+
- Evidence: `src/bin/beamcode.ts:300`, `src/adapters/console-metrics-collector.ts:60`, `src/http/health.ts:4`
57+
- Detail: Console metrics are mostly debug-level and minimal by default.
58+
- Recommendation: Promote critical signals to default output and expose key counters in health/metrics.
59+
60+
### Low
61+
62+
1. Tunnel restart/failure signals are not strongly surfaced
63+
- Impact: Public connectivity degradation may go unnoticed.
64+
- Evidence: `src/relay/cloudflared-manager.ts:183`
65+
- Detail: Restart logic exists, but health/metrics surfacing is limited.
66+
- Recommendation: Emit explicit tunnel health metrics/events and include them in health checks.
67+
68+
2. Message tracing summary is global, not session-scoped
69+
- Impact: Harder to diagnose per-session issues in multi-session runs.
70+
- Evidence: `src/core/messaging/message-tracer.ts:353`
71+
- Detail: Summary aggregates process-wide sets rather than per-session views.
72+
- Recommendation: Track summary by session ID and surface session-tagged diagnostics.
73+
74+
3. Consumer-plane composition is tightly coupled to runtime internals
75+
- Impact: Higher refactor cost and weaker transport/domain separation.
76+
- Evidence: `src/core/session-bridge/compose-consumer-plane.ts:56`
77+
- Detail: Consumer composition reaches deeply into runtime state and helpers.
78+
- Recommendation: Narrow interfaces so transport interacts via explicit domain/service contracts.
79+
80+
## Strengths
81+
82+
- The bounded-context architecture is clear and mostly reflected in implementation (`docs/architecture.md:51`).
83+
- `SessionCoordinator` and `SessionBridge` centralize lifecycle flow and context composition (`src/core/session-coordinator.ts:108`, `src/core/session-bridge.ts:24`).
84+
- Runtime ownership is concentrated in session runtime/repository abstractions, which provides a solid foundation for future hardening (`src/core/session/session-runtime.ts:1`, `src/core/session/session-repository.ts:1`).
85+
86+
## Open Questions
87+
88+
1. Is single-node operation an intentional product constraint, or should active-active/multi-instance support be a target?
89+
2. Is shared API/WS token behavior acceptable only for local trust mode, or intended for tunneled/remote usage?
90+
3. Should queued message durability be a guaranteed contract across restarts?
91+
92+
## Recommended Next Steps
93+
94+
1. Implement token scope separation and rotation/revocation.
95+
2. Enforce lifecycle transition validity in runtime state machine.
96+
3. Persist queued-message state and restore on startup.
97+
4. Sync root redirect target with live session lifecycle events.
98+
5. Expand test coverage for CLI entrypoint and flag interactions.
99+
6. Improve default observability and health surfacing for critical failures.
100+
101+
## Remediation Plan: Process-Local Session State
102+
103+
### Phase 1 (implemented in this branch)
104+
105+
1. Explicitly codify single-node constraints in runtime and docs.
106+
2. Expose deployment topology in `/health` so operators and automation can detect unsupported horizontal scaling.
107+
3. Emit startup warning to reduce accidental multi-instance assumptions.
108+
109+
Delivered changes:
110+
- `/health` now includes:
111+
- `deployment.topology = "single-node"`
112+
- `deployment.session_state_scope = "process-local"`
113+
- `deployment.horizontal_scaling = "unsupported"`
114+
- Startup now logs an explicit process-local session-state warning.
115+
- CLI "already running" guidance now clarifies that separate `--data-dir` instances are isolated, not clustered.
116+
- Documentation updated in `README.md` and `docs/architecture.md`.
117+
118+
### Phase 2 (in progress)
119+
120+
1. Introduce a shared coordination contract for live session ownership/leases.
121+
2. Add pluggable distributed backend (Redis/DB) behind the contract.
122+
3. Gate session mutation on lease ownership to prevent split-brain writes.
123+
124+
Delivered in this branch:
125+
- Added a `SessionLeaseCoordinator` contract with in-memory default implementation.
126+
- Added lease ownership checks at central runtime mutation ingress (`RuntimeApi`) and in mutating `SessionRuntime` methods.
127+
- Added lifecycle lease semantics: `getOrCreateSession` now acquires/validates lease ownership; `removeSession`/`closeSession` release leases.
128+
- Routed consumer/backend mutation paths through lease-aware runtime APIs where possible.
129+
130+
### Phase 3 (next)
131+
132+
1. Add multi-instance integration tests (failover, reconnect, queue durability).
133+
2. Add metrics for lease contention/failover and propagate into health/readiness checks.

src/bin/beamcode.ts

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -351,6 +351,9 @@ export async function runBeamcode(argv: string[] = process.argv): Promise<void>
351351
if (err instanceof Error && err.message.includes("already running")) {
352352
console.error(`Error: ${err.message}`);
353353
console.error("Stop the other instance first, or use a different --data-dir.");
354+
console.error(
355+
"Note: separate --data-dir instances do not share session state and are not horizontally coordinated.",
356+
);
354357
process.exit(1);
355358
}
356359
throw err;
@@ -454,12 +457,24 @@ export async function runBeamcode(argv: string[] = process.argv): Promise<void>
454457
});
455458

456459
let activeSessionId = "";
460+
logger.warn(
461+
"Session state is process-local to this BeamCode instance; horizontal scaling requires external coordination that is not enabled in this runtime.",
462+
{ component: "startup", topology: "single-node", sessionStateScope: "process-local" },
463+
);
457464

458465
const httpServer = createBeamcodeServer({
459466
sessionCoordinator,
460467
activeSessionId,
461468
apiKey: consumerToken,
462-
healthContext: { version, metrics },
469+
healthContext: {
470+
version,
471+
metrics,
472+
deployment: {
473+
topology: "single-node",
474+
sessionStateScope: "process-local",
475+
horizontalScaling: "unsupported",
476+
},
477+
},
463478
prometheusCollector,
464479
});
465480

@@ -538,6 +553,7 @@ export async function runBeamcode(argv: string[] = process.argv): Promise<void>
538553
Local: ${localUrl}${tunnelSessionUrl ? `\n Tunnel: ${tunnelSessionUrl}` : ""}
539554
${activeSessionId ? `\n Session: ${activeSessionId}` : ""}
540555
Adapter: ${adapter.name}${config.noAutoLaunch ? " (no auto-launch)" : ""}
556+
Topology: single-node (process-local session state)
541557
CWD: ${config.cwd}
542558
API Key: ${consumerToken}
543559

src/core/bridge/runtime-api.test.ts

Lines changed: 49 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,11 @@
11
import { describe, expect, it, vi } from "vitest";
22
import type { Logger } from "../../interfaces/logger.js";
3+
import type { WebSocketLike } from "../../interfaces/transport.js";
34
import type { PolicyCommand } from "../interfaces/runtime-commands.js";
5+
import {
6+
InMemorySessionLeaseCoordinator,
7+
type SessionLeaseCoordinator,
8+
} from "../session/session-lease-coordinator.js";
49
import type { Session, SessionRepository } from "../session/session-repository.js";
510
import { RuntimeApi } from "./runtime-api.js";
611
import type { RuntimeManager } from "./runtime-manager.js";
@@ -22,10 +27,16 @@ function createRuntimeStub() {
2227
executeSlashCommand: vi.fn().mockResolvedValue({ content: "ok", source: "emulated" }),
2328
handlePolicyCommand: vi.fn(),
2429
sendToBackend: vi.fn(),
30+
handleInboundCommand: vi.fn(),
31+
handleBackendMessage: vi.fn(),
32+
handleSignal: vi.fn(),
2533
};
2634
}
2735

28-
function createApi() {
36+
function createApi(options?: {
37+
leaseCoordinator?: SessionLeaseCoordinator;
38+
leaseOwnerId?: string;
39+
}) {
2940
const sessions = new Map<string, Session>();
3041
const store = {
3142
get: vi.fn((sessionId: string) => sessions.get(sessionId)),
@@ -42,7 +53,9 @@ function createApi() {
4253
error: vi.fn(),
4354
};
4455

45-
const api = new RuntimeApi({ store, runtimeManager, logger });
56+
const leaseCoordinator = options?.leaseCoordinator ?? new InMemorySessionLeaseCoordinator();
57+
const leaseOwnerId = options?.leaseOwnerId ?? "owner-1";
58+
const api = new RuntimeApi({ store, runtimeManager, logger, leaseCoordinator, leaseOwnerId });
4659
return { api, sessions, runtime, runtimeManager, logger };
4760
}
4861

@@ -144,4 +157,38 @@ describe("RuntimeApi", () => {
144157
message: "ok",
145158
});
146159
});
160+
161+
it("delegates inbound/backend handlers when session exists", () => {
162+
const { api, sessions, runtime } = createApi();
163+
sessions.set("s1", stubSession("s1"));
164+
const ws = {} as WebSocketLike;
165+
const inbound = { type: "interrupt" } as any;
166+
const backend = { type: "result", metadata: {} } as any;
167+
168+
api.handleInboundCommand("s1", inbound, ws);
169+
api.handleBackendMessage("s1", backend);
170+
api.handleLifecycleSignal("s1", "backend:connected");
171+
172+
expect(runtime.handleInboundCommand).toHaveBeenCalledWith(inbound, ws);
173+
expect(runtime.handleBackendMessage).toHaveBeenCalledWith(backend);
174+
expect(runtime.handleSignal).toHaveBeenCalledWith("backend:connected");
175+
});
176+
177+
it("blocks mutation when lease is held by another owner", () => {
178+
const leaseCoordinator = new InMemorySessionLeaseCoordinator();
179+
leaseCoordinator.ensureLease("s1", "owner-other");
180+
const { api, sessions, runtime, logger } = createApi({
181+
leaseCoordinator,
182+
leaseOwnerId: "owner-1",
183+
});
184+
sessions.set("s1", stubSession("s1"));
185+
186+
api.sendInterrupt("s1");
187+
188+
expect(runtime.sendInterrupt).not.toHaveBeenCalled();
189+
expect(logger.warn).toHaveBeenCalledWith(
190+
"Session mutation blocked: lease not owned by this runtime",
191+
expect.objectContaining({ sessionId: "s1", operation: "sendInterrupt" }),
192+
);
193+
});
147194
});

0 commit comments

Comments
 (0)