Skip to content

Consolidate bundle respawn after credential change into one mutex-guarded helper#218

Open
mgoldsborough wants to merge 2 commits into
mainfrom
feat/hot-bundle-reload
Open

Consolidate bundle respawn after credential change into one mutex-guarded helper#218
mgoldsborough wants to merge 2 commits into
mainfrom
feat/hot-bundle-reload

Conversation

@mgoldsborough
Copy link
Copy Markdown
Contributor

Summary

  • BundleLifecycleManager.respawnBundle(serverName, wsId, registry) is now the single supported path to restart a running bundle after a workspace credential is written. Both credential-writing tool handlers (set_user_config via connector-tools.ts and configureBundle in system-tools.ts) delegate to it.
  • A per-(serverName, wsId) promise-chain mutex serializes concurrent respawns for the same key without blocking unrelated keys. The previous duplicated code paths could race on removeSource / startBundleSource and leave a workspace's bundle dead until manual restart.
  • Lifecycle state stays accurate on every branch: missing instance and protected-bundle paths return structured errors; startBundleSource failure transitions to dead; success transitions to running.

Why now

The lifecycle of "set credential → restart subprocess to pick up new env" was implemented twice with slightly different shapes. After diagnosing tenant-ipsdi's DATABASE_URL propagation issue (the matching manifest fix shipped in synapse-db-query #7), the platform side was the remaining hygiene gap: same intent, two code paths, no concurrency guard.

Test plan

  • bun run verify:static (format / lint / tsc strict / web tsc / cycles / bundle-transport / codegen) clean
  • bun run verify:test-unit — 246 unit + web tests pass, including 5 new tests in test/unit/lifecycle-respawn.test.ts:
    • missing-instance error
    • protected-bundle refusal with source preserved
    • startBundleSource failure → dead
    • concurrent respawns: removeSource spy confirms maxInFlight === 1
    • lock map drops the key after the last queued op resolves
  • Integration check in QA: open a workspace's Configure UI, set/clear database_url for synapse-db-query twice rapidly; verify the bundle ends up running with the latest credential applied (no need for a manual restart).

…le helper

Both `set_user_config` and `configureBundle` were duplicating the
remove-source / spawn-new-source sequence after a credential write.
That duplication is the kind of thing that drifts: one site adds a
log line or a check, the other doesn't, and the second self-service
credential write produces a different runtime state than the first.

Worse, the two paths could race against each other on the same
`(serverName, wsId)`: two concurrent writes meant two concurrent
removeSource / startBundleSource pairs with no ordering guarantee,
and the surviving McpSource depended on whichever spawn finished
last. Under load this manifested as the bundle "going dead" until
the next manual restart.

This PR introduces `BundleLifecycleManager.respawnBundle(serverName,
wsId, registry)` as the single supported path, with:

- A per-(serverName, wsId) promise-chain mutex so concurrent
  respawns serialize without blocking different keys.
- Explicit guards: missing instance and protected bundle both
  return structured errors instead of throwing.
- `transition(instance, 'starting' | 'running' | 'dead')` on every
  branch so the lifecycle state stays accurate.

Callers now delegate:
- `connector-tools.ts::respawnBundleAfterCredentialChange` becomes a
  one-liner over `lifecycle.respawnBundle`.
- `system-tools.ts::configureBundle` drops its inline remove + start
  block and calls the same helper. The `eventSink` param is gone
  from the signature (events emit from inside the helper now).

Tests (`test/unit/lifecycle-respawn.test.ts`):
- Missing instance returns an error.
- Protected bundle refuses respawn and source stays in registry.
- `startBundleSource` failure marks instance `dead` and returns
  the error.
- Two concurrent respawns for the same key serialize: a spy on
  `registry.removeSource` confirms `maxInFlight === 1`.
- The lock map drops its key after the last queued op resolves
  (no leak under churn).
Follow-ups on PR review feedback:

- Inject workDir through `BundleLifecycleManager`'s constructor
  instead of reading `process.env.NB_WORK_DIR` inside `runRespawn`.
  Runtime already has `resolveWorkDir(config)` at construction time;
  this keeps `lifecycle.ts` honest about its dependencies. The other
  pre-existing `process.env.NB_WORK_DIR` reads inside this file are
  unrelated to this PR and remain unchanged.

- Inline `respawnBundleAfterCredentialChange` into its sole caller
  (`handleSetUserConfig`). The wrapper had one caller and an unused
  `_bundleName` parameter — it added indirection without leverage
  now that the real work lives in `lifecycle.respawnBundle`.

- Improve the missing-instance error to be user-shaped rather than
  developer-shaped: "Bundle X is not installed in workspace Y. Install
  it first via manage_app install." instead of "No bundle instance
  for X in workspace Y".

- Fix a stale comment: `runRespawn` transitions to `dead`, not
  `failed` (no `failed` state exists in `BundleState`).

- Update the `protected` docstring on `BundleInstance` and its
  schema description to reflect that this PR makes `protected`
  gate respawn as well as uninstall (both are tear-down operations
  and the rationale — don't tear down mid-call — is identical).

- Update the unit-test assertion for the new missing-instance
  message.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant