Skip to content

Embedded→canonical pgserve upgrade (main→dev) breaks; omni doctor --fix can't complete on a clean host #722

@vasconceloscezar

Description

@vasconceloscezar

Summary

Upgrading an embedded-Postgres install from @latest (main, 2.260410.1) to @next (dev, 2.260617.3) leaves omni-api in a permanent crash loop, and the documented remediation — omni doctor --fix (embedded → canonical pgserve migration) — cannot complete on a clean host. Data is not lost, but the service is down with no working automated path forward.

Found via an isolated sandbox upgrade test (rootful Podman, ubuntu:24.04, unprivileged user, real installer flow). The production VM (already on canonical autopg-server@2.6.10) is not an embedded install and is likely unaffected; this bites embedded/older installs upgrading across the backbone change.

Environment

  • Clean ubuntu:24.04, install via the published flow: bun add -g @automagik/omni@<channel>omni install --non-interactiveomni start.
  • main 2.260410.1 installs an embedded, in-process Postgres (PG18) on :8432.
  • dev 2.260617.3 expects a standalone canonical pgserve/autopg on :5432.

Repro

  1. Install + boot main (@latest): healthy — embedded PG18 on :8432, 18 migrations, NATS connected. ✅
  2. bun add -g @automagik/omni@next then omni stop && omni start.
  3. omni-api crash-loops: ERROR api:startup "Failed to start API server" error="Database not ready after 30 attempts". omni start reports services "online" while the API silently restarts.
  4. omni doctor correctly flags it: pgserve-canonical … embedded … DEPRECATED. Run omni doctor --fix (idempotent; pg_dump → pgserve install → restore → relaunch).
  5. omni doctor --fix fails and cannot recover (details below).

Root causes (each independently blocks the migration)

1. Opaque failure on the bare upgrade

After upgrade, omni-api only logs Database not ready after 30 attempts. Nothing tells the operator the DB-backbone model changed (embedded → canonical) or to run omni doctor --fix. omni start should fail-fast on a deprecated-embedded + dev-binary combination with an actionable hint, not enter a silent restart loop.

2. No pg_dump ships with any bundled Postgres distribution

omni doctor --fix needs a PG18 pg_dump, but:

  • @embedded-postgres bundles only initdb / pg_ctl / postgresno pg_dump/pg_restore/psql.
  • Canonical autopg/pgserve (~/.autopg/bin/...) — same: only initdb / pg_ctl / postgres.
  • find / -name pg_dumpempty on a fresh install.

3. The remediation hint installs the wrong major version

doctor --fix suggests apt install postgresql-client. On Ubuntu 24.04 that's PG16, which refuses to dump a PG18 server (pg_dump: server version 18 … is newer). The hint is actively misleading; it should point at a PG≥server client (e.g. PGDG postgresql-client-18) or use a bundled one.

4. Ordering chicken-and-egg, and the source server isn't running under dev

Even with a real PG18 pg_dump supplied:

  • doctor --fix tries to pg_dump the embedded DB on :8432, but the dev binary never starts the embedded server, so the source is down → connection to server … port 8432 failed: Connection refused.
  • Installing canonical pgserve does a soft-rename of ~/.pgserve~/.autopg (documented, reversible — leaves MIGRATED-FROM-PGSERVE.md), which removes the embedded binaries from ~/.pgserve/bin, so there's no obvious supported way to bring the embedded source up for the dump.

Net: completing the migration required (a) an externally-installed PG18 client, and (b) manually starting the embedded server that dev no longer manages — neither of which omni doctor --fix does. The cascade of manual workarounds is the real blocker.

Impact

  • Embedded installs upgrading main→dev get a down API with no working automated recovery.
  • Data is intact (embedded PGDATA preserved, 15M PG18) but unreachable by the dev service.

Suggested fixes

  • omni start/startup: detect deprecated-embedded under a canonical-only binary and fail fast with the omni doctor --fix hint instead of a 30-retry crash loop.
  • Make omni doctor --fix self-sufficient: locate/ship a PG≥server-version pg_dump, and start the embedded server itself (it knows the embedded binary + PGDATA) for the dump before relaunching on canonical.
  • Fix the remediation hint: never suggest a pg_dump older than the server; prefer the bundled/canonical PG18 client or PGDG postgresql-client-18.
  • Consider bundling pg_dump/pg_restore with @embedded-postgres/autopg so migrations don't depend on host tooling.

Notes

Found via sandbox upgrade test. Not observed to affect the production deployment (already canonical autopg-server@2.6.10).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions