From 478e4959a50f35cdcded91646eae5bd54c823b71 Mon Sep 17 00:00:00 2001 From: Paulo Cabral Sanz Date: Mon, 27 Apr 2026 10:54:44 -0300 Subject: [PATCH 1/6] docs: Point-in-time Recovery for Postgres New reference page under Volumes covering the PITR feature: how it works, the standalone vs HA enable flows (incl. the rolling-restart sequence), restoring to a target timestamp, the no-snapshot / coverage-gap warnings the UI surfaces, the disable flow (and bucket-retention behavior), cost, and limitations. Sibling to /volumes/backups; sidebar entry added next to Backups. --- .../docs/volumes/point-in-time-recovery.md | 95 +++++++++++++++++++ src/data/sidebar.ts | 1 + 2 files changed, 96 insertions(+) create mode 100644 content/docs/volumes/point-in-time-recovery.md diff --git a/content/docs/volumes/point-in-time-recovery.md b/content/docs/volumes/point-in-time-recovery.md new file mode 100644 index 000000000..e97c13921 --- /dev/null +++ b/content/docs/volumes/point-in-time-recovery.md @@ -0,0 +1,95 @@ +--- +title: Point-in-Time Recovery +description: Recover a Railway Postgres service to any moment within the WAL retention window using continuous wal-g WAL archiving. +--- + +Point-in-Time Recovery (PITR) lets you restore a Postgres service to any timestamp within the WAL retention window — not just to the moment of the most recent snapshot. It's the right tool when something goes wrong between scheduled backups: an accidental `DROP TABLE`, a faulty migration, a runaway script. + +PITR works for both single-node Postgres and [Postgres HA](/databases/postgresql-ha) clusters, and it's available on the **Pro** plan. + +## How it works + +When PITR is enabled, your Postgres image archives every WAL segment it produces to a Railway [storage bucket](/storage-buckets) using [wal-g](https://github.com/wal-g/wal-g). The bucket retains roughly 7 days of WAL by default. To restore, you pick a target timestamp; Railway finds the most recent base [snapshot](/volumes/backups) at-or-before that timestamp, creates a new volume from it, and the Postgres image replays archived WAL up to your target before promoting. + +The original volume is retained, unmounted, in case you need to roll back the restore. + +## Enabling PITR + +Open the **Backups** tab on your Postgres service (or on the cluster Backups page for Postgres HA). When PITR isn't yet configured, you'll see a **Point-in-time recovery is off** banner with an **Enable PITR** button. + +### Single-node Postgres + +Click Enable, confirm, and Railway: + +1. Creates a Railway Bucket named **Postgres-PITR** for archived WAL. +2. Sets the wal-g credentials as variables on the Postgres service. +3. Redeploys the service. The image picks up the new env vars and starts archiving WAL on every commit. +4. Adds a daily backup schedule (if not already enabled) and triggers an immediate base snapshot — PITR replay needs at least one covering snapshot. + +One redeploy. After the service comes back, you'll see a **PITR datetime picker** appear on the Backups tab. + +### Postgres HA + +HA enable runs a rolling restart so the cluster never loses Patroni quorum: + +1. **Bucket + DCS** — Bucket is created and the cluster's Patroni DCS gains `archive_mode=on`, `archive_command='wal-g wal-push %p'`, and `archive_timeout=60`. No restarts at this step. +2. **Roll replicas** — Each replica gets the wal-g env vars one at a time. The variable change triggers a redeploy of just that node; Patroni absorbs the restart and the cluster stays available throughout. Railway waits for each replica to come back as a healthy streaming follower (`state=streaming`, `lag=0`) before moving to the next. +3. **Switchover** — Patroni promotes one of the now-configured replicas to leader, demoting the original primary. Brief (~5s) write-unavailability blip absorbed by HAProxy + client reconnect. +4. **Roll ex-primary** — The demoted node gets the wal-g env vars and redeploys cleanly as a replica. + +After this completes, every Postgres node carries the wal-g credentials, so whichever node holds Patroni leadership at any given moment can fire `archive_command`. Failovers preserve archiving with no config change. + +The full HA enable typically takes 2–4 minutes. + +## Restoring to a point in time + +On the Backups tab, the PITR section shows the available restore range — typically **(oldest snapshot's timestamp) → now** — and a datetime picker bounded to that window. Pick a moment, click **Restore to this moment**. + +Railway: + +1. Resolves the most recent base snapshot at-or-before your target timestamp. (The picker prevents you from picking a target outside the available range; the API also rejects out-of-range targets.) +2. Creates a new volume from that snapshot. +3. Sets `POSTGRES_RECOVERY_TARGET_TIME` on the restored service. +4. Stages the changes — review them via the Project Canvas, then Deploy. + +When the new volume mounts, the Postgres image enters archive recovery, replays WAL from the bucket up to your target, and promotes. The old volume is retained, unmounted, under its original name. + +For Postgres HA, restoring to a past timestamp requires full cluster downtime: replicas are wiped to fresh empty volumes and re-bootstrap from the restored primary via `pg_basebackup`. Plan accordingly. + +## Coverage and warnings + +PITR replay needs a covering base snapshot. Two states the Backups tab will surface: + +- **No snapshots yet** — PITR is enabled but no backup has run. The picker is disabled. Take a manual backup or wait for the daily schedule. +- **Coverage gap** — The most recent snapshot is older than the WAL retention window (~7 days). The picker is disabled and a warning explains why: WAL beyond the retention window has no covering snapshot to replay onto. Take a fresh backup before restoring. + +Both states are recoverable — just take a snapshot. Keep the daily schedule on (it's added automatically when you enable PITR) and you'll never hit them. + +## Disabling PITR + +Click **Disable PITR** on the Backups tab. + +For single-node, Railway reverts the postgres-pitr template — removes the wal-g env vars and redeploys the service. The image cleans its archive config out of `postgresql.auto.conf` on next start so Postgres comes back without any wal-g machinery. + +For HA, the same rolling-restart pattern as enable runs in reverse: revert the cluster's DCS archive config, roll replicas (clearing their wal-g vars), switchover, roll the ex-primary. + +The Railway Bucket holding archived WAL is **not deleted** — you can still restore from it, or remove it manually from the Buckets page once you're sure you no longer need it. + +If an enable run fails partway through, Railway runs a best-effort cleanup automatically: reverts the DCS edit and clears any wal-g vars that were written. The cluster goes back to its pre-enable state, and a toast surfaces the original error so you can decide whether to retry. + +## Cost + +PITR storage costs are billed at the standard [Railway storage bucket](/storage-buckets) rate. For most workloads, expect roughly: + +- A few GB of compressed WAL per day under steady write load (zstd-compressed; idle databases are nearly free) +- One base snapshot per day at the volume's size, retained per your backup schedule + +Restore egress is free. + +## Limitations + +- Available on the **Pro** plan and above. +- Currently amd64 only (the wal-g binary in the postgres-ssl image isn't published for arm64 yet). +- The available restore window starts from the first post-enable base snapshot, not retroactively. If you enable PITR today, you can't restore to yesterday. +- HA restore inherently requires full cluster downtime (replicas wiped, re-bootstrapped from restored primary). +- After failover, the residual RPO is bounded by `archive_timeout` (60s) plus failover-detection time. Workloads that need tighter guarantees should consider a dedicated archiver (pgBackRest) — out of scope for Railway's PITR offering today. diff --git a/src/data/sidebar.ts b/src/data/sidebar.ts index 5d48f9d15..13285b642 100644 --- a/src/data/sidebar.ts +++ b/src/data/sidebar.ts @@ -356,6 +356,7 @@ export const sidebarContent: ISidebarContent = [ subTitle: makePage("Volumes", undefined, "/volumes"), pages: [ makePage("Backups", "volumes"), + makePage("Point-in-Time Recovery", "volumes"), makePage("Reference", "volumes"), ], }, From b351cf8481896b2253a2d52f3effdefb5fe2e332 Mon Sep 17 00:00:00 2001 From: Paulo Cabral Sanz Date: Mon, 27 Apr 2026 11:46:43 -0300 Subject: [PATCH 2/6] docs(pitr): require both daily AND weekly snapshot schedules MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Daily-only is broken: snapshot retention is 6 days, WAL retention is 7 — the oldest day of the WAL window has no base snapshot to replay onto. Weekly snapshots (27-day retention) close the gap. Both schedules are required and Enable PITR adds them automatically; document that removing either breaks the feature. --- content/docs/volumes/point-in-time-recovery.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/docs/volumes/point-in-time-recovery.md b/content/docs/volumes/point-in-time-recovery.md index e97c13921..129ca637f 100644 --- a/content/docs/volumes/point-in-time-recovery.md +++ b/content/docs/volumes/point-in-time-recovery.md @@ -24,7 +24,7 @@ Click Enable, confirm, and Railway: 1. Creates a Railway Bucket named **Postgres-PITR** for archived WAL. 2. Sets the wal-g credentials as variables on the Postgres service. 3. Redeploys the service. The image picks up the new env vars and starts archiving WAL on every commit. -4. Adds a daily backup schedule (if not already enabled) and triggers an immediate base snapshot — PITR replay needs at least one covering snapshot. +4. Adds **daily and weekly** backup schedules (if not already enabled) and triggers an immediate base snapshot. Both schedules are required: WAL is retained for 7 days, but daily snapshots are only retained for 6 — without a weekly snapshot, the oldest day of your WAL window would have no base snapshot to replay onto. The weekly fills that gap. One redeploy. After the service comes back, you'll see a **PITR datetime picker** appear on the Backups tab. @@ -63,7 +63,7 @@ PITR replay needs a covering base snapshot. Two states the Backups tab will surf - **No snapshots yet** — PITR is enabled but no backup has run. The picker is disabled. Take a manual backup or wait for the daily schedule. - **Coverage gap** — The most recent snapshot is older than the WAL retention window (~7 days). The picker is disabled and a warning explains why: WAL beyond the retention window has no covering snapshot to replay onto. Take a fresh backup before restoring. -Both states are recoverable — just take a snapshot. Keep the daily schedule on (it's added automatically when you enable PITR) and you'll never hit them. +Both states are recoverable — just take a snapshot. Keep the daily and weekly schedules on (both added automatically when you enable PITR) and you'll never hit them. Removing either of those schedules while PITR is enabled will reintroduce a coverage gap, so leave them in place. ## Disabling PITR From 3585405115736664838db5290eb590e44e0c5b55 Mon Sep 17 00:00:00 2001 From: Paulo Cabral Sanz Date: Mon, 4 May 2026 17:44:03 -0300 Subject: [PATCH 3/6] docs(pitr): rewrite for pgBackRest + new-service restore flow MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The previous draft documented the wal-g design. The implementation has since pivoted to pgBackRest baked into the Postgres image (direct-to-S3, async push) and restore now creates a new sibling service rather than mutating the source in place. Key updates: - wal-g → pgBackRest, with the WAL_ARCHIVE_* tool-agnostic env contract replacing the wal-g credential references. - Base backups taken by pgBackRest itself (full + incremental), not by Railway snapshot schedules — drops the daily/weekly schedule requirement that was a workaround for snapshot-as-base. - Restore creates -restored-YYYYMMDD-HHMM as a new service; source service stays online and untouched. Same flow for standalone and HA — drops the "HA restore requires full cluster downtime" claim. - Coverage timeline reads pgBackRest catalog directly: green band + red diagonal stripes for gaps; upper bound is latest archived WAL, not current time. - Standalone disable stages a patch (env vars + bucket deletion) for user review; HA disable runs imperatively and retains the bucket. - Drops the arm64 limitation (was wal-g specific) and the "consider pgBackRest" hedge (we use it now). --- .../docs/volumes/point-in-time-recovery.md | 79 ++++++++++--------- 1 file changed, 42 insertions(+), 37 deletions(-) diff --git a/content/docs/volumes/point-in-time-recovery.md b/content/docs/volumes/point-in-time-recovery.md index 129ca637f..94a14e40c 100644 --- a/content/docs/volumes/point-in-time-recovery.md +++ b/content/docs/volumes/point-in-time-recovery.md @@ -1,17 +1,19 @@ --- title: Point-in-Time Recovery -description: Recover a Railway Postgres service to any moment within the WAL retention window using continuous wal-g WAL archiving. +description: Recover a Railway Postgres service to any moment within the WAL retention window using continuous pgBackRest WAL archiving. --- -Point-in-Time Recovery (PITR) lets you restore a Postgres service to any timestamp within the WAL retention window — not just to the moment of the most recent snapshot. It's the right tool when something goes wrong between scheduled backups: an accidental `DROP TABLE`, a faulty migration, a runaway script. +Point-in-Time Recovery (PITR) lets you restore a Postgres service to any timestamp within the archive retention window — not just to the moment of the most recent backup. It's the right tool when something goes wrong between scheduled backups: an accidental `DROP TABLE`, a faulty migration, a runaway script. PITR works for both single-node Postgres and [Postgres HA](/databases/postgresql-ha) clusters, and it's available on the **Pro** plan. ## How it works -When PITR is enabled, your Postgres image archives every WAL segment it produces to a Railway [storage bucket](/storage-buckets) using [wal-g](https://github.com/wal-g/wal-g). The bucket retains roughly 7 days of WAL by default. To restore, you pick a target timestamp; Railway finds the most recent base [snapshot](/volumes/backups) at-or-before that timestamp, creates a new volume from it, and the Postgres image replays archived WAL up to your target before promoting. +When PITR is enabled, your Postgres image archives every WAL segment it produces directly to a Railway [storage bucket](/storage-buckets) using [pgBackRest](https://pgbackrest.org/). pgBackRest also takes its own base backups (full + incremental) on a rolling schedule, so the bucket holds everything Postgres needs to rebuild a database at any point in the archive window. -The original volume is retained, unmounted, in case you need to roll back the restore. +Pushes are async: pgBackRest's worker batches WAL segments and ships them to S3 in the background, so a stalled bucket can't block writes on Postgres. Under sustained S3 outages, a 5 GiB queue cap on the leader trips and pgBackRest drops WAL to keep the database running — your PITR window truncates, but Postgres stays up. + +To restore, you pick a target timestamp. Railway provisions a brand-new Postgres service alongside the source, and the new service's image runs `pgbackrest restore --type=time --target=` on first boot — pulling the most recent base backup at-or-before your target, then replaying archived WAL forward until it reaches the target, then promoting. The source service is never touched. ## Enabling PITR @@ -21,75 +23,78 @@ Open the **Backups** tab on your Postgres service (or on the cluster Backups pag Click Enable, confirm, and Railway: -1. Creates a Railway Bucket named **Postgres-PITR** for archived WAL. -2. Sets the wal-g credentials as variables on the Postgres service. -3. Redeploys the service. The image picks up the new env vars and starts archiving WAL on every commit. -4. Adds **daily and weekly** backup schedules (if not already enabled) and triggers an immediate base snapshot. Both schedules are required: WAL is retained for 7 days, but daily snapshots are only retained for 6 — without a weekly snapshot, the oldest day of your WAL window would have no base snapshot to replay onto. The weekly fills that gap. +1. Creates a Railway Bucket named **Postgres-PITR** for archived WAL and base backups. +2. Sets `WAL_ARCHIVE_*` env vars on the Postgres service, referencing the bucket's credentials. +3. Redeploys the service. + +When the new container boots, the image detects the archive credentials, writes `archive_mode=on` / `archive_command='pgbackrest --stanza=main archive-push %p'` / `archive_timeout=60` into Postgres config, runs `pgbackrest stanza-create`, and starts pushing WAL on every commit. Once archiving is healthy, an in-container watcher takes the first pgBackRest base backup automatically — no manual snapshot step. After that, the **PITR datetime picker** appears on the Backups tab. -One redeploy. After the service comes back, you'll see a **PITR datetime picker** appear on the Backups tab. +One redeploy. ### Postgres HA HA enable runs a rolling restart so the cluster never loses Patroni quorum: -1. **Bucket + DCS** — Bucket is created and the cluster's Patroni DCS gains `archive_mode=on`, `archive_command='wal-g wal-push %p'`, and `archive_timeout=60`. No restarts at this step. -2. **Roll replicas** — Each replica gets the wal-g env vars one at a time. The variable change triggers a redeploy of just that node; Patroni absorbs the restart and the cluster stays available throughout. Railway waits for each replica to come back as a healthy streaming follower (`state=streaming`, `lag=0`) before moving to the next. +1. **Bucket + DCS** — Bucket is created and the cluster's Patroni DCS gains `archive_mode=on`, `archive_command='pgbackrest --stanza=main archive-push %p'`, and `archive_timeout=60`. No restarts at this step. +2. **Roll replicas** — Each replica gets the `WAL_ARCHIVE_*` env vars one at a time. The variable change triggers a redeploy of just that node; Patroni absorbs the restart and the cluster stays available throughout. Railway waits for each replica to come back as a healthy streaming follower (`state=streaming`, `lag≈0`) before moving to the next. 3. **Switchover** — Patroni promotes one of the now-configured replicas to leader, demoting the original primary. Brief (~5s) write-unavailability blip absorbed by HAProxy + client reconnect. -4. **Roll ex-primary** — The demoted node gets the wal-g env vars and redeploys cleanly as a replica. +4. **Roll ex-primary** — The demoted node gets the `WAL_ARCHIVE_*` env vars and redeploys cleanly as a replica. -After this completes, every Postgres node carries the wal-g credentials, so whichever node holds Patroni leadership at any given moment can fire `archive_command`. Failovers preserve archiving with no config change. +After this completes, every Postgres node carries the archive credentials, so whichever node holds Patroni leadership at any given moment can fire `archive_command`. Failovers preserve archiving with no config change. The leader takes the first base backup once archiving is healthy. -The full HA enable typically takes 2–4 minutes. +The full HA enable typically takes 2–4 minutes. If it fails partway through, Railway runs a best-effort cleanup automatically: the DCS edit is reverted and any partial env vars are cleared, leaving the cluster in its pre-enable state. ## Restoring to a point in time -On the Backups tab, the PITR section shows the available restore range — typically **(oldest snapshot's timestamp) → now** — and a datetime picker bounded to that window. Pick a moment, click **Restore to this moment**. +On the Backups tab, the PITR section shows the available restore range — bounded below by the oldest pgBackRest base backup and above by the latest WAL successfully archived to the bucket — and a datetime picker. Pick a moment, click **Restore to this moment**. Railway: -1. Resolves the most recent base snapshot at-or-before your target timestamp. (The picker prevents you from picking a target outside the available range; the API also rejects out-of-range targets.) -2. Creates a new volume from that snapshot. -3. Sets `POSTGRES_RECOVERY_TARGET_TIME` on the restored service. -4. Stages the changes — review them via the Project Canvas, then Deploy. +1. Creates a brand-new Postgres service in the project, named `-restored-YYYYMMDD-HHMM` (you can override the name). +2. Provisions an empty volume for it, the same size as the source. +3. Stages a patch wiring up the new service: same image as the source, the source's env vars (minus the archive credentials), `WAL_RECOVER_FROM_*` pointing read-only at the source's bucket, and `POSTGRES_RECOVERY_TARGET_TIME` set to your target. +4. Deploys the new service. + +On first boot, the image runs `pgbackrest restore --type=time --target=`, populating the empty volume from the bucket — base backup first, then archived WAL replayed forward until your target. Postgres promotes when it hits the target. + +**The source service is never touched.** It keeps serving traffic the entire time. After the restore finishes, you have two services side by side: the original and the fork. Cut over by swapping connection strings, copying out the rows you need, or replacing the source service with the fork. -When the new volume mounts, the Postgres image enters archive recovery, replays WAL from the bucket up to your target, and promotes. The old volume is retained, unmounted, under its original name. +The restored fork runs as plain non-archiving Postgres. If you want continued PITR coverage on it, enable PITR on the new service through the same flow — it'll get its own bucket. -For Postgres HA, restoring to a past timestamp requires full cluster downtime: replicas are wiped to fresh empty volumes and re-bootstrap from the restored primary via `pg_basebackup`. Plan accordingly. +For Postgres HA, restore also produces a single-node fork (not a cluster). To restore an HA cluster as HA, restore to a single-node fork, then [convert it to HA](/databases/postgresql-ha) once you're satisfied with the data. ## Coverage and warnings -PITR replay needs a covering base snapshot. Two states the Backups tab will surface: +PITR replay needs both a covering base backup and an unbroken WAL chain to your target. The picker reads pgBackRest's catalog directly to draw a coverage timeline: -- **No snapshots yet** — PITR is enabled but no backup has run. The picker is disabled. Take a manual backup or wait for the daily schedule. -- **Coverage gap** — The most recent snapshot is older than the WAL retention window (~7 days). The picker is disabled and a warning explains why: WAL beyond the retention window has no covering snapshot to replay onto. Take a fresh backup before restoring. +- **Green band** — restorable: covered by base backups with continuous WAL. +- **Red diagonal stripes** — coverage gap. pgBackRest detected a missed cycle (e.g. archiving was paused, the bucket was unreachable for a long stretch, or a base backup failed). The picker won't let you target a time inside a gap, and the API rejects it as well. +- **No backups yet** — PITR is enabled but the first base backup hasn't completed. The picker is disabled. Wait for the in-container watcher to take it (typically minutes after enable). -Both states are recoverable — just take a snapshot. Keep the daily and weekly schedules on (both added automatically when you enable PITR) and you'll never hit them. Removing either of those schedules while PITR is enabled will reintroduce a coverage gap, so leave them in place. +The upper bound of the restore window is the **latest archived WAL**, not the current time. Idle databases produce WAL segments slowly, so on a quiet system the head of the window may sit a few minutes behind real time. Targets past the archive head are rejected — there's no covering WAL to replay onto, and Postgres recovery would abort mid-replay. ## Disabling PITR Click **Disable PITR** on the Backups tab. -For single-node, Railway reverts the postgres-pitr template — removes the wal-g env vars and redeploys the service. The image cleans its archive config out of `postgresql.auto.conf` on next start so Postgres comes back without any wal-g machinery. +For **single-node**, Railway stages a single patch that removes the `WAL_ARCHIVE_*` env vars and deletes the Postgres-PITR bucket — the full inverse of enable. Nothing changes on the running service until you review the patch in the **Staged Changes** panel and click Deploy. If you want to keep the archived WAL around (e.g. to restore from it later before fully cleaning up), edit the patch to drop the bucket-deletion step before deploying. -For HA, the same rolling-restart pattern as enable runs in reverse: revert the cluster's DCS archive config, roll replicas (clearing their wal-g vars), switchover, roll the ex-primary. - -The Railway Bucket holding archived WAL is **not deleted** — you can still restore from it, or remove it manually from the Buckets page once you're sure you no longer need it. - -If an enable run fails partway through, Railway runs a best-effort cleanup automatically: reverts the DCS edit and clears any wal-g vars that were written. The cluster goes back to its pre-enable state, and a toast surfaces the original error so you can decide whether to retry. +For **HA**, the rolling-restart pattern from enable runs in reverse: revert the cluster's DCS archive config, roll replicas (clearing their archive vars), switchover, roll the ex-primary. The Railway Bucket holding archived WAL is **not deleted** — you can still restore from it, or remove it manually from the Buckets page once you're sure you no longer need it. ## Cost PITR storage costs are billed at the standard [Railway storage bucket](/storage-buckets) rate. For most workloads, expect roughly: -- A few GB of compressed WAL per day under steady write load (zstd-compressed; idle databases are nearly free) -- One base snapshot per day at the volume's size, retained per your backup schedule +- A few GB of compressed WAL per day under steady write load (zstd-compressed; idle databases are nearly free). +- One base backup per cycle, compressed and de-duplicated by pgBackRest. + +pgBackRest's `expire` runs after each backup and reclaims old base backups along with their pinned WAL, so the bucket size stabilizes at roughly retention × daily-write-volume. Restore egress is free. ## Limitations - Available on the **Pro** plan and above. -- Currently amd64 only (the wal-g binary in the postgres-ssl image isn't published for arm64 yet). -- The available restore window starts from the first post-enable base snapshot, not retroactively. If you enable PITR today, you can't restore to yesterday. -- HA restore inherently requires full cluster downtime (replicas wiped, re-bootstrapped from restored primary). -- After failover, the residual RPO is bounded by `archive_timeout` (60s) plus failover-detection time. Workloads that need tighter guarantees should consider a dedicated archiver (pgBackRest) — out of scope for Railway's PITR offering today. +- The available restore window starts from the first post-enable base backup, not retroactively. If you enable PITR today, you can't restore to yesterday. +- Restore creates a new sibling service. Cutting over to the restored database (renaming, swapping connection strings, decommissioning the original) is a manual step. +- HA restore produces a single-node Postgres fork; convert to HA after restore if you want HA on the restored data. From 73fc7983425666179d4baf0775b9104fb913797d Mon Sep 17 00:00:00 2001 From: Paulo Cabral Sanz Date: Tue, 12 May 2026 00:40:23 -0300 Subject: [PATCH 4/6] =?UTF-8?q?docs(pitr):=20simplify=20HA=20enable=20?= =?UTF-8?q?=E2=80=94=20brief=20downtime,=20no=20rolling=20restart?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../docs/volumes/point-in-time-recovery.md | 25 +++---------------- 1 file changed, 4 insertions(+), 21 deletions(-) diff --git a/content/docs/volumes/point-in-time-recovery.md b/content/docs/volumes/point-in-time-recovery.md index 94a14e40c..a223f6f96 100644 --- a/content/docs/volumes/point-in-time-recovery.md +++ b/content/docs/volumes/point-in-time-recovery.md @@ -17,9 +17,7 @@ To restore, you pick a target timestamp. Railway provisions a brand-new Postgres ## Enabling PITR -Open the **Backups** tab on your Postgres service (or on the cluster Backups page for Postgres HA). When PITR isn't yet configured, you'll see a **Point-in-time recovery is off** banner with an **Enable PITR** button. - -### Single-node Postgres +Open the **Backups** tab on your Postgres service. When PITR isn't yet configured, you'll see a **Point-in-time recovery is off** banner with an **Enable PITR** button. Click Enable, confirm, and Railway: @@ -29,20 +27,7 @@ Click Enable, confirm, and Railway: When the new container boots, the image detects the archive credentials, writes `archive_mode=on` / `archive_command='pgbackrest --stanza=main archive-push %p'` / `archive_timeout=60` into Postgres config, runs `pgbackrest stanza-create`, and starts pushing WAL on every commit. Once archiving is healthy, an in-container watcher takes the first pgBackRest base backup automatically — no manual snapshot step. After that, the **PITR datetime picker** appears on the Backups tab. -One redeploy. - -### Postgres HA - -HA enable runs a rolling restart so the cluster never loses Patroni quorum: - -1. **Bucket + DCS** — Bucket is created and the cluster's Patroni DCS gains `archive_mode=on`, `archive_command='pgbackrest --stanza=main archive-push %p'`, and `archive_timeout=60`. No restarts at this step. -2. **Roll replicas** — Each replica gets the `WAL_ARCHIVE_*` env vars one at a time. The variable change triggers a redeploy of just that node; Patroni absorbs the restart and the cluster stays available throughout. Railway waits for each replica to come back as a healthy streaming follower (`state=streaming`, `lag≈0`) before moving to the next. -3. **Switchover** — Patroni promotes one of the now-configured replicas to leader, demoting the original primary. Brief (~5s) write-unavailability blip absorbed by HAProxy + client reconnect. -4. **Roll ex-primary** — The demoted node gets the `WAL_ARCHIVE_*` env vars and redeploys cleanly as a replica. - -After this completes, every Postgres node carries the archive credentials, so whichever node holds Patroni leadership at any given moment can fire `archive_command`. Failovers preserve archiving with no config change. The leader takes the first base backup once archiving is healthy. - -The full HA enable typically takes 2–4 minutes. If it fails partway through, Railway runs a best-effort cleanup automatically: the DCS edit is reverted and any partial env vars are cleared, leaving the cluster in its pre-enable state. +For Postgres HA clusters, all nodes are redeployed at once when enabling — expect brief downtime while the cluster restarts. ## Restoring to a point in time @@ -75,11 +60,9 @@ The upper bound of the restore window is the **latest archived WAL**, not the cu ## Disabling PITR -Click **Disable PITR** on the Backups tab. - -For **single-node**, Railway stages a single patch that removes the `WAL_ARCHIVE_*` env vars and deletes the Postgres-PITR bucket — the full inverse of enable. Nothing changes on the running service until you review the patch in the **Staged Changes** panel and click Deploy. If you want to keep the archived WAL around (e.g. to restore from it later before fully cleaning up), edit the patch to drop the bucket-deletion step before deploying. +Click **Disable PITR** on the Backups tab. Railway stages a patch that removes the `WAL_ARCHIVE_*` env vars and deletes the Postgres-PITR bucket. Nothing changes on the running service until you review the patch in the **Staged Changes** panel and click Deploy. -For **HA**, the rolling-restart pattern from enable runs in reverse: revert the cluster's DCS archive config, roll replicas (clearing their archive vars), switchover, roll the ex-primary. The Railway Bucket holding archived WAL is **not deleted** — you can still restore from it, or remove it manually from the Buckets page once you're sure you no longer need it. +If you want to keep the archived WAL around (e.g. to restore from it later before fully cleaning up), edit the patch to drop the bucket-deletion step before deploying. ## Cost From 2919d6192a3040d0b0c79457564a21fdd9761e44 Mon Sep 17 00:00:00 2001 From: Paulo Cabral Sanz Date: Tue, 12 May 2026 00:41:23 -0300 Subject: [PATCH 5/6] docs(pitr): remove Coverage and warnings section --- content/docs/volumes/point-in-time-recovery.md | 12 +----------- 1 file changed, 1 insertion(+), 11 deletions(-) diff --git a/content/docs/volumes/point-in-time-recovery.md b/content/docs/volumes/point-in-time-recovery.md index a223f6f96..3f3b24003 100644 --- a/content/docs/volumes/point-in-time-recovery.md +++ b/content/docs/volumes/point-in-time-recovery.md @@ -31,7 +31,7 @@ For Postgres HA clusters, all nodes are redeployed at once when enabling — exp ## Restoring to a point in time -On the Backups tab, the PITR section shows the available restore range — bounded below by the oldest pgBackRest base backup and above by the latest WAL successfully archived to the bucket — and a datetime picker. Pick a moment, click **Restore to this moment**. +On the Backups tab, the PITR section shows the available restore range and a datetime picker. Pick a moment, click **Restore to this moment**. Railway: @@ -48,16 +48,6 @@ The restored fork runs as plain non-archiving Postgres. If you want continued PI For Postgres HA, restore also produces a single-node fork (not a cluster). To restore an HA cluster as HA, restore to a single-node fork, then [convert it to HA](/databases/postgresql-ha) once you're satisfied with the data. -## Coverage and warnings - -PITR replay needs both a covering base backup and an unbroken WAL chain to your target. The picker reads pgBackRest's catalog directly to draw a coverage timeline: - -- **Green band** — restorable: covered by base backups with continuous WAL. -- **Red diagonal stripes** — coverage gap. pgBackRest detected a missed cycle (e.g. archiving was paused, the bucket was unreachable for a long stretch, or a base backup failed). The picker won't let you target a time inside a gap, and the API rejects it as well. -- **No backups yet** — PITR is enabled but the first base backup hasn't completed. The picker is disabled. Wait for the in-container watcher to take it (typically minutes after enable). - -The upper bound of the restore window is the **latest archived WAL**, not the current time. Idle databases produce WAL segments slowly, so on a quiet system the head of the window may sit a few minutes behind real time. Targets past the archive head are rejected — there's no covering WAL to replay onto, and Postgres recovery would abort mid-replay. - ## Disabling PITR Click **Disable PITR** on the Backups tab. Railway stages a patch that removes the `WAL_ARCHIVE_*` env vars and deletes the Postgres-PITR bucket. Nothing changes on the running service until you review the patch in the **Staged Changes** panel and click Deploy. From e14182684025814b38c25ec028d62bffba22220b Mon Sep 17 00:00:00 2001 From: Paulo Cabral Sanz Date: Tue, 12 May 2026 00:42:03 -0300 Subject: [PATCH 6/6] docs(pitr): remove Pro-only restriction --- content/docs/volumes/point-in-time-recovery.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/content/docs/volumes/point-in-time-recovery.md b/content/docs/volumes/point-in-time-recovery.md index 3f3b24003..b1c14d242 100644 --- a/content/docs/volumes/point-in-time-recovery.md +++ b/content/docs/volumes/point-in-time-recovery.md @@ -5,7 +5,7 @@ description: Recover a Railway Postgres service to any moment within the WAL ret Point-in-Time Recovery (PITR) lets you restore a Postgres service to any timestamp within the archive retention window — not just to the moment of the most recent backup. It's the right tool when something goes wrong between scheduled backups: an accidental `DROP TABLE`, a faulty migration, a runaway script. -PITR works for both single-node Postgres and [Postgres HA](/databases/postgresql-ha) clusters, and it's available on the **Pro** plan. +PITR works for both single-node Postgres and [Postgres HA](/databases/postgresql-ha) clusters. ## How it works @@ -67,7 +67,6 @@ Restore egress is free. ## Limitations -- Available on the **Pro** plan and above. - The available restore window starts from the first post-enable base backup, not retroactively. If you enable PITR today, you can't restore to yesterday. - Restore creates a new sibling service. Cutting over to the restored database (renaming, swapping connection strings, decommissioning the original) is a manual step. - HA restore produces a single-node Postgres fork; convert to HA after restore if you want HA on the restored data.