[Postgres + MongoDB] Resumable replication #163

rkistner · 2024-12-11T16:53:59Z

Problem Statement

When updating sync rules, we perform initial replication from the source database with the new sync rules. This involves doing an initial snapshot, then streaming incremental changes.

For large source databases, initial replication can take a long time. Historically we did not implement resuming for the initial snapshot queries - only for the initial replication stream. So if the replication process/container crashed/restarted in that process, we had to restart the snapshot from scratch.

We have already implemented partial mitigations for Postgres source databases:

[Postgres] Resumeable initial replication #150 added support to skipping tables already replicated.
[Postgres] Improve timeouts and table snapshots #152 improved table snapshots to be faster when retrying - avoiding writing the same rows again.

The fix

This implements the same mitigations for MongoDB source databases, and additionally attempts to chunk initial snapshot queries using the primary key, allowing us to fully resume initial replication where we left off. This is basically repeated queries of the form SELECT * from <table> WHERE id > :lastId LIMIT 10000, instead of just using a single SELECT * FROM <table> query.

For Postgres, this is limited to tables with a single primary key column of specific types (text/varchar/uuid/int2/int4/int8). We can add more supported types over time, and potentially support compound primary keys, if we do proper testing. However, these should already cover a large percentage of replicated tables. Tables not covered from this would restart replication when interrupted.

For MongoDB, we support any collection.

Edge cases

Since the snapshot query is no longer performed at a single point-in-time, there are some edge cases we need to consider, when rows are added, updated or removed while snapshotting. Most of these are handled with the proper consistency purely by resuming streaming replication after the snapshot. However, Postgres has a particular test case we need to be careful with:

Replicate snapshot for chunk A.
Update a row, moving it from chunk B to chunk B.
Replicate snapshot for chunk B.

This means the row is not covered by the snapshot query. It is still covered by streaming replication, but the change event may exclude TOAST values.

To handle this edge case, we detect rows with missing TOAST columns, and re-replicate those.

Limitations

Some notable limitations currently:

Resumable replication is not implemented for MySQL yet.
~~Progress inside a table is not implemented for Postgres storage yet.~~
~~Resuming MongoDB snapshots require storing the snapshot LSN, which isn't implemented for Postgres storage yet.~~

Other changes

Logging

This improves replication logging a bit, by:

Pre-calculating table sizes, and logging this when we start replicating.
Using a consistent log prefix as part of the logger instance when replicating.

In the future we can use this to expose snapshot progress via the diagnostics API, but for now logs can be used to see that.

MongoDB snapshot resume token

Previously, we stored the clusterTime when starting a MongoDB snapshot, then resumed streaming from that point afterwards. Now, this is replaced by an actual resume token. This allows for better detection of replication issues after the initial snapshot, such as the oplog window being too small, or switching source databases.

Migrations

For Postgres storage, this has a new migration for the additional snapshot metadata. No migration is needed for MongoDB storage.

changeset-bot · 2024-12-11T16:54:03Z

⚠️ No Changeset found

Latest commit: 5070e76

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

rkistner · 2025-06-04T11:02:09Z

packages/service-core/src/storage/bson.ts

+  useBigInt64: true,
+  // We cannot use promoteBuffers: true, since that also converst UUID to Buffer
+  // Instead, we need to handle bson.Binary when reading data
+  promoteBuffers: false


This does not actually change anything - just a note to not attempt to change that in the future.

Copilot

Pull Request Overview

This PR extends initial snapshot replication to be resumable for both Postgres (chunking by primary key) and MongoDB (chunking by _id), improves replication logging, and adds necessary storage schema migrations and tests for resume tokens and chunked snapshots.

Added custom logger prefixes and more consistent error messages in MySQL and MongoDB replication jobs.
Introduced ChunkedSnapshotQuery and storage metadata (snapshot_status / snapshot_lsn) for resumable snapshots.
Refactored tests to use a unified describeWithStorage helper, and added end-to-end slow tests for resuming and chunked snapshot behavior.

Reviewed Changes

Copilot reviewed 57 out of 57 changed files in this pull request and generated no comments.

File	Description
modules/module-mongodb/test/src/util.ts	`describeWithStorage` helper introduced (calls undefined factory)
modules/module-mongodb/test/src/change_stream_utils.ts	`getBucketData` now loops chunks but drops `limit` support
modules/module-mongodb/src/replication/ChangeStreamReplicationJob.ts	Typo in comment
modules/module-mongodb/test/src/resume_token.test.ts	Invalid assertion usage (`.true` instead of `.toBe(true)`)

Comments suppressed due to low confidence (4)

modules/module-mongodb/test/src/util.ts:28

The helper calls INITIALIZED_MONGO_STORAGE_FACTORY for Mongo storage, but that identifier isn't defined in this file. Import or define the Mongo storage factory (e.g. INITIALIZED_MONGO_STORAGE_FACTORY) or rename the variable to the correct factory.

fn(INITIALIZED_MONGO_STORAGE_FACTORY);

modules/module-mongodb/test/src/change_stream_utils.ts:161

getBucketData no longer passes through the options.limit or options.chunkLimitBytes parameters when calling getBucketDataBatch. Consider forwarding those options or documenting why they're intentionally omitted.

const batch = this.storage!.getBucketDataBatch(checkpoint, map);

modules/module-mongodb/src/replication/ChangeStreamReplicationJob.ts:18

Typo in comment: replace us with use.

// We us a custom formatter to process the prefix

modules/module-mongodb/test/src/resume_token.test.ts:48

Invalid Vitest assertion: use .toBe(true) or .toBeTruthy() instead of .true.

).true;

Base automatically changed from fix-slot-recovery to main December 12, 2024 08:51

rkistner force-pushed the resumable-replication-2 branch from fd8ade8 to 3cb3248 Compare December 12, 2024 12:01

rkistner added 3 commits May 26, 2025 16:52

Refactor pgwire types.

fc1d92c

Add SnapshotQuery support for postgres again.

76e4e2d

Add test again.

951c951

rkistner force-pushed the resumable-replication-2 branch from f1ff6db to 951c951 Compare May 26, 2025 15:21

rkistner added 16 commits May 29, 2025 14:10

Merge remote-tracking branch 'origin/main' into resumable-replication-2

95b2095

WIP: Record replication progress for Postgres.

0d248fb

Avoid promoteBuffers.

2aad7e7

MongoDB: Skip tables already replicated.

c1fcf74

Refactor MongoDB snapshot queries.

3372eec

Chunk MongoDB queries by _id.

66188ab

Fix query typo.

3928604

Get a resume token instead of clusterTime for initial snapshot.

dc02713

Replication logging improvements.

a369964

Define a log prefix on child loggers.

0460455

Make sure the change stream is always closed.

959ad11

tryNext(), not hasNext().

eaac127

Separate storage for snapshot LSN; fix snapshot resumeToken.

e084aef

Improve test.

b2bde45

Merge remote-tracking branch 'origin/main' into resumable-replication-2

1117a13

Merge remote-tracking branch 'origin/main' into resumable-replication-2

fc8fe2e

rkistner changed the title ~~[WIP] [Postgres] Resumable replication per-table~~ [WIP] [Postgres + MongoDB] Resumable replication Jun 3, 2025

rkistner added 6 commits June 3, 2025 14:38

Implement re-replication for postgres storage.

b7efb76

Some test cleanup.

321c1c2

Keepalive after table snapshot to fix schema tests.

a4958a8

Test cleanup.

8a1ef29

Add some mongodb replication tests.

5dd028b

Improve postgres replication abort logic and progress reporting.

a05ceea

rkistner added 6 commits June 4, 2025 09:46

Split out tests.

35d7c31

Add tests for resuming mongodb snapshots.

db64a37

Clear mongo storage if we're resuming replication without an LSN.

8b5d03b

Correctly persist snapshot progress in postgres storage.

ab2fd45

Fix down migrations on first run.

bb223d5

Fix migration tests.

3ce8e0e

rkistner commented Jun 4, 2025

View reviewed changes

rkistner requested a review from Copilot June 4, 2025 12:16

Copilot AI reviewed Jun 4, 2025

View reviewed changes

rkistner added 2 commits June 4, 2025 15:03

Address copilot comments.

5070e76

Add changeset.

ff396c2

rkistner changed the title ~~[WIP] [Postgres + MongoDB] Resumable replication~~ [Postgres + MongoDB] Resumable replication Jun 4, 2025

rkistner marked this pull request as ready for review June 4, 2025 13:51

rkistner requested a review from stevensJourney June 4, 2025 13:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Postgres + MongoDB] Resumable replication #163

[Postgres + MongoDB] Resumable replication #163

Uh oh!

rkistner commented Dec 11, 2024 •

edited

Loading

Uh oh!

changeset-bot bot commented Dec 11, 2024 •

edited

Loading

Uh oh!

rkistner Jun 4, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

[Postgres + MongoDB] Resumable replication #163

Are you sure you want to change the base?

[Postgres + MongoDB] Resumable replication #163

Uh oh!

Conversation

rkistner commented Dec 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem Statement

The fix

Edge cases

Limitations

Other changes

Logging

MongoDB snapshot resume token

Migrations

Uh oh!

changeset-bot bot commented Dec 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

rkistner Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

rkistner commented Dec 11, 2024 •

edited

Loading

changeset-bot bot commented Dec 11, 2024 •

edited

Loading