-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add live migration from postgres #2759
Conversation
Allow 10 minutes from last push for the staging site to build. If the link doesn't work, try using incognito mode instead. For internal reviewers, check web-documentation repo actions for staging build status. Link to build for this PR: http://docs-dev.timescale.com/docs-adn-logical-decoding |
--source "$SOURCE" \ | ||
--target "$TARGET" \ | ||
--fail-fast \ | ||
--plugin wal2json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Question for my own interest): Is wal2json available in all Postgres versions? What is the minimum Postgres version that we need for live migration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just did a quick read and it's an external extension, I think it's installed with the postgres-dev package. From the pgcopydb's docs I get that we can use test_decoding
and that one I think comes bundle with postgres and it's the default for pgcopydb.
@arajkumar is there a reason why we are using wal2json and not the other? WDYT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a bug in pgcopydb while using with test_decoding which prevents migration of a table which has reserved Postgres keywords as column name.(e.g. time). I fixed it in upstream, but it is not yet released to the public.
We have to stick to wal2json until the new pgcopydb release is made available.
wal2json is usually distributed with most of the postgres distributions. Most of the cloud providers(AWS, GCP, Azure) has enabled wal2json in their managed offerings.
is optional. It can even be performed after all the data has been migrated, | ||
since hypertables can be created from tables with existing data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be cautious with the statement. It is correct, but it misses the fact that if you convert a plain table to a hypertable after-the-fact, the table can't accept writes through the duration of the conversion process, which is basically the equivalent to having application downtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that you have a bit more detailed description below. I would rework this section to say something along the lines of "while it is optional to set up hypertables before moving data, if you do it later it's going to be a pain in the ass and you're going to regret it".
--source "$SOURCE" \ | ||
--target "$TARGET" \ | ||
--snapshot $(cat /tmp/pgcopydb/snapshot) \ | ||
--split-tables-larger-than <same-table concurrency size threshold> \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How should this be configured?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arajkumar can you check up on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--split-tables-larger-than
Allow Same-table Concurrency when processing the source database. This environment variable value is expected to be a byte size, and bytes units B, kB, MB, GB, TB, PB, and EB are known.
Reference: https://pgcopydb.readthedocs.io/en/latest/ref/pgcopydb_clone.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should set this to a default like 10GB and add a note about it. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And table-jobs to 8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that sounds good
|
||
Before proceeding to the next step, you need to wait until the replication lag | ||
is minimized as much as possible. For instance, you can move forward when the | ||
`replay_lag` is close to just a few seconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the replay lag never gets close to just a few seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say you either run out of space, or remain on a state of constant lag, or you have to stop ingest to burn down the lag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll move this part to the next section so we can tackle all the lag nuances there.
database. | ||
|
||
Since no new data is being ingested, `pgcopydb` can be instructed to stop the | ||
replication process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we need to be cautious here. If the replication lag doesn't go down to a few seconds, then the user must wait for it to go down here. We don't mention this, and it could lead to data loss, which we don't want.
7be1e36
to
5ad3056
Compare
16f0f0b
to
60dc15c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting there! I think the main feedback is the missing context/parent page which explains what Live migration actually is.
EOF | ||
``` | ||
|
||
- `replay_lag`: Time elapsed between flushing recent WAL locally and receiving |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the unit? Seconds? Milliseconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type is interval so it's based on the intervalstyle variable, by default it's pretty printed and depending on how big it's there's more or less information:
https://www.postgresql.org/docs/16/datatype-datetime.html#DATATYPE-INTERVAL-OUTPUT
adn@/tmp:adn> select interval '1 second'
╒══════════╕
│ interval │
╞══════════╡
│ 0:00:01 │
╘══════════╛
SELECT 1
Time: 0.004s
adn@/tmp:adn> select interval '1 month'
╒══════════════════╕
│ interval │
╞══════════════════╡
│ 30 days, 0:00:00 │
╘══════════════════╛
- `replay_lag`: Time elapsed between flushing recent WAL locally and receiving | ||
notification that this standby server has written, flushed and applied it. | ||
|
||
## 7. Promote target database as new primary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I think that we're missing giving the user enough context about how the whole migration process works. IMO that would be the function of the parent page (Low-downtime: Live migration), which is currently absent.
Aside from the missing context, I think this section has all of the information that it needs, but I think that it needs to be reworked a little for clarity. I think that it would only make sense to do so once we've got the content for the context. Maybe some things will become clearer then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we have to update /migrate
(https://docs.timescale.com/migrate/latest/) parent page with "Live migration" availability alongside dump. restore and dual writes and backfill.
1. Migrate roles and schema from source to target. | ||
1. Convert hypertables and enable Timescale features. | ||
1. Migrate initial data from source to target. | ||
1. Apply replication changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe change this to "Apply change data capture (CDC) on target"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We never mention CDC or what does it mean. If you want to put it there as a title we'd have to explain what it is. Maybe we can change it to Apply stream of changes from source
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"stream of changes", SGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, if possible adding a note about REPLICATION IDENTITY would be great!
7cc2867
to
3ebf575
Compare
@@ -35,8 +35,9 @@ migrating from PostgreSQL. | |||
|
|||
If you're using PostgreSQL, you may also have heard of logical replication | |||
being the recommended strategy for migrations with low downtime. Currently, | |||
TimescaleDB doesn't work with logical replication, so this is not a viable | |||
option, but we are actively working on making this possible. | |||
TimescaleDB supports logical replication only from sources that are using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't true but this PR should address this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
f88cb76
to
97a1c88
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've addressed most of my own comments myself in my most recent commit.
should be actively monitored to prevent any issues that might result due to | ||
lack of space. | ||
|
||
Before starting, there's an important caveat. To replicate `DELETE` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the REPLICA IDENTITY
instructions be part of "Prepare the source database"?
lack of space. | ||
|
||
Before starting, there's an important caveat. To replicate `DELETE` and | ||
`UPDATE` operations, the source table must either have a primary key or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A possible extension: would be nice to have a query which returns user tables which don't have a PK, or REPLICA IDENTITY set, so that the user can easily audit their tables.
Co-authored-by: James Guthrie <[email protected]> Signed-off-by: alejandrodnm <[email protected]>
cc4571d
to
323f5c9
Compare
323f5c9
to
f31f2e4
Compare
Description
[Short summary of why you created this PR]
Links
Fixes #[insert issue link, if any]
Writing help
For information about style and word usage, see the style guide
Review checklists
Reviewers: use this section to ensure you have checked everything before approving this PR:
Subject matter expert (SME) review checklist
Documentation team review checklist
and have they been implemented?