You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- reusable module VCAP::BigintMigration
- implementation of step 1 (the events' primary key is not used as
foreign key, thus additions will be required when reusing this for
other tables)
- check database type (PostgreSQL only)
- check opt-out flag
- check if table is empty
- change primary key to bigint directly, if table is empty
- add bigint column and create trigger + function otherwise
- reusable shared_context for tests
- ADR adapted (PostgreSQL only)
- Rake task db:bigint_backfill to manually trigger a backfill (optional)
Copy file name to clipboardExpand all lines: decisions/0013-migrating-int-to-bigint-for-primary-keys.md
+35-35
Original file line number
Diff line number
Diff line change
@@ -1,40 +1,50 @@
1
1
# 13: Migrating `int` to `bigint` for `id` Primary Keys
2
-
3
-
Date: 2025-02-04
2
+
Date: 2025-04-04
4
3
5
4
## Status
6
-
7
5
Draft :construction:
8
6
9
7
## Context
10
-
11
8
The primary key `id` columns in all database tables use the integer type, which has a maximum value of 2,147,483,647.
12
9
As foundations grow over time, the `id` values in some of these tables (e.g., events) are approaching this limit.
13
-
If the limit is reached, the cloud controller will be unable to insert new records, leading to critical failures in the CF API.
10
+
If the limit is reached, the cloud controller will be unable to insert new records, leading to critical failures in the CF API.
14
11
E.g.:
15
12
```
16
13
PG::SequenceGeneratorLimitExceeded: ERROR: nextval: reached maximum value of sequence "events_id_seq"
17
14
```
18
-
The goal is to migrate these primary key `id` columns from `int` to `bigint` without causing downtime and to ensure compatibility across PostgreSQL and MySQL databases.
15
+
The goal is to migrate these primary key `id` columns from `int` to `bigint` without causing downtime.
19
16
This migration must:
20
17
- Avoid downtime since the CF API is actively used in production.
21
-
- Handle tables with millions of records efficiently.
18
+
- Handle tables with millions of records efficiently.
22
19
- Provide a safe rollback mechanism in case of issues during the migration.
23
20
- Be reusable for other tables in the future.
24
21
- Ensure that migration only is executed when the new `id_bigint` column is fully populated.
25
22
26
23
The largest tables in a long-running foundation are `events`, `delayed_jobs`, `jobs`, and `app_usage_events`.
27
24
28
25
## Decisions
26
+
### PostgreSQL Only
27
+
We will implement the migration exclusively for PostgreSQL databases.
28
+
The reasons for this decision are:
29
+
-**Organizational Usage**: Our organization exclusively uses PostgreSQL, not MySQL. This allows us to test the migration with copies of large production databases and perform a step-wise rollout from test environments to production.
30
+
-**Support and Contribution**: Focusing on PostgreSQL enables us to identify and address any issues during the migration process. We can contribute solutions back to the migration procedure, benefiting the broader community.
31
+
-**Deployment Environments**: We operate PostgreSQL instances across various hyperscalers. Successful migration in our environments increases confidence that it will work for others using PostgreSQL.
32
+
-**Limited MySQL Exposure**: We lack access to production environments using MySQL and have limited expertise with it. Testing would be confined to community-owned test foundations, which do not reflect real-world production data. Additionally, the community uses a limited set of MySQL variants, reducing our ability to detect and resolve issues during a production rollout.
33
+
-**Community Feedback**: Feedback from other organizations operating Cloud Foundry on MySQL indicates they would opt-out of this migration, as their foundations are smaller and unlikely to encounter the issues this migration addresses.
34
+
35
+
While this approach results in somewhat inconsistent schemas between PostgreSQL and MySQL — specifically regarding the data types of primary and foreign keys — the application layer does not depend on these specifics.
36
+
Therefore, no additional application logic is required to handle these differences.
37
+
38
+
By concentrating our efforts on PostgreSQL, we can ensure a robust and thoroughly tested migration process, leveraging our expertise and infrastructure to maintain the stability and scalability of the Cloud Controller database.
29
39
30
40
### Opt-Out Mechanism
31
41
Operators of smaller foundations, which are unlikely to ever encounter the integer overflow issue, may wish to avoid the risks and complexity associated with this migration.
32
-
They can optout of the migration by setting the `skip_bigint_id_migration` flag in the CAPI-Release manifest.
42
+
They can opt-out of the migration by setting a flag in the CAPI-Release manifest.
33
43
When this flag is set, all migration steps will result in a no-op but will still be marked as applied in the `schema_versions` table.
34
-
*Important*: Removing the flag later will *not* re-trigger the migration. Operators must handle the migration manually if they choose to opt out.
35
44
36
-
### Scope
45
+
*Important*: Removing the flag later will *not* re-trigger the migration. Operators must handle the migration manually if they choose to opt-out.
37
46
47
+
### Scope
38
48
The `events` table will be migrated first as it has the most significant growth in `id` values.
39
49
Other tables will be migrated at a later stage.
40
50
@@ -44,6 +54,7 @@ This will be implemented with migration step 1 and will be only applied, if the
44
54
45
55
### Phased Migration
46
56
The migration will be conducted in multiple steps to ensure minimal risk.
57
+
47
58
#### Step 1 - Preparation
48
59
- If the opt-out flag is set, this step will be a no-op.
49
60
- In case the target table is empty the type of the `id` column will be set to `bigint` directly.
@@ -57,19 +68,21 @@ The migration will be conducted in multiple steps to ensure minimal risk.
57
68
- If the `id_bigint` column does not exist, backfill will be skipped or result in a no-op.
58
69
- Use a batch-processing script (e.g. a delayed job) to populate `id_bigint` for existing rows in both the primary table and, if applicable, all foreign key references.
59
70
- Table locks will be avoided by using a batch processing approach.
60
-
- In case the table has a configurable cleanup duration, the backfill job will only process records which are beyond the cleanup duration to reduce the number of records to be processed.
71
+
- In case the table has a configurable cleanup duration, the backfill job will only process records which are beyond the cleanup duration to reduce the number of records to be processed.
61
72
- Backfill will be executed outside the migration due to its potentially long runtime.
62
73
- If necessary the backfill will run for multiple weeks to ensure all records are processed.
63
74
64
75
#### Step 3 - Migration
65
76
- The migration is divided into two parts: a pre-check and the actual migration but both will be stored in a single migration script.
66
77
- This step will be a no-op if the opt-out flag is set or the `id` column is already of type `bigint`.
67
78
- All sql statements will be executed in a single transaction to ensure consistency.
79
+
68
80
##### Step 3a - Migration Pre Check
69
81
- In case the `id_bigint` column does not exist the migration will fail with a clear error message.
70
82
- Add a `CHECK` constraint to verify that `id_bigint` is fully populated (`id_bigint == id & id_bigint != NULL`).
71
83
- In case the backfill is not yet complete or the `id_bigint` column is not fully populated the migration will fail.
72
84
- If pre-check fails, operators might need to take manual actions to ensure all preconditions are met as the migration will be retried during the next deployment.
85
+
73
86
##### Step 3b - Actual Migration
74
87
- Remove the `CHECK` constraint once verified.
75
88
- Drop the primary key constraint on id.
@@ -87,11 +100,6 @@ The default value of the `id` column could be either a sequence (for PostgreSQL
87
100
This depends on the version of PostgreSQL which was used when the table was initially created.
88
101
The migration script needs to handle both cases.
89
102
90
-
#### MySQL
91
-
MySQL primary key changes typically cause table rebuilds due to clustered indexing, which can be expensive and disruptive, especially with clustered replication setups like Galera.
92
-
A common approach to mitigate this involves creating a new shadow table, performing a backfill, and then swapping tables atomically.
93
-
Further details will be refined during implementation.
94
-
95
103
### Rollback Mechanism
96
104
The old `id` column is no longer retained, as the `CHECK` constraint ensures correctness during migration.
97
105
Step 3b (switch over) will be executed in a single transaction and will be rolled back if any issues occur.
@@ -103,22 +111,21 @@ Write reusable scripts for adding `id_bigint`, setting up triggers, backfilling
103
111
These scripts can be reused for other tables in the future.
104
112
105
113
### Release Strategy
106
-
107
-
Steps 1-2 will be released as a cf-deployment major release to ensure that the database is prepared for the migration.
108
-
Steps 3-4 will be released as a subsequent cf-deployment major release to complete the migration.
114
+
Steps 1-2 will be released as a cf-deployment major release to ensure that the database is prepared for the migration.
115
+
Steps 3-4 will be released as a subsequent cf-deployment major release to complete the migration.
109
116
Between these releases there should be a reasonable time to allow the backfill to complete.
110
117
111
-
For the `events` table there is a default cleanup interval of 31 days. Therefore, for the `events` table the gap between the releases should be around 60 days.
118
+
For the `events` table there is a default cleanup interval of 31 days.
119
+
Therefore, for the `events` table the gap between the releases should be around 60 days.
112
120
113
121
## Consequences
114
-
### Positive Consequences
115
122
123
+
### Positive Consequences
116
124
- Future-proofing the schema for tables with high record counts.
117
125
- Minimal locking during step 3b (actual migration) could result in slower queries or minimal downtime.
118
126
- A standardized process for similar migrations across the database.
119
127
120
128
### Negative Consequences
121
-
122
129
- Increased complexity in the migration process.
123
130
- Potentially long runtimes for backfilling data in case tables have millions of records.
124
131
- Requires careful coordination across multiple CAPI/CF-Deployment versions.
@@ -127,31 +134,26 @@ For the `events` table there is a default cleanup interval of 31 days. Therefore
127
134
## Alternatives Considered
128
135
129
136
### Switching to `guid` Field as Primary Key
130
-
131
137
Pros: Provides globally unique identifiers and eliminates the risk of overflow.
132
138
133
139
Cons: Might decrease query and index performance, requires significant changes for foreign key constraints, and introduces non-sequential keys.
134
140
135
141
Reason Rejected: The overhead and complexity outweighed the benefits for our use case.
136
142
137
143
### Implementing Rollover for `id` Reuse
138
-
139
144
Pros: Delays the overflow issue by reusing IDs from deleted rows. Minimal schema changes.
140
145
141
146
Cons: Potential issues with foreign key constraints and increased complexity in the rollover process. Could be problematic for tables which do not have frequent deletions.
142
147
143
148
Reason Rejected: Might work well for tables like events, but not a universal solution for all tables where there is no guarantee of frequent deletions.
144
149
145
-
146
150
### Direct Migration of `id` to `bigint` via `ALTER TABLE` Statement
147
-
148
151
Pros: One-step migration process.
149
152
150
153
Cons: Requires downtime, locks the table for the duration of the migration, and can be slow for tables with millions of records.
151
154
152
155
Reason Rejected: Downtimes are unacceptable for productive foundations.
153
156
154
-
155
157
## Example Migration Scripts With PostgreSQL Syntax For `events` Table
0 commit comments