Skip to content

Commit 6d05889

Browse files
committed
Bigint Migration for 'events' Table (Step 1)
- reusable module VCAP::BigintMigration - implementation of step 1 (the events' primary key is not used as foreign key, thus additions will be required when reusing this for other tables) - check database type (PostgreSQL only) - check opt-out flag - check if table is empty - change primary key to bigint directly, if table is empty - add bigint column and create trigger + function otherwise - reusable shared_context for tests - ADR adapted (PostgreSQL only) - Rake task db:bigint_backfill to manually trigger a backfill (optional)
1 parent 7b84ec1 commit 6d05889

11 files changed

+458
-37
lines changed

.rubocop_cc.yml

+2-2
Original file line numberDiff line numberDiff line change
@@ -136,8 +136,8 @@ Rails/DangerousColumnNames: # Disabled, in comparison to active_record we need t
136136
Rails/SkipsModelValidations: # We don`t want any model at all in migrations and migration specs
137137
Enabled: true
138138
Exclude:
139-
- db/migrations/*
140-
- spec/migrations/*
139+
- db/migrations/**/*
140+
- spec/migrations/**/*
141141

142142
#### ENABLED SECTION
143143
Gemspec/DeprecatedAttributeAssignment:
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
require 'database/bigint_migration'
2+
3+
Sequel.migration do
4+
up do
5+
if database_type == :postgres && !VCAP::BigintMigration.opt_out?
6+
if VCAP::BigintMigration.table_empty?(self, :events)
7+
VCAP::BigintMigration.change_pk_to_bigint(self, :events)
8+
else
9+
VCAP::BigintMigration.add_bigint_column(self, :events)
10+
VCAP::BigintMigration.create_trigger_function(self, :events)
11+
end
12+
end
13+
end
14+
15+
down do
16+
if database_type == :postgres
17+
VCAP::BigintMigration.revert_pk_to_integer(self, :events)
18+
VCAP::BigintMigration.drop_trigger_function(self, :events)
19+
VCAP::BigintMigration.drop_bigint_column(self, :events)
20+
end
21+
end
22+
end

decisions/0013-migrating-int-to-bigint-for-primary-keys.md

+35-35
Original file line numberDiff line numberDiff line change
@@ -1,40 +1,50 @@
11
# 13: Migrating `int` to `bigint` for `id` Primary Keys
2-
3-
Date: 2025-02-04
2+
Date: 2025-04-04
43

54
## Status
6-
75
Draft :construction:
86

97
## Context
10-
118
The primary key `id` columns in all database tables use the integer type, which has a maximum value of 2,147,483,647.
129
As foundations grow over time, the `id` values in some of these tables (e.g., events) are approaching this limit.
13-
If the limit is reached, the cloud controller will be unable to insert new records, leading to critical failures in the CF API.
10+
If the limit is reached, the cloud controller will be unable to insert new records, leading to critical failures in the CF API.
1411
E.g.:
1512
```
1613
PG::SequenceGeneratorLimitExceeded: ERROR: nextval: reached maximum value of sequence "events_id_seq"
1714
```
18-
The goal is to migrate these primary key `id` columns from `int` to `bigint` without causing downtime and to ensure compatibility across PostgreSQL and MySQL databases.
15+
The goal is to migrate these primary key `id` columns from `int` to `bigint` without causing downtime.
1916
This migration must:
2017
- Avoid downtime since the CF API is actively used in production.
21-
- Handle tables with millions of records efficiently.
18+
- Handle tables with millions of records efficiently.
2219
- Provide a safe rollback mechanism in case of issues during the migration.
2320
- Be reusable for other tables in the future.
2421
- Ensure that migration only is executed when the new `id_bigint` column is fully populated.
2522

2623
The largest tables in a long-running foundation are `events`, `delayed_jobs`, `jobs`, and `app_usage_events`.
2724

2825
## Decisions
26+
### PostgreSQL Only
27+
We will implement the migration exclusively for PostgreSQL databases.
28+
The reasons for this decision are:
29+
- **Organizational Usage**: Our organization exclusively uses PostgreSQL, not MySQL. This allows us to test the migration with copies of large production databases and perform a step-wise rollout from test environments to production.
30+
- **Support and Contribution**: Focusing on PostgreSQL enables us to identify and address any issues during the migration process. We can contribute solutions back to the migration procedure, benefiting the broader community.
31+
- **Deployment Environments**: We operate PostgreSQL instances across various hyperscalers. Successful migration in our environments increases confidence that it will work for others using PostgreSQL.
32+
- **Limited MySQL Exposure**: We lack access to production environments using MySQL and have limited expertise with it. Testing would be confined to community-owned test foundations, which do not reflect real-world production data. Additionally, the community uses a limited set of MySQL variants, reducing our ability to detect and resolve issues during a production rollout.
33+
- **Community Feedback**: Feedback from other organizations operating Cloud Foundry on MySQL indicates they would opt-out of this migration, as their foundations are smaller and unlikely to encounter the issues this migration addresses.
34+
35+
While this approach results in somewhat inconsistent schemas between PostgreSQL and MySQL — specifically regarding the data types of primary and foreign keys — the application layer does not depend on these specifics.
36+
Therefore, no additional application logic is required to handle these differences.
37+
38+
By concentrating our efforts on PostgreSQL, we can ensure a robust and thoroughly tested migration process, leveraging our expertise and infrastructure to maintain the stability and scalability of the Cloud Controller database.
2939

3040
### Opt-Out Mechanism
3141
Operators of smaller foundations, which are unlikely to ever encounter the integer overflow issue, may wish to avoid the risks and complexity associated with this migration.
32-
They can opt out of the migration by setting the `skip_bigint_id_migration` flag in the CAPI-Release manifest.
42+
They can opt-out of the migration by setting a flag in the CAPI-Release manifest.
3343
When this flag is set, all migration steps will result in a no-op but will still be marked as applied in the `schema_versions` table.
34-
*Important*: Removing the flag later will *not* re-trigger the migration. Operators must handle the migration manually if they choose to opt out.
3544

36-
### Scope
45+
*Important*: Removing the flag later will *not* re-trigger the migration. Operators must handle the migration manually if they choose to opt-out.
3746

47+
### Scope
3848
The `events` table will be migrated first as it has the most significant growth in `id` values.
3949
Other tables will be migrated at a later stage.
4050

@@ -44,6 +54,7 @@ This will be implemented with migration step 1 and will be only applied, if the
4454

4555
### Phased Migration
4656
The migration will be conducted in multiple steps to ensure minimal risk.
57+
4758
#### Step 1 - Preparation
4859
- If the opt-out flag is set, this step will be a no-op.
4960
- In case the target table is empty the type of the `id` column will be set to `bigint` directly.
@@ -57,19 +68,21 @@ The migration will be conducted in multiple steps to ensure minimal risk.
5768
- If the `id_bigint` column does not exist, backfill will be skipped or result in a no-op.
5869
- Use a batch-processing script (e.g. a delayed job) to populate `id_bigint` for existing rows in both the primary table and, if applicable, all foreign key references.
5970
- Table locks will be avoided by using a batch processing approach.
60-
- In case the table has a configurable cleanup duration, the backfill job will only process records which are beyond the cleanup duration to reduce the number of records to be processed.
71+
- In case the table has a configurable cleanup duration, the backfill job will only process records which are beyond the cleanup duration to reduce the number of records to be processed.
6172
- Backfill will be executed outside the migration due to its potentially long runtime.
6273
- If necessary the backfill will run for multiple weeks to ensure all records are processed.
6374

6475
#### Step 3 - Migration
6576
- The migration is divided into two parts: a pre-check and the actual migration but both will be stored in a single migration script.
6677
- This step will be a no-op if the opt-out flag is set or the `id` column is already of type `bigint`.
6778
- All sql statements will be executed in a single transaction to ensure consistency.
79+
6880
##### Step 3a - Migration Pre Check
6981
- In case the `id_bigint` column does not exist the migration will fail with a clear error message.
7082
- Add a `CHECK` constraint to verify that `id_bigint` is fully populated (`id_bigint == id & id_bigint != NULL`).
7183
- In case the backfill is not yet complete or the `id_bigint` column is not fully populated the migration will fail.
7284
- If pre-check fails, operators might need to take manual actions to ensure all preconditions are met as the migration will be retried during the next deployment.
85+
7386
##### Step 3b - Actual Migration
7487
- Remove the `CHECK` constraint once verified.
7588
- Drop the primary key constraint on id.
@@ -87,11 +100,6 @@ The default value of the `id` column could be either a sequence (for PostgreSQL
87100
This depends on the version of PostgreSQL which was used when the table was initially created.
88101
The migration script needs to handle both cases.
89102

90-
#### MySQL
91-
MySQL primary key changes typically cause table rebuilds due to clustered indexing, which can be expensive and disruptive, especially with clustered replication setups like Galera.
92-
A common approach to mitigate this involves creating a new shadow table, performing a backfill, and then swapping tables atomically.
93-
Further details will be refined during implementation.
94-
95103
### Rollback Mechanism
96104
The old `id` column is no longer retained, as the `CHECK` constraint ensures correctness during migration.
97105
Step 3b (switch over) will be executed in a single transaction and will be rolled back if any issues occur.
@@ -103,22 +111,21 @@ Write reusable scripts for adding `id_bigint`, setting up triggers, backfilling
103111
These scripts can be reused for other tables in the future.
104112

105113
### Release Strategy
106-
107-
Steps 1-2 will be released as a cf-deployment major release to ensure that the database is prepared for the migration.
108-
Steps 3-4 will be released as a subsequent cf-deployment major release to complete the migration.
114+
Steps 1-2 will be released as a cf-deployment major release to ensure that the database is prepared for the migration.
115+
Steps 3-4 will be released as a subsequent cf-deployment major release to complete the migration.
109116
Between these releases there should be a reasonable time to allow the backfill to complete.
110117

111-
For the `events` table there is a default cleanup interval of 31 days. Therefore, for the `events` table the gap between the releases should be around 60 days.
118+
For the `events` table there is a default cleanup interval of 31 days.
119+
Therefore, for the `events` table the gap between the releases should be around 60 days.
112120

113121
## Consequences
114-
### Positive Consequences
115122

123+
### Positive Consequences
116124
- Future-proofing the schema for tables with high record counts.
117125
- Minimal locking during step 3b (actual migration) could result in slower queries or minimal downtime.
118126
- A standardized process for similar migrations across the database.
119127

120128
### Negative Consequences
121-
122129
- Increased complexity in the migration process.
123130
- Potentially long runtimes for backfilling data in case tables have millions of records.
124131
- Requires careful coordination across multiple CAPI/CF-Deployment versions.
@@ -127,31 +134,26 @@ For the `events` table there is a default cleanup interval of 31 days. Therefore
127134
## Alternatives Considered
128135

129136
### Switching to `guid` Field as Primary Key
130-
131137
Pros: Provides globally unique identifiers and eliminates the risk of overflow.
132138

133139
Cons: Might decrease query and index performance, requires significant changes for foreign key constraints, and introduces non-sequential keys.
134140

135141
Reason Rejected: The overhead and complexity outweighed the benefits for our use case.
136142

137143
### Implementing Rollover for `id` Reuse
138-
139144
Pros: Delays the overflow issue by reusing IDs from deleted rows. Minimal schema changes.
140145

141146
Cons: Potential issues with foreign key constraints and increased complexity in the rollover process. Could be problematic for tables which do not have frequent deletions.
142147

143148
Reason Rejected: Might work well for tables like events, but not a universal solution for all tables where there is no guarantee of frequent deletions.
144149

145-
146150
### Direct Migration of `id` to `bigint` via `ALTER TABLE` Statement
147-
148151
Pros: One-step migration process.
149152

150153
Cons: Requires downtime, locks the table for the duration of the migration, and can be slow for tables with millions of records.
151154

152155
Reason Rejected: Downtimes are unacceptable for productive foundations.
153156

154-
155157
## Example Migration Scripts With PostgreSQL Syntax For `events` Table
156158

157159
### Step 1 - Preparation
@@ -178,7 +180,6 @@ CREATE TRIGGER trigger_events_set_id_bigint
178180
EXECUTE FUNCTION events_set_id_bigint_on_insert();
179181

180182
COMMIT;
181-
182183
```
183184

184185
### Step 2 - Backfill
@@ -196,10 +197,9 @@ FROM batch
196197
WHERE events.id = batch.id;
197198
```
198199

199-
200200
### Step 3a - Migration Pre Check
201201
```sql
202-
ALTER TABLE events ADD CONSTRAINT check_id_bigint_matches CHECK (id_bigint IS NOT NULL AND id_bigint = id);
202+
ALTER TABLE events ADD CONSTRAINT check_id_bigint_matches CHECK (id_bigint IS NOT NULL AND id_bigint = id);
203203

204204
-- Alternative:
205205
SELECT COUNT(*) FROM events WHERE id_bigint IS DISTINCT FROM id;
@@ -229,7 +229,7 @@ ALTER TABLE events RENAME COLUMN id_bigint TO id;
229229
ALTER TABLE events ADD PRIMARY KEY (id);
230230

231231
-- Set `id` as IDENTITY with correct starting value
232-
DO $$
232+
DO $$
233233
DECLARE max_id BIGINT;
234234
BEGIN
235235
SELECT COALESCE(MAX(id), 1) + 1 INTO max_id FROM events;
@@ -256,10 +256,10 @@ SELECT COUNT(*) FROM events WHERE id_bigint IS DISTINCT FROM id;
256256
```
257257

258258
## References
259-
WIP - e.g.:
260-
Migration scripts in the repository.
261-
Backfill script documentation.
262-
Trigger functions for PostgreSQL and MySQL.
259+
WIP - e.g.:
260+
Migration scripts in the repository.
261+
Backfill script documentation.
262+
Trigger functions for PostgreSQL and MySQL.
263263

264264
### Helpful Links
265265
- [Stack Overflow: Migrating int to bigint](https://stackoverflow.com/questions/33504982/postgresql-concurrently-change-column-type-from-int-to-bigint)

lib/cloud_controller/config_schemas/base/api_schema.rb

+1
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,7 @@ class ApiSchema < VCAP::Config
9999
optional(:max_migration_statement_runtime_in_seconds) => Integer,
100100
optional(:migration_psql_concurrent_statement_timeout_in_seconds) => Integer,
101101
optional(:migration_psql_worker_memory_kb) => Integer,
102+
optional(:skip_bigint_id_migration) => bool,
102103
db: {
103104
optional(:database) => Hash, # db connection hash for sequel
104105
max_connections: Integer, # max connections in the connection pool

lib/cloud_controller/config_schemas/base/migrate_schema.rb

+1
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ class MigrateSchema < VCAP::Config
1010
optional(:max_migration_statement_runtime_in_seconds) => Integer,
1111
optional(:migration_psql_concurrent_statement_timeout_in_seconds) => Integer,
1212
optional(:migration_psql_worker_memory_kb) => Integer,
13+
optional(:skip_bigint_id_migration) => bool,
1314

1415
db: {
1516
optional(:database) => Hash, # db connection hash for sequel

lib/database/bigint_migration.rb

+80
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
module VCAP::BigintMigration
2+
class << self
3+
def opt_out?
4+
opt_out = VCAP::CloudController::Config.config&.get(:skip_bigint_id_migration)
5+
opt_out.nil? ? false : opt_out
6+
rescue VCAP::CloudController::Config::InvalidConfigPath
7+
false
8+
end
9+
10+
def table_empty?(db, table)
11+
db[table].empty?
12+
end
13+
14+
def change_pk_to_bigint(db, table)
15+
db.set_column_type(table, :id, :Bignum) if column_type(db, table, :id) != 'bigint'
16+
end
17+
18+
def revert_pk_to_integer(db, table)
19+
db.set_column_type(table, :id, :integer) if column_type(db, table, :id) == 'bigint'
20+
end
21+
22+
def add_bigint_column(db, table)
23+
db.add_column(table, :id_bigint, :Bignum, if_not_exists: true)
24+
end
25+
26+
def drop_bigint_column(db, table)
27+
db.drop_column(table, :id_bigint, if_exists: true)
28+
end
29+
30+
def create_trigger_function(db, table)
31+
drop_trigger_function(db, table)
32+
33+
function = <<~FUNC
34+
BEGIN
35+
NEW.id_bigint := NEW.id;
36+
RETURN NEW;
37+
END;
38+
FUNC
39+
db.create_function(function_name(table), function, language: :plpgsql, returns: :trigger)
40+
db.create_trigger(table, trigger_name(table), function_name(table), each_row: true, events: :insert)
41+
end
42+
43+
def drop_trigger_function(db, table)
44+
db.drop_trigger(table, trigger_name(table), if_exists: true)
45+
db.drop_function(function_name(table), if_exists: true)
46+
end
47+
48+
def backfill(db, table, batch_size: 10_000, iterations: -1)
49+
raise "table '#{table}' does not contain column 'id_bigint'" unless column_exists?(db, table, :id_bigint)
50+
51+
loop do
52+
updated_rows = db.
53+
from(table, :batch).
54+
with(:batch, db[table].select(:id).where(id_bigint: nil).order(:id).limit(batch_size).for_update.skip_locked).
55+
where(Sequel.qualify(table, :id) => :batch__id).
56+
update(id_bigint: :batch__id)
57+
iterations -= 1 if iterations > 0
58+
break if updated_rows < batch_size || iterations == 0
59+
end
60+
end
61+
62+
private
63+
64+
def column_type(db, table, column)
65+
db.schema(table).find { |col, _| col == column }&.dig(1, :db_type)
66+
end
67+
68+
def function_name(table)
69+
:"#{table}_set_id_bigint_on_insert"
70+
end
71+
72+
def trigger_name(table)
73+
:"trigger_#{function_name(table)}"
74+
end
75+
76+
def column_exists?(db, table, column)
77+
db[table].columns.include?(column)
78+
end
79+
end
80+
end

lib/tasks/db.rake

+13
Original file line numberDiff line numberDiff line change
@@ -177,6 +177,19 @@ namespace :db do
177177
end
178178
end
179179

180+
desc 'Backfill ...'
181+
task :bigint_backfill, %i[table batch_size iterations] => :environment do |_t, args|
182+
args.with_defaults(batch_size: 10_000, iterations: -1)
183+
RakeConfig.context = :migrate
184+
185+
require 'database/bigint_migration'
186+
logging_output
187+
db_logger = Steno.logger('cc.db.bigint_backfill')
188+
RakeConfig.config.load_db_encryption_key
189+
db = VCAP::CloudController::DB.connect(RakeConfig.config.get(:db), db_logger)
190+
VCAP::BigintMigration.backfill(db, args.table.to_sym, batch_size: args.batch_size, iterations: args.iterations)
191+
end
192+
180193
namespace :dev do
181194
desc 'Migrate the database set in spec/support/bootstrap/db_config'
182195
task migrate: :environment do
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
require 'spec_helper'
2+
require 'migrations/helpers/bigint_migration_step1_shared_context'
3+
4+
RSpec.describe 'bigint migration - events table - step1', isolation: :truncation, type: :migration do
5+
include_context 'bigint migration step1' do
6+
let(:migration_filename) { '20250327142351_bigint_migration_events_step1.rb' }
7+
let(:table) { :events }
8+
let(:insert) do
9+
lambda do |db|
10+
db[:events].insert(guid: SecureRandom.uuid, timestamp: Time.now.utc, type: 'type',
11+
actor: 'actor', actor_type: 'actor_type',
12+
actee: 'actee', actee_type: 'actee_type')
13+
end
14+
end
15+
end
16+
end

0 commit comments

Comments
 (0)