Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store startup fixes #5926

Merged
merged 13 commits into from
Apr 7, 2025
Merged

Store startup fixes #5926

merged 13 commits into from
Apr 7, 2025

Conversation

lutter
Copy link
Collaborator

@lutter lutter commented Apr 3, 2025

With multiple nodes, they could race each other and cause database errors.

I've reworked the startup code to be (a) race-proof (I hope) and (b) easier to follow. Any node that wants to run database setup now gets a lock on the primary, and runs all code needed for setup while holding that lock. That way, nodes can't interfere with each other. In the common case, where there are no database changes, be it because of migrations, configuration changes, or code changes that map different tables, the time during which any node holds the lock is very brief. This could be further optimized but let's first see how this performs in practice.

@lutter lutter requested a review from incrypto32 April 3, 2025 17:25
@lutter lutter force-pushed the lutter/sharded branch 12 times, most recently from 6bc5d84 to fc0ae44 Compare April 6, 2025 18:24
Comment on lines +1654 to +1661
// Everything here happens under the migration lock. Anything called
// from here should not try to get that lock, otherwise the process
// will deadlock
debug!(self.logger, "Waiting for migration lock");
let res = with_migration_lock(&mut pconn, |_| async {
debug!(self.logger, "Migration lock acquired");

// While we were waiting for the migration lock, another thread
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NICE!

lutter added 13 commits April 7, 2025 09:22
It should be up to the operator if they use it or not, and when they want
to reset it
The current database setup code was inerently racy when several nodes were
starting up as it relied on piecemeal locking of individual steps. This
change completely revamps the strategy we use: setup now takes a lock on
the primary, so that only one node at a time will run the setup code.
Before, PoolState was just an enum and code all over the place dealt with
its interior mutability. Now, we encapsulate that to simplify code using
the PoolState
Instead of dealing with disabled shards (shards that have a pool size of 0
configured), filter those shards out on startup and warn about them.

The end effect is that for that configuration, users will get an error of
'unkown shard' rather than 'shard disabled'. Since configuring a shard to
have no connections is kinda pathological, and leads to an error when it is
used either way, the code simplification is worth the slightly less helpful
error message.

Removing the 'disabled' state from pools has ripple effects to quite a few
other places, simplifying them a bit
With the previous code, we would run setup initially when creating all
pools, but they would not be marked as set up. On the first access to the
pool we would try to run setup again, which is not needed. This change
makes it so that we remember that we ran setup successfully when pools are
created
@lutter lutter merged commit c23ee96 into master Apr 7, 2025
6 checks passed
@lutter lutter deleted the lutter/sharded branch April 7, 2025 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants