Store startup fixes #5926

lutter · 2025-04-03T17:25:26Z

With multiple nodes, they could race each other and cause database errors.

I've reworked the startup code to be (a) race-proof (I hope) and (b) easier to follow. Any node that wants to run database setup now gets a lock on the primary, and runs all code needed for setup while holding that lock. That way, nodes can't interfere with each other. In the common case, where there are no database changes, be it because of migrations, configuration changes, or code changes that map different tables, the time during which any node holds the lock is very brief. This could be further optimized but let's first see how this performs in practice.

incrypto32 · 2025-04-07T08:41:16Z

store/postgres/src/connection_pool.rs

+        // Everything here happens under the migration lock. Anything called
+        // from here should not try to get that lock, otherwise the process
+        // will deadlock
+        debug!(self.logger, "Waiting for migration lock");
+        let res = with_migration_lock(&mut pconn, |_| async {
+            debug!(self.logger, "Migration lock acquired");
+
+            // While we were waiting for the migration lock, another thread


It should be up to the operator if they use it or not, and when they want to reset it

The current database setup code was inerently racy when several nodes were starting up as it relied on piecemeal locking of individual steps. This change completely revamps the strategy we use: setup now takes a lock on the primary, so that only one node at a time will run the setup code.

Before, PoolState was just an enum and code all over the place dealt with its interior mutability. Now, we encapsulate that to simplify code using the PoolState

Instead of dealing with disabled shards (shards that have a pool size of 0 configured), filter those shards out on startup and warn about them. The end effect is that for that configuration, users will get an error of 'unkown shard' rather than 'shard disabled'. Since configuring a shard to have no connections is kinda pathological, and leads to an error when it is used either way, the code simplification is worth the slightly less helpful error message. Removing the 'disabled' state from pools has ripple effects to quite a few other places, simplifying them a bit

With the previous code, we would run setup initially when creating all pools, but they would not be marked as set up. On the first access to the pool we would try to run setup again, which is not needed. This change makes it so that we remember that we ran setup successfully when pools are created

…n it

lutter requested a review from incrypto32 April 3, 2025 17:25

lutter force-pushed the lutter/sharded branch 12 times, most recently from 6bc5d84 to fc0ae44 Compare April 6, 2025 18:24

incrypto32 approved these changes Apr 7, 2025

View reviewed changes

lutter added 13 commits April 7, 2025 09:22

graph: Make metrics registration less noisy

8ddd068

store: Allow creating special Namespaces

13cce53

store: Factor the locale check into a method

3eac30c

graph: Allow returning values from task_spawn

034cba5

store: Remove dead code from connection_pool

cbbd4e1

store: Do not manage anything about pg_stat_statements

f162e2f

It should be up to the operator if they use it or not, and when they want to reset it

node, store: Give the PoolCoordinator a logger

a63e607

store: Encapsulate mutable state tracking in PoolState

3282af2

Before, PoolState was just an enum and code all over the place dealt with its interior mutability. Now, we encapsulate that to simplify code using the PoolState

node, store: Rename 'PoolName' to 'PoolRole'

70656a3

store: Avoid running setup unnecessarily if several threads try to ru…

c23ee96

…n it

lutter force-pushed the lutter/sharded branch from fc0ae44 to c23ee96 Compare April 7, 2025 16:22

lutter merged commit c23ee96 into master Apr 7, 2025
6 checks passed

lutter deleted the lutter/sharded branch April 7, 2025 16:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Store startup fixes #5926

Store startup fixes #5926

Uh oh!

lutter commented Apr 3, 2025 •

edited

Loading

Uh oh!

incrypto32 Apr 7, 2025

Uh oh!

Uh oh!

Uh oh!

Store startup fixes #5926

Store startup fixes #5926

Uh oh!

Conversation

lutter commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

incrypto32 Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lutter commented Apr 3, 2025 •

edited

Loading