-
-
Notifications
You must be signed in to change notification settings - Fork 9
Reconnection backoff delay for workers and schedulers #107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Reconnection backoff delay for workers and schedulers #107
Conversation
src/scheduler.rs
Outdated
| shutdown_token: CancellationToken, | ||
|
|
||
| // Policy for reconnection backoff when PostgreSQL connection is lost. | ||
| reconnection_policy: RetryPolicy, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nit but I think it's a bit confusing that we have RetryPolicy, connection policy, and task policy. Perhaps there's an opportunity for better naming, if RetryPolicy really is applicable to both otherwise distinct types might make better sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback! I agree that having RetryPolicy used for both task retries and connection reconnection could be confusing.
A few options I'm considering:
- Rename
RetryPolicyto something more specific likeTaskRetryPolicyand create a separateConnectionRetryPolicyorReconnectionPolicy - Create a more generic
BackoffPolicythat both can use, with clearer naming at the usage sites - Keep
RetryPolicyfor tasks and introduce aReconnectionBackofftype for connection retries
What's your preference? Or do you have any better suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd lean towards the second option if that reduces duplication, but I don't have a strong preference.
| let mut retry_count: RetryCount = 1; | ||
|
|
||
| // Outer loop: handle reconnection logic | ||
| 'reconnect: loop { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of a shame to have significant duplication.
migrations/20241105164503_2.sql
Outdated
| max_interval_ms int, | ||
| backoff_coefficient float | ||
| backoff_coefficient float, | ||
| jitter_factor float |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay on this. I think we probably want to incorporate these changes in a separate migration, right?
If we don't, then anyone using Underway today wouldn't have a clean upgrade path. (Maybe that's reasonable for breaking releases, but it would be nice if we could avoid that.)
Summary
task::RetryPolicywith a configurable jitter factor and use it for reconnection backoff.Fixes #91
RetryPolicy & backoff behavior
RetryPolicy is extended with a jitter_factor field and a jitter_factor(...) builder method, and calculate_delay(retry_count) now computes an exponential delay based on initial_interval_ms, max_interval_ms, and backoff_coefficient, then applies jitter by sampling between target * (1.0 - jitter_factor) and target; with the default values (initial_interval_ms = 1_000, max_interval_ms = 60_000, backoff_coefficient = 2.0, jitter_factor = 0.5), retries back off exponentially while being de‑synchronized across processes.
Worker & scheduler reconnection
Their main loops are wrapped in a shared reconnection pattern: both components maintain a retry_count for connection attempts and, on failure to acquire a pool connection, create a PgListener, or LISTEN on the relevant channel, they compute the next delay via reconnection_policy.calculate_delay(retry_count), log the error with backoff_secs and attempt, sleep for the computed duration, increment retry_count, and retry; on successful connection and subscription they reset retry_count to 1 and then preserve the existing behavior, with workers continuing to process tasks and handle graceful shutdown, and schedulers continuing to enforce the single‑instance advisory lock and iterate the configured schedule.