Reconnection backoff delay for workers and schedulers #107

Marvel-Gu · 2025-12-16T14:09:13Z

Summary

Extend task::RetryPolicy with a configurable jitter factor and use it for reconnection backoff.
Apply the same jittered exponential reconnection backoff policy to both workers and schedulers.

Fixes #91

RetryPolicy & backoff behavior

RetryPolicy is extended with a jitter_factor field and a jitter_factor(...) builder method, and calculate_delay(retry_count) now computes an exponential delay based on initial_interval_ms, max_interval_ms, and backoff_coefficient, then applies jitter by sampling between target * (1.0 - jitter_factor) and target; with the default values (initial_interval_ms = 1_000, max_interval_ms = 60_000, backoff_coefficient = 2.0, jitter_factor = 0.5), retries back off exponentially while being de‑synchronized across processes.

Worker & scheduler reconnection

Their main loops are wrapped in a shared reconnection pattern: both components maintain a retry_count for connection attempts and, on failure to acquire a pool connection, create a PgListener, or LISTEN on the relevant channel, they compute the next delay via reconnection_policy.calculate_delay(retry_count), log the error with backoff_secs and attempt, sleep for the computed duration, increment retry_count, and retry; on successful connection and subscription they reset retry_count to 1 and then preserve the existing behavior, with workers continuing to process tasks and handle graceful shutdown, and schedulers continuing to enforce the single‑instance advisory lock and iterate the configured schedule.

maxcountryman · 2025-12-17T16:44:24Z

src/scheduler.rs

    shutdown_token: CancellationToken,
+
+    // Policy for reconnection backoff when PostgreSQL connection is lost.
+    reconnection_policy: RetryPolicy,


This is a nit but I think it's a bit confusing that we have RetryPolicy, connection policy, and task policy. Perhaps there's an opportunity for better naming, if RetryPolicy really is applicable to both otherwise distinct types might make better sense.

Thanks for the feedback! I agree that having RetryPolicy used for both task retries and connection reconnection could be confusing.

A few options I'm considering:

Rename RetryPolicy to something more specific like TaskRetryPolicy and create a separate ConnectionRetryPolicy or ReconnectionPolicy

Create a more generic BackoffPolicy that both can use, with clearer naming at the usage sites

Keep RetryPolicy for tasks and introduce a ReconnectionBackoff type for connection retries

What's your preference? Or do you have any better suggestions?

I think I'd lean towards the second option if that reduces duplication, but I don't have a strong preference.

maxcountryman · 2025-12-17T17:01:20Z

src/worker.rs

+        let mut retry_count: RetryCount = 1;
+
+        // Outer loop: handle reconnection logic
+        'reconnect: loop {


Kind of a shame to have significant duplication.

maxcountryman · 2025-12-27T01:47:40Z

migrations/20241105164503_2.sql

    max_interval_ms      int,
-    backoff_coefficient  float
+    backoff_coefficient  float,
+    jitter_factor        float


Sorry for the delay on this. I think we probably want to incorporate these changes in a separate migration, right?

If we don't, then anyone using Underway today wouldn't have a clean upgrade path. (Maybe that's reasonable for breaking releases, but it would be nice if we could avoid that.)

Marvel-Gu added 5 commits December 12, 2025 15:39

feat: proportional jitter

dc34f1b

feat: reconnection use exponential backoff with jitter

cb19f9d

feat: reconnection use exponential backoff with jitter

598562b

fix

7617573

fix

6b9d976

maxcountryman reviewed Dec 17, 2025

View reviewed changes

refactor: reduce duplicate code

d432935

maxcountryman reviewed Dec 27, 2025

View reviewed changes

feat: separate migration

da58355

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Reconnection backoff delay for workers and schedulers #107

Reconnection backoff delay for workers and schedulers #107

Uh oh!

Marvel-Gu commented Dec 16, 2025

Uh oh!

maxcountryman Dec 17, 2025

Uh oh!

Marvel-Gu Dec 19, 2025

Uh oh!

maxcountryman Dec 19, 2025

Uh oh!

maxcountryman Dec 17, 2025

Uh oh!

maxcountryman Dec 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Reconnection backoff delay for workers and schedulers #107

Are you sure you want to change the base?

Reconnection backoff delay for workers and schedulers #107

Uh oh!

Conversation

Marvel-Gu commented Dec 16, 2025

Summary

RetryPolicy & backoff behavior

Worker & scheduler reconnection

Uh oh!

maxcountryman Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Marvel-Gu Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

maxcountryman Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

maxcountryman Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

maxcountryman Dec 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants