Skip to content

Conversation

@Marvel-Gu
Copy link

Summary

  • Extend task::RetryPolicy with a configurable jitter factor and use it for reconnection backoff.
  • Apply the same jittered exponential reconnection backoff policy to both workers and schedulers.

Fixes #91

RetryPolicy & backoff behavior

RetryPolicy is extended with a jitter_factor field and a jitter_factor(...) builder method, and calculate_delay(retry_count) now computes an exponential delay based on initial_interval_ms, max_interval_ms, and backoff_coefficient, then applies jitter by sampling between target * (1.0 - jitter_factor) and target; with the default values (initial_interval_ms = 1_000, max_interval_ms = 60_000, backoff_coefficient = 2.0, jitter_factor = 0.5), retries back off exponentially while being de‑synchronized across processes.

Worker & scheduler reconnection

Their main loops are wrapped in a shared reconnection pattern: both components maintain a retry_count for connection attempts and, on failure to acquire a pool connection, create a PgListener, or LISTEN on the relevant channel, they compute the next delay via reconnection_policy.calculate_delay(retry_count), log the error with backoff_secs and attempt, sleep for the computed duration, increment retry_count, and retry; on successful connection and subscription they reset retry_count to 1 and then preserve the existing behavior, with workers continuing to process tasks and handle graceful shutdown, and schedulers continuing to enforce the single‑instance advisory lock and iterate the configured schedule.

src/scheduler.rs Outdated
shutdown_token: CancellationToken,

// Policy for reconnection backoff when PostgreSQL connection is lost.
reconnection_policy: RetryPolicy,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nit but I think it's a bit confusing that we have RetryPolicy, connection policy, and task policy. Perhaps there's an opportunity for better naming, if RetryPolicy really is applicable to both otherwise distinct types might make better sense.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback! I agree that having RetryPolicy used for both task retries and connection reconnection could be confusing.

A few options I'm considering:

  1. Rename RetryPolicy to something more specific like TaskRetryPolicy and create a separate ConnectionRetryPolicy or ReconnectionPolicy
  2. Create a more generic BackoffPolicy that both can use, with clearer naming at the usage sites
  3. Keep RetryPolicy for tasks and introduce a ReconnectionBackoff type for connection retries

What's your preference? Or do you have any better suggestions?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd lean towards the second option if that reduces duplication, but I don't have a strong preference.

let mut retry_count: RetryCount = 1;

// Outer loop: handle reconnection logic
'reconnect: loop {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of a shame to have significant duplication.

max_interval_ms int,
backoff_coefficient float
backoff_coefficient float,
jitter_factor float
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay on this. I think we probably want to incorporate these changes in a separate migration, right?

If we don't, then anyone using Underway today wouldn't have a clean upgrade path. (Maybe that's reasonable for breaking releases, but it would be nice if we could avoid that.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Workers and schedulers need a reconnection backoff delay

2 participants