Skip to content

Conversation

@maxcountryman
Copy link
Owner

Motivation

  • Prevent races where a stale attempt finalizes or reschedules a task after a new attempt has been created by introducing explicit per-attempt fencing.
  • Avoid brittle guards that relied on the task row state or the latest attempt_number alone.
  • Make staleness detection explicit by using an expiration timestamp rather than only heartbeat timestamps.

Description

  • Add a migration that adds lease_token and lease_expires_at to underway.task and lease_token to underway.task_attempt.
  • Generate a new UUID lease on dequeue and store it on both the task row and the inserted task_attempt, and set lease_expires_at = now() + heartbeat.
  • Wire lease_token into InProgressTask and fence all attempt / task updates (mark_succeeded, mark_failed, mark_cancelled, retry_after, record_failure, and record_heartbeat) to require a matching lease and clear leases on terminal or reschedule transitions.
  • Replace staleness checks to use lease_expires_at and add a test that verifies a stale attempt cannot finalize a task (stale_attempt_cannot_finalize_task).

Testing

  • Ran cargo fmt, which completed successfully.
  • No unit or integration tests were executed as part of this change in this rollout.

Codex Task

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 820 to 824
or (
state = $3
-- Has heartbeat stalled?
and last_heartbeat_at < now() - heartbeat
and (lease_expires_at < now() or lease_expires_at is null)
-- Are there remaining retries?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid treating NULL leases as stale on rollout

This staleness predicate treats lease_expires_at is null as immediately stale, which means any tasks that were already InProgress before this migration (and thus have NULL in the new column) will be re-dequeued right away. During a rolling deploy, that can create duplicate attempts and drop results from the original worker because the old attempt cannot match a lease token. If those tasks are not idempotent, this can corrupt results. Consider backfilling lease_expires_at for existing in-progress rows or falling back to last_heartbeat_at when the lease is NULL.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants