Skip to content

Conversation

@maxcountryman
Copy link
Owner

Motivation

  • Prevent stale/stalled task attempts from mutating task state or heartbeats once a newer attempt runs.
  • Ensure only the latest attempt can mark task attempts/tables as InProgress/Succeeded/Failed/Cancelled.
  • Avoid hard errors when a stale attempt races to finalize a task and instead make those operations no-ops with a warning.
  • Preserve correct cancellation semantics for pending tasks that may not yet have an attempt row.

Description

  • Added an attempt_number field to InProgressTask (InProgressTask.attempt_number) and populate it when dequeueing by returning the attempt_number from the inserted underway.task_attempt row.
  • Fenced all state transitions and heartbeats (mark_succeeded, mark_failed, mark_cancelled, retry_after, record_failure, record_heartbeat) to require the attempt row match self.attempt_number and that the task's max attempt equals self.attempt_number before updating the task row.
  • Changed cancel path to tolerate tasks without an attempt row by allowing the task update when no attempt exists (via OR NOT EXISTS), and updated Job::cancel query to select a default attempt_number for pending tasks.
  • Converted stale-attempt errors into tracing::warn! + no-op returns so races where a stale attempt attempts to finalize a task do not break the system, and added a regression test stale_attempt_cannot_finalize_task.

Testing

  • Ran cargo fmt, which completed successfully.
  • No unit/integration tests were executed in this run.
  • Added a SQLx test stale_attempt_cannot_finalize_task in src/queue.rs to verify that a stale attempt cannot finalize a task (test included but not executed here).

Codex Task

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1304 to +1308
select max(attempt_number)
from underway.task_attempt
where task_id = $1
and task_queue_name = $3
) = $5

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Allow cancelling pending tasks with prior attempts

The new cancellation guard requires max(attempt_number) = self.attempt_number, but Job::cancel now hardcodes attempt_number to 0 for pending tasks. If a task is pending because a previous attempt failed and scheduled a retry, there is an attempt row with a max attempt number > 0, so this condition fails and the cancel becomes a no-op (only a warning). That means you can no longer cancel retried-but-pending jobs, which is a regression from the previous behavior. Consider selecting the latest attempt number for pending tasks or relaxing the guard for Pending cancellations.

Useful? React with 👍 / 👎.

@maxcountryman maxcountryman deleted the codex/assess-heartbeat-mechanism-for-soundness-issue branch December 26, 2025 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants