Skip to content

Conversation

@moonbox3
Copy link
Contributor

Motivation and Context

When resuming a workflow from a checkpoint with 0 messages (already completed), the workflow would incorrectly:

  • Emit WorkflowStartedEvent and status events
  • Potentially run executor iterations despite having no work to do

Description

Added early exit logic in two places:

  1. Runner: Check for messages before starting each superstep iteration
  2. Workflow: When resuming from checkpoint, check if checkpoint was already complete (had 0 messages) before emitting events

The fix distinguishes between:

  • Checkpoints with 0 messages → workflow already complete, return immediately
  • Checkpoints with messages that get processed → continue normal execution

Changes:

Contribution Checklist

  • The code builds clean without any errors or warnings
  • The PR follows the Contribution Guidelines
  • All unit tests pass, and I have added new tests where possible
  • Is this a breaking change? If yes, add "[BREAKING]" prefix to the title of the PR.

@moonbox3 moonbox3 self-assigned this Oct 28, 2025
Copilot AI review requested due to automatic review settings October 28, 2025 03:15
@moonbox3 moonbox3 added python squad: workflows Agent Framework Workflows Squad workflows Related to Workflows in agent-framework labels Oct 28, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a bug where resuming a workflow from an already-completed checkpoint would unnecessarily re-execute workflow logic and emit events. The fix adds early-exit checks to prevent execution when there's no work to do.

Key changes:

  • Added message-checking logic before workflow iteration loops to avoid running when no messages exist
  • Modified checkpoint resumption to detect already-completed workflows and return immediately
  • Added comprehensive unit tests validating the fix

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
test_checkpoint_resume_completed.py New test file with comprehensive coverage of checkpoint resume scenarios
_workflow.py Added early-exit logic when resuming from completed checkpoints; restructured checkpoint restoration
_runner.py Added message check before each iteration to prevent unnecessary loops

# Check if workflow is already complete
# Only return early if checkpoint had NO messages to begin with
if not had_messages_before and not await self._runner.context.has_messages():
# Return empty result - workflow was already complete
Copy link

Copilot AI Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The empty WorkflowRunResult([], []) doesn't preserve any status events. If the workflow was previously completed, there might be value in returning the final status from the checkpoint. Consider whether the return value should include the completion status or document why an empty result is appropriate.

Suggested change
# Return empty result - workflow was already complete
# Return result with final status event if available
final_status = self._runner.context.get_status() if hasattr(self._runner.context, "get_status") else None
if final_status is not None:
status_event = WorkflowStatusEvent(final_status)
return WorkflowRunResult([], [status_event])

Copilot uses AI. Check for mistakes.
@markwallace-microsoft
Copy link
Member

Python Test Coverage

Python Test Coverage Report •
FileStmtsMissCoverMissing
packages/core/agent_framework/_workflows
   _runner.py2213982%120, 122–125, 162–165, 169, 205–207, 228, 234, 267–268, 302–304, 313, 317, 319, 323, 331, 337, 362, 364–367, 378, 380–381, 384–387, 410
   _workflow.py2873587%95, 248–250, 252–253, 271, 295, 297, 466, 470, 504, 580, 614, 622, 628, 637–641, 650–651, 655, 657, 664, 671–673, 692, 735–737, 750, 815
TOTAL12303201283% 

Python Unit Test Overview

Tests Skipped Failures Errors Time
1408 98 💤 0 ❌ 0 🔥 26.919s ⏱️

logger.info("Skipping 'after_initial_execution' checkpoint because we resumed from a checkpoint")

while self._iteration < self._max_iterations:
# Check if there are any messages to process before starting iteration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this prevent the workflow from ever getting started because there may not be messages in the queue at the beginning?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python squad: workflows Agent Framework Workflows Squad workflows Related to Workflows in agent-framework

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Python: [Bug] Workflow resumes from latest checkpoint but re-runs first executor

3 participants