Skip to content

Align rollout error rescheduling with verifiers#2579

Open
rasdani wants to merge 7 commits into
PrimeIntellect-ai:mainfrom
rasdani:fix/verifiers-error-reschedule-policy
Open

Align rollout error rescheduling with verifiers#2579
rasdani wants to merge 7 commits into
PrimeIntellect-ai:mainfrom
rasdani:fix/verifiers-error-reschedule-policy

Conversation

@rasdani
Copy link
Copy Markdown
Contributor

@rasdani rasdani commented May 20, 2026

Summary

  • Add a scheduler error policy that reschedules serialized vf.RolloutOutput["error"] dictionaries only when verifiers marks them with is_retryable=true.
  • Stop treating every non-null serialized rollout error as reschedulable. Plain ModelError, including ModelError -> InternalServerError no-worker outages, now fails the batch fast with a clear terminal error instead of spawning replacement rollout groups.
  • Preserve existing bounded reschedule behavior for retryable verifier failures and empty trajectories.

Coordinated Verifiers Change

This PR depends on the serialized retryability flag proposed in PrimeIntellect-ai/verifiers#1427. verifiers owns the retry policy because it still has live exception types and can evaluate InfraError / InvalidModelResponseError subclass semantics before serializing ErrorInfo.

prime-rl intentionally does not enumerate serialized subclass names and does not special-case live vf exception classes in the scheduler. If a serialized error has no is_retryable=true flag, prime-rl treats it as terminal/no-reschedule.

Why

prime-rl receives rollout failures after verifiers serializes them into ErrorInfo dictionaries (error, error_chain_str, error_chain_repr, and with verifiers#1427, is_retryable). Raw class hierarchy is not preserved across that boundary, so the retry decision needs to be carried in the serialized payload rather than reconstructed from class-name strings.

This prevents model/router outages like ModelError -> InternalServerError('No available workers (all circuits open or unhealthy)') from being mistaken for sandbox/infra failures and causing replacement SWE sandbox churn.

Failure Behavior

Retryable serialized rollout errors still use the scheduler's bounded replacement-rollout path. Non-retryable serialized rollout errors drop the affected group for cleanup, then raise a plain RuntimeError outside the scheduler's broad rollout-task exception handler so the orchestrator exits instead of silently continuing with no-signal batches.

Test Plan

  • PYTHONPATH=/home/daniel/git/prime-rl-verifiers-retry-semantics/src:/home/daniel/git/prime-rl-verifiers-retry-semantics/packages/prime-rl-configs/src:/home/daniel/git/prime-rl-verifiers-retry-semantics/deps/verifiers /home/daniel/git/prime-rl/.venv/bin/python -m pytest tests/unit/orchestrator/test_scheduler.py
  • /home/daniel/git/prime-rl/.venv/bin/ruff check src/prime_rl/orchestrator/scheduler.py tests/unit/orchestrator/test_scheduler.py
  • /home/daniel/git/prime-rl/.venv/bin/ruff format --check src/prime_rl/orchestrator/scheduler.py tests/unit/orchestrator/test_scheduler.py

Note

Medium Risk
Changes rollout failure handling in the scheduler to fail fast on non-retryable serialized errors, which can halt training runs if upstream retry flags are missing or incorrect. Logic is localized and covered by new unit tests, but impacts core batch generation behavior.

Overview
Rollout rescheduling now follows verifiers’ retry decision. The scheduler adds is_reschedulable_rollout_error() and only replaces errored rollouts when vf.RolloutOutput["error"] contains is_retryable: true; otherwise it drops the group and raises a RuntimeError to abort batch generation.

It also tightens task completion handling by explicitly dropping groups for cancelled tasks and tasks with exceptions before processing results. Tests were expanded to cover retryable vs terminal serialized errors and to assert that non-reschedulable errors stop generate_batch and avoid updating the buffer.

Reviewed by Cursor Bugbot for commit 609f080. Bugbot is set up for automated code reviews on this repo. Configure here.

@rasdani rasdani force-pushed the fix/verifiers-error-reschedule-policy branch from bfffbb4 to f97c908 Compare May 21, 2026 01:07
@rasdani rasdani requested a review from mikasenghaas May 21, 2026 02:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant