Align rollout error rescheduling with verifiers#2579
Open
rasdani wants to merge 7 commits into
Open
Conversation
bfffbb4 to
f97c908
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
vf.RolloutOutput["error"]dictionaries only when verifiers marks them withis_retryable=true.ModelError, includingModelError -> InternalServerErrorno-worker outages, now fails the batch fast with a clear terminal error instead of spawning replacement rollout groups.Coordinated Verifiers Change
This PR depends on the serialized retryability flag proposed in PrimeIntellect-ai/verifiers#1427. verifiers owns the retry policy because it still has live exception types and can evaluate
InfraError/InvalidModelResponseErrorsubclass semantics before serializingErrorInfo.prime-rl intentionally does not enumerate serialized subclass names and does not special-case live
vfexception classes in the scheduler. If a serialized error has nois_retryable=trueflag, prime-rl treats it as terminal/no-reschedule.Why
prime-rl receives rollout failures after verifiers serializes them into
ErrorInfodictionaries (error,error_chain_str,error_chain_repr, and with verifiers#1427,is_retryable). Raw class hierarchy is not preserved across that boundary, so the retry decision needs to be carried in the serialized payload rather than reconstructed from class-name strings.This prevents model/router outages like
ModelError -> InternalServerError('No available workers (all circuits open or unhealthy)')from being mistaken for sandbox/infra failures and causing replacement SWE sandbox churn.Failure Behavior
Retryable serialized rollout errors still use the scheduler's bounded replacement-rollout path. Non-retryable serialized rollout errors drop the affected group for cleanup, then raise a plain
RuntimeErroroutside the scheduler's broad rollout-task exception handler so the orchestrator exits instead of silently continuing with no-signal batches.Test Plan
PYTHONPATH=/home/daniel/git/prime-rl-verifiers-retry-semantics/src:/home/daniel/git/prime-rl-verifiers-retry-semantics/packages/prime-rl-configs/src:/home/daniel/git/prime-rl-verifiers-retry-semantics/deps/verifiers /home/daniel/git/prime-rl/.venv/bin/python -m pytest tests/unit/orchestrator/test_scheduler.py/home/daniel/git/prime-rl/.venv/bin/ruff check src/prime_rl/orchestrator/scheduler.py tests/unit/orchestrator/test_scheduler.py/home/daniel/git/prime-rl/.venv/bin/ruff format --check src/prime_rl/orchestrator/scheduler.py tests/unit/orchestrator/test_scheduler.pyNote
Medium Risk
Changes rollout failure handling in the scheduler to fail fast on non-retryable serialized errors, which can halt training runs if upstream retry flags are missing or incorrect. Logic is localized and covered by new unit tests, but impacts core batch generation behavior.
Overview
Rollout rescheduling now follows verifiers’ retry decision. The scheduler adds
is_reschedulable_rollout_error()and only replaces errored rollouts whenvf.RolloutOutput["error"]containsis_retryable: true; otherwise it drops the group and raises aRuntimeErrorto abort batch generation.It also tightens task completion handling by explicitly dropping groups for cancelled tasks and tasks with exceptions before processing results. Tests were expanded to cover retryable vs terminal serialized errors and to assert that non-reschedulable errors stop
generate_batchand avoid updating the buffer.Reviewed by Cursor Bugbot for commit 609f080. Bugbot is set up for automated code reviews on this repo. Configure here.