Serialize retryable rollout error policy#1427
Conversation
| error_chain = ErrorChain(error) | ||
| return any(error_type in error_chain for error_type in error_types) | ||
|
|
||
| return error.get("is_retryable") is True |
There was a problem hiding this comment.
Serialized path silently ignores error_types parameter
Medium Severity
is_retryable_error accepts an error_types parameter but silently ignores it for serialized ErrorInfo inputs, only checking the is_retryable flag. In maybe_retry, error_types is a customizable parameter — if a caller narrows it (e.g., only InfraError), serialized InvalidModelResponseError entries with is_retryable=True would still be retried. Conversely, if a caller adds a new retryable type, serialized errors of that type won't be retried because error_info() only uses DEFAULT_RETRYABLE_ERROR_TYPES. The parameter creates a false expectation of symmetry between live and serialized error handling.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 64b1382. Configure here.
| error_chain = ErrorChain(error) | ||
| return any(error_type in error_chain for error_type in error_types) | ||
|
|
||
| return error.get("is_retryable") is True |
There was a problem hiding this comment.
Missing type guard crashes on unexpected error types
Low Severity
is_retryable_error falls through to error.get("is_retryable") for any non-BaseException input. If state["error"] is set to an unexpected type (e.g., a string from apply_internal_state_patch in sandbox programs), this crashes with AttributeError. The old code guarded with isinstance(err, Mapping) before attempting dict access. The PR description mentions adding an is_error_info runtime guard, but no such guard was implemented.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 64b1382. Configure here.
64b1382 to
e710e01
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e710e01e62
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| error_chain = ErrorChain(error) | ||
| return any(error_type in error_chain for error_type in error_types) | ||
|
|
||
| return error.get("is_retryable") is True |
There was a problem hiding this comment.
Honor caller retry types for serialized errors
Serialized errors now ignore the error_types argument and only check is_retryable, so maybe_retry(..., error_types=...) can retry the wrong failures or skip requested ones once errors cross the serialization boundary. For example, an InfraError serialized with is_retryable=true will still be retried even if the caller only configured InvalidModelResponseError, while custom retry classes that are not in the default policy will never retry after serialization.
Useful? React with 👍 / 👎.
| if isinstance(err, BaseException): | ||
| raise err | ||
| detail = str(err.get("error_chain_repr") or err.get("error") or "") | ||
| raise error_types[0](detail) |
There was a problem hiding this comment.
Guard empty retry type tuple before re-raising
When a serialized error is marked retryable, reraise_error always raises error_types[0]; if a caller passes error_types=() (a valid tuple value) this path throws IndexError instead of returning the rollout state. That turns a recoverable rollout-error result into an unexpected crash in retry handling.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 3 total unresolved issues (including 2 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit e710e01. Configure here.
| """Return whether a live or serialized error matches verifiers' retry policy.""" | ||
| if isinstance(error, BaseException): | ||
| error_chain = ErrorChain(error) | ||
| return any(error_type in error_chain for error_type in error_types) |
There was a problem hiding this comment.
ErrorChain check mismatches tenacity's top-level-only retry guard
Medium Severity
is_retryable_error for live BaseException traverses the full __cause__ chain via ErrorChain, but reraise_error then re-raises the original top-level exception. Tenacity's retry_if_exception_type(error_types) only checks isinstance on the raised exception itself. If the top-level exception is not a retryable type but a chained cause is, is_retryable_error returns True, the non-retryable exception is raised, tenacity doesn't match it, and the exception propagates uncaught — crashing the caller instead of returning the result with error in state. The old code used a flat isinstance(err, err_type) check that was consistent with tenacity's guard.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit e710e01. Configure here.


Summary
is_retryableto serializedErrorInfopayloads.InfraErrororInvalidModelResponseError, including subclasses such as sandbox/tunnel/browser sandbox errors and empty model responses.is_retryablewhen reserializing existingErrorInfomappings throughstate_to_output.maybe_retryto useis_retryable_error(error: BaseException | ErrorInfo, ...)for both live exceptions and serializedErrorInfo.is_retryableistrue.Coordinated Consumer
This is the producer-side change for PrimeIntellect-ai/prime-rl#2579. prime-rl receives rollout failures after verifiers serializes them into
ErrorInfodictionaries, so it cannot safely recover Python subclass relationships from error-name strings.With this field, verifiers remains the source of truth for retry semantics and prime-rl can consume the serialized retry decision directly. prime-rl intentionally does not enumerate subclass names or fall back to string parsing; missing
is_retryable=trueis treated as terminal/no-reschedule.Why
Class-name strings preserve logging detail, but they do not preserve Python subclass relationships like
SandboxError <: InfraError. Serializing the retryability decision keeps downstream consumers aligned with verifiersmaybe_retrysemantics and prevents plain model/provider errors from being mistaken for retryable infra failures.Existing serialized
ErrorInforecords remain valid becauseis_retryableis optional. The compatibility behavior is intentionally conservative: older serialized errors without the flag do not retry across the serialized boundary.Test Plan
uv run pytest tests/test_error_chain.pyuv run ty check verifiers/utils/error_utils.py verifiers/utils/async_utils.pyuv run ruff check verifiers/utils/error_utils.py verifiers/utils/async_utils.py tests/test_error_chain.pyuv run ruff format --check verifiers/utils/error_utils.py verifiers/utils/async_utils.py tests/test_error_chain.pyNote
Medium Risk
Changes retry behavior across the serialized-error boundary: previously some serialized errors could retry via class-name string parsing; now retries for serialized errors only happen when
ErrorInfo.is_retryableis set, which could alter rollout scheduling/resilience if producers/consumers are mismatched.Overview
Adds an optional
is_retryableflag to serializedErrorInfoand populates it using verifiers’ default retry policy (error chain containsInfraErrororInvalidModelResponseError).Updates
maybe_retryto use a newis_retryable_error()helper for both live exceptions and serializedErrorInfo, and removes the prior string-based reconstruction path for serialized errors (missingis_retryable=Trueis treated as non-retryable).state_to_outputnow preservesis_retryablewhen re-serializing existingErrorInfomappings and otherwise emits errors via the sharederror_info()serializer.Extends tests to cover default retryability marking and the stricter behavior for serialized errors without the flag.
Reviewed by Cursor Bugbot for commit e710e01. Bugbot is set up for automated code reviews on this repo. Configure here.