Skip to content

Serialize retryable rollout error policy#1427

Open
rasdani wants to merge 7 commits into
PrimeIntellect-ai:mainfrom
rasdani:fix/retryable-error-info
Open

Serialize retryable rollout error policy#1427
rasdani wants to merge 7 commits into
PrimeIntellect-ai:mainfrom
rasdani:fix/retryable-error-info

Conversation

@rasdani
Copy link
Copy Markdown
Contributor

@rasdani rasdani commented May 20, 2026

Summary

  • Add optional is_retryable to serialized ErrorInfo payloads.
  • Populate it from verifiers' default retry semantics: retry when the causal error chain contains InfraError or InvalidModelResponseError, including subclasses such as sandbox/tunnel/browser sandbox errors and empty model responses.
  • Preserve is_retryable when reserializing existing ErrorInfo mappings through state_to_output.
  • Update maybe_retry to use is_retryable_error(error: BaseException | ErrorInfo, ...) for both live exceptions and serialized ErrorInfo.
  • Remove the old serialized-error string reconstruction path. Serialized errors now retry only when is_retryable is true.

Coordinated Consumer

This is the producer-side change for PrimeIntellect-ai/prime-rl#2579. prime-rl receives rollout failures after verifiers serializes them into ErrorInfo dictionaries, so it cannot safely recover Python subclass relationships from error-name strings.

With this field, verifiers remains the source of truth for retry semantics and prime-rl can consume the serialized retry decision directly. prime-rl intentionally does not enumerate subclass names or fall back to string parsing; missing is_retryable=true is treated as terminal/no-reschedule.

Why

Class-name strings preserve logging detail, but they do not preserve Python subclass relationships like SandboxError <: InfraError. Serializing the retryability decision keeps downstream consumers aligned with verifiers maybe_retry semantics and prevents plain model/provider errors from being mistaken for retryable infra failures.

Existing serialized ErrorInfo records remain valid because is_retryable is optional. The compatibility behavior is intentionally conservative: older serialized errors without the flag do not retry across the serialized boundary.

Test Plan

  • uv run pytest tests/test_error_chain.py
  • uv run ty check verifiers/utils/error_utils.py verifiers/utils/async_utils.py
  • uv run ruff check verifiers/utils/error_utils.py verifiers/utils/async_utils.py tests/test_error_chain.py
  • uv run ruff format --check verifiers/utils/error_utils.py verifiers/utils/async_utils.py tests/test_error_chain.py

Note

Medium Risk
Changes retry behavior across the serialized-error boundary: previously some serialized errors could retry via class-name string parsing; now retries for serialized errors only happen when ErrorInfo.is_retryable is set, which could alter rollout scheduling/resilience if producers/consumers are mismatched.

Overview
Adds an optional is_retryable flag to serialized ErrorInfo and populates it using verifiers’ default retry policy (error chain contains InfraError or InvalidModelResponseError).

Updates maybe_retry to use a new is_retryable_error() helper for both live exceptions and serialized ErrorInfo, and removes the prior string-based reconstruction path for serialized errors (missing is_retryable=True is treated as non-retryable). state_to_output now preserves is_retryable when re-serializing existing ErrorInfo mappings and otherwise emits errors via the shared error_info() serializer.

Extends tests to cover default retryability marking and the stricter behavior for serialized errors without the flag.

Reviewed by Cursor Bugbot for commit e710e01. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread verifiers/utils/error_utils.py Outdated
Comment thread verifiers/utils/async_utils.py Outdated
error_chain = ErrorChain(error)
return any(error_type in error_chain for error_type in error_types)

return error.get("is_retryable") is True
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Serialized path silently ignores error_types parameter

Medium Severity

is_retryable_error accepts an error_types parameter but silently ignores it for serialized ErrorInfo inputs, only checking the is_retryable flag. In maybe_retry, error_types is a customizable parameter — if a caller narrows it (e.g., only InfraError), serialized InvalidModelResponseError entries with is_retryable=True would still be retried. Conversely, if a caller adds a new retryable type, serialized errors of that type won't be retried because error_info() only uses DEFAULT_RETRYABLE_ERROR_TYPES. The parameter creates a false expectation of symmetry between live and serialized error handling.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 64b1382. Configure here.

error_chain = ErrorChain(error)
return any(error_type in error_chain for error_type in error_types)

return error.get("is_retryable") is True
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing type guard crashes on unexpected error types

Low Severity

is_retryable_error falls through to error.get("is_retryable") for any non-BaseException input. If state["error"] is set to an unexpected type (e.g., a string from apply_internal_state_patch in sandbox programs), this crashes with AttributeError. The old code guarded with isinstance(err, Mapping) before attempting dict access. The PR description mentions adding an is_error_info runtime guard, but no such guard was implemented.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 64b1382. Configure here.

@rasdani rasdani force-pushed the fix/retryable-error-info branch from 64b1382 to e710e01 Compare May 21, 2026 01:07
@rasdani
Copy link
Copy Markdown
Contributor Author

rasdani commented May 21, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e710e01e62

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

error_chain = ErrorChain(error)
return any(error_type in error_chain for error_type in error_types)

return error.get("is_retryable") is True
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Honor caller retry types for serialized errors

Serialized errors now ignore the error_types argument and only check is_retryable, so maybe_retry(..., error_types=...) can retry the wrong failures or skip requested ones once errors cross the serialization boundary. For example, an InfraError serialized with is_retryable=true will still be retried even if the caller only configured InvalidModelResponseError, while custom retry classes that are not in the default policy will never retry after serialization.

Useful? React with 👍 / 👎.

if isinstance(err, BaseException):
raise err
detail = str(err.get("error_chain_repr") or err.get("error") or "")
raise error_types[0](detail)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard empty retry type tuple before re-raising

When a serialized error is marked retryable, reraise_error always raises error_types[0]; if a caller passes error_types=() (a valid tuple value) this path throws IndexError instead of returning the rollout state. That turns a recoverable rollout-error result into an unexpected crash in retry handling.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit e710e01. Configure here.

"""Return whether a live or serialized error matches verifiers' retry policy."""
if isinstance(error, BaseException):
error_chain = ErrorChain(error)
return any(error_type in error_chain for error_type in error_types)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ErrorChain check mismatches tenacity's top-level-only retry guard

Medium Severity

is_retryable_error for live BaseException traverses the full __cause__ chain via ErrorChain, but reraise_error then re-raises the original top-level exception. Tenacity's retry_if_exception_type(error_types) only checks isinstance on the raised exception itself. If the top-level exception is not a retryable type but a chained cause is, is_retryable_error returns True, the non-retryable exception is raised, tenacity doesn't match it, and the exception propagates uncaught — crashing the caller instead of returning the result with error in state. The old code used a flat isinstance(err, err_type) check that was consistent with tenacity's guard.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit e710e01. Configure here.

@rasdani rasdani requested a review from mikasenghaas May 21, 2026 02:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant