Serialize retryable rollout error policy by rasdani · Pull Request #1427 · PrimeIntellect-ai/verifiers

rasdani · 2026-05-20T23:51:11Z

Summary

Add optional is_retryable to serialized ErrorInfo payloads.
Populate it from verifiers' default retry semantics: retry when the causal error chain contains InfraError or InvalidModelResponseError, including subclasses such as sandbox/tunnel/browser sandbox errors and empty model responses.
Preserve is_retryable when reserializing existing ErrorInfo mappings through state_to_output.
Update maybe_retry to use is_retryable_error(error: BaseException | ErrorInfo, ...) for both live exceptions and serialized ErrorInfo.
Remove the old serialized-error string reconstruction path. Serialized errors now retry only when is_retryable is true.

Coordinated Consumer

This is the producer-side change for PrimeIntellect-ai/prime-rl#2579. prime-rl receives rollout failures after verifiers serializes them into ErrorInfo dictionaries, so it cannot safely recover Python subclass relationships from error-name strings.

With this field, verifiers remains the source of truth for retry semantics and prime-rl can consume the serialized retry decision directly. prime-rl intentionally does not enumerate subclass names or fall back to string parsing; missing is_retryable=true is treated as terminal/no-reschedule.

Why

Class-name strings preserve logging detail, but they do not preserve Python subclass relationships like SandboxError <: InfraError. Serializing the retryability decision keeps downstream consumers aligned with verifiers maybe_retry semantics and prevents plain model/provider errors from being mistaken for retryable infra failures.

Existing serialized ErrorInfo records remain valid because is_retryable is optional. The compatibility behavior is intentionally conservative: older serialized errors without the flag do not retry across the serialized boundary.

Test Plan

uv run pytest tests/test_error_chain.py
uv run ty check verifiers/utils/error_utils.py verifiers/utils/async_utils.py
uv run ruff check verifiers/utils/error_utils.py verifiers/utils/async_utils.py tests/test_error_chain.py
uv run ruff format --check verifiers/utils/error_utils.py verifiers/utils/async_utils.py tests/test_error_chain.py

Note

Medium Risk
Changes retry behavior across the serialized-error boundary: previously some serialized errors could retry via class-name string parsing; now retries for serialized errors only happen when ErrorInfo.is_retryable is set, which could alter rollout scheduling/resilience if producers/consumers are mismatched.

Overview
Adds an optional is_retryable flag to serialized ErrorInfo and populates it using verifiers’ default retry policy (error chain contains InfraError or InvalidModelResponseError).

Updates maybe_retry to use a new is_retryable_error() helper for both live exceptions and serialized ErrorInfo, and removes the prior string-based reconstruction path for serialized errors (missing is_retryable=True is treated as non-retryable). state_to_output now preserves is_retryable when re-serializing existing ErrorInfo mappings and otherwise emits errors via the shared error_info() serializer.

Extends tests to cover default retryability marking and the stricter behavior for serialized errors without the flag.

^{Reviewed by Cursor Bugbot for commit e710e01. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor · 2026-05-21T01:04:47Z

+        error_chain = ErrorChain(error)
+        return any(error_type in error_chain for error_type in error_types)
+
+    return error.get("is_retryable") is True


Serialized path silently ignores error_types parameter

Medium Severity

is_retryable_error accepts an error_types parameter but silently ignores it for serialized ErrorInfo inputs, only checking the is_retryable flag. In maybe_retry, error_types is a customizable parameter — if a caller narrows it (e.g., only InfraError), serialized InvalidModelResponseError entries with is_retryable=True would still be retried. Conversely, if a caller adds a new retryable type, serialized errors of that type won't be retried because error_info() only uses DEFAULT_RETRYABLE_ERROR_TYPES. The parameter creates a false expectation of symmetry between live and serialized error handling.

Additional Locations (1)

verifiers/utils/async_utils.py#L171-L178

^{Reviewed by Cursor Bugbot for commit 64b1382. Configure here.}

cursor · 2026-05-21T01:04:47Z

+        error_chain = ErrorChain(error)
+        return any(error_type in error_chain for error_type in error_types)
+
+    return error.get("is_retryable") is True


Missing type guard crashes on unexpected error types

Low Severity

is_retryable_error falls through to error.get("is_retryable") for any non-BaseException input. If state["error"] is set to an unexpected type (e.g., a string from apply_internal_state_patch in sandbox programs), this crashes with AttributeError. The old code guarded with isinstance(err, Mapping) before attempting dict access. The PR description mentions adding an is_error_info runtime guard, but no such guard was implemented.

Additional Locations (1)

verifiers/utils/async_utils.py#L171-L174

^{Reviewed by Cursor Bugbot for commit 64b1382. Configure here.}

rasdani · 2026-05-21T01:09:07Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e710e01e62

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-21T01:12:28Z

+        error_chain = ErrorChain(error)
+        return any(error_type in error_chain for error_type in error_types)
+
+    return error.get("is_retryable") is True


Honor caller retry types for serialized errors

Serialized errors now ignore the error_types argument and only check is_retryable, so maybe_retry(..., error_types=...) can retry the wrong failures or skip requested ones once errors cross the serialization boundary. For example, an InfraError serialized with is_retryable=true will still be retried even if the caller only configured InvalidModelResponseError, while custom retry classes that are not in the default policy will never retry after serialization.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-21T01:12:29Z

+        if isinstance(err, BaseException):
+            raise err
+        detail = str(err.get("error_chain_repr") or err.get("error") or "")
+        raise error_types[0](detail)


Guard empty retry type tuple before re-raising

When a serialized error is marked retryable, reraise_error always raises error_types[0]; if a caller passes error_types=() (a valid tuple value) this path throws IndexError instead of returning the rollout state. That turns a recoverable rollout-error result into an unexpected crash in retry handling.

Useful? React with 👍 / 👎.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit e710e01. Configure here.}

cursor · 2026-05-21T01:15:23Z

+    """Return whether a live or serialized error matches verifiers' retry policy."""
+    if isinstance(error, BaseException):
+        error_chain = ErrorChain(error)
+        return any(error_type in error_chain for error_type in error_types)


ErrorChain check mismatches tenacity's top-level-only retry guard

Medium Severity

is_retryable_error for live BaseException traverses the full __cause__ chain via ErrorChain, but reraise_error then re-raises the original top-level exception. Tenacity's retry_if_exception_type(error_types) only checks isinstance on the raised exception itself. If the top-level exception is not a retryable type but a chained cause is, is_retryable_error returns True, the non-retryable exception is raised, tenacity doesn't match it, and the exception propagates uncaught — crashing the caller instead of returning the result with error in state. The old code used a flat isinstance(err, err_type) check that was consistent with tenacity's guard.

Additional Locations (1)

verifiers/utils/async_utils.py#L171-L178

^{Reviewed by Cursor Bugbot for commit e710e01. Configure here.}

fix: serialize retryable error policy

69d5eed

rasdani mentioned this pull request May 20, 2026

Align rollout error rescheduling with verifiers PrimeIntellect-ai/prime-rl#2579

Open

cursor Bot reviewed May 20, 2026

View reviewed changes

Comment thread verifiers/utils/error_utils.py Outdated

cursor Bot reviewed May 21, 2026

View reviewed changes

Comment thread verifiers/utils/async_utils.py Outdated

rasdani added 6 commits May 21, 2026 02:42

refactor: classify serialized retryable errors

2e9919d

refactor: make serialized error guard explicit

cc2ba1d

refactor: avoid TypeGuard for error info guard

4c19dc8

refactor: fold serialized retry policy into is_retryable_error

a8ddd2e

refactor: simplify serialized retry check

d1e04f4

refactor: inline serialized retry exception

e710e01

cursor Bot reviewed May 21, 2026

View reviewed changes

rasdani force-pushed the fix/retryable-error-info branch from 64b1382 to e710e01 Compare May 21, 2026 01:07

chatgpt-codex-connector Bot reviewed May 21, 2026

View reviewed changes

cursor Bot reviewed May 21, 2026

View reviewed changes

rasdani requested a review from mikasenghaas May 21, 2026 02:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serialize retryable rollout error policy#1427

Serialize retryable rollout error policy#1427
rasdani wants to merge 7 commits into
PrimeIntellect-ai:mainfrom
rasdani:fix/retryable-error-info

rasdani commented May 20, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

Uh oh!

cursor Bot May 21, 2026

Uh oh!

cursor Bot May 21, 2026

Uh oh!

rasdani commented May 21, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 21, 2026

Uh oh!

chatgpt-codex-connector Bot May 21, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rasdani commented May 20, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Coordinated Consumer

Why

Test Plan

Uh oh!

Uh oh!

Uh oh!

cursor Bot May 21, 2026

Choose a reason for hiding this comment

Serialized path silently ignores error_types parameter

Uh oh!

cursor Bot May 21, 2026

Choose a reason for hiding this comment

Missing type guard crashes on unexpected error types

Uh oh!

rasdani commented May 21, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 21, 2026

Choose a reason for hiding this comment

ErrorChain check mismatches tenacity's top-level-only retry guard

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rasdani commented May 20, 2026 •

edited by cursor Bot

Loading

Serialized path silently ignores `error_types` parameter