Skip SWE sandbox scoring for errored rollouts by rasdani · Pull Request #1412 · PrimeIntellect-ai/verifiers

rasdani · 2026-05-19T01:02:51Z

Summary

Skip SWE taskset sandbox scoring when a rollout state already has any error, not only vf.InfraError.
Apply the guard consistently across R2E, SWE-bench, MultiSWE, OpenSWE, SWE-Lego, SWE-rebench, and SWESmith tasksets.

Why

SWE-style tasksets use keep_sandbox_for_scoring=True, so successful rollouts keep the sandbox around for deferred test-based scoring. When the rollout already failed, running the sandbox test suite is not useful and can keep sandboxes live longer than necessary.

The previous guard only skipped vf.InfraError. That meant non-infra rollout errors such as model/provider failures could still enter sandbox scoring. In particular, a vf.ModelError("No available workers...") should not run SWE tests after the rollout has already errored.

This PR is intentionally limited to the rubric error guards. Broader cleanup lifecycle hardening is left out of this change.

Tests

uv run ruff check verifiers/envs/experimental/composable/tasksets/swe/multi_swe.py verifiers/envs/experimental/composable/tasksets/swe/openswe.py verifiers/envs/experimental/composable/tasksets/swe/r2e_gym.py verifiers/envs/experimental/composable/tasksets/swe/swe_bench.py verifiers/envs/experimental/composable/tasksets/swe/swe_lego.py verifiers/envs/experimental/composable/tasksets/swe/swe_rebench_v2.py verifiers/envs/experimental/composable/tasksets/swe/swe_smith.py
uv run ruff format --check verifiers/envs/experimental/composable/tasksets/swe/multi_swe.py verifiers/envs/experimental/composable/tasksets/swe/openswe.py verifiers/envs/experimental/composable/tasksets/swe/r2e_gym.py verifiers/envs/experimental/composable/tasksets/swe/swe_bench.py verifiers/envs/experimental/composable/tasksets/swe/swe_lego.py verifiers/envs/experimental/composable/tasksets/swe/swe_rebench_v2.py verifiers/envs/experimental/composable/tasksets/swe/swe_smith.py

Push hooks also passed: ruff check, ruff format, Semgrep v1 policy, AGENTS sync, and ty ci parity.

Note

Skip SWE sandbox scoring for any errored rollout, not just `vf.InfraError`

All seven SWE rubric solved methods previously returned 0.0 early only when state['error'] was an instance of vf.InfraError. This change broadens the guard to return 0.0 for any non-None error, skipping test execution for all error types. Behavioral Change: rollouts with non-InfraError errors now score 0.0 instead of proceeding to test evaluation.

^{Macroscope summarized 472fb00.}

Note

Low Risk
Low risk guard change: expands the existing early-return condition so SWE taskset rubrics skip running sandbox tests whenever the rollout state contains any error, reducing wasted sandbox work without altering test execution logic for successful rollouts.

Overview
Skips deferred SWE sandbox scoring for errored rollouts. Across the SWE-style rubrics (MultiSWE, OpenSWE, R2E-Gym, SWE-bench, SWE-Lego, SWE-rebench-V2, and SWE-Smith), the solved guard now returns 0.0 whenever state["error"] is set (instead of only when it is a vf.InfraError), preventing unnecessary test runs and keeping sandboxes alive longer after already-failed rollouts.

^{Reviewed by Cursor Bugbot for commit 472fb00. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit ed56ed2. Configure here.}

macroscopeapp · 2026-05-19T01:15:33Z

Approvability

Verdict: Needs human review

This PR modifies runtime behavior by broadening error conditions that skip SWE sandbox scoring. Additionally, there are unresolved review comments questioning design decisions on files not visible in the diff, suggesting outstanding concerns that need human attention.

No code changes detected at 472fb00. Prior analysis still applies.

^{You can customize Macroscope's approvability policy. Learn more.}

rasdani · 2026-05-19T16:49:26Z

@codex review

chatgpt-codex-connector · 2026-05-19T16:55:04Z

Codex Review: Didn't find any major issues. Another round soon, please!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

rasdani · 2026-05-21T17:05:33Z

Narrowed this PR down to only the SWE taskset error guards.

The earlier version also changed rollout/group cleanup and cleanup error-handling semantics. Those broader lifecycle changes are intentionally removed from this PR so the diff only prevents deferred SWE sandbox scoring when a rollout state already has error set.

Fix sandbox cleanup on failed rollouts

ed56ed2

cursor Bot reviewed May 19, 2026

View reviewed changes

Comment thread verifiers/envs/experimental/composable/composable_env.py Outdated

rasdani marked this pull request as draft May 19, 2026 01:24

rasdani marked this pull request as ready for review May 19, 2026 16:28

mikasenghaas reviewed May 19, 2026

View reviewed changes

Comment thread verifiers/envs/environment.py Outdated

Comment thread verifiers/envs/environment.py Outdated

Comment thread verifiers/envs/environment.py Outdated

Comment thread verifiers/envs/experimental/composable/composable_env.py

Narrow SWE error guard changes

472fb00

rasdani changed the title ~~Fix sandbox cleanup on failed rollout groups~~ Skip SWE sandbox scoring for errored rollouts May 21, 2026

rasdani force-pushed the fix/sandbox-rollout-cleanup branch from 291566e to 472fb00 Compare May 21, 2026 17:07

rasdani requested review from mikasenghaas and samsja May 21, 2026 17:13

samsja approved these changes May 21, 2026

View reviewed changes

rasdani merged commit a3b5733 into main May 21, 2026
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip SWE sandbox scoring for errored rollouts#1412

Skip SWE sandbox scoring for errored rollouts#1412
rasdani merged 2 commits into
mainfrom
fix/sandbox-rollout-cleanup

rasdani commented May 19, 2026 •

edited by macroscopeapp Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

macroscopeapp Bot commented May 19, 2026 •

edited

Loading

Uh oh!

rasdani commented May 19, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rasdani commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rasdani commented May 19, 2026 • edited by macroscopeapp Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Tests

Skip SWE sandbox scoring for any errored rollout, not just vf.InfraError

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

macroscopeapp Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Approvability

Uh oh!

rasdani commented May 19, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rasdani commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rasdani commented May 19, 2026 •

edited by macroscopeapp Bot

Loading

Skip SWE sandbox scoring for any errored rollout, not just `vf.InfraError`

macroscopeapp Bot commented May 19, 2026 •

edited

Loading