Skip to content

Skip SWE sandbox scoring for errored rollouts#1412

Merged
rasdani merged 2 commits into
mainfrom
fix/sandbox-rollout-cleanup
May 21, 2026
Merged

Skip SWE sandbox scoring for errored rollouts#1412
rasdani merged 2 commits into
mainfrom
fix/sandbox-rollout-cleanup

Conversation

@rasdani
Copy link
Copy Markdown
Contributor

@rasdani rasdani commented May 19, 2026

Summary

  • Skip SWE taskset sandbox scoring when a rollout state already has any error, not only vf.InfraError.
  • Apply the guard consistently across R2E, SWE-bench, MultiSWE, OpenSWE, SWE-Lego, SWE-rebench, and SWESmith tasksets.

Why

SWE-style tasksets use keep_sandbox_for_scoring=True, so successful rollouts keep the sandbox around for deferred test-based scoring. When the rollout already failed, running the sandbox test suite is not useful and can keep sandboxes live longer than necessary.

The previous guard only skipped vf.InfraError. That meant non-infra rollout errors such as model/provider failures could still enter sandbox scoring. In particular, a vf.ModelError("No available workers...") should not run SWE tests after the rollout has already errored.

This PR is intentionally limited to the rubric error guards. Broader cleanup lifecycle hardening is left out of this change.

Tests

uv run ruff check verifiers/envs/experimental/composable/tasksets/swe/multi_swe.py verifiers/envs/experimental/composable/tasksets/swe/openswe.py verifiers/envs/experimental/composable/tasksets/swe/r2e_gym.py verifiers/envs/experimental/composable/tasksets/swe/swe_bench.py verifiers/envs/experimental/composable/tasksets/swe/swe_lego.py verifiers/envs/experimental/composable/tasksets/swe/swe_rebench_v2.py verifiers/envs/experimental/composable/tasksets/swe/swe_smith.py
uv run ruff format --check verifiers/envs/experimental/composable/tasksets/swe/multi_swe.py verifiers/envs/experimental/composable/tasksets/swe/openswe.py verifiers/envs/experimental/composable/tasksets/swe/r2e_gym.py verifiers/envs/experimental/composable/tasksets/swe/swe_bench.py verifiers/envs/experimental/composable/tasksets/swe/swe_lego.py verifiers/envs/experimental/composable/tasksets/swe/swe_rebench_v2.py verifiers/envs/experimental/composable/tasksets/swe/swe_smith.py

Push hooks also passed: ruff check, ruff format, Semgrep v1 policy, AGENTS sync, and ty ci parity.

Note

Skip SWE sandbox scoring for any errored rollout, not just vf.InfraError

All seven SWE rubric solved methods previously returned 0.0 early only when state['error'] was an instance of vf.InfraError. This change broadens the guard to return 0.0 for any non-None error, skipping test execution for all error types. Behavioral Change: rollouts with non-InfraError errors now score 0.0 instead of proceeding to test evaluation.

Macroscope summarized 472fb00.


Note

Low Risk
Low risk guard change: expands the existing early-return condition so SWE taskset rubrics skip running sandbox tests whenever the rollout state contains any error, reducing wasted sandbox work without altering test execution logic for successful rollouts.

Overview
Skips deferred SWE sandbox scoring for errored rollouts. Across the SWE-style rubrics (MultiSWE, OpenSWE, R2E-Gym, SWE-bench, SWE-Lego, SWE-rebench-V2, and SWE-Smith), the solved guard now returns 0.0 whenever state["error"] is set (instead of only when it is a vf.InfraError), preventing unnecessary test runs and keeping sandboxes alive longer after already-failed rollouts.

Reviewed by Cursor Bugbot for commit 472fb00. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit ed56ed2. Configure here.

Comment thread verifiers/envs/experimental/composable/composable_env.py Outdated
@macroscopeapp
Copy link
Copy Markdown

macroscopeapp Bot commented May 19, 2026

Approvability

Verdict: Needs human review

This PR modifies runtime behavior by broadening error conditions that skip SWE sandbox scoring. Additionally, there are unresolved review comments questioning design decisions on files not visible in the diff, suggesting outstanding concerns that need human attention.

No code changes detected at 472fb00. Prior analysis still applies.

You can customize Macroscope's approvability policy. Learn more.

@rasdani rasdani marked this pull request as draft May 19, 2026 01:24
@rasdani rasdani marked this pull request as ready for review May 19, 2026 16:28
@rasdani
Copy link
Copy Markdown
Contributor Author

rasdani commented May 19, 2026

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Another round soon, please!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/envs/environment.py Outdated
Comment thread verifiers/envs/environment.py Outdated
Comment thread verifiers/envs/environment.py Outdated
Comment thread verifiers/envs/experimental/composable/composable_env.py
@rasdani rasdani changed the title Fix sandbox cleanup on failed rollout groups Skip SWE sandbox scoring for errored rollouts May 21, 2026
Copy link
Copy Markdown
Contributor Author

rasdani commented May 21, 2026

Narrowed this PR down to only the SWE taskset error guards.

The earlier version also changed rollout/group cleanup and cleanup error-handling semantics. Those broader lifecycle changes are intentionally removed from this PR so the diff only prevents deferred SWE sandbox scoring when a rollout state already has error set.

@rasdani rasdani force-pushed the fix/sandbox-rollout-cleanup branch from 291566e to 472fb00 Compare May 21, 2026 17:07
@rasdani rasdani requested review from mikasenghaas and samsja May 21, 2026 17:13
@rasdani rasdani merged commit a3b5733 into main May 21, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants