Skip SWE sandbox scoring for errored rollouts#1412
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit ed56ed2. Configure here.
ApprovabilityVerdict: Needs human review This PR modifies runtime behavior by broadening error conditions that skip SWE sandbox scoring. Additionally, there are unresolved review comments questioning design decisions on files not visible in the diff, suggesting outstanding concerns that need human attention. No code changes detected at You can customize Macroscope's approvability policy. Learn more. |
|
@codex review |
|
Codex Review: Didn't find any major issues. Another round soon, please! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
Narrowed this PR down to only the SWE taskset error guards. The earlier version also changed rollout/group cleanup and cleanup error-handling semantics. Those broader lifecycle changes are intentionally removed from this PR so the diff only prevents deferred SWE sandbox scoring when a rollout state already has |
291566e to
472fb00
Compare

Summary
error, not onlyvf.InfraError.Why
SWE-style tasksets use
keep_sandbox_for_scoring=True, so successful rollouts keep the sandbox around for deferred test-based scoring. When the rollout already failed, running the sandbox test suite is not useful and can keep sandboxes live longer than necessary.The previous guard only skipped
vf.InfraError. That meant non-infra rollout errors such as model/provider failures could still enter sandbox scoring. In particular, avf.ModelError("No available workers...")should not run SWE tests after the rollout has already errored.This PR is intentionally limited to the rubric error guards. Broader cleanup lifecycle hardening is left out of this change.
Tests
Push hooks also passed: ruff check, ruff format, Semgrep v1 policy, AGENTS sync, and ty ci parity.
Note
Skip SWE sandbox scoring for any errored rollout, not just
vf.InfraErrorAll seven SWE rubric
solvedmethods previously returned 0.0 early only whenstate['error']was an instance ofvf.InfraError. This change broadens the guard to return 0.0 for any non-Noneerror, skipping test execution for all error types. Behavioral Change: rollouts with non-InfraErrorerrors now score 0.0 instead of proceeding to test evaluation.Macroscope summarized 472fb00.
Note
Low Risk
Low risk guard change: expands the existing early-return condition so SWE taskset rubrics skip running sandbox tests whenever the rollout state contains any error, reducing wasted sandbox work without altering test execution logic for successful rollouts.
Overview
Skips deferred SWE sandbox scoring for errored rollouts. Across the SWE-style rubrics (MultiSWE, OpenSWE, R2E-Gym, SWE-bench, SWE-Lego, SWE-rebench-V2, and SWE-Smith), the
solvedguard now returns0.0wheneverstate["error"]is set (instead of only when it is avf.InfraError), preventing unnecessary test runs and keeping sandboxes alive longer after already-failed rollouts.Reviewed by Cursor Bugbot for commit 472fb00. Bugbot is set up for automated code reviews on this repo. Configure here.