Skip to content

APR-262 Add Harbor verifiers support for v1#1401

Open
xeophon wants to merge 1 commit into
mainfrom
apr-262-harbor-v1-verifier-modes
Open

APR-262 Add Harbor verifiers support for v1#1401
xeophon wants to merge 1 commit into
mainfrom
apr-262-harbor-v1-verifier-modes

Conversation

@xeophon
Copy link
Copy Markdown
Member

@xeophon xeophon commented May 17, 2026

Summary

  • add Harbor v1 support for resolving [verifier].environment_mode and [verifier.environment], including shared/separate validation and separate-mode sandbox config
  • run separate Harbor verifiers in a fresh sandbox, reusing the shared test runner/archive upload helpers and uploading task tests only for the environment_mode = "separate" fallback without [verifier.environment]
  • transfer /logs/artifacts plus configured Harbor artifacts into verifier sandboxes, including explicit /logs/artifacts entries with exclude patterns and source-path handling for artifact destination mappings
  • align reward parsing and docs with current Harbor behavior by reading reward.json before reward.txt and documenting the v1 Harbor verifier lifecycle

Verification

  • uv run --frozen pytest tests/test_v1_harbor_cli.py tests/test_opencode_harbor.py -q
  • uv run --frozen ruff format .
  • uv run --frozen ruff check --fix .
  • uv run --frozen ty check verifiers
  • uv run --frozen --group policy semgrep --metrics=off --disable-version-check --config .semgrep/verifiers.yml --error --quiet
  • GitHub Actions passed: Ruff, Semgrep, Ty, CodeQL/analyze, Environments, Verifiers 3.10/3.11/3.12/3.13

Note

Medium Risk
Adds a new execution path that spins up and tears down separate verifier sandboxes, including cross-sandbox file transfer and reward parsing changes, which could affect Harbor scoring and sandbox lifecycle behavior. Risk is mitigated by added unit tests covering mode resolution and separate-sandbox execution.

Overview
Adds v1 Harbor support for running verifiers in shared vs separate sandbox modes based on task.toml ([verifier].environment_mode / [verifier.environment]), including validation and per-mode sandbox configuration in HarborTaskset.

Updates harbor_reward to optionally create a fresh verifier sandbox, transfer /logs/artifacts plus configured artifacts into it (with exclude handling), optionally upload task tests/, run /tests/test.sh, and then delete the verifier sandbox. Reward parsing is aligned to prefer /logs/verifier/reward.json over reward.txt, and docs/tests are updated accordingly.

Reviewed by Cursor Bugbot for commit a063a69. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Add Harbor verifier support for separate sandbox execution in v1 taskset

  • Adds verifier_mode support to HarborTaskset.task_row: defaults to shared; switches to separate when [verifier.environment] is present in task config; raises ValueError for incompatible combinations.
  • Adds run_separate_harbor_verifier to provision a fresh isolated sandbox, transfer task artifacts and inputs via transfer_harbor_verifier_inputs, run tests, and delete the sandbox on completion.
  • Adds transfer_harbor_verifier_inputs which packages specified artifacts plus /logs/artifacts from the agent sandbox into a gzipped tarball and extracts it inside the verifier sandbox.
  • Updates HARBOR_REWARD_COMMAND to prefer /logs/verifier/reward.json over reward.txt when both exist.
  • Behavioral Change: harbor_reward now routes execution through run_separate_harbor_verifier when verifier_mode is separate, stores harbor_verifier_sandbox_id in state, and passes [verifier].env to the test process.

Macroscope summarized a063a69.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a91f41ae52

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +251 to +252
if verifier_mode == VERIFIER_MODE_SEPARATE and verifier_environment is None:
verifier_sandbox = {**sandbox, "command_timeout": int(test_timeout)}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Upload task tests when reusing the agent image

When a task sets [verifier] environment_mode = "separate" but does not provide [verifier.environment], this branch advertises support by creating a fresh verifier sandbox from the agent sandbox config. That fresh sandbox never receives the task's tests/ directory: run_separate_harbor_verifier only creates an empty /tests directory and transfers /logs/artifacts plus configured artifacts before running bash /tests/test.sh. In this supported mode the script is absent unless the base agent image happens to contain it, so otherwise-valid Harbor tasks always get reward 0.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Harbor PR #1655 does support environment_mode = "separate" without [verifier.environment]; upstream resolves it to a fresh copy of the top-level environment. The difference is that Harbor can build the verifier image from tests/, while this v1 sandbox path cannot. I fixed the fallback by marking that case with verifier_upload_tests and uploading the task tests into the fresh verifier sandbox before running /tests/test.sh, while explicit [verifier.environment] continues to expect a pre-baked verifier image.

@macroscopeapp
Copy link
Copy Markdown

macroscopeapp Bot commented May 17, 2026

Approvability

Verdict: Needs human review

This PR introduces a substantial new feature (Harbor verifier support) with 350+ lines of new runtime logic including sandbox management, artifact transfer, and changed reward file precedence. Multiple unresolved P1 review comments identify potential bugs affecting evaluation correctness.

You can customize Macroscope's approvability policy. Learn more.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f02768b6d6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/v1/packages/tasksets/harbor.py
Comment thread verifiers/v1/packages/tasksets/harbor.py
@xeophon xeophon force-pushed the apr-262-harbor-v1-verifier-modes branch from 8ae240a to 05d4739 Compare May 20, 2026 09:22
Comment thread verifiers/v1/packages/tasksets/harbor.py
@xeophon xeophon force-pushed the apr-262-harbor-v1-verifier-modes branch 2 times, most recently from a953660 to f23a006 Compare May 20, 2026 09:41
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f23a0066c7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +29 to +32
"if [ -s /logs/verifier/reward.json ]; then "
"cat /logs/verifier/reward.json; "
"elif [ -s /logs/verifier/reward.txt ]; then "
"cat /logs/verifier/reward.txt; fi"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restore reward.txt precedence in Harbor reward lookup

When both /logs/verifier/reward.txt and /logs/verifier/reward.json are present, this command now reads JSON first, which can score from stale/diagnostic JSON or return 0 for invalid JSON even if reward.txt contains the valid final score. That changes previously working Harbor tasks and diverges from the existing Harbor implementations in this repo that prioritize reward.txt first. Fresh evidence in this commit: HARBOR_REWARD_COMMAND checks reward.json before reward.txt, and the new test test_harbor_reward_command_prefers_json_reward explicitly locks in that order.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit f23a006. Configure here.

Comment thread verifiers/v1/packages/tasksets/harbor.py Outdated
@xeophon xeophon force-pushed the apr-262-harbor-v1-verifier-modes branch from f23a006 to 9daf4b2 Compare May 20, 2026 13:01
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9daf4b2aee

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

lease.id,
state,
command="bash /tests/test.sh",
working_dir=None,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Run separate Harbor tests from /tests working directory

In separate verifier mode, the test runner now invokes bash /tests/test.sh with working_dir=None, which changes execution semantics versus the existing Harbor path (bash test.sh in /tests). Any verifier script that uses relative paths (for example ./fixtures/... or python helper.py) will resolve against the sandbox default workdir (often /app or image WORKDIR) instead of /tests, causing false failures and reward 0 in otherwise valid tasks. This is specific to verifier_mode = "separate" and is introduced by the new separate-sandbox path.

Useful? React with 👍 / 👎.

@xeophon xeophon force-pushed the apr-262-harbor-v1-verifier-modes branch from 9daf4b2 to a063a69 Compare May 20, 2026 13:09
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a063a69bff

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +29 to +32
"if [ -s /logs/verifier/reward.json ]; then "
"cat /logs/verifier/reward.json; "
"elif [ -s /logs/verifier/reward.txt ]; then "
"cat /logs/verifier/reward.txt; fi"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Read reward.txt before reward.json for Harbor rewards

HARBOR_REWARD_COMMAND now reads /logs/verifier/reward.json before /logs/verifier/reward.txt, which can mis-score rollouts when both files exist (for example, a stale/diagnostic JSON alongside the final scalar reward in reward.txt). In that case valid tasks can return the wrong reward or 0 if JSON is malformed, despite a correct text reward being present, so this ordering can silently corrupt evaluation results.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant