Summary
BenchFlow currently does not robustly handle Daytona sandbox startup failures where image build/export or sandbox creation gets stuck/errors before the task reaches sandbox user setup. In this case the local bench eval create process can remain stuck at Starting environment and never emits a job result/error for the evaluation-level retry loop to classify.
This should be treated as a retryable sandbox/environment startup failure, distinct from verifier timeout or agent timeout.
Repro / Evidence
From a SkillsBench oracle rerun on benchflow 0.4.0:
uv run bench eval create \
--tasks-dir tasks/reserves-at-risk-calc \
--agent oracle \
--sandbox daytona \
--jobs-dir /tmp/sb845-rar-rerun2-reserves-at-risk-calc-jobs \
--concurrency 1
Fresh rerun evidence:
- Job dir:
/tmp/sb845-rar-rerun2-reserves-at-risk-calc-jobs/2026-05-21__21-51-37/reserves-at-risk-calc__3c96b223/
- Files emitted after more than 600 seconds: only
config.json
- Matching Daytona sandbox:
e7d8ab0f-47da-40b1-b179-46e1363fe014
- Daytona state after more than 600 seconds:
creating
created_at, updated_at, and last_activity_at remained unchanged from 2026-05-22T01:51:39.295Z
- BenchFlow local process remained stuck at
Starting environment
Earlier observed Daytona sandboxes for the same task:
3e0f7a6c-9961-4cde-9c9a-6f50ea789598
- Daytona state:
ERROR
- Error:
timeout of 1200000ms exceeded
- Job dir only had
config.json
ebbf17da-f24e-4143-a544-22e0d1388176
- Daytona state:
ERROR
- Error:
timeout of 1200000ms exceeded
- Job dir only had
config.json
Daytona build logs for the earlier failed runs completed apt/libreoffice and pip dependency installation. The last observed phase was Docker/image export:
BenchFlow never reached sandbox user setup, oracle execution, or verifier execution. This was not a task-level verifier timeout and not an agent timeout.
Current Behavior
In benchflow 0.4.0:
RetryConfig retries install_failure, pipe_closed, and acp_error, while excluding timeout by default.
- Evaluation-level retry only happens after a
RunResult with a retryable result.error is returned.
Rollout._start_env_and_upload() waits inside env.start(force_build=False) before environment setup artifacts are emitted.
- Daytona sandbox create has a small internal retry, but the failed/stuck sandbox can still leave the outer rollout waiting at
Starting environment.
- The task-level
build_timeout_sec = 600 was crossed in the fresh rerun without a structured BenchFlow failure/result.
Expected Behavior
BenchFlow should:
- Add an environment/sandbox startup watchdog around
env.start(...).
- Detect Daytona sandbox state
ERROR or stuck creating during create/start and raise a structured error such as SandboxStartFailed / daytona_build_timeout.
- Mark Daytona startup/build/export failures retryable by default, or expose a clear retry config knob.
- Clean up failed/creating sandbox IDs before retrying.
- Emit a failed job result/error instead of leaving only
config.json.
- Keep this category separate from verifier timeout and agent timeout in logs/metrics.
Impact
Large or slow task images can fail due to transient Daytona image export/create timeouts. Without SDK-level retry and clear result emission, batch evaluations can hang or under-report the actual infra failure mode.
Summary
BenchFlow currently does not robustly handle Daytona sandbox startup failures where image build/export or sandbox creation gets stuck/errors before the task reaches sandbox user setup. In this case the local
bench eval createprocess can remain stuck atStarting environmentand never emits a job result/error for the evaluation-level retry loop to classify.This should be treated as a retryable sandbox/environment startup failure, distinct from verifier timeout or agent timeout.
Repro / Evidence
From a SkillsBench oracle rerun on
benchflow 0.4.0:uv run bench eval create \ --tasks-dir tasks/reserves-at-risk-calc \ --agent oracle \ --sandbox daytona \ --jobs-dir /tmp/sb845-rar-rerun2-reserves-at-risk-calc-jobs \ --concurrency 1Fresh rerun evidence:
/tmp/sb845-rar-rerun2-reserves-at-risk-calc-jobs/2026-05-21__21-51-37/reserves-at-risk-calc__3c96b223/config.jsone7d8ab0f-47da-40b1-b179-46e1363fe014creatingcreated_at,updated_at, andlast_activity_atremained unchanged from2026-05-22T01:51:39.295ZStarting environmentEarlier observed Daytona sandboxes for the same task:
3e0f7a6c-9961-4cde-9c9a-6f50ea789598ERRORtimeout of 1200000ms exceededconfig.jsonebbf17da-f24e-4143-a544-22e0d1388176ERRORtimeout of 1200000ms exceededconfig.jsonDaytona build logs for the earlier failed runs completed apt/libreoffice and pip dependency installation. The last observed phase was Docker/image export:
BenchFlow never reached sandbox user setup, oracle execution, or verifier execution. This was not a task-level verifier timeout and not an agent timeout.
Current Behavior
In
benchflow 0.4.0:RetryConfigretriesinstall_failure,pipe_closed, andacp_error, while excludingtimeoutby default.RunResultwith a retryableresult.erroris returned.Rollout._start_env_and_upload()waits insideenv.start(force_build=False)before environment setup artifacts are emitted.Starting environment.build_timeout_sec = 600was crossed in the fresh rerun without a structured BenchFlow failure/result.Expected Behavior
BenchFlow should:
env.start(...).ERRORor stuckcreatingduring create/start and raise a structured error such asSandboxStartFailed/daytona_build_timeout.config.json.Impact
Large or slow task images can fail due to transient Daytona image export/create timeouts. Without SDK-level retry and clear result emission, batch evaluations can hang or under-report the actual infra failure mode.