Skip to content

Retry Daytona sandbox startup/export timeouts instead of hanging at environment start #337

@xdotli

Description

@xdotli

Summary

BenchFlow currently does not robustly handle Daytona sandbox startup failures where image build/export or sandbox creation gets stuck/errors before the task reaches sandbox user setup. In this case the local bench eval create process can remain stuck at Starting environment and never emits a job result/error for the evaluation-level retry loop to classify.

This should be treated as a retryable sandbox/environment startup failure, distinct from verifier timeout or agent timeout.

Repro / Evidence

From a SkillsBench oracle rerun on benchflow 0.4.0:

uv run bench eval create \
  --tasks-dir tasks/reserves-at-risk-calc \
  --agent oracle \
  --sandbox daytona \
  --jobs-dir /tmp/sb845-rar-rerun2-reserves-at-risk-calc-jobs \
  --concurrency 1

Fresh rerun evidence:

  • Job dir: /tmp/sb845-rar-rerun2-reserves-at-risk-calc-jobs/2026-05-21__21-51-37/reserves-at-risk-calc__3c96b223/
  • Files emitted after more than 600 seconds: only config.json
  • Matching Daytona sandbox: e7d8ab0f-47da-40b1-b179-46e1363fe014
  • Daytona state after more than 600 seconds: creating
  • created_at, updated_at, and last_activity_at remained unchanged from 2026-05-22T01:51:39.295Z
  • BenchFlow local process remained stuck at Starting environment

Earlier observed Daytona sandboxes for the same task:

  • 3e0f7a6c-9961-4cde-9c9a-6f50ea789598
    • Daytona state: ERROR
    • Error: timeout of 1200000ms exceeded
    • Job dir only had config.json
  • ebbf17da-f24e-4143-a544-22e0d1388176
    • Daytona state: ERROR
    • Error: timeout of 1200000ms exceeded
    • Job dir only had config.json

Daytona build logs for the earlier failed runs completed apt/libreoffice and pip dependency installation. The last observed phase was Docker/image export:

#11 exporting to image

BenchFlow never reached sandbox user setup, oracle execution, or verifier execution. This was not a task-level verifier timeout and not an agent timeout.

Current Behavior

In benchflow 0.4.0:

  • RetryConfig retries install_failure, pipe_closed, and acp_error, while excluding timeout by default.
  • Evaluation-level retry only happens after a RunResult with a retryable result.error is returned.
  • Rollout._start_env_and_upload() waits inside env.start(force_build=False) before environment setup artifacts are emitted.
  • Daytona sandbox create has a small internal retry, but the failed/stuck sandbox can still leave the outer rollout waiting at Starting environment.
  • The task-level build_timeout_sec = 600 was crossed in the fresh rerun without a structured BenchFlow failure/result.

Expected Behavior

BenchFlow should:

  1. Add an environment/sandbox startup watchdog around env.start(...).
  2. Detect Daytona sandbox state ERROR or stuck creating during create/start and raise a structured error such as SandboxStartFailed / daytona_build_timeout.
  3. Mark Daytona startup/build/export failures retryable by default, or expose a clear retry config knob.
  4. Clean up failed/creating sandbox IDs before retrying.
  5. Emit a failed job result/error instead of leaving only config.json.
  6. Keep this category separate from verifier timeout and agent timeout in logs/metrics.

Impact

Large or slow task images can fail due to transient Daytona image export/create timeouts. Without SDK-level retry and clear result emission, batch evaluations can hang or under-report the actual infra failure mode.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingfixedVerified fixed by running the patched code

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions