Retry Daytona sandbox startup/export timeouts instead of hanging at environment start

## Summary

BenchFlow currently does not robustly handle Daytona sandbox startup failures where image build/export or sandbox creation gets stuck/errors before the task reaches sandbox user setup. In this case the local `bench eval create` process can remain stuck at `Starting environment` and never emits a job result/error for the evaluation-level retry loop to classify.

This should be treated as a retryable sandbox/environment startup failure, distinct from verifier timeout or agent timeout.

## Repro / Evidence

From a SkillsBench oracle rerun on `benchflow 0.4.0`:

```bash
uv run bench eval create \
  --tasks-dir tasks/reserves-at-risk-calc \
  --agent oracle \
  --sandbox daytona \
  --jobs-dir /tmp/sb845-rar-rerun2-reserves-at-risk-calc-jobs \
  --concurrency 1
```

Fresh rerun evidence:

- Job dir: `/tmp/sb845-rar-rerun2-reserves-at-risk-calc-jobs/2026-05-21__21-51-37/reserves-at-risk-calc__3c96b223/`
- Files emitted after more than 600 seconds: only `config.json`
- Matching Daytona sandbox: `e7d8ab0f-47da-40b1-b179-46e1363fe014`
- Daytona state after more than 600 seconds: `creating`
- `created_at`, `updated_at`, and `last_activity_at` remained unchanged from `2026-05-22T01:51:39.295Z`
- BenchFlow local process remained stuck at `Starting environment`

Earlier observed Daytona sandboxes for the same task:

- `3e0f7a6c-9961-4cde-9c9a-6f50ea789598`
  - Daytona state: `ERROR`
  - Error: `timeout of 1200000ms exceeded`
  - Job dir only had `config.json`
- `ebbf17da-f24e-4143-a544-22e0d1388176`
  - Daytona state: `ERROR`
  - Error: `timeout of 1200000ms exceeded`
  - Job dir only had `config.json`

Daytona build logs for the earlier failed runs completed apt/libreoffice and pip dependency installation. The last observed phase was Docker/image export:

```text
#11 exporting to image
```

BenchFlow never reached sandbox user setup, oracle execution, or verifier execution. This was not a task-level verifier timeout and not an agent timeout.

## Current Behavior

In `benchflow 0.4.0`:

- `RetryConfig` retries `install_failure`, `pipe_closed`, and `acp_error`, while excluding `timeout` by default.
- Evaluation-level retry only happens after a `RunResult` with a retryable `result.error` is returned.
- `Rollout._start_env_and_upload()` waits inside `env.start(force_build=False)` before environment setup artifacts are emitted.
- Daytona sandbox create has a small internal retry, but the failed/stuck sandbox can still leave the outer rollout waiting at `Starting environment`.
- The task-level `build_timeout_sec = 600` was crossed in the fresh rerun without a structured BenchFlow failure/result.

## Expected Behavior

BenchFlow should:

1. Add an environment/sandbox startup watchdog around `env.start(...)`.
2. Detect Daytona sandbox state `ERROR` or stuck `creating` during create/start and raise a structured error such as `SandboxStartFailed` / `daytona_build_timeout`.
3. Mark Daytona startup/build/export failures retryable by default, or expose a clear retry config knob.
4. Clean up failed/creating sandbox IDs before retrying.
5. Emit a failed job result/error instead of leaving only `config.json`.
6. Keep this category separate from verifier timeout and agent timeout in logs/metrics.

## Impact

Large or slow task images can fail due to transient Daytona image export/create timeouts. Without SDK-level retry and clear result emission, batch evaluations can hang or under-report the actual infra failure mode.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry Daytona sandbox startup/export timeouts instead of hanging at environment start #337

Summary

Repro / Evidence

Current Behavior

Expected Behavior

Impact

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Retry Daytona sandbox startup/export timeouts instead of hanging at environment start #337

Description

Summary

Repro / Evidence

Current Behavior

Expected Behavior

Impact

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions