Skip to content

Sandbox filesystem snapshots and rewind (Docker)#20

Merged
akrentsel merged 8 commits into
mainfrom
sandbox-snapshots-filesystem
Jun 1, 2026
Merged

Sandbox filesystem snapshots and rewind (Docker)#20
akrentsel merged 8 commits into
mainfrom
sandbox-snapshots-filesystem

Conversation

@akrentsel
Copy link
Copy Markdown
Collaborator

@akrentsel akrentsel commented May 25, 2026

Fills in the snapshot/rewind plumbing that was previously metadata-only. Captures and restores a sandboxed container's filesystem state. A separate PR (#12) stacks on top of this one for an experimental CRIU full-state path — that one is intentionally a draft because it's blocked by upstream docker bugs.

What this does

Inside a chat REPL (exo chat repl <agent> <conv>), three new slash commands:

Command What it does
/snapshot Capture the sandbox's filesystem state as it is right now — every file created, modified, or deleted since the base image. Prints a snapshot UUID. On disk this is a docker save tarball under conversations/<conv>/snapshots/<id>/payload.bin plus a manifest.json sidecar.
/snapshots List every snapshot taken in this conversation.
/rewind <uuid> Stop the current container, bring up a fresh one whose filesystem matches the snapshot. Subsequent tool calls see the rolled-back state.

What this does NOT do — be clear about the limit

The snapshot is filesystem only. Specifically:

  • Files persisted to disk inside the container are captured and restored — files you created, packages you installed, configs you wrote.
  • Running processes are not preserved. If you /snapshot while nohup sleep 9999 & is running and then /rewind, the file /proc/<pid> is not coming back; the process is gone. The new container boots fresh.
  • In-memory state is gone. An interactive REPL's variables, an open TCP connection, a buffered write that hadn't been flushed — none of these survive a rewind.
  • Conversation history is not rewound. Your chat messages and event log stay where they were. Use conversation fork if you want to rewind the conversation itself; snapshots only operate on the sandbox filesystem underneath.

For agent workflows where "state worth preserving" = "files written to disk + tools/packages installed", filesystem snapshots cover the case. For pause-and-resume of long-running in-memory processes, you'd need full-state (CRIU), which is #12's beat and currently impractical to ship due to upstream docker bugs.

Demo

gif:
snapshot-rewind

# Bootstrap a state dir + agent + conversation, drop into the REPL.
# (--networking enabled because the demo needs the LLM to reach OpenAI)
STATE=/tmp/exo-snapshot-demo && rm -rf $STATE && mkdir -p $STATE && \
EXO="./target/debug/exo --root $STATE --secret-backend file --sandbox-backend docker --master-key-path $STATE/master.key" && \
$EXO secret set OPENAI_KEY --env OPENAI_API_KEY && \
$EXO model register gpt-4o-mini --secret OPENAI_KEY && \
$EXO agent create demo --model gpt-4o-mini --networking enabled && \
$EXO conversation create demo first --repl

Then inside first>:

create /tmp/demo.txt with the content "version 1" and cat it back

Sandbox now has /tmp/demo.txt with "version 1".

/snapshot

snapshot 019e5782-7c6b-72a2-b4fa-a81bf56eb37e

Behind the scenes: docker commit -pdocker save → tarball (~47MB for the ubuntu:24.04 base) → written to conversations/<conv>/snapshots/<id>/payload.bin + manifest.json. The local docker image is dropped (docker image rm) — exoharness storage is the canonical home.

overwrite /tmp/demo.txt with "version 2" and cat it back

Sandbox file now reads "version 2".

/snapshots
SNAPSHOT                              SANDBOX
019e5782-7c6b-72a2-b4fa-a81bf56eb37e  sandbox-019e5782-2a46-7970-a5bf-62900a2233e8
/rewind 019e5782-7c6b-72a2-b4fa-a81bf56eb37e

rewound to snapshot 019e5782-7c6b-72a2-b4fa-a81bf56eb37e

Behind the scenes: docker load < payload.bin → fresh container booted from the restored image → swapped into the warm pool, keyed identically. Mounts / network policy / lifecycle preserved from the original sandbox request.

cat /tmp/demo.txt

version 1 ← the rewind worked

You can take many snapshots in a conversation and rewind to any of them.

Six commits

# Commit What
1 exoharness: sandbox snapshot/restore trait surface and Docker implementation Adds SnapshotPayload { kind, bytes } + SnapshotKind::DockerImageTar. Extends ManagedSandboxHandle::snapshot() and ManagedSandboxBackend::acquire_from_snapshot(req, payload). Stubs with explicit "not supported" errors for OneShot / LocalProcess / AppleContainer with clear "where the real impl goes" comments.
2 exoharness: persist sandbox snapshots and restore via start_sandbox Wires snapshot_sandbox to actually capture and persist. Wires start_sandbox to load the payload by manifest kind and call acquire_from_snapshot.
3 cli: /snapshot, /snapshots, /rewind slash commands in chat repl The user-facing surface. Lives in the REPL because the sandbox handle is per-process; a top-level exo conversation snapshot subcommand needs cross-invocation container adoption (separate follow-up).
4 docs: sandbox snapshot/rewind design docs/sandbox-snapshots.md covering data flow, on-disk layout, backend extension story, known limits.
5 exoharness: default sandbox image to ubuntu:24.04 debian:bookworm doesn't ship procps; even basic "list running processes" tool calls hit command not found. Ubuntu 24.04 has procps + coreutils in the base.
6 ci: end-to-end snapshot + rewind round-trip test (docker, linux) New crates/cli/tests/snapshot_round_trip.rs that drives the harness library directly against real Docker. Mirrors the demo's lifecycle and asserts on two independent rewind signals (file content rolls back AND a post-snapshot file disappears). Wired into the integration matrix workflow.

On-disk shape

agents/<agent_id>/conversations/<conv_id>/snapshots/<snapshot_id>/
├── manifest.json   { snapshot_id, sandbox_id, kind, created_at, payload_size_bytes }
└── payload.bin     docker save tarball for SnapshotKind::DockerImageTar

The snapshot's existence is also recorded in the conversation event log as SandboxSnapshotted { sandbox_id, snapshot_id }, which is what /snapshots walks to render the listing.

Adding more backends

Anyone adding snapshot support for a new sandbox backend follows this recipe:

  1. Add a new SnapshotKind variant naming the on-disk format (e.g. AppleContainerImageTar).
  2. Implement ManagedSandboxHandle::snapshot to produce that kind. The Docker version is the template — three CLI calls and a Bytes capture.
  3. Implement ManagedSandboxBackend::acquire_from_snapshot to consume the same kind, with an explicit kind-mismatch error.
  4. Backends that genuinely can't snapshot (local-process today) keep returning an explicit error.

No other layer changes. The conversation orchestration, on-disk layout, and CLI surface are all backend-agnostic.

Test plan

  • cargo test --workspace (51 unit tests pass)
  • End-to-end CI test added: crates/cli/tests/snapshot_round_trip.rs — drives the harness library directly against real Docker on Linux, asserts the file content rolls back AND a post-snapshot file disappears after /rewind. Passes locally in ~1.8s; self-skips on non-docker matrix cells. Wired into the existing integration.yml workflow alongside the existing integration_chat test.
  • Manual REPL verification (the demo above): /snapshot → file content modified → /rewind → file content rolled back. Verified visually before the automated test was written.
  • On-disk layout (manifest.json + payload.bin) verified by the round-trip test using read_dir().

Comment thread crates/exoharness/src/sandbox.rs
Comment thread docs/sandbox-snapshots.md
@akrentsel akrentsel requested a review from ankrgyl May 25, 2026 07:37
akrentsel added a commit that referenced this pull request May 25, 2026
Three new integration tests, one per tier of the 3-tier fallback chain
in ensure_shell_sandbox:

  tier_1_stopped_container_is_resumed_same_id
    Drop the harness (PR #21's Drop stops, doesn't rm). Container
    survives on the host in Exited state. Second harness's try_resume
    finds it by label, docker-starts it, attaches. Same container ID,
    same sandbox_id, marker file persists across the stop/start cycle.

  tier_2_gone_container_with_snapshot_restores
    First harness takes a snapshot of the live sandbox (PR #20 API).
    Drop the harness; `docker rm -f` the container (simulates idle-TTL
    expiry / external cleanup). Second harness's try_resume misses,
    falls through to Tier 2, finds the snapshot in the event log, and
    calls start_sandbox -> acquire_from_snapshot. A NEW container id is
    materialised, but the sandbox_id is reused and the marker is
    restored from the snapshot — proving the snapshot path actually
    fires, not just resume.

  tier_3_gone_container_without_snapshot_creates_fresh
    Same setup as tier 2 minus the snapshot. Second harness misses
    Tier 1 (no container) and Tier 2 (no snapshot), so falls through
    to create_sandbox. A new sandbox_id is generated; the conversation
    log now has two SandboxCreated events; the previous marker is gone
    from the fresh container.

Each test simulates the "two exo processes" boundary by dropping the
first BasicExoHarness and constructing a new one from the same root
dir. Library-API driven (no LLM mock, no binary spawn) — the harness's
3-tier behaviour is the only thing under test here.

Wired into integration.yml as a third --test target alongside
integration_chat and snapshot_round_trip. Self-skips on non-docker
matrix cells via preflight().

All three pass locally in ~3s against real Docker; self-skip path
runs in 50ms.
Comment thread .github/workflows/integration.yml Outdated
Comment thread crates/cli/src/tui.rs Outdated
Comment thread crates/cli/src/tui.rs Outdated
Comment thread crates/cli/src/tui.rs Outdated
Comment thread crates/cli/src/tui.rs Outdated
Comment thread crates/exoharness/src/basic.rs Outdated
Comment thread crates/exoharness/src/basic.rs Outdated
Comment thread crates/exoharness/src/sandbox.rs
Comment thread crates/exoharness/src/sandbox.rs
Comment thread docs/sandbox-snapshots.md
@akrentsel
Copy link
Copy Markdown
Collaborator Author

Comments addressed. A handful of followup tasks came up, filed in their own issues.

akrentsel added a commit that referenced this pull request May 28, 2026
CLI / chat REPL (crates/cli/src/tui.rs):
  - Refactor the slash-command if-chain into `match trimmed`. Each arm
    is a single block; the `/rewind` and `/snapshot <id>` prefix forms
    fit as `other if let Some(arg) = ... => ...` guard arms; the `_`
    arm is the LLM-send default.
  - `/rewind <id>` and `/snapshot <id>` now reject args containing
    whitespace ("takes exactly one snapshot id; got: \"id1 id2\"")
    instead of feeding multi-word input to the downstream parser.
  - New `/snapshot <id>` form for picking which sandbox to snapshot
    when a conversation has more than one. `/snapshot` with no arg
    still defaults to the latest. Helper renamed
    `snapshot_current_sandbox()` -> `snapshot_sandbox(Option<SandboxId>)`.
  - Help text updated to show `/snapshot [<id>]` with the default-to-
    latest note.

Concurrency (crates/exoharness/src/basic.rs):
  - `snapshot_sandbox`: payload (multi-MB) and manifest writes now
    fan out via `tokio::try_join!`. The sandbox-metadata write stays
    sequential since it advertises the artifact's existence.
  - `start_sandbox`: manifest + payload reads run concurrently via
    `tokio::join!`. Per-read `with_context` preserved so the
    "have you taken a snapshot?" hint still surfaces.

Typed event-kind filter (crates/exoharness/src/types.rs +
8 call sites across exoharness/executor/cli):
  - New `EventKind` newtype with 13 named constants
    (`SANDBOX_CREATED`, `SANDBOX_SNAPSHOTTED`, etc.) plus a
    `custom(name)` escape hatch for `EventData::Custom`. Wire format
    unchanged (`#[serde(transparent)]`).
  - `EventQuery::types` is now `Option<Vec<EventKind>>` instead of
    `Option<Vec<String>>`. Typos like `"sandbox_creatd"` are
    compile errors at every known call site.
  - `EventData::kind()` is the new source of truth for variant -> tag
    mapping. The manual `event_type(&EventData) -> String` helper is
    gone; its duplicated match was the original drift hazard.
  - Updated 9 call sites (tui, harness_tool, harness_basic_tests x3,
    executor/basic, harness_helpers, cli/main). User-supplied
    `--type` CLI strings go through `EventKind::custom(...)`, which
    Cow-equality lets match either known kinds or true Custom events.

Error on unexpected event variant (tui.rs + harness_tool.rs):
  - `latest_sandbox_id`, `list_snapshots`, and `latest_shell_sandbox`
    all queried events with a type filter and then did `if let
    EventData::FooBar { .. } = event.data { ... }`, silently
    dropping anything that didn't match. By construction the filter
    should have made this impossible, so non-match is a storage-
    layer drift indicator. Promoted to a hard error.

CI (.github/workflows/integration.yml):
  - Drop the explicit `--test integration_chat --test snapshot_round_trip`
    list; use `--tests -- --ignored` so new test files in
    `crates/cli/tests/` are picked up automatically.

`latest_sandbox_id` query: limit dropped from 50 to 1. The query is
type-filtered + descending; the first match is what we want, asking
for 50 was waste.

Follow-up tracking issues filed for the comments deferred from this
PR:
  - #32 Switch chat REPL slash commands to a CLI library
  - #33 Add proper logging across crates
  - #34 Reduce reliance on docker CLI shell-out in sandbox backend
akrentsel added 2 commits May 28, 2026 06:40
…ntation

Adds the backend-level plumbing for capturing a running sandbox's state
as an opaque blob and reconstituting a sandbox from that blob.

Two new types in the public sandbox API:

  SnapshotPayload { kind, bytes }   - opaque snapshot artifact
  SnapshotKind                       - tag identifying the on-disk format
                                       (DockerImageTar today; new variants
                                       for other backends as they grow)

Trait extensions:

  ManagedSandboxHandle::snapshot()
    Capture this sandbox's state as a SnapshotPayload. Backends that
    can't snapshot return an explicit error.

  ManagedSandboxBackend::acquire_from_snapshot(request, payload)
    Acquire a sandbox whose filesystem is sourced from the supplied
    payload instead of request.spec.image. Mounts, network, lifecycle
    are honoured from the request.

Docker implementation:

  snapshot         docker commit -p <container> exo-snap-<uuid>
                   docker save exo-snap-<uuid>          (to bytes)
                   docker image rm exo-snap-<uuid>      (canonical store
                                                         lives in exoharness)

  restore          docker load < payload.bytes          (parse loaded ref)
                   evict any pre-existing warm container for this key
                   create a fresh warm container off the loaded image
                   (mounts/network/etc. preserved from the request)

The other implementations are deliberate stubs with clear "where the
real implementation goes" comments:

  - OneShotSandboxHandle::snapshot: snapshots require a warm sandbox
    (positive idle_ttl). One-shot is point-in-time-only by design.
  - LocalProcessSandboxBackend / -Handle: no container filesystem to
    capture or restore on the host.
  - CliContainerSandboxBackend with ContainerCliFlavor::AppleContainer:
    apple's `container` CLI doesn't yet ship the commit/save flow we
    need. When it lands, mirror docker_snapshot_container with a new
    AppleContainerImageTar SnapshotKind variant.
Wires snapshot_sandbox and start_sandbox to the trait methods added in
the previous commit. Today these two API methods only updated metadata;
now they actually capture and restore container state.

snapshot_sandbox
  - looks up the live ManagedSandboxHandle for the supplied id
  - calls handle.snapshot() to capture a SnapshotPayload (slow:
    docker commit + docker save, kept outside the write lock)
  - persists the payload + a StoredSnapshotManifest sidecar under
      conversations/<conv_id>/snapshots/<snapshot_id>/
        manifest.json   - kind, sandbox_id, created_at, payload_size
        payload.bin     - raw blob (docker save tarball for now)
  - then continues with the existing sandbox-metadata + event updates

start_sandbox
  - loads the snapshot manifest + payload from storage (before the
    write lock, in case the payload is large)
  - calls sandbox_backend.acquire_from_snapshot(request, payload)
    instead of acquire(request) — so the new container's filesystem
    comes from the snapshot rather than request.spec.image

Together these complete the round-trip: take a snapshot at state S,
make changes -> S', call start_sandbox with the snapshot_id, and the
container's filesystem is back at S.

Storage layout follows the existing artifact pattern (sidecar JSON +
.bin blob in a per-id directory), so a future migration to streamed
or chunked storage would touch a small surface.

The sandbox must be running (i.e. in this process's running_sandboxes
map) to be snapshotted — snapshots are of live state. Cross-process
container re-discovery (so a sandbox started by an earlier `exo`
invocation can be snapshotted from a later one) is a worthwhile
follow-up but out of scope here.
@akrentsel akrentsel requested a review from ankrgyl May 28, 2026 06:40
akrentsel added 6 commits May 28, 2026 06:44
Lets a user exercise the snapshot/rewind round-trip without leaving
the conversation:

  /snapshot          capture the conversation's currently-running
                     sandbox; prints the new snapshot id
  /snapshots         list snapshots taken in this conversation
                     (walks SandboxSnapshotted events)
  /rewind <id>       stop the current sandbox, start a fresh one from
                     the named snapshot — subsequent shell tool calls
                     hit the restored filesystem
  /help              show the command list

Lives in the chat repl rather than as a top-level CLI subcommand
because the sandbox running_sandboxes map is per-process; the
container created by an earlier `exo` invocation isn't reachable from
a later one. Inside the repl the same process holds the sandbox for
the duration, so capture + rewind both have a live handle to operate
on.

(When cross-invocation container adoption lands, `exo conversation
snapshot/rewind` subcommands become trivial to add — they just call
the same ConversationHandle methods this repl path uses.)

Verified live against `--sandbox-backend docker`: create file with
contents "v1", /snapshot, overwrite to "v2", /rewind <id>, read file
back -> "v1". On-disk payload is the docker save tarball
(~47 MB for the debian:bookworm base image) plus a manifest.json
recording kind/sandbox_id/created_at/size.
Pulls the design rationale out of the commit messages and into a doc
alongside the other architecture notes. Covers:

  - the producer/consumer model (SnapshotKind is the contract between
    backend snapshot/restore)
  - data flow across ConversationHandle, ManagedSandboxHandle, and
    ManagedSandboxBackend
  - the Docker pipeline (commit -p -> save -> rmi; load -> swap image
    -> evict-and-recreate-warm-container)
  - on-disk layout (manifest.json + payload.bin sidecar pair, mirroring
    the existing artifact pattern)
  - REPL slash-command surface
  - extension recipe for adding a new backend (new SnapshotKind variant
    + matching arm in acquire_from_snapshot)
  - known limits: cross-invocation container adoption, in-memory
    payload size, no GC, no running-process checkpointing
debian:bookworm doesn't ship procps (no ps/pgrep), which makes
sandbox introspection during chat sessions painful — even basic
"list running processes" tool calls hit `command not found`. Swap
the default to ubuntu:24.04, which has procps + coreutils in the
base image. ~50MB smaller bottle, same overlay2 storage, same
network behaviour. Agents that want a different image continue to
override via `agent create --sandbox-image`.
Adds crates/cli/tests/snapshot_round_trip.rs — the canonical
executable reference for using the filesystem snapshot APIs. The
test drives the harness library directly (no LLM mock, no binary
spawn) against a real docker container and walks the same lifecycle
the manual REPL demo does:

  1. create a sandbox
  2. write "version 1" to /tmp/demo.txt via run_in_sandbox
  3. snapshot_sandbox — capture filesystem state
  4. overwrite to "version 2", create a sibling /tmp/post-snapshot.txt
  5. start_sandbox with the captured snapshot_id — rewind
  6. assert /tmp/demo.txt reads "version 1" AND the sibling file is gone

The two assertions on rewind cover both directions of the
correctness claim: a file modified after the snapshot rolls back,
and a file *created* after the snapshot disappears.

Mocking philosophy: we don't mock the LLM here at all. The thing
under test is the harness's snapshot/restore primitives; an LLM
mock that emits the same shell tool calls we'd issue directly adds
noise without adding coverage. The shell commands are real and
their effects in the sandbox are observed via run_in_sandbox.

Wiring:

- Linux + docker only (`#[cfg(target_os = "linux")]`, runtime check
  for docker availability, self-skip when
  EXO_TEST_SANDBOX_BACKEND != "docker" so non-docker matrix cells
  pass cleanly).
- `#[ignore]`d so regular `cargo test` skips; CI runs with
  `-- --ignored`.
- integration.yml workflow now lists each test target explicitly
  (`--test integration_chat --test snapshot_round_trip`). Adding a
  new scenario in a future PR is one extra `--test <name>` flag.
- exoharness moved into the cli's [dev-dependencies] with the
  basic-backend feature so the test can use BasicExoHarness
  directly. futures crate added for AsyncReadExt to drive the
  SandboxProcess streams.

Verified locally: passes in ~1.8s against the dev-box docker
daemon. Self-skip path verified by setting
EXO_TEST_SANDBOX_BACKEND=local-process (skips cleanly in 30ms).

docs/sandbox-snapshots.md gains an "Executable demo" section
pointing at the test file as the runnable spec.
I had `#![cfg(target_os = "linux")]` on the new test by mistake —
docker works fine on macOS (Docker Desktop / Colima), and the
integration workflow already has a `macos-15-intel / docker` matrix
cell that runs integration_chat without any OS gate. There's no
reason snapshot_round_trip needs one either: the `docker commit` /
`docker save` calls behave identically on macOS, and the runtime
EXO_TEST_SANDBOX_BACKEND check already handles non-docker cells.

Dropping the gate means the test now also exercises the macos/docker
cell in CI on push to main. Verified locally that build + run + the
local-process self-skip path all still behave correctly.
CLI / chat REPL (crates/cli/src/tui.rs):
  - Refactor the slash-command if-chain into `match trimmed`. Each arm
    is a single block; the `/rewind` and `/snapshot <id>` prefix forms
    fit as `other if let Some(arg) = ... => ...` guard arms; the `_`
    arm is the LLM-send default.
  - `/rewind <id>` and `/snapshot <id>` now reject args containing
    whitespace ("takes exactly one snapshot id; got: \"id1 id2\"")
    instead of feeding multi-word input to the downstream parser.
  - New `/snapshot <id>` form for picking which sandbox to snapshot
    when a conversation has more than one. `/snapshot` with no arg
    still defaults to the latest. Helper renamed
    `snapshot_current_sandbox()` -> `snapshot_sandbox(Option<SandboxId>)`.
  - Help text updated to show `/snapshot [<id>]` with the default-to-
    latest note.

Concurrency (crates/exoharness/src/basic.rs):
  - `snapshot_sandbox`: payload (multi-MB) and manifest writes now
    fan out via `tokio::try_join!`. The sandbox-metadata write stays
    sequential since it advertises the artifact's existence.
  - `start_sandbox`: manifest + payload reads run concurrently via
    `tokio::join!`. Per-read `with_context` preserved so the
    "have you taken a snapshot?" hint still surfaces.

Typed event-kind filter (crates/exoharness/src/types.rs +
8 call sites across exoharness/executor/cli):
  - New `EventKind` newtype with 13 named constants
    (`SANDBOX_CREATED`, `SANDBOX_SNAPSHOTTED`, etc.) plus a
    `custom(name)` escape hatch for `EventData::Custom`. Wire format
    unchanged (`#[serde(transparent)]`).
  - `EventQuery::types` is now `Option<Vec<EventKind>>` instead of
    `Option<Vec<String>>`. Typos like `"sandbox_creatd"` are
    compile errors at every known call site.
  - `EventData::kind()` is the new source of truth for variant -> tag
    mapping. The manual `event_type(&EventData) -> String` helper is
    gone; its duplicated match was the original drift hazard.
  - Updated 9 call sites (tui, harness_tool, harness_basic_tests x3,
    executor/basic, harness_helpers, cli/main). User-supplied
    `--type` CLI strings go through `EventKind::custom(...)`, which
    Cow-equality lets match either known kinds or true Custom events.

Error on unexpected event variant (tui.rs + harness_tool.rs):
  - `latest_sandbox_id`, `list_snapshots`, and `latest_shell_sandbox`
    all queried events with a type filter and then did `if let
    EventData::FooBar { .. } = event.data { ... }`, silently
    dropping anything that didn't match. By construction the filter
    should have made this impossible, so non-match is a storage-
    layer drift indicator. Promoted to a hard error.

CI (.github/workflows/integration.yml):
  - Drop the explicit `--test integration_chat --test snapshot_round_trip`
    list; use `--tests -- --ignored` so new test files in
    `crates/cli/tests/` are picked up automatically.

`latest_sandbox_id` query: limit dropped from 50 to 1. The query is
type-filtered + descending; the first match is what we want, asking
for 50 was waste.

Follow-up tracking issues filed for the comments deferred from this
PR:
  - #32 Switch chat REPL slash commands to a CLI library
  - #33 Add proper logging across crates
  - #34 Reduce reliance on docker CLI shell-out in sandbox backend
@akrentsel akrentsel force-pushed the sandbox-snapshots-filesystem branch from 483a3bc to 36dfb8a Compare June 1, 2026 02:09
@akrentsel akrentsel merged commit a082e3e into main Jun 1, 2026
3 checks passed
akrentsel added a commit that referenced this pull request Jun 1, 2026
Three new integration tests, one per tier of the 3-tier fallback chain
in ensure_shell_sandbox:

  tier_1_stopped_container_is_resumed_same_id
    Drop the harness (PR #21's Drop stops, doesn't rm). Container
    survives on the host in Exited state. Second harness's try_resume
    finds it by label, docker-starts it, attaches. Same container ID,
    same sandbox_id, marker file persists across the stop/start cycle.

  tier_2_gone_container_with_snapshot_restores
    First harness takes a snapshot of the live sandbox (PR #20 API).
    Drop the harness; `docker rm -f` the container (simulates idle-TTL
    expiry / external cleanup). Second harness's try_resume misses,
    falls through to Tier 2, finds the snapshot in the event log, and
    calls start_sandbox -> acquire_from_snapshot. A NEW container id is
    materialised, but the sandbox_id is reused and the marker is
    restored from the snapshot — proving the snapshot path actually
    fires, not just resume.

  tier_3_gone_container_without_snapshot_creates_fresh
    Same setup as tier 2 minus the snapshot. Second harness misses
    Tier 1 (no container) and Tier 2 (no snapshot), so falls through
    to create_sandbox. A new sandbox_id is generated; the conversation
    log now has two SandboxCreated events; the previous marker is gone
    from the fresh container.

Each test simulates the "two exo processes" boundary by dropping the
first BasicExoHarness and constructing a new one from the same root
dir. Library-API driven (no LLM mock, no binary spawn) — the harness's
3-tier behaviour is the only thing under test here.

Wired into integration.yml as a third --test target alongside
integration_chat and snapshot_round_trip. Self-skips on non-docker
matrix cells via preflight().

All three pass locally in ~3s against real Docker; self-skip path
runs in 50ms.
@akrentsel akrentsel deleted the sandbox-snapshots-filesystem branch June 1, 2026 02:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants