Sandbox filesystem snapshots and rewind (Docker)#20
Merged
Conversation
This was referenced May 25, 2026
akrentsel
commented
May 25, 2026
akrentsel
commented
May 25, 2026
akrentsel
added a commit
that referenced
this pull request
May 25, 2026
Three new integration tests, one per tier of the 3-tier fallback chain
in ensure_shell_sandbox:
tier_1_stopped_container_is_resumed_same_id
Drop the harness (PR #21's Drop stops, doesn't rm). Container
survives on the host in Exited state. Second harness's try_resume
finds it by label, docker-starts it, attaches. Same container ID,
same sandbox_id, marker file persists across the stop/start cycle.
tier_2_gone_container_with_snapshot_restores
First harness takes a snapshot of the live sandbox (PR #20 API).
Drop the harness; `docker rm -f` the container (simulates idle-TTL
expiry / external cleanup). Second harness's try_resume misses,
falls through to Tier 2, finds the snapshot in the event log, and
calls start_sandbox -> acquire_from_snapshot. A NEW container id is
materialised, but the sandbox_id is reused and the marker is
restored from the snapshot — proving the snapshot path actually
fires, not just resume.
tier_3_gone_container_without_snapshot_creates_fresh
Same setup as tier 2 minus the snapshot. Second harness misses
Tier 1 (no container) and Tier 2 (no snapshot), so falls through
to create_sandbox. A new sandbox_id is generated; the conversation
log now has two SandboxCreated events; the previous marker is gone
from the fresh container.
Each test simulates the "two exo processes" boundary by dropping the
first BasicExoHarness and constructing a new one from the same root
dir. Library-API driven (no LLM mock, no binary spawn) — the harness's
3-tier behaviour is the only thing under test here.
Wired into integration.yml as a third --test target alongside
integration_chat and snapshot_round_trip. Self-skips on non-docker
matrix cells via preflight().
All three pass locally in ~3s against real Docker; self-skip path
runs in 50ms.
ankrgyl
reviewed
May 25, 2026
This was referenced May 28, 2026
Collaborator
Author
|
Comments addressed. A handful of followup tasks came up, filed in their own issues. |
akrentsel
added a commit
that referenced
this pull request
May 28, 2026
CLI / chat REPL (crates/cli/src/tui.rs):
- Refactor the slash-command if-chain into `match trimmed`. Each arm
is a single block; the `/rewind` and `/snapshot <id>` prefix forms
fit as `other if let Some(arg) = ... => ...` guard arms; the `_`
arm is the LLM-send default.
- `/rewind <id>` and `/snapshot <id>` now reject args containing
whitespace ("takes exactly one snapshot id; got: \"id1 id2\"")
instead of feeding multi-word input to the downstream parser.
- New `/snapshot <id>` form for picking which sandbox to snapshot
when a conversation has more than one. `/snapshot` with no arg
still defaults to the latest. Helper renamed
`snapshot_current_sandbox()` -> `snapshot_sandbox(Option<SandboxId>)`.
- Help text updated to show `/snapshot [<id>]` with the default-to-
latest note.
Concurrency (crates/exoharness/src/basic.rs):
- `snapshot_sandbox`: payload (multi-MB) and manifest writes now
fan out via `tokio::try_join!`. The sandbox-metadata write stays
sequential since it advertises the artifact's existence.
- `start_sandbox`: manifest + payload reads run concurrently via
`tokio::join!`. Per-read `with_context` preserved so the
"have you taken a snapshot?" hint still surfaces.
Typed event-kind filter (crates/exoharness/src/types.rs +
8 call sites across exoharness/executor/cli):
- New `EventKind` newtype with 13 named constants
(`SANDBOX_CREATED`, `SANDBOX_SNAPSHOTTED`, etc.) plus a
`custom(name)` escape hatch for `EventData::Custom`. Wire format
unchanged (`#[serde(transparent)]`).
- `EventQuery::types` is now `Option<Vec<EventKind>>` instead of
`Option<Vec<String>>`. Typos like `"sandbox_creatd"` are
compile errors at every known call site.
- `EventData::kind()` is the new source of truth for variant -> tag
mapping. The manual `event_type(&EventData) -> String` helper is
gone; its duplicated match was the original drift hazard.
- Updated 9 call sites (tui, harness_tool, harness_basic_tests x3,
executor/basic, harness_helpers, cli/main). User-supplied
`--type` CLI strings go through `EventKind::custom(...)`, which
Cow-equality lets match either known kinds or true Custom events.
Error on unexpected event variant (tui.rs + harness_tool.rs):
- `latest_sandbox_id`, `list_snapshots`, and `latest_shell_sandbox`
all queried events with a type filter and then did `if let
EventData::FooBar { .. } = event.data { ... }`, silently
dropping anything that didn't match. By construction the filter
should have made this impossible, so non-match is a storage-
layer drift indicator. Promoted to a hard error.
CI (.github/workflows/integration.yml):
- Drop the explicit `--test integration_chat --test snapshot_round_trip`
list; use `--tests -- --ignored` so new test files in
`crates/cli/tests/` are picked up automatically.
`latest_sandbox_id` query: limit dropped from 50 to 1. The query is
type-filtered + descending; the first match is what we want, asking
for 50 was waste.
Follow-up tracking issues filed for the comments deferred from this
PR:
- #32 Switch chat REPL slash commands to a CLI library
- #33 Add proper logging across crates
- #34 Reduce reliance on docker CLI shell-out in sandbox backend
…ntation
Adds the backend-level plumbing for capturing a running sandbox's state
as an opaque blob and reconstituting a sandbox from that blob.
Two new types in the public sandbox API:
SnapshotPayload { kind, bytes } - opaque snapshot artifact
SnapshotKind - tag identifying the on-disk format
(DockerImageTar today; new variants
for other backends as they grow)
Trait extensions:
ManagedSandboxHandle::snapshot()
Capture this sandbox's state as a SnapshotPayload. Backends that
can't snapshot return an explicit error.
ManagedSandboxBackend::acquire_from_snapshot(request, payload)
Acquire a sandbox whose filesystem is sourced from the supplied
payload instead of request.spec.image. Mounts, network, lifecycle
are honoured from the request.
Docker implementation:
snapshot docker commit -p <container> exo-snap-<uuid>
docker save exo-snap-<uuid> (to bytes)
docker image rm exo-snap-<uuid> (canonical store
lives in exoharness)
restore docker load < payload.bytes (parse loaded ref)
evict any pre-existing warm container for this key
create a fresh warm container off the loaded image
(mounts/network/etc. preserved from the request)
The other implementations are deliberate stubs with clear "where the
real implementation goes" comments:
- OneShotSandboxHandle::snapshot: snapshots require a warm sandbox
(positive idle_ttl). One-shot is point-in-time-only by design.
- LocalProcessSandboxBackend / -Handle: no container filesystem to
capture or restore on the host.
- CliContainerSandboxBackend with ContainerCliFlavor::AppleContainer:
apple's `container` CLI doesn't yet ship the commit/save flow we
need. When it lands, mirror docker_snapshot_container with a new
AppleContainerImageTar SnapshotKind variant.
Wires snapshot_sandbox and start_sandbox to the trait methods added in
the previous commit. Today these two API methods only updated metadata;
now they actually capture and restore container state.
snapshot_sandbox
- looks up the live ManagedSandboxHandle for the supplied id
- calls handle.snapshot() to capture a SnapshotPayload (slow:
docker commit + docker save, kept outside the write lock)
- persists the payload + a StoredSnapshotManifest sidecar under
conversations/<conv_id>/snapshots/<snapshot_id>/
manifest.json - kind, sandbox_id, created_at, payload_size
payload.bin - raw blob (docker save tarball for now)
- then continues with the existing sandbox-metadata + event updates
start_sandbox
- loads the snapshot manifest + payload from storage (before the
write lock, in case the payload is large)
- calls sandbox_backend.acquire_from_snapshot(request, payload)
instead of acquire(request) — so the new container's filesystem
comes from the snapshot rather than request.spec.image
Together these complete the round-trip: take a snapshot at state S,
make changes -> S', call start_sandbox with the snapshot_id, and the
container's filesystem is back at S.
Storage layout follows the existing artifact pattern (sidecar JSON +
.bin blob in a per-id directory), so a future migration to streamed
or chunked storage would touch a small surface.
The sandbox must be running (i.e. in this process's running_sandboxes
map) to be snapshotted — snapshots are of live state. Cross-process
container re-discovery (so a sandbox started by an earlier `exo`
invocation can be snapshotted from a later one) is a worthwhile
follow-up but out of scope here.
Lets a user exercise the snapshot/rewind round-trip without leaving
the conversation:
/snapshot capture the conversation's currently-running
sandbox; prints the new snapshot id
/snapshots list snapshots taken in this conversation
(walks SandboxSnapshotted events)
/rewind <id> stop the current sandbox, start a fresh one from
the named snapshot — subsequent shell tool calls
hit the restored filesystem
/help show the command list
Lives in the chat repl rather than as a top-level CLI subcommand
because the sandbox running_sandboxes map is per-process; the
container created by an earlier `exo` invocation isn't reachable from
a later one. Inside the repl the same process holds the sandbox for
the duration, so capture + rewind both have a live handle to operate
on.
(When cross-invocation container adoption lands, `exo conversation
snapshot/rewind` subcommands become trivial to add — they just call
the same ConversationHandle methods this repl path uses.)
Verified live against `--sandbox-backend docker`: create file with
contents "v1", /snapshot, overwrite to "v2", /rewind <id>, read file
back -> "v1". On-disk payload is the docker save tarball
(~47 MB for the debian:bookworm base image) plus a manifest.json
recording kind/sandbox_id/created_at/size.
Pulls the design rationale out of the commit messages and into a doc
alongside the other architecture notes. Covers:
- the producer/consumer model (SnapshotKind is the contract between
backend snapshot/restore)
- data flow across ConversationHandle, ManagedSandboxHandle, and
ManagedSandboxBackend
- the Docker pipeline (commit -p -> save -> rmi; load -> swap image
-> evict-and-recreate-warm-container)
- on-disk layout (manifest.json + payload.bin sidecar pair, mirroring
the existing artifact pattern)
- REPL slash-command surface
- extension recipe for adding a new backend (new SnapshotKind variant
+ matching arm in acquire_from_snapshot)
- known limits: cross-invocation container adoption, in-memory
payload size, no GC, no running-process checkpointing
debian:bookworm doesn't ship procps (no ps/pgrep), which makes sandbox introspection during chat sessions painful — even basic "list running processes" tool calls hit `command not found`. Swap the default to ubuntu:24.04, which has procps + coreutils in the base image. ~50MB smaller bottle, same overlay2 storage, same network behaviour. Agents that want a different image continue to override via `agent create --sandbox-image`.
Adds crates/cli/tests/snapshot_round_trip.rs — the canonical executable reference for using the filesystem snapshot APIs. The test drives the harness library directly (no LLM mock, no binary spawn) against a real docker container and walks the same lifecycle the manual REPL demo does: 1. create a sandbox 2. write "version 1" to /tmp/demo.txt via run_in_sandbox 3. snapshot_sandbox — capture filesystem state 4. overwrite to "version 2", create a sibling /tmp/post-snapshot.txt 5. start_sandbox with the captured snapshot_id — rewind 6. assert /tmp/demo.txt reads "version 1" AND the sibling file is gone The two assertions on rewind cover both directions of the correctness claim: a file modified after the snapshot rolls back, and a file *created* after the snapshot disappears. Mocking philosophy: we don't mock the LLM here at all. The thing under test is the harness's snapshot/restore primitives; an LLM mock that emits the same shell tool calls we'd issue directly adds noise without adding coverage. The shell commands are real and their effects in the sandbox are observed via run_in_sandbox. Wiring: - Linux + docker only (`#[cfg(target_os = "linux")]`, runtime check for docker availability, self-skip when EXO_TEST_SANDBOX_BACKEND != "docker" so non-docker matrix cells pass cleanly). - `#[ignore]`d so regular `cargo test` skips; CI runs with `-- --ignored`. - integration.yml workflow now lists each test target explicitly (`--test integration_chat --test snapshot_round_trip`). Adding a new scenario in a future PR is one extra `--test <name>` flag. - exoharness moved into the cli's [dev-dependencies] with the basic-backend feature so the test can use BasicExoHarness directly. futures crate added for AsyncReadExt to drive the SandboxProcess streams. Verified locally: passes in ~1.8s against the dev-box docker daemon. Self-skip path verified by setting EXO_TEST_SANDBOX_BACKEND=local-process (skips cleanly in 30ms). docs/sandbox-snapshots.md gains an "Executable demo" section pointing at the test file as the runnable spec.
I had `#![cfg(target_os = "linux")]` on the new test by mistake — docker works fine on macOS (Docker Desktop / Colima), and the integration workflow already has a `macos-15-intel / docker` matrix cell that runs integration_chat without any OS gate. There's no reason snapshot_round_trip needs one either: the `docker commit` / `docker save` calls behave identically on macOS, and the runtime EXO_TEST_SANDBOX_BACKEND check already handles non-docker cells. Dropping the gate means the test now also exercises the macos/docker cell in CI on push to main. Verified locally that build + run + the local-process self-skip path all still behave correctly.
CLI / chat REPL (crates/cli/src/tui.rs):
- Refactor the slash-command if-chain into `match trimmed`. Each arm
is a single block; the `/rewind` and `/snapshot <id>` prefix forms
fit as `other if let Some(arg) = ... => ...` guard arms; the `_`
arm is the LLM-send default.
- `/rewind <id>` and `/snapshot <id>` now reject args containing
whitespace ("takes exactly one snapshot id; got: \"id1 id2\"")
instead of feeding multi-word input to the downstream parser.
- New `/snapshot <id>` form for picking which sandbox to snapshot
when a conversation has more than one. `/snapshot` with no arg
still defaults to the latest. Helper renamed
`snapshot_current_sandbox()` -> `snapshot_sandbox(Option<SandboxId>)`.
- Help text updated to show `/snapshot [<id>]` with the default-to-
latest note.
Concurrency (crates/exoharness/src/basic.rs):
- `snapshot_sandbox`: payload (multi-MB) and manifest writes now
fan out via `tokio::try_join!`. The sandbox-metadata write stays
sequential since it advertises the artifact's existence.
- `start_sandbox`: manifest + payload reads run concurrently via
`tokio::join!`. Per-read `with_context` preserved so the
"have you taken a snapshot?" hint still surfaces.
Typed event-kind filter (crates/exoharness/src/types.rs +
8 call sites across exoharness/executor/cli):
- New `EventKind` newtype with 13 named constants
(`SANDBOX_CREATED`, `SANDBOX_SNAPSHOTTED`, etc.) plus a
`custom(name)` escape hatch for `EventData::Custom`. Wire format
unchanged (`#[serde(transparent)]`).
- `EventQuery::types` is now `Option<Vec<EventKind>>` instead of
`Option<Vec<String>>`. Typos like `"sandbox_creatd"` are
compile errors at every known call site.
- `EventData::kind()` is the new source of truth for variant -> tag
mapping. The manual `event_type(&EventData) -> String` helper is
gone; its duplicated match was the original drift hazard.
- Updated 9 call sites (tui, harness_tool, harness_basic_tests x3,
executor/basic, harness_helpers, cli/main). User-supplied
`--type` CLI strings go through `EventKind::custom(...)`, which
Cow-equality lets match either known kinds or true Custom events.
Error on unexpected event variant (tui.rs + harness_tool.rs):
- `latest_sandbox_id`, `list_snapshots`, and `latest_shell_sandbox`
all queried events with a type filter and then did `if let
EventData::FooBar { .. } = event.data { ... }`, silently
dropping anything that didn't match. By construction the filter
should have made this impossible, so non-match is a storage-
layer drift indicator. Promoted to a hard error.
CI (.github/workflows/integration.yml):
- Drop the explicit `--test integration_chat --test snapshot_round_trip`
list; use `--tests -- --ignored` so new test files in
`crates/cli/tests/` are picked up automatically.
`latest_sandbox_id` query: limit dropped from 50 to 1. The query is
type-filtered + descending; the first match is what we want, asking
for 50 was waste.
Follow-up tracking issues filed for the comments deferred from this
PR:
- #32 Switch chat REPL slash commands to a CLI library
- #33 Add proper logging across crates
- #34 Reduce reliance on docker CLI shell-out in sandbox backend
ankrgyl
approved these changes
May 29, 2026
483a3bc to
36dfb8a
Compare
akrentsel
added a commit
that referenced
this pull request
Jun 1, 2026
Three new integration tests, one per tier of the 3-tier fallback chain
in ensure_shell_sandbox:
tier_1_stopped_container_is_resumed_same_id
Drop the harness (PR #21's Drop stops, doesn't rm). Container
survives on the host in Exited state. Second harness's try_resume
finds it by label, docker-starts it, attaches. Same container ID,
same sandbox_id, marker file persists across the stop/start cycle.
tier_2_gone_container_with_snapshot_restores
First harness takes a snapshot of the live sandbox (PR #20 API).
Drop the harness; `docker rm -f` the container (simulates idle-TTL
expiry / external cleanup). Second harness's try_resume misses,
falls through to Tier 2, finds the snapshot in the event log, and
calls start_sandbox -> acquire_from_snapshot. A NEW container id is
materialised, but the sandbox_id is reused and the marker is
restored from the snapshot — proving the snapshot path actually
fires, not just resume.
tier_3_gone_container_without_snapshot_creates_fresh
Same setup as tier 2 minus the snapshot. Second harness misses
Tier 1 (no container) and Tier 2 (no snapshot), so falls through
to create_sandbox. A new sandbox_id is generated; the conversation
log now has two SandboxCreated events; the previous marker is gone
from the fresh container.
Each test simulates the "two exo processes" boundary by dropping the
first BasicExoHarness and constructing a new one from the same root
dir. Library-API driven (no LLM mock, no binary spawn) — the harness's
3-tier behaviour is the only thing under test here.
Wired into integration.yml as a third --test target alongside
integration_chat and snapshot_round_trip. Self-skips on non-docker
matrix cells via preflight().
All three pass locally in ~3s against real Docker; self-skip path
runs in 50ms.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fills in the snapshot/rewind plumbing that was previously metadata-only. Captures and restores a sandboxed container's filesystem state. A separate PR (#12) stacks on top of this one for an experimental CRIU full-state path — that one is intentionally a draft because it's blocked by upstream docker bugs.
What this does
Inside a chat REPL (
exo chat repl <agent> <conv>), three new slash commands:/snapshotdocker savetarball underconversations/<conv>/snapshots/<id>/payload.binplus amanifest.jsonsidecar./snapshots/rewind <uuid>What this does NOT do — be clear about the limit
The snapshot is filesystem only. Specifically:
/snapshotwhilenohup sleep 9999 &is running and then/rewind, the file/proc/<pid>is not coming back; the process is gone. The new container boots fresh.conversation forkif you want to rewind the conversation itself; snapshots only operate on the sandbox filesystem underneath.For agent workflows where "state worth preserving" = "files written to disk + tools/packages installed", filesystem snapshots cover the case. For pause-and-resume of long-running in-memory processes, you'd need full-state (CRIU), which is #12's beat and currently impractical to ship due to upstream docker bugs.
Demo
gif:

Then inside
first>:Sandbox now has
/tmp/demo.txtwith "version 1".Sandbox file now reads "version 2".
You can take many snapshots in a conversation and rewind to any of them.
Six commits
exoharness: sandbox snapshot/restore trait surface and Docker implementationSnapshotPayload { kind, bytes }+SnapshotKind::DockerImageTar. ExtendsManagedSandboxHandle::snapshot()andManagedSandboxBackend::acquire_from_snapshot(req, payload). Stubs with explicit "not supported" errors for OneShot / LocalProcess / AppleContainer with clear "where the real impl goes" comments.exoharness: persist sandbox snapshots and restore via start_sandboxsnapshot_sandboxto actually capture and persist. Wiresstart_sandboxto load the payload by manifest kind and callacquire_from_snapshot.cli: /snapshot, /snapshots, /rewind slash commands in chat replexo conversation snapshotsubcommand needs cross-invocation container adoption (separate follow-up).docs: sandbox snapshot/rewind designdocs/sandbox-snapshots.mdcovering data flow, on-disk layout, backend extension story, known limits.exoharness: default sandbox image to ubuntu:24.04procps; even basic "list running processes" tool calls hitcommand not found. Ubuntu 24.04 has procps + coreutils in the base.ci: end-to-end snapshot + rewind round-trip test (docker, linux)crates/cli/tests/snapshot_round_trip.rsthat drives the harness library directly against real Docker. Mirrors the demo's lifecycle and asserts on two independent rewind signals (file content rolls back AND a post-snapshot file disappears). Wired into the integration matrix workflow.On-disk shape
The snapshot's existence is also recorded in the conversation event log as
SandboxSnapshotted { sandbox_id, snapshot_id }, which is what/snapshotswalks to render the listing.Adding more backends
Anyone adding snapshot support for a new sandbox backend follows this recipe:
SnapshotKindvariant naming the on-disk format (e.g.AppleContainerImageTar).ManagedSandboxHandle::snapshotto produce that kind. The Docker version is the template — three CLI calls and aBytescapture.ManagedSandboxBackend::acquire_from_snapshotto consume the same kind, with an explicit kind-mismatch error.No other layer changes. The conversation orchestration, on-disk layout, and CLI surface are all backend-agnostic.
Test plan
cargo test --workspace(51 unit tests pass)crates/cli/tests/snapshot_round_trip.rs— drives the harness library directly against real Docker on Linux, asserts the file content rolls back AND a post-snapshot file disappears after/rewind. Passes locally in ~1.8s; self-skips on non-docker matrix cells. Wired into the existingintegration.ymlworkflow alongside the existingintegration_chattest./snapshot→ file content modified →/rewind→ file content rolled back. Verified visually before the automated test was written.read_dir().