Sandbox full-state (CRIU) snapshots — blocked by docker bugs#12
Draft
akrentsel wants to merge 8 commits into
Draft
Sandbox full-state (CRIU) snapshots — blocked by docker bugs#12akrentsel wants to merge 8 commits into
akrentsel wants to merge 8 commits into
Conversation
7b146ee to
e14464b
Compare
This was referenced May 24, 2026
Collaborator
Author
…ntation
Adds the backend-level plumbing for capturing a running sandbox's state
as an opaque blob and reconstituting a sandbox from that blob.
Two new types in the public sandbox API:
SnapshotPayload { kind, bytes } - opaque snapshot artifact
SnapshotKind - tag identifying the on-disk format
(DockerImageTar today; new variants
for other backends as they grow)
Trait extensions:
ManagedSandboxHandle::snapshot()
Capture this sandbox's state as a SnapshotPayload. Backends that
can't snapshot return an explicit error.
ManagedSandboxBackend::acquire_from_snapshot(request, payload)
Acquire a sandbox whose filesystem is sourced from the supplied
payload instead of request.spec.image. Mounts, network, lifecycle
are honoured from the request.
Docker implementation:
snapshot docker commit -p <container> exo-snap-<uuid>
docker save exo-snap-<uuid> (to bytes)
docker image rm exo-snap-<uuid> (canonical store
lives in exoharness)
restore docker load < payload.bytes (parse loaded ref)
evict any pre-existing warm container for this key
create a fresh warm container off the loaded image
(mounts/network/etc. preserved from the request)
The other implementations are deliberate stubs with clear "where the
real implementation goes" comments:
- OneShotSandboxHandle::snapshot: snapshots require a warm sandbox
(positive idle_ttl). One-shot is point-in-time-only by design.
- LocalProcessSandboxBackend / -Handle: no container filesystem to
capture or restore on the host.
- CliContainerSandboxBackend with ContainerCliFlavor::AppleContainer:
apple's `container` CLI doesn't yet ship the commit/save flow we
need. When it lands, mirror docker_snapshot_container with a new
AppleContainerImageTar SnapshotKind variant.
Wires snapshot_sandbox and start_sandbox to the trait methods added in
the previous commit. Today these two API methods only updated metadata;
now they actually capture and restore container state.
snapshot_sandbox
- looks up the live ManagedSandboxHandle for the supplied id
- calls handle.snapshot() to capture a SnapshotPayload (slow:
docker commit + docker save, kept outside the write lock)
- persists the payload + a StoredSnapshotManifest sidecar under
conversations/<conv_id>/snapshots/<snapshot_id>/
manifest.json - kind, sandbox_id, created_at, payload_size
payload.bin - raw blob (docker save tarball for now)
- then continues with the existing sandbox-metadata + event updates
start_sandbox
- loads the snapshot manifest + payload from storage (before the
write lock, in case the payload is large)
- calls sandbox_backend.acquire_from_snapshot(request, payload)
instead of acquire(request) — so the new container's filesystem
comes from the snapshot rather than request.spec.image
Together these complete the round-trip: take a snapshot at state S,
make changes -> S', call start_sandbox with the snapshot_id, and the
container's filesystem is back at S.
Storage layout follows the existing artifact pattern (sidecar JSON +
.bin blob in a per-id directory), so a future migration to streamed
or chunked storage would touch a small surface.
The sandbox must be running (i.e. in this process's running_sandboxes
map) to be snapshotted — snapshots are of live state. Cross-process
container re-discovery (so a sandbox started by an earlier `exo`
invocation can be snapshotted from a later one) is a worthwhile
follow-up but out of scope here.
Lets a user exercise the snapshot/rewind round-trip without leaving
the conversation:
/snapshot capture the conversation's currently-running
sandbox; prints the new snapshot id
/snapshots list snapshots taken in this conversation
(walks SandboxSnapshotted events)
/rewind <id> stop the current sandbox, start a fresh one from
the named snapshot — subsequent shell tool calls
hit the restored filesystem
/help show the command list
Lives in the chat repl rather than as a top-level CLI subcommand
because the sandbox running_sandboxes map is per-process; the
container created by an earlier `exo` invocation isn't reachable from
a later one. Inside the repl the same process holds the sandbox for
the duration, so capture + rewind both have a live handle to operate
on.
(When cross-invocation container adoption lands, `exo conversation
snapshot/rewind` subcommands become trivial to add — they just call
the same ConversationHandle methods this repl path uses.)
Verified live against `--sandbox-backend docker`: create file with
contents "v1", /snapshot, overwrite to "v2", /rewind <id>, read file
back -> "v1". On-disk payload is the docker save tarball
(~47 MB for the debian:bookworm base image) plus a manifest.json
recording kind/sandbox_id/created_at/size.
Pulls the design rationale out of the commit messages and into a doc
alongside the other architecture notes. Covers:
- the producer/consumer model (SnapshotKind is the contract between
backend snapshot/restore)
- data flow across ConversationHandle, ManagedSandboxHandle, and
ManagedSandboxBackend
- the Docker pipeline (commit -p -> save -> rmi; load -> swap image
-> evict-and-recreate-warm-container)
- on-disk layout (manifest.json + payload.bin sidecar pair, mirroring
the existing artifact pattern)
- REPL slash-command surface
- extension recipe for adding a new backend (new SnapshotKind variant
+ matching arm in acquire_from_snapshot)
- known limits: cross-invocation container adoption, in-memory
payload size, no GC, no running-process checkpointing
debian:bookworm doesn't ship procps (no ps/pgrep), which makes sandbox introspection during chat sessions painful — even basic "list running processes" tool calls hit `command not found`. Swap the default to ubuntu:24.04, which has procps + coreutils in the base image. ~50MB smaller bottle, same overlay2 storage, same network behaviour. Agents that want a different image continue to override via `agent create --sandbox-image`.
Adds SnapshotMode { Filesystem, FullState } as a caller-facing knob on
the snapshot path. Filesystem is the existing docker commit/save flow
(SnapshotKind::DockerImageTar). FullState is new: a CRIU-backed
checkpoint of the live process tree (memory pages, open FDs, sockets,
filesystem diff), tagged SnapshotKind::DockerCheckpointTar.
Pipeline for the new mode:
capture docker checkpoint create --checkpoint-dir=<tmp> <c> exo-snap
tar -cf - -C <tmp>/exo-snap .
restore tar -xf - -C <tmp>/exo-snap
docker create <fresh container, same image/mounts/network>
docker start --checkpoint exo-snap --checkpoint-dir=<tmp> <new>
Mode and kind are intentionally separate:
- mode = what the caller wants (filesystem vs full state)
- kind = the on-disk format the backend chose to produce
The restore path dispatches on kind alone, so a future backend that
honours the FullState mode but produces a different format (e.g.
apple-container's VZ.framework save state) just adds a new kind variant
and matching arm in acquire_from_snapshot.
Surface preflight + error handling:
- `docker info --format {{.ExperimentalBuild}}` is probed before any
checkpoint operation; if false, surfaces an actionable message
pointing at docs/requirements.md rather than letting docker's raw
"Unknown command: checkpoint" bubble up.
- same on restore — failure to start --checkpoint cleans up the
half-created container and reports cause.
CLI: new `/checkpoint` slash command in the chat REPL alongside the
existing `/snapshot`. /rewind handles either kind transparently
(dispatch happens at the backend layer based on manifest.json).
Trait API change: ConversationHandle::snapshot_sandbox grows a `mode`
parameter; ManagedSandboxHandle::snapshot likewise. Single internal
caller updated. Mock impls in tests updated.
Verified locally that filesystem-mode snapshots still work end-to-end
(the regression path) and that the experimental-flag preflight
correctly returns the actionable error on this runner where docker
experimental is disabled. End-to-end CRIU verification requires a host
with CRIU installed and docker experimental enabled — see
docs/requirements.md.
Adds docs/requirements.md as the central place for runtime requirements per feature/backend (sandbox + secret backend matrices, CRIU setup for full-state snapshots, CI matrix expectations). Updates docs/sandbox-snapshots.md to cover the SnapshotMode addition: filesystem vs full-state semantics, the mode/kind separation, the new docker checkpoint pipeline, and the /checkpoint slash command. The existing snapshot doc's "fundamental reason" line for processes-not- captured is no longer accurate — full-state IS captured under SnapshotMode::FullState. Reworked that section.
The earlier full-state restore path passed --checkpoint-dir to both
`docker checkpoint create` and `docker start --checkpoint`. The
latter actually rejects custom checkpoint dirs at runtime ("custom
checkpointdir is not supported"), so the restore step always failed
with that error before any of our error handling kicked in.
Switch both ends to docker's default location
(/var/lib/docker/containers/<id>/checkpoints/<name>/). The
trade-off is that those paths are root-owned, so the tar/untar
steps need sudo:
- snapshot: `docker checkpoint create <c> <name>` then `sudo tar -cf
- -C /var/lib/docker/containers/<id>/checkpoints/<name>` to read
the dump bytes, then `docker checkpoint rm` to drop the local copy
(exoharness storage is the canonical home).
- restore: `docker create` a fresh container, `sudo mkdir -p` its
checkpoint dir, `sudo tar -xf -` the payload bytes into it, then
`docker start --checkpoint <name>` (no --checkpoint-dir).
Adds the passwordless-sudo requirement to docs/requirements.md with
an example sudoers fragment for machines that don't have it by
default.
Note: this commit unblocks the code path but does NOT make
end-to-end /checkpoint work on docker 29.x. After fixing the dir
issue we then hit two further docker bugs (`/proc/0/ns/net`
netns-restore bug and a containerd "already exists" regression in
the 29.x line) that are known upstream issues and beyond our
control. The capture half of the pipeline works fully; the restore
half is blocked by docker until a 28.x-style flow returns or we
swap to a different runtime. See the next commit's design doc
update for the full investigation.
e14464b to
5cccd55
Compare
4 tasks
483a3bc to
36dfb8a
Compare
akrentsel
added a commit
that referenced
this pull request
Jun 1, 2026
Fills in the snapshot/rewind plumbing that was previously metadata-only. **Captures and restores a sandboxed container's filesystem state.** A separate PR (#12) stacks on top of this one for an experimental CRIU full-state path — that one is intentionally a draft because it's blocked by upstream docker bugs. ## What this does Inside a chat REPL (`exo chat repl <agent> <conv>`), three new slash commands: | Command | What it does | |---|---| | `/snapshot` | Capture the sandbox's **filesystem state** as it is right now — every file created, modified, or deleted since the base image. Prints a snapshot UUID. On disk this is a `docker save` tarball under `conversations/<conv>/snapshots/<id>/payload.bin` plus a `manifest.json` sidecar. | | `/snapshots` | List every snapshot taken in this conversation. | | `/rewind <uuid>` | Stop the current container, bring up a fresh one whose filesystem matches the snapshot. Subsequent tool calls see the rolled-back state. | ## What this does NOT do — be clear about the limit The snapshot is **filesystem only**. Specifically: - ✅ **Files persisted to disk inside the container** are captured and restored — files you created, packages you installed, configs you wrote. - ❌ **Running processes are not preserved.** If you `/snapshot` while `nohup sleep 9999 &` is running and then `/rewind`, the file `/proc/<pid>` is not coming back; the process is gone. The new container boots fresh. - ❌ **In-memory state is gone.** An interactive REPL's variables, an open TCP connection, a buffered write that hadn't been flushed — none of these survive a rewind. - ❌ **Conversation history is not rewound.** Your chat messages and event log stay where they were. Use `conversation fork` if you want to rewind the conversation itself; snapshots only operate on the sandbox filesystem underneath. For agent workflows where "state worth preserving" = "files written to disk + tools/packages installed", filesystem snapshots cover the case. For pause-and-resume of long-running in-memory processes, you'd need full-state (CRIU), which is #12's beat and currently impractical to ship due to upstream docker bugs. ## Demo gif: <img width="800" height="515" alt="snapshot-rewind" src="https://github.com/user-attachments/assets/55223d38-aae8-42ce-9884-1106b04bcc55" /> ```bash # Bootstrap a state dir + agent + conversation, drop into the REPL. # (--networking enabled because the demo needs the LLM to reach OpenAI) STATE=/tmp/exo-snapshot-demo && rm -rf $STATE && mkdir -p $STATE && \ EXO="./target/debug/exo --root $STATE --secret-backend file --sandbox-backend docker --master-key-path $STATE/master.key" && \ $EXO secret set OPENAI_KEY --env OPENAI_API_KEY && \ $EXO model register gpt-4o-mini --secret OPENAI_KEY && \ $EXO agent create demo --model gpt-4o-mini --networking enabled && \ $EXO conversation create demo first --repl ``` Then inside `first>`: ``` create /tmp/demo.txt with the content "version 1" and cat it back ``` *Sandbox now has `/tmp/demo.txt` with "version 1".* ``` /snapshot ``` > `snapshot 019e5782-7c6b-72a2-b4fa-a81bf56eb37e` > > Behind the scenes: `docker commit -p` → `docker save` → tarball (~47MB for the ubuntu:24.04 base) → written to `conversations/<conv>/snapshots/<id>/payload.bin` + `manifest.json`. The local docker image is dropped (`docker image rm`) — exoharness storage is the canonical home. ``` overwrite /tmp/demo.txt with "version 2" and cat it back ``` *Sandbox file now reads "version 2".* ``` /snapshots ``` > ``` > SNAPSHOT SANDBOX > 019e5782-7c6b-72a2-b4fa-a81bf56eb37e sandbox-019e5782-2a46-7970-a5bf-62900a2233e8 > ``` ``` /rewind 019e5782-7c6b-72a2-b4fa-a81bf56eb37e ``` > `rewound to snapshot 019e5782-7c6b-72a2-b4fa-a81bf56eb37e` > > Behind the scenes: `docker load < payload.bin` → fresh container booted from the restored image → swapped into the warm pool, keyed identically. Mounts / network policy / lifecycle preserved from the original sandbox request. ``` cat /tmp/demo.txt ``` > `version 1` ← the rewind worked You can take many snapshots in a conversation and rewind to any of them. ## Six commits | # | Commit | What | |---|---|---| | 1 | `exoharness: sandbox snapshot/restore trait surface and Docker implementation` | Adds `SnapshotPayload { kind, bytes }` + `SnapshotKind::DockerImageTar`. Extends `ManagedSandboxHandle::snapshot()` and `ManagedSandboxBackend::acquire_from_snapshot(req, payload)`. Stubs with explicit "not supported" errors for OneShot / LocalProcess / AppleContainer with clear "where the real impl goes" comments. | | 2 | `exoharness: persist sandbox snapshots and restore via start_sandbox` | Wires `snapshot_sandbox` to actually capture and persist. Wires `start_sandbox` to load the payload by manifest kind and call `acquire_from_snapshot`. | | 3 | `cli: /snapshot, /snapshots, /rewind slash commands in chat repl` | The user-facing surface. Lives in the REPL because the sandbox handle is per-process; a top-level `exo conversation snapshot` subcommand needs cross-invocation container adoption (separate follow-up). | | 4 | `docs: sandbox snapshot/rewind design` | `docs/sandbox-snapshots.md` covering data flow, on-disk layout, backend extension story, known limits. | | 5 | `exoharness: default sandbox image to ubuntu:24.04` | debian:bookworm doesn't ship `procps`; even basic "list running processes" tool calls hit `command not found`. Ubuntu 24.04 has procps + coreutils in the base. | | 6 | `ci: end-to-end snapshot + rewind round-trip test (docker, linux)` | New `crates/cli/tests/snapshot_round_trip.rs` that drives the harness library directly against real Docker. Mirrors the demo's lifecycle and asserts on two independent rewind signals (file content rolls back AND a post-snapshot file disappears). Wired into the integration matrix workflow. | ## On-disk shape ``` agents/<agent_id>/conversations/<conv_id>/snapshots/<snapshot_id>/ ├── manifest.json { snapshot_id, sandbox_id, kind, created_at, payload_size_bytes } └── payload.bin docker save tarball for SnapshotKind::DockerImageTar ``` The snapshot's existence is also recorded in the conversation event log as `SandboxSnapshotted { sandbox_id, snapshot_id }`, which is what `/snapshots` walks to render the listing. ## Adding more backends Anyone adding snapshot support for a new sandbox backend follows this recipe: 1. Add a new `SnapshotKind` variant naming the on-disk format (e.g. `AppleContainerImageTar`). 2. Implement `ManagedSandboxHandle::snapshot` to produce that kind. The Docker version is the template — three CLI calls and a `Bytes` capture. 3. Implement `ManagedSandboxBackend::acquire_from_snapshot` to consume the same kind, with an explicit kind-mismatch error. 4. Backends that genuinely can't snapshot (local-process today) keep returning an explicit error. No other layer changes. The conversation orchestration, on-disk layout, and CLI surface are all backend-agnostic. ## Test plan - [x] `cargo test --workspace` (51 unit tests pass) - [x] **End-to-end CI test added: `crates/cli/tests/snapshot_round_trip.rs`** — drives the harness library directly against real Docker on Linux, asserts the file content rolls back AND a post-snapshot file disappears after `/rewind`. Passes locally in ~1.8s; self-skips on non-docker matrix cells. Wired into the existing `integration.yml` workflow alongside the existing `integration_chat` test. - [x] Manual REPL verification (the demo above): `/snapshot` → file content modified → `/rewind` → file content rolled back. Verified visually before the automated test was written. - [x] On-disk layout (manifest.json + payload.bin) verified by the round-trip test using `read_dir()`.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Stacked on #20 (
sandbox-snapshots-filesystem).Summary
Adds a second snapshot mode on top of the filesystem-only path in #20:
SnapshotMode::FullState— captures filesystem + running processes + memory + open FDs viadocker checkpoint create(CRIU under the hood).SnapshotKind::DockerCheckpointTar— the tagged on-disk format alongsideDockerImageTar./checkpointslash command in the REPL alongside/snapshot./rewinddispatches on the persisted manifest kind, so it transparently handles either mode.Three commits:
exoharness: full-state (CRIU) snapshot mode for docker sandboxessnapshot(mode)), Docker capture+restore pipelines,/checkpointREPL command, requirements preflight (docker info ExperimentalBuild) with an actionable error.docs: requirements doc + cover full-state snapshot modedocs/requirements.md(general home for runtime requirements per backend) + design doc updates covering both modes and the mode/kind separation.exoharness: rewrite CRIU restore to use docker's default checkpoint dir--checkpoint-dir; docker rejects custom dirs onstart --checkpoint("custom checkpointdir is not supported"). Switch to docker's default/var/lib/docker/containers/<id>/checkpoints/, which requiressudo tar/sudo mkdirsince the dir is root-owned.What's verified
docker checkpoint createsucceeds, the CRIU dump bytes are tarred and persisted to exoharness storage, the manifest is written correctly.docker info --format '{{.ExperimentalBuild}}'isfalse,/checkpointreturns the actionable error pointing atdocs/requirements.mdinstead of letting docker fail cryptically.What's NOT working — known docker bugs
The restore path correctly invokes docker (verified via raw-docker reproduction without any exo code involved). Docker itself then fails:
Error response from daemon: bind-mount /proc/0/ns/net -> /var/run/docker/netns/...: no such file or directory--network=hostas workaround.failed to upload checkpoint to containerd: commit failed: content sha256:... already existsBoth reproduce on docker 29.4.2 (dev box) and the second one on docker 29.1.3 (test VM with full kernel modules). The forum thread reports docker 28.3.3 + CRIU +
--network=hostdoes work; we couldn't validate that combination because neither of the boxes we have access to runs that exact combo (one is on 29.x, the other has a stripped kernel without the modules CRIU needs).Why keep the code around (maybe)
containerCLI exposes it, etc.).Why drop the code (maybe)
Test plan
cargo test --workspace(51 pass)/checkpointcapture succeeds; payload bytes written; manifest correct/checkpoint→/rewindon docker 28.3.3 with--network=host(not verified — no matching host available)