Sandbox full-state (CRIU) snapshots — blocked by docker bugs by akrentsel · Pull Request #12 · ankrgyl/exo

akrentsel · 2026-05-24T17:43:48Z

⚠️ Draft. Not recommended for merge as-is. Captures work; restore is blocked by known docker bugs across 28.x and 29.x. Keeping the branch so the code and design are reviewable, but probably either lands behind a feature flag with documented broken-restore or doesn't land at all until upstream docker fixes its checkpoint feature.

Stacked on #20 (sandbox-snapshots-filesystem).

Summary

Adds a second snapshot mode on top of the filesystem-only path in #20:

SnapshotMode::FullState — captures filesystem + running processes + memory + open FDs via docker checkpoint create (CRIU under the hood).
SnapshotKind::DockerCheckpointTar — the tagged on-disk format alongside DockerImageTar.
/checkpoint slash command in the REPL alongside /snapshot.
/rewind dispatches on the persisted manifest kind, so it transparently handles either mode.

Three commits:

#	Commit	What
1	`exoharness: full-state (CRIU) snapshot mode for docker sandboxes`	Trait extensions (`snapshot(mode)`), Docker capture+restore pipelines, `/checkpoint` REPL command, requirements preflight (`docker info ExperimentalBuild`) with an actionable error.
2	`docs: requirements doc + cover full-state snapshot mode`	New `docs/requirements.md` (general home for runtime requirements per backend) + design doc updates covering both modes and the mode/kind separation.
3	`exoharness: rewrite CRIU restore to use docker's default checkpoint dir`	Initial implementation tried `--checkpoint-dir`; docker rejects custom dirs on `start --checkpoint` ("custom checkpointdir is not supported"). Switch to docker's default `/var/lib/docker/containers/<id>/checkpoints/`, which requires `sudo tar`/`sudo mkdir` since the dir is root-owned.

What's verified

Capture works end-to-end on docker 29.x: docker checkpoint create succeeds, the CRIU dump bytes are tarred and persisted to exoharness storage, the manifest is written correctly.
Preflight: when docker info --format '{{.ExperimentalBuild}}' is false, /checkpoint returns the actionable error pointing at docs/requirements.md instead of letting docker fail cryptically.
51 unit tests pass.

What's NOT working — known docker bugs

The restore path correctly invokes docker (verified via raw-docker reproduction without any exo code involved). Docker itself then fails:

Path	Error	Tracking
Restore into fresh container, default networking	`Error response from daemon: bind-mount /proc/0/ns/net -> /var/run/docker/netns/...: no such file or directory`	Acknowledged docker bug — daemon writes invalid netns paths into the checkpoint. Docker forum maintainer recommends `--network=host` as workaround.
Restore in-place, any networking	`failed to upload checkpoint to containerd: commit failed: content sha256:... already exists`	moby/moby#42900 — claimed fixed in 2021 (PR 47456) but regressed somewhere in docker 29.x

Both reproduce on docker 29.4.2 (dev box) and the second one on docker 29.1.3 (test VM with full kernel modules). The forum thread reports docker 28.3.3 + CRIU + --network=host does work; we couldn't validate that combination because neither of the boxes we have access to runs that exact combo (one is on 29.x, the other has a stripped kernel without the modules CRIU needs).

Why keep the code around (maybe)

Capture is sound and the trait surface is right; when docker fixes its restore bugs or we add a different runtime (podman has substantially better CRIU integration, runc could be driven directly), the restore arm is a small addition.
The mode/kind separation lets us cleanly add more kinds later (Apple VZ pause+save when container CLI exposes it, etc.).
Documents the failure modes so future-us doesn't redo this investigation.

Why drop the code (maybe)

It's a meaningful diff for a feature that doesn't actually work.
"It works on docker 28.3.3 with the right network setup" is a fragile claim we can't verify on our infrastructure.
Filesystem snapshots cover ~99% of practical agent-rewind use cases (state lives on disk, processes are short-lived shell calls).

Test plan

cargo test --workspace (51 pass)
/checkpoint capture succeeds; payload bytes written; manifest correct
Preflight error path verified when experimental is off
Raw-docker reproduction confirms restore failure is upstream, not in exo code
End-to-end /checkpoint → /rewind on docker 28.3.3 with --network=host (not verified — no matching host available)

akrentsel · 2026-05-24T23:33:22Z

Woo, looks like this is working!!

successful snapshotting

…ntation Adds the backend-level plumbing for capturing a running sandbox's state as an opaque blob and reconstituting a sandbox from that blob. Two new types in the public sandbox API: SnapshotPayload { kind, bytes } - opaque snapshot artifact SnapshotKind - tag identifying the on-disk format (DockerImageTar today; new variants for other backends as they grow) Trait extensions: ManagedSandboxHandle::snapshot() Capture this sandbox's state as a SnapshotPayload. Backends that can't snapshot return an explicit error. ManagedSandboxBackend::acquire_from_snapshot(request, payload) Acquire a sandbox whose filesystem is sourced from the supplied payload instead of request.spec.image. Mounts, network, lifecycle are honoured from the request. Docker implementation: snapshot docker commit -p <container> exo-snap-<uuid> docker save exo-snap-<uuid> (to bytes) docker image rm exo-snap-<uuid> (canonical store lives in exoharness) restore docker load < payload.bytes (parse loaded ref) evict any pre-existing warm container for this key create a fresh warm container off the loaded image (mounts/network/etc. preserved from the request) The other implementations are deliberate stubs with clear "where the real implementation goes" comments: - OneShotSandboxHandle::snapshot: snapshots require a warm sandbox (positive idle_ttl). One-shot is point-in-time-only by design. - LocalProcessSandboxBackend / -Handle: no container filesystem to capture or restore on the host. - CliContainerSandboxBackend with ContainerCliFlavor::AppleContainer: apple's `container` CLI doesn't yet ship the commit/save flow we need. When it lands, mirror docker_snapshot_container with a new AppleContainerImageTar SnapshotKind variant.

Wires snapshot_sandbox and start_sandbox to the trait methods added in the previous commit. Today these two API methods only updated metadata; now they actually capture and restore container state. snapshot_sandbox - looks up the live ManagedSandboxHandle for the supplied id - calls handle.snapshot() to capture a SnapshotPayload (slow: docker commit + docker save, kept outside the write lock) - persists the payload + a StoredSnapshotManifest sidecar under conversations/<conv_id>/snapshots/<snapshot_id>/ manifest.json - kind, sandbox_id, created_at, payload_size payload.bin - raw blob (docker save tarball for now) - then continues with the existing sandbox-metadata + event updates start_sandbox - loads the snapshot manifest + payload from storage (before the write lock, in case the payload is large) - calls sandbox_backend.acquire_from_snapshot(request, payload) instead of acquire(request) — so the new container's filesystem comes from the snapshot rather than request.spec.image Together these complete the round-trip: take a snapshot at state S, make changes -> S', call start_sandbox with the snapshot_id, and the container's filesystem is back at S. Storage layout follows the existing artifact pattern (sidecar JSON + .bin blob in a per-id directory), so a future migration to streamed or chunked storage would touch a small surface. The sandbox must be running (i.e. in this process's running_sandboxes map) to be snapshotted — snapshots are of live state. Cross-process container re-discovery (so a sandbox started by an earlier `exo` invocation can be snapshotted from a later one) is a worthwhile follow-up but out of scope here.

Lets a user exercise the snapshot/rewind round-trip without leaving the conversation: /snapshot capture the conversation's currently-running sandbox; prints the new snapshot id /snapshots list snapshots taken in this conversation (walks SandboxSnapshotted events) /rewind <id> stop the current sandbox, start a fresh one from the named snapshot — subsequent shell tool calls hit the restored filesystem /help show the command list Lives in the chat repl rather than as a top-level CLI subcommand because the sandbox running_sandboxes map is per-process; the container created by an earlier `exo` invocation isn't reachable from a later one. Inside the repl the same process holds the sandbox for the duration, so capture + rewind both have a live handle to operate on. (When cross-invocation container adoption lands, `exo conversation snapshot/rewind` subcommands become trivial to add — they just call the same ConversationHandle methods this repl path uses.) Verified live against `--sandbox-backend docker`: create file with contents "v1", /snapshot, overwrite to "v2", /rewind <id>, read file back -> "v1". On-disk payload is the docker save tarball (~47 MB for the debian:bookworm base image) plus a manifest.json recording kind/sandbox_id/created_at/size.

Pulls the design rationale out of the commit messages and into a doc alongside the other architecture notes. Covers: - the producer/consumer model (SnapshotKind is the contract between backend snapshot/restore) - data flow across ConversationHandle, ManagedSandboxHandle, and ManagedSandboxBackend - the Docker pipeline (commit -p -> save -> rmi; load -> swap image -> evict-and-recreate-warm-container) - on-disk layout (manifest.json + payload.bin sidecar pair, mirroring the existing artifact pattern) - REPL slash-command surface - extension recipe for adding a new backend (new SnapshotKind variant + matching arm in acquire_from_snapshot) - known limits: cross-invocation container adoption, in-memory payload size, no GC, no running-process checkpointing

debian:bookworm doesn't ship procps (no ps/pgrep), which makes sandbox introspection during chat sessions painful — even basic "list running processes" tool calls hit `command not found`. Swap the default to ubuntu:24.04, which has procps + coreutils in the base image. ~50MB smaller bottle, same overlay2 storage, same network behaviour. Agents that want a different image continue to override via `agent create --sandbox-image`.

Adds SnapshotMode { Filesystem, FullState } as a caller-facing knob on the snapshot path. Filesystem is the existing docker commit/save flow (SnapshotKind::DockerImageTar). FullState is new: a CRIU-backed checkpoint of the live process tree (memory pages, open FDs, sockets, filesystem diff), tagged SnapshotKind::DockerCheckpointTar. Pipeline for the new mode: capture docker checkpoint create --checkpoint-dir=<tmp> <c> exo-snap tar -cf - -C <tmp>/exo-snap . restore tar -xf - -C <tmp>/exo-snap docker create <fresh container, same image/mounts/network> docker start --checkpoint exo-snap --checkpoint-dir=<tmp> <new> Mode and kind are intentionally separate: - mode = what the caller wants (filesystem vs full state) - kind = the on-disk format the backend chose to produce The restore path dispatches on kind alone, so a future backend that honours the FullState mode but produces a different format (e.g. apple-container's VZ.framework save state) just adds a new kind variant and matching arm in acquire_from_snapshot. Surface preflight + error handling: - `docker info --format {{.ExperimentalBuild}}` is probed before any checkpoint operation; if false, surfaces an actionable message pointing at docs/requirements.md rather than letting docker's raw "Unknown command: checkpoint" bubble up. - same on restore — failure to start --checkpoint cleans up the half-created container and reports cause. CLI: new `/checkpoint` slash command in the chat REPL alongside the existing `/snapshot`. /rewind handles either kind transparently (dispatch happens at the backend layer based on manifest.json). Trait API change: ConversationHandle::snapshot_sandbox grows a `mode` parameter; ManagedSandboxHandle::snapshot likewise. Single internal caller updated. Mock impls in tests updated. Verified locally that filesystem-mode snapshots still work end-to-end (the regression path) and that the experimental-flag preflight correctly returns the actionable error on this runner where docker experimental is disabled. End-to-end CRIU verification requires a host with CRIU installed and docker experimental enabled — see docs/requirements.md.

Adds docs/requirements.md as the central place for runtime requirements per feature/backend (sandbox + secret backend matrices, CRIU setup for full-state snapshots, CI matrix expectations). Updates docs/sandbox-snapshots.md to cover the SnapshotMode addition: filesystem vs full-state semantics, the mode/kind separation, the new docker checkpoint pipeline, and the /checkpoint slash command. The existing snapshot doc's "fundamental reason" line for processes-not- captured is no longer accurate — full-state IS captured under SnapshotMode::FullState. Reworked that section.

The earlier full-state restore path passed --checkpoint-dir to both `docker checkpoint create` and `docker start --checkpoint`. The latter actually rejects custom checkpoint dirs at runtime ("custom checkpointdir is not supported"), so the restore step always failed with that error before any of our error handling kicked in. Switch both ends to docker's default location (/var/lib/docker/containers/<id>/checkpoints/<name>/). The trade-off is that those paths are root-owned, so the tar/untar steps need sudo: - snapshot: `docker checkpoint create <c> <name>` then `sudo tar -cf - -C /var/lib/docker/containers/<id>/checkpoints/<name>` to read the dump bytes, then `docker checkpoint rm` to drop the local copy (exoharness storage is the canonical home). - restore: `docker create` a fresh container, `sudo mkdir -p` its checkpoint dir, `sudo tar -xf -` the payload bytes into it, then `docker start --checkpoint <name>` (no --checkpoint-dir). Adds the passwordless-sudo requirement to docs/requirements.md with an example sudoers fragment for machines that don't have it by default. Note: this commit unblocks the code path but does NOT make end-to-end /checkpoint work on docker 29.x. After fixing the dir issue we then hit two further docker bugs (`/proc/0/ns/net` netns-restore bug and a containerd "already exists" regression in the 29.x line) that are known upstream issues and beyond our control. The capture half of the pipeline works fully; the restore half is blocked by docker until a 28.x-style flow returns or we swap to a different runtime. See the next commit's design doc update for the full investigation.

Fills in the snapshot/rewind plumbing that was previously metadata-only. **Captures and restores a sandboxed container's filesystem state.** A separate PR (#12) stacks on top of this one for an experimental CRIU full-state path — that one is intentionally a draft because it's blocked by upstream docker bugs. ## What this does Inside a chat REPL (`exo chat repl <agent> <conv>`), three new slash commands: | Command | What it does | |---|---| | `/snapshot` | Capture the sandbox's **filesystem state** as it is right now — every file created, modified, or deleted since the base image. Prints a snapshot UUID. On disk this is a `docker save` tarball under `conversations/<conv>/snapshots/<id>/payload.bin` plus a `manifest.json` sidecar. | | `/snapshots` | List every snapshot taken in this conversation. | | `/rewind <uuid>` | Stop the current container, bring up a fresh one whose filesystem matches the snapshot. Subsequent tool calls see the rolled-back state. | ## What this does NOT do — be clear about the limit The snapshot is **filesystem only**. Specifically: - ✅ **Files persisted to disk inside the container** are captured and restored — files you created, packages you installed, configs you wrote. - ❌ **Running processes are not preserved.** If you `/snapshot` while `nohup sleep 9999 &` is running and then `/rewind`, the file `/proc/<pid>` is not coming back; the process is gone. The new container boots fresh. - ❌ **In-memory state is gone.** An interactive REPL's variables, an open TCP connection, a buffered write that hadn't been flushed — none of these survive a rewind. - ❌ **Conversation history is not rewound.** Your chat messages and event log stay where they were. Use `conversation fork` if you want to rewind the conversation itself; snapshots only operate on the sandbox filesystem underneath. For agent workflows where "state worth preserving" = "files written to disk + tools/packages installed", filesystem snapshots cover the case. For pause-and-resume of long-running in-memory processes, you'd need full-state (CRIU), which is #12's beat and currently impractical to ship due to upstream docker bugs. ## Demo gif: <img width="800" height="515" alt="snapshot-rewind" src="https://github.com/user-attachments/assets/55223d38-aae8-42ce-9884-1106b04bcc55" /> ```bash # Bootstrap a state dir + agent + conversation, drop into the REPL. # (--networking enabled because the demo needs the LLM to reach OpenAI) STATE=/tmp/exo-snapshot-demo && rm -rf $STATE && mkdir -p $STATE && \ EXO="./target/debug/exo --root $STATE --secret-backend file --sandbox-backend docker --master-key-path $STATE/master.key" && \ $EXO secret set OPENAI_KEY --env OPENAI_API_KEY && \ $EXO model register gpt-4o-mini --secret OPENAI_KEY && \ $EXO agent create demo --model gpt-4o-mini --networking enabled && \ $EXO conversation create demo first --repl ``` Then inside `first>`: ``` create /tmp/demo.txt with the content "version 1" and cat it back ``` *Sandbox now has `/tmp/demo.txt` with "version 1".* ``` /snapshot ``` > `snapshot 019e5782-7c6b-72a2-b4fa-a81bf56eb37e` > > Behind the scenes: `docker commit -p` → `docker save` → tarball (~47MB for the ubuntu:24.04 base) → written to `conversations/<conv>/snapshots/<id>/payload.bin` + `manifest.json`. The local docker image is dropped (`docker image rm`) — exoharness storage is the canonical home. ``` overwrite /tmp/demo.txt with "version 2" and cat it back ``` *Sandbox file now reads "version 2".* ``` /snapshots ``` > ``` > SNAPSHOT SANDBOX > 019e5782-7c6b-72a2-b4fa-a81bf56eb37e sandbox-019e5782-2a46-7970-a5bf-62900a2233e8 > ``` ``` /rewind 019e5782-7c6b-72a2-b4fa-a81bf56eb37e ``` > `rewound to snapshot 019e5782-7c6b-72a2-b4fa-a81bf56eb37e` > > Behind the scenes: `docker load < payload.bin` → fresh container booted from the restored image → swapped into the warm pool, keyed identically. Mounts / network policy / lifecycle preserved from the original sandbox request. ``` cat /tmp/demo.txt ``` > `version 1` ← the rewind worked You can take many snapshots in a conversation and rewind to any of them. ## Six commits | # | Commit | What | |---|---|---| | 1 | `exoharness: sandbox snapshot/restore trait surface and Docker implementation` | Adds `SnapshotPayload { kind, bytes }` + `SnapshotKind::DockerImageTar`. Extends `ManagedSandboxHandle::snapshot()` and `ManagedSandboxBackend::acquire_from_snapshot(req, payload)`. Stubs with explicit "not supported" errors for OneShot / LocalProcess / AppleContainer with clear "where the real impl goes" comments. | | 2 | `exoharness: persist sandbox snapshots and restore via start_sandbox` | Wires `snapshot_sandbox` to actually capture and persist. Wires `start_sandbox` to load the payload by manifest kind and call `acquire_from_snapshot`. | | 3 | `cli: /snapshot, /snapshots, /rewind slash commands in chat repl` | The user-facing surface. Lives in the REPL because the sandbox handle is per-process; a top-level `exo conversation snapshot` subcommand needs cross-invocation container adoption (separate follow-up). | | 4 | `docs: sandbox snapshot/rewind design` | `docs/sandbox-snapshots.md` covering data flow, on-disk layout, backend extension story, known limits. | | 5 | `exoharness: default sandbox image to ubuntu:24.04` | debian:bookworm doesn't ship `procps`; even basic "list running processes" tool calls hit `command not found`. Ubuntu 24.04 has procps + coreutils in the base. | | 6 | `ci: end-to-end snapshot + rewind round-trip test (docker, linux)` | New `crates/cli/tests/snapshot_round_trip.rs` that drives the harness library directly against real Docker. Mirrors the demo's lifecycle and asserts on two independent rewind signals (file content rolls back AND a post-snapshot file disappears). Wired into the integration matrix workflow. | ## On-disk shape ``` agents/<agent_id>/conversations/<conv_id>/snapshots/<snapshot_id>/ ├── manifest.json { snapshot_id, sandbox_id, kind, created_at, payload_size_bytes } └── payload.bin docker save tarball for SnapshotKind::DockerImageTar ``` The snapshot's existence is also recorded in the conversation event log as `SandboxSnapshotted { sandbox_id, snapshot_id }`, which is what `/snapshots` walks to render the listing. ## Adding more backends Anyone adding snapshot support for a new sandbox backend follows this recipe: 1. Add a new `SnapshotKind` variant naming the on-disk format (e.g. `AppleContainerImageTar`). 2. Implement `ManagedSandboxHandle::snapshot` to produce that kind. The Docker version is the template — three CLI calls and a `Bytes` capture. 3. Implement `ManagedSandboxBackend::acquire_from_snapshot` to consume the same kind, with an explicit kind-mismatch error. 4. Backends that genuinely can't snapshot (local-process today) keep returning an explicit error. No other layer changes. The conversation orchestration, on-disk layout, and CLI surface are all backend-agnostic. ## Test plan - [x] `cargo test --workspace` (51 unit tests pass) - [x] **End-to-end CI test added: `crates/cli/tests/snapshot_round_trip.rs`** — drives the harness library directly against real Docker on Linux, asserts the file content rolls back AND a post-snapshot file disappears after `/rewind`. Passes locally in ~1.8s; self-skips on non-docker matrix cells. Wired into the existing `integration.yml` workflow alongside the existing `integration_chat` test. - [x] Manual REPL verification (the demo above): `/snapshot` → file content modified → `/rewind` → file content rolled back. Verified visually before the automated test was written. - [x] On-disk layout (manifest.json + payload.bin) verified by the round-trip test using `read_dir()`.

Base automatically changed from selectable-backends to main May 24, 2026 17:58

akrentsel force-pushed the sandbox-snapshots branch from 7b146ee to e14464b Compare May 24, 2026 18:01

This was referenced May 24, 2026

Remote container sandbox backends (Daytona, and an abstraction for additional providers) #18

Open

Cross-provider sandbox snapshot migration (local ↔ Daytona) #19

Open

akrentsel added 8 commits May 25, 2026 00:17

akrentsel changed the base branch from main to sandbox-snapshots-filesystem May 25, 2026 00:19

akrentsel force-pushed the sandbox-snapshots branch from e14464b to 5cccd55 Compare May 25, 2026 00:19

akrentsel mentioned this pull request May 25, 2026

Sandbox filesystem snapshots and rewind (Docker) #20

Merged

4 tasks

akrentsel changed the title ~~Sandbox snapshots (filesystem + experimental CRIU full-state)~~ Sandbox full-state (CRIU) snapshots — blocked by docker bugs May 25, 2026

akrentsel force-pushed the sandbox-snapshots-filesystem branch from 483a3bc to 36dfb8a Compare June 1, 2026 02:09

Base automatically changed from sandbox-snapshots-filesystem to main June 1, 2026 02:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sandbox full-state (CRIU) snapshots — blocked by docker bugs#12

Sandbox full-state (CRIU) snapshots — blocked by docker bugs#12
akrentsel wants to merge 8 commits into
mainfrom
sandbox-snapshots

akrentsel commented May 24, 2026 •

edited

Loading

Uh oh!

akrentsel commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

akrentsel commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's verified

What's NOT working — known docker bugs

Why keep the code around (maybe)

Why drop the code (maybe)

Test plan

Uh oh!

akrentsel commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

akrentsel commented May 24, 2026 •

edited

Loading