Skip to content

Sandbox full-state (CRIU) snapshots — blocked by docker bugs#12

Draft
akrentsel wants to merge 8 commits into
mainfrom
sandbox-snapshots
Draft

Sandbox full-state (CRIU) snapshots — blocked by docker bugs#12
akrentsel wants to merge 8 commits into
mainfrom
sandbox-snapshots

Conversation

@akrentsel
Copy link
Copy Markdown
Collaborator

@akrentsel akrentsel commented May 24, 2026

⚠️ Draft. Not recommended for merge as-is. Captures work; restore is blocked by known docker bugs across 28.x and 29.x. Keeping the branch so the code and design are reviewable, but probably either lands behind a feature flag with documented broken-restore or doesn't land at all until upstream docker fixes its checkpoint feature.

Stacked on #20 (sandbox-snapshots-filesystem).

Summary

Adds a second snapshot mode on top of the filesystem-only path in #20:

  • SnapshotMode::FullState — captures filesystem + running processes + memory + open FDs via docker checkpoint create (CRIU under the hood).
  • SnapshotKind::DockerCheckpointTar — the tagged on-disk format alongside DockerImageTar.
  • /checkpoint slash command in the REPL alongside /snapshot.
  • /rewind dispatches on the persisted manifest kind, so it transparently handles either mode.

Three commits:

# Commit What
1 exoharness: full-state (CRIU) snapshot mode for docker sandboxes Trait extensions (snapshot(mode)), Docker capture+restore pipelines, /checkpoint REPL command, requirements preflight (docker info ExperimentalBuild) with an actionable error.
2 docs: requirements doc + cover full-state snapshot mode New docs/requirements.md (general home for runtime requirements per backend) + design doc updates covering both modes and the mode/kind separation.
3 exoharness: rewrite CRIU restore to use docker's default checkpoint dir Initial implementation tried --checkpoint-dir; docker rejects custom dirs on start --checkpoint ("custom checkpointdir is not supported"). Switch to docker's default /var/lib/docker/containers/<id>/checkpoints/, which requires sudo tar/sudo mkdir since the dir is root-owned.

What's verified

  • Capture works end-to-end on docker 29.x: docker checkpoint create succeeds, the CRIU dump bytes are tarred and persisted to exoharness storage, the manifest is written correctly.
  • Preflight: when docker info --format '{{.ExperimentalBuild}}' is false, /checkpoint returns the actionable error pointing at docs/requirements.md instead of letting docker fail cryptically.
  • 51 unit tests pass.

What's NOT working — known docker bugs

The restore path correctly invokes docker (verified via raw-docker reproduction without any exo code involved). Docker itself then fails:

Path Error Tracking
Restore into fresh container, default networking Error response from daemon: bind-mount /proc/0/ns/net -> /var/run/docker/netns/...: no such file or directory Acknowledged docker bug — daemon writes invalid netns paths into the checkpoint. Docker forum maintainer recommends --network=host as workaround.
Restore in-place, any networking failed to upload checkpoint to containerd: commit failed: content sha256:... already exists moby/moby#42900 — claimed fixed in 2021 (PR 47456) but regressed somewhere in docker 29.x

Both reproduce on docker 29.4.2 (dev box) and the second one on docker 29.1.3 (test VM with full kernel modules). The forum thread reports docker 28.3.3 + CRIU + --network=host does work; we couldn't validate that combination because neither of the boxes we have access to runs that exact combo (one is on 29.x, the other has a stripped kernel without the modules CRIU needs).

Why keep the code around (maybe)

  • Capture is sound and the trait surface is right; when docker fixes its restore bugs or we add a different runtime (podman has substantially better CRIU integration, runc could be driven directly), the restore arm is a small addition.
  • The mode/kind separation lets us cleanly add more kinds later (Apple VZ pause+save when container CLI exposes it, etc.).
  • Documents the failure modes so future-us doesn't redo this investigation.

Why drop the code (maybe)

  • It's a meaningful diff for a feature that doesn't actually work.
  • "It works on docker 28.3.3 with the right network setup" is a fragile claim we can't verify on our infrastructure.
  • Filesystem snapshots cover ~99% of practical agent-rewind use cases (state lives on disk, processes are short-lived shell calls).

Test plan

  • cargo test --workspace (51 pass)
  • /checkpoint capture succeeds; payload bytes written; manifest correct
  • Preflight error path verified when experimental is off
  • Raw-docker reproduction confirms restore failure is upstream, not in exo code
  • End-to-end /checkpoint/rewind on docker 28.3.3 with --network=host (not verified — no matching host available)

@akrentsel
Copy link
Copy Markdown
Collaborator Author

Woo, looks like this is working!!

image

successful snapshotting

akrentsel added 8 commits May 25, 2026 00:17
…ntation

Adds the backend-level plumbing for capturing a running sandbox's state
as an opaque blob and reconstituting a sandbox from that blob.

Two new types in the public sandbox API:

  SnapshotPayload { kind, bytes }   - opaque snapshot artifact
  SnapshotKind                       - tag identifying the on-disk format
                                       (DockerImageTar today; new variants
                                       for other backends as they grow)

Trait extensions:

  ManagedSandboxHandle::snapshot()
    Capture this sandbox's state as a SnapshotPayload. Backends that
    can't snapshot return an explicit error.

  ManagedSandboxBackend::acquire_from_snapshot(request, payload)
    Acquire a sandbox whose filesystem is sourced from the supplied
    payload instead of request.spec.image. Mounts, network, lifecycle
    are honoured from the request.

Docker implementation:

  snapshot         docker commit -p <container> exo-snap-<uuid>
                   docker save exo-snap-<uuid>          (to bytes)
                   docker image rm exo-snap-<uuid>      (canonical store
                                                         lives in exoharness)

  restore          docker load < payload.bytes          (parse loaded ref)
                   evict any pre-existing warm container for this key
                   create a fresh warm container off the loaded image
                   (mounts/network/etc. preserved from the request)

The other implementations are deliberate stubs with clear "where the
real implementation goes" comments:

  - OneShotSandboxHandle::snapshot: snapshots require a warm sandbox
    (positive idle_ttl). One-shot is point-in-time-only by design.
  - LocalProcessSandboxBackend / -Handle: no container filesystem to
    capture or restore on the host.
  - CliContainerSandboxBackend with ContainerCliFlavor::AppleContainer:
    apple's `container` CLI doesn't yet ship the commit/save flow we
    need. When it lands, mirror docker_snapshot_container with a new
    AppleContainerImageTar SnapshotKind variant.
Wires snapshot_sandbox and start_sandbox to the trait methods added in
the previous commit. Today these two API methods only updated metadata;
now they actually capture and restore container state.

snapshot_sandbox
  - looks up the live ManagedSandboxHandle for the supplied id
  - calls handle.snapshot() to capture a SnapshotPayload (slow:
    docker commit + docker save, kept outside the write lock)
  - persists the payload + a StoredSnapshotManifest sidecar under
      conversations/<conv_id>/snapshots/<snapshot_id>/
        manifest.json   - kind, sandbox_id, created_at, payload_size
        payload.bin     - raw blob (docker save tarball for now)
  - then continues with the existing sandbox-metadata + event updates

start_sandbox
  - loads the snapshot manifest + payload from storage (before the
    write lock, in case the payload is large)
  - calls sandbox_backend.acquire_from_snapshot(request, payload)
    instead of acquire(request) — so the new container's filesystem
    comes from the snapshot rather than request.spec.image

Together these complete the round-trip: take a snapshot at state S,
make changes -> S', call start_sandbox with the snapshot_id, and the
container's filesystem is back at S.

Storage layout follows the existing artifact pattern (sidecar JSON +
.bin blob in a per-id directory), so a future migration to streamed
or chunked storage would touch a small surface.

The sandbox must be running (i.e. in this process's running_sandboxes
map) to be snapshotted — snapshots are of live state. Cross-process
container re-discovery (so a sandbox started by an earlier `exo`
invocation can be snapshotted from a later one) is a worthwhile
follow-up but out of scope here.
Lets a user exercise the snapshot/rewind round-trip without leaving
the conversation:

  /snapshot          capture the conversation's currently-running
                     sandbox; prints the new snapshot id
  /snapshots         list snapshots taken in this conversation
                     (walks SandboxSnapshotted events)
  /rewind <id>       stop the current sandbox, start a fresh one from
                     the named snapshot — subsequent shell tool calls
                     hit the restored filesystem
  /help              show the command list

Lives in the chat repl rather than as a top-level CLI subcommand
because the sandbox running_sandboxes map is per-process; the
container created by an earlier `exo` invocation isn't reachable from
a later one. Inside the repl the same process holds the sandbox for
the duration, so capture + rewind both have a live handle to operate
on.

(When cross-invocation container adoption lands, `exo conversation
snapshot/rewind` subcommands become trivial to add — they just call
the same ConversationHandle methods this repl path uses.)

Verified live against `--sandbox-backend docker`: create file with
contents "v1", /snapshot, overwrite to "v2", /rewind <id>, read file
back -> "v1". On-disk payload is the docker save tarball
(~47 MB for the debian:bookworm base image) plus a manifest.json
recording kind/sandbox_id/created_at/size.
Pulls the design rationale out of the commit messages and into a doc
alongside the other architecture notes. Covers:

  - the producer/consumer model (SnapshotKind is the contract between
    backend snapshot/restore)
  - data flow across ConversationHandle, ManagedSandboxHandle, and
    ManagedSandboxBackend
  - the Docker pipeline (commit -p -> save -> rmi; load -> swap image
    -> evict-and-recreate-warm-container)
  - on-disk layout (manifest.json + payload.bin sidecar pair, mirroring
    the existing artifact pattern)
  - REPL slash-command surface
  - extension recipe for adding a new backend (new SnapshotKind variant
    + matching arm in acquire_from_snapshot)
  - known limits: cross-invocation container adoption, in-memory
    payload size, no GC, no running-process checkpointing
debian:bookworm doesn't ship procps (no ps/pgrep), which makes
sandbox introspection during chat sessions painful — even basic
"list running processes" tool calls hit `command not found`. Swap
the default to ubuntu:24.04, which has procps + coreutils in the
base image. ~50MB smaller bottle, same overlay2 storage, same
network behaviour. Agents that want a different image continue to
override via `agent create --sandbox-image`.
Adds SnapshotMode { Filesystem, FullState } as a caller-facing knob on
the snapshot path. Filesystem is the existing docker commit/save flow
(SnapshotKind::DockerImageTar). FullState is new: a CRIU-backed
checkpoint of the live process tree (memory pages, open FDs, sockets,
filesystem diff), tagged SnapshotKind::DockerCheckpointTar.

Pipeline for the new mode:

  capture   docker checkpoint create --checkpoint-dir=<tmp> <c> exo-snap
            tar -cf - -C <tmp>/exo-snap .

  restore   tar -xf - -C <tmp>/exo-snap
            docker create <fresh container, same image/mounts/network>
            docker start --checkpoint exo-snap --checkpoint-dir=<tmp> <new>

Mode and kind are intentionally separate:
  - mode  = what the caller wants (filesystem vs full state)
  - kind  = the on-disk format the backend chose to produce
The restore path dispatches on kind alone, so a future backend that
honours the FullState mode but produces a different format (e.g.
apple-container's VZ.framework save state) just adds a new kind variant
and matching arm in acquire_from_snapshot.

Surface preflight + error handling:
  - `docker info --format {{.ExperimentalBuild}}` is probed before any
    checkpoint operation; if false, surfaces an actionable message
    pointing at docs/requirements.md rather than letting docker's raw
    "Unknown command: checkpoint" bubble up.
  - same on restore — failure to start --checkpoint cleans up the
    half-created container and reports cause.

CLI: new `/checkpoint` slash command in the chat REPL alongside the
existing `/snapshot`. /rewind handles either kind transparently
(dispatch happens at the backend layer based on manifest.json).

Trait API change: ConversationHandle::snapshot_sandbox grows a `mode`
parameter; ManagedSandboxHandle::snapshot likewise. Single internal
caller updated. Mock impls in tests updated.

Verified locally that filesystem-mode snapshots still work end-to-end
(the regression path) and that the experimental-flag preflight
correctly returns the actionable error on this runner where docker
experimental is disabled. End-to-end CRIU verification requires a host
with CRIU installed and docker experimental enabled — see
docs/requirements.md.
Adds docs/requirements.md as the central place for runtime requirements
per feature/backend (sandbox + secret backend matrices, CRIU setup for
full-state snapshots, CI matrix expectations).

Updates docs/sandbox-snapshots.md to cover the SnapshotMode addition:
filesystem vs full-state semantics, the mode/kind separation, the new
docker checkpoint pipeline, and the /checkpoint slash command.

The existing snapshot doc's "fundamental reason" line for processes-not-
captured is no longer accurate — full-state IS captured under
SnapshotMode::FullState. Reworked that section.
The earlier full-state restore path passed --checkpoint-dir to both
`docker checkpoint create` and `docker start --checkpoint`. The
latter actually rejects custom checkpoint dirs at runtime ("custom
checkpointdir is not supported"), so the restore step always failed
with that error before any of our error handling kicked in.

Switch both ends to docker's default location
(/var/lib/docker/containers/<id>/checkpoints/<name>/). The
trade-off is that those paths are root-owned, so the tar/untar
steps need sudo:

- snapshot: `docker checkpoint create <c> <name>` then `sudo tar -cf
  - -C /var/lib/docker/containers/<id>/checkpoints/<name>` to read
  the dump bytes, then `docker checkpoint rm` to drop the local copy
  (exoharness storage is the canonical home).
- restore: `docker create` a fresh container, `sudo mkdir -p` its
  checkpoint dir, `sudo tar -xf -` the payload bytes into it, then
  `docker start --checkpoint <name>` (no --checkpoint-dir).

Adds the passwordless-sudo requirement to docs/requirements.md with
an example sudoers fragment for machines that don't have it by
default.

Note: this commit unblocks the code path but does NOT make
end-to-end /checkpoint work on docker 29.x. After fixing the dir
issue we then hit two further docker bugs (`/proc/0/ns/net`
netns-restore bug and a containerd "already exists" regression in
the 29.x line) that are known upstream issues and beyond our
control. The capture half of the pipeline works fully; the restore
half is blocked by docker until a 28.x-style flow returns or we
swap to a different runtime. See the next commit's design doc
update for the full investigation.
@akrentsel akrentsel changed the base branch from main to sandbox-snapshots-filesystem May 25, 2026 00:19
@akrentsel akrentsel force-pushed the sandbox-snapshots branch from e14464b to 5cccd55 Compare May 25, 2026 00:19
@akrentsel akrentsel changed the title Sandbox snapshots (filesystem + experimental CRIU full-state) Sandbox full-state (CRIU) snapshots — blocked by docker bugs May 25, 2026
@akrentsel akrentsel force-pushed the sandbox-snapshots-filesystem branch from 483a3bc to 36dfb8a Compare June 1, 2026 02:09
akrentsel added a commit that referenced this pull request Jun 1, 2026
Fills in the snapshot/rewind plumbing that was previously metadata-only.
**Captures and restores a sandboxed container's filesystem state.** A
separate PR (#12) stacks on top of this one for an experimental CRIU
full-state path — that one is intentionally a draft because it's blocked
by upstream docker bugs.

## What this does

Inside a chat REPL (`exo chat repl <agent> <conv>`), three new slash
commands:

| Command | What it does |
|---|---|
| `/snapshot` | Capture the sandbox's **filesystem state** as it is
right now — every file created, modified, or deleted since the base
image. Prints a snapshot UUID. On disk this is a `docker save` tarball
under `conversations/<conv>/snapshots/<id>/payload.bin` plus a
`manifest.json` sidecar. |
| `/snapshots` | List every snapshot taken in this conversation. |
| `/rewind <uuid>` | Stop the current container, bring up a fresh one
whose filesystem matches the snapshot. Subsequent tool calls see the
rolled-back state. |

## What this does NOT do — be clear about the limit

The snapshot is **filesystem only**. Specifically:

- ✅ **Files persisted to disk inside the container** are captured and
restored — files you created, packages you installed, configs you wrote.
- ❌ **Running processes are not preserved.** If you `/snapshot` while
`nohup sleep 9999 &` is running and then `/rewind`, the file
`/proc/<pid>` is not coming back; the process is gone. The new container
boots fresh.
- ❌ **In-memory state is gone.** An interactive REPL's variables, an
open TCP connection, a buffered write that hadn't been flushed — none of
these survive a rewind.
- ❌ **Conversation history is not rewound.** Your chat messages and
event log stay where they were. Use `conversation fork` if you want to
rewind the conversation itself; snapshots only operate on the sandbox
filesystem underneath.

For agent workflows where "state worth preserving" = "files written to
disk + tools/packages installed", filesystem snapshots cover the case.
For pause-and-resume of long-running in-memory processes, you'd need
full-state (CRIU), which is #12's beat and currently impractical to ship
due to upstream docker bugs.

## Demo

gif:
<img width="800" height="515" alt="snapshot-rewind"
src="https://github.com/user-attachments/assets/55223d38-aae8-42ce-9884-1106b04bcc55"
/>


```bash
# Bootstrap a state dir + agent + conversation, drop into the REPL.
# (--networking enabled because the demo needs the LLM to reach OpenAI)
STATE=/tmp/exo-snapshot-demo && rm -rf $STATE && mkdir -p $STATE && \
EXO="./target/debug/exo --root $STATE --secret-backend file --sandbox-backend docker --master-key-path $STATE/master.key" && \
$EXO secret set OPENAI_KEY --env OPENAI_API_KEY && \
$EXO model register gpt-4o-mini --secret OPENAI_KEY && \
$EXO agent create demo --model gpt-4o-mini --networking enabled && \
$EXO conversation create demo first --repl
```

Then inside `first>`:

```
create /tmp/demo.txt with the content "version 1" and cat it back
```
*Sandbox now has `/tmp/demo.txt` with "version 1".*

```
/snapshot
```
> `snapshot 019e5782-7c6b-72a2-b4fa-a81bf56eb37e`
>
> Behind the scenes: `docker commit -p` → `docker save` → tarball (~47MB
for the ubuntu:24.04 base) → written to
`conversations/<conv>/snapshots/<id>/payload.bin` + `manifest.json`. The
local docker image is dropped (`docker image rm`) — exoharness storage
is the canonical home.

```
overwrite /tmp/demo.txt with "version 2" and cat it back
```
*Sandbox file now reads "version 2".*

```
/snapshots
```
> ```
> SNAPSHOT                              SANDBOX
> 019e5782-7c6b-72a2-b4fa-a81bf56eb37e
sandbox-019e5782-2a46-7970-a5bf-62900a2233e8
> ```

```
/rewind 019e5782-7c6b-72a2-b4fa-a81bf56eb37e
```
> `rewound to snapshot 019e5782-7c6b-72a2-b4fa-a81bf56eb37e`
>
> Behind the scenes: `docker load < payload.bin` → fresh container
booted from the restored image → swapped into the warm pool, keyed
identically. Mounts / network policy / lifecycle preserved from the
original sandbox request.

```
cat /tmp/demo.txt
```
> `version 1`   ← the rewind worked

You can take many snapshots in a conversation and rewind to any of them.

## Six commits

| # | Commit | What |
|---|---|---|
| 1 | `exoharness: sandbox snapshot/restore trait surface and Docker
implementation` | Adds `SnapshotPayload { kind, bytes }` +
`SnapshotKind::DockerImageTar`. Extends
`ManagedSandboxHandle::snapshot()` and
`ManagedSandboxBackend::acquire_from_snapshot(req, payload)`. Stubs with
explicit "not supported" errors for OneShot / LocalProcess /
AppleContainer with clear "where the real impl goes" comments. |
| 2 | `exoharness: persist sandbox snapshots and restore via
start_sandbox` | Wires `snapshot_sandbox` to actually capture and
persist. Wires `start_sandbox` to load the payload by manifest kind and
call `acquire_from_snapshot`. |
| 3 | `cli: /snapshot, /snapshots, /rewind slash commands in chat repl`
| The user-facing surface. Lives in the REPL because the sandbox handle
is per-process; a top-level `exo conversation snapshot` subcommand needs
cross-invocation container adoption (separate follow-up). |
| 4 | `docs: sandbox snapshot/rewind design` |
`docs/sandbox-snapshots.md` covering data flow, on-disk layout, backend
extension story, known limits. |
| 5 | `exoharness: default sandbox image to ubuntu:24.04` |
debian:bookworm doesn't ship `procps`; even basic "list running
processes" tool calls hit `command not found`. Ubuntu 24.04 has procps +
coreutils in the base. |
| 6 | `ci: end-to-end snapshot + rewind round-trip test (docker, linux)`
| New `crates/cli/tests/snapshot_round_trip.rs` that drives the harness
library directly against real Docker. Mirrors the demo's lifecycle and
asserts on two independent rewind signals (file content rolls back AND a
post-snapshot file disappears). Wired into the integration matrix
workflow. |

## On-disk shape

```
agents/<agent_id>/conversations/<conv_id>/snapshots/<snapshot_id>/
├── manifest.json   { snapshot_id, sandbox_id, kind, created_at, payload_size_bytes }
└── payload.bin     docker save tarball for SnapshotKind::DockerImageTar
```

The snapshot's existence is also recorded in the conversation event log
as `SandboxSnapshotted { sandbox_id, snapshot_id }`, which is what
`/snapshots` walks to render the listing.

## Adding more backends

Anyone adding snapshot support for a new sandbox backend follows this
recipe:

1. Add a new `SnapshotKind` variant naming the on-disk format (e.g.
`AppleContainerImageTar`).
2. Implement `ManagedSandboxHandle::snapshot` to produce that kind. The
Docker version is the template — three CLI calls and a `Bytes` capture.
3. Implement `ManagedSandboxBackend::acquire_from_snapshot` to consume
the same kind, with an explicit kind-mismatch error.
4. Backends that genuinely can't snapshot (local-process today) keep
returning an explicit error.

No other layer changes. The conversation orchestration, on-disk layout,
and CLI surface are all backend-agnostic.

## Test plan

- [x] `cargo test --workspace` (51 unit tests pass)
- [x] **End-to-end CI test added:
`crates/cli/tests/snapshot_round_trip.rs`** — drives the harness library
directly against real Docker on Linux, asserts the file content rolls
back AND a post-snapshot file disappears after `/rewind`. Passes locally
in ~1.8s; self-skips on non-docker matrix cells. Wired into the existing
`integration.yml` workflow alongside the existing `integration_chat`
test.
- [x] Manual REPL verification (the demo above): `/snapshot` → file
content modified → `/rewind` → file content rolled back. Verified
visually before the automated test was written.
- [x] On-disk layout (manifest.json + payload.bin) verified by the
round-trip test using `read_dir()`.
Base automatically changed from sandbox-snapshots-filesystem to main June 1, 2026 02:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant