Skip to content

Replace Firecracker with Flint VMM#13

Open
joshuaisaact wants to merge 13 commits intomainfrom
feat/migrate-flint
Open

Replace Firecracker with Flint VMM#13
joshuaisaact wants to merge 13 commits intomainfrom
feat/migrate-flint

Conversation

@joshuaisaact
Copy link
Copy Markdown
Owner

@joshuaisaact joshuaisaact commented Mar 22, 2026

Summary

  • Replace Firecracker (downloaded binary) with Flint, a custom Zig KVM microVMM built from source
  • Snapshot restore working — Sandbox.create() ~145ms (was ~135ms with Firecracker)
  • Delete dm-thin provisioning (~360 lines), always use cp --reflink=auto
  • Setup downloads pre-built 5.10.245 bzImage kernel from GitHub releases
  • Code review fixes: seccomp allowlist, virtio descriptor validation, path traversal checks, pause deadlock fix

Key changes

Area Before After
VMM Firecracker v1.15.0 (downloaded) Flint (Zig, built from source)
Kernel ELF vmlinux (downloaded) bzImage (pre-built, hosted)
Sandbox.create() ~135ms snapshot restore ~145ms snapshot restore
Rootfs cloning dm-thin or reflink reflink only
Setup deps curl Zig 0.16+

Known issue

Multiple sequential exec() calls on the same sandbox hang — first exec works, second blocks. This is a pre-existing hearth-agent protocol bug (single request per connection), not a regression from this migration.

42 files changed, +7133 / -653

Merge the Flint VMM (custom Zig microVMM) into the repo at vmm/,
replacing Firecracker as the underlying VM engine.

TypeScript SDK changes:
- Rename FirecrackerApi → FlintApi with Flint-compatible payloads
  (two-phase snapshot/load, flat mem_file_path, no resume_vm)
- Sandbox restore uses Flint CLI restore mode (--restore flags)
  instead of HTTP roundtrips for snapshot loading
- Replace getFirecrackerPath() → getVmmPath() pointing to ~/.hearth/bin/flint
- Setup builds Flint from source (zig build -Doptimize=ReleaseSafe)
  instead of downloading Firecracker binary
- Delete dm-thin provisioning (src/vm/thin.ts, ~360 lines) — use
  cp --reflink=auto (COPYFILE_FICLONE) for all rootfs copies
- Remove pool CLI command and thin pool from status
- Socket renamed to flint.sock, vcpu_count set to 1

VMM changes (vmm/):
- Copy Flint source, remove redundant files (agent, pool, pool_api, sandbox)
- Strip pool mode CLI args, agent listener/accept, sandbox API endpoints
- Add ELF vmlinux loading support alongside bzImage in boot/loader.zig
- Remove flint-agent build target

Known issue: ELF vmlinux boot (Firecracker CI kernels) stalls during
init — needs further debugging of boot_params setup for ELF path.
bzImage kernels work correctly.
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 22, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Replaces Firecracker with an in-tree Zig VMM named Flint: docs and CLI updated, thin-pool/dm-thin support removed, TypeScript VM API renamed/adjusted, setup now builds Flint from vmm/, and ~4k+ lines of Zig code added implementing KVM, VirtIO devices, boot loader, snapshot orchestration, jail/seccomp, and tests.

Changes

Cohort / File(s) Summary
Docs & metadata
\.gitignore, ARCHITECTURE.md, CLAUDE.md, README.md, package.json, docs/exec-plans/active/merge-flint.md, examples/claude-in-sandbox.ts
Replaced Firecracker wording with Flint/KVM, added migration plan and vmm/ docs, updated package metadata and .gitignore to ignore Zig build outputs.
CLI — commands & setup
src/cli/hearth.ts, src/cli/pool.ts, src/cli/status.ts, src/cli/setup.ts, src/cli/setup.test.ts
Removed pool CLI and thin-pool setup/status; setup now builds/installs flint from vmm/ (Zig); tests now check for flint presence; thin-pool lifecycle code deleted.
VM client & paths
src/vm/api.ts, src/vm/binary.ts, src/vm/snapshot.ts
Renamed FirecrackerApiFlintApi, simplified snapshot load payload, updated socket/name constants (firecracker.sockflint.sock) and VMM binary lookup (getVmmPath~/.hearth/bin/flint).
Sandbox & restore flow
src/sandbox/sandbox.ts
Sandbox now uses FlintApi and spawns Flint CLI --restore with direct file args; removed dm‑thin handling and thin-device state; always copies artifacts with reflink; adjusted vsock wait and error contexts.
Thin-pool removal
src/vm/thin.ts
Entire thin-pool module deleted; all dm-thin APIs and helpers removed.
Network docs
src/network/proxy.ts
Doc comment updated to reference Flint proxies.
Zig build & package
vmm/build.zig, vmm/build.zig.zon
Added Zig build script and package manifest for Flint (build/test/integration steps, Zig version constraint).
VMM core (Zig) — KVM & VM
vmm/src/kvm/..., vmm/src/main.zig, vmm/src/jail.zig, vmm/src/seccomp.zig
Added KVM ABI wrappers, system/vm/vcpu modules, main VMM entrypoint and run-loop, jail setup, seccomp installer, and related APIs.
Boot & memory (Zig)
vmm/src/boot/..., vmm/src/memory.zig, vmm/src/boot/params.zig
Added ELF/bzImage loader, boot-params structs/constants and e820 setup, guest memory allocation and mmap-from-file helpers.
VirtIO & devices (Zig)
vmm/src/devices/virtio.zig, vmm/src/devices/virtio/*, vmm/src/devices/serial.zig, vmm/src/devices/virtio/mmio.zig, vmm/src/devices/virtio/queue.zig
Implemented Virtio MMIO layout, queue handling, block/net/vsock backends, vsock connection semantics, and 16550 serial emulation with snapshot support.
Snapshot orchestration & tests (Zig)
vmm/src/snapshot.zig, vmm/src/tests.zig, vmm/src/integration_tests.zig
Added snapshot save/load format and device serialization, unit tests, and integration tests exercising boot/pause/create/restore flows.

Sequence Diagram(s)

sequenceDiagram
  participant TS as TypeScript host client
  participant Flint as flint (Zig VMM)
  participant KVM as /dev/kvm (kernel)
  participant Devices as VirtIO backends

  TS->>Flint: PUT /boot-source, /drives, /vsock, /machine-config
  Flint-->>TS: 200 OK (config stored)
  TS->>Flint: PUT /actions {"ActionType":"InstanceStart"}
  Flint->>KVM: open /dev/kvm, create VM, set memory, create vCPU
  Flint->>Devices: init blk/net/vsock/serial backends
  Flint->>KVM: KVM_RUN (vCPU loop)
  loop device events
    Devices->>Flint: signal interrupt / queue work
    Flint->>KVM: inject IRQ / handle MMIO IO
  end
Loading
sequenceDiagram
  participant TS as TypeScript host client
  participant FlintCLI as flint (CLI)
  participant KVM as /dev/kvm

  TS->>FlintCLI: spawn flint --restore --vmstate-path ... --mem-path ...
  FlintCLI->>KVM: open /dev/kvm, create VM, mmap mem file
  FlintCLI->>KVM: restore vCPU, irq/timer, device state (snapshot order)
  FlintCLI-->>TS: API available / VM running
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Poem

🐰 I hopped from Firecracker to Flint's new den,
Built kernels, vsock bridges, and jailed roots again.
Snapshots tucked in vmstate, memory mapped clear,
Devices buzz, KVM hums — a rabbit steered the gear. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title clearly and concisely summarizes the main change: replacing the Firecracker VMM with Flint VMM throughout the codebase.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/migrate-flint

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
examples/claude-in-sandbox.ts (1)

5-16: ⚠️ Potential issue | 🟡 Minor

Update the macOS prerequisite text to the new remote-daemon flow.

The block around Line 14 still tells macOS users to run hearth lima setup, which conflicts with the new README guidance that Flint requires KVM and macOS should connect to a remote Linux daemon instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/claude-in-sandbox.ts` around lines 5 - 16, Update the macOS
prerequisite line in the examples/claude-in-sandbox.ts README block: replace the
old "hearth lima setup (macOS M3+)" instruction with the new remote-daemon
guidance (e.g., mention connecting to a remote Linux daemon or running "npx
hearth setup" on a remote Linux host) so it matches README and Flint KVM
requirements; edit the Prerequisites bullet where "2." and the macOS note appear
to reflect the remote-daemon flow and remove the lima instruction.
README.md (1)

573-573: ⚠️ Potential issue | 🟡 Minor

Synchronize the earlier checkpoint section with this new behavior.

This line says sandbox.checkpoint() keeps the sandbox running, but the ## Checkpoint and Rollback section around Line 182 still says the active session is terminated on checkpoint. Please update that section too so the README does not document two different semantics for the same API.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` at line 573, Update the "## Checkpoint and Rollback" section to
match the current behavior described in v0.6: change any text that says
checkpointing terminates or destroys the active session to state that
sandbox.checkpoint() saves the VM state while keeping the sandbox/session
running, allowing rollback and branching without terminating the active sandbox;
ensure you mention sandbox.checkpoint() by name and remove or replace phrases
like "terminates the active session" or "destroys the sandbox" so the README
documents a single, consistent API behavior.
🟠 Major comments (22)
src/cli/setup.ts-77-80 (1)

77-80: ⚠️ Potential issue | 🟠 Major

Don't treat any existing Flint binary as current.

Once ~/.hearth/bin/flint exists, setup never rebuilds it. After an upgrade, users can keep running an older VMM binary that no longer matches the shipped SDK/API or snapshot behavior. Please key this cache on a Flint version/fingerprint, not bare file existence.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cli/setup.ts` around lines 77 - 80, The current check treats any file at
flintPath (join(BIN_DIR, "flint")) as up-to-date; change it to validate a
version/fingerprint before skipping install: compute a stable fingerprint or
query the binary (e.g., run "flint --version" or compute SHA256 of flintPath)
and compare it to the expected FLINT_FINGERPRINT/FLINT_VERSION constant shipped
with the build; only log "flint: already installed" and return if the
fingerprint/version matches, otherwise remove/overwrite the stale binary and
proceed with building/installing; also handle failures to read or execute the
existing binary by falling back to reinstall.
src/cli/setup.ts-123-125 (1)

123-125: ⚠️ Potential issue | 🟠 Major

Use HTTPS for the kernel index request.

The final kernel download is HTTPS, but the object listing that chooses latestKey is fetched over plaintext HTTP. That allows on-path tampering or downgrade selection before the secure download step.

🔒 Suggested change
-  const listUrl = `http://spec.ccfc.min.s3.amazonaws.com/?prefix=firecracker-ci/v1.15/${architecture}/vmlinux-&list-type=2`;
+  const listUrl = `https://spec.ccfc.min.s3.amazonaws.com/?prefix=firecracker-ci/v1.15/${architecture}/vmlinux-&list-type=2`;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cli/setup.ts` around lines 123 - 125, Change the kernel index request to
use HTTPS: update the construction of listUrl (where it's built using
architecture and passed to fetchText) to use "https://" instead of "http://", so
the object listing that determines latestKey is fetched over TLS; ensure
fetchText is still used to fetch the HTTPS URL and that any logic that derives
latestKey from xml remains unchanged.
src/vm/api.ts-31-32 (1)

31-32: 🛠️ Refactor suggestion | 🟠 Major

Create a typed FlintApiError class for HTTP errors from the Flint API.

The module violates the coding guideline requiring each module to define its own typed error types. Using a generic Error forces callers to parse the message string to distinguish Flint HTTP status failures from other VM boot/configuration errors. Define the error class at the module level with structured fields for method, path, statusCode, and response body.

♻️ Suggested change
+export class FlintApiError extends Error {
+  constructor(
+    readonly method: string,
+    readonly path: string,
+    readonly statusCode: number,
+    readonly responseBody: string,
+  ) {
+    super(`Flint ${method} ${path}: ${statusCode} ${responseBody}`);
+    this.name = "FlintApiError";
+  }
+}
+
 /** Thin client for the Flint VMM REST API over Unix socket. */
 export class FlintApi {
@@
-            reject(new Error(`Flint ${method} ${path}: ${res.statusCode} ${data}`));
+            reject(new FlintApiError(method, path, res.statusCode, data));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/vm/api.ts` around lines 31 - 32, Add a module-level typed FlintApiError
class (exported) that extends Error and exposes structured fields: method:
string, path: string, statusCode: number, and body: string (and set name =
"FlintApiError" and preserve stack). Replace the generic reject(new Error(...))
in the request/response path (the branch inside the code that currently checks
res.statusCode >= 300 in the function that sends HTTP requests) with reject(new
FlintApiError(method, path, res.statusCode, data)) so callers can
instanceof-check and access structured fields instead of parsing the message.
src/cli/setup.ts-100-108 (1)

100-108: ⚠️ Potential issue | 🟠 Major

Use a typed error instead of generic Error for setup failures.

The findVmmDir() function currently throws a generic Error, which violates the coding guideline that "each module must define its own typed error types instead of using generic errors." Create a SetupError class (or similar) extending HearthError and use it here. This allows callers to distinguish setup failures from other error types.

Note: The vmm/ directory will be included in the published npm package by default (no .npmignore exists and .gitignore only excludes build artifacts), so the lookup will work from distributed tarballs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cli/setup.ts` around lines 100 - 108, Define a typed setup error and
throw it instead of a generic Error: add a SetupError class (or similarly named)
that extends HearthError (import HearthError where needed) in the same module or
a nearby errors module, and replace the throw in findVmmDir() with throw new
SetupError("Could not find vmm/ directory. Run from the hearth repo root.");
ensure the new SetupError is exported if other modules need to catch it so
callers can distinguish setup failures from other error types.
src/cli/setup.ts-83-95 (1)

83-95: ⚠️ Potential issue | 🟠 Major

Validate Zig version against the manifest's minimum requirement.

The preflight only checks that zig exists via execSync("zig version", { stdio: "pipe" }) but does not validate the version number. The vmm/build.zig.zon manifest declares .minimum_zig_version = "0.16.0-dev.1484+d0ba6642b", which is never checked. When an incompatible Zig version is installed, the build fails later with an opaque error from zig build since stdio: "pipe" suppresses stderr output to the user.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cli/setup.ts` around lines 83 - 95, Read and parse the manifest's
minimum_zig_version (from the manifest file in the vmm dir found by
findVmmDir()), then replace the current naive execSync("zig version") check with
logic that execSyncs "zig version", captures and parses the installed Zig
semantic version, compares it against the manifest's minimum_zig_version string
(handle pre-release/metadata as needed), and if the installed version is lower
print a clear error mentioning both the installed version and the required
minimum and exit; additionally, when invoking execSync("zig build
-Doptimize=ReleaseSafe", { cwd: vmmDir, ... }), ensure stderr is not suppressed
(either forward stdio or capture and log stderr on failure) so build errors are
visible to the user.
vmm/src/api.zig-109-117 (1)

109-117: ⚠️ Potential issue | 🟠 Major

Don't return 204 for InstanceStart until the boot config is valid.

handleAction() acknowledges InstanceStart immediately, but serve() can still discover that neither kernel_path nor snapshot_path was configured and just keep waiting. Clients will see a successful start request even though the VM never transitions. Validate before responding, or return a 4xx from the action handler.

Also applies to: 184-186, 367-386

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vmm/src/api.zig` around lines 109 - 117, handleAction() currently
acknowledges InstanceStart (returns 204) before validating the boot
configuration, causing clients to see success even when neither
config.kernel_path nor config.snapshot_path is set; update handleAction() (and
the InstanceStart handling branches around the started/config checks) to
validate that either config.snapshot_path != null or config.kernel_path != null
before sending a 2xx response, and if validation fails return a 4xx error
instead; mirror the same validation logic used in serve() (the code paths around
the started flag and snapshot/kernel checks) so the action handler only
acknowledges when the boot config is actually valid.
vmm/src/devices/virtio/blk.zig-125-146 (1)

125-146: ⚠️ Potential issue | 🟠 Major

Keep used_len equal to the bytes the device actually produced.

The T_IN path unconditionally adds total_data_len even if a pread() fails partway through, and T_GET_ID adds id_desc.len even though the device writes at most 20 bytes. vmm/src/devices/virtio/queue.zig, Lines 76-92 copy this value straight into the used ring, so over-reporting here can make the guest treat unwritten bytes as valid data.

Also applies to: 175-184, 195-197

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vmm/src/devices/virtio/blk.zig` around lines 125 - 146, Ensure
used_len/device_written reflects actual bytes written rather than assumed
totals: in the T_IN path (loop using linux.pread and variables file_offset, rc,
bytes_read) accumulate the real bytes written per descriptor (bytes_read or 0 on
short/failure) into used_len/device_written and do not add total_data_len when a
pread fails partway; similarly in the T_GET_ID handling cap the increment to at
most the actual number of bytes written (the min of id_desc.len and the
device-written ID length, e.g., 20). Update the logic around variables used_len,
device_written, and any additions of total_data_len/id_desc.len so they sum the
actual bytes written (using rc/bytes_read) and stop adding on errors/breaks;
also mirror this fix at the other affected ranges noted (around lines 175-184
and 195-197) so queue.zig receives correct used lengths.
vmm/src/api.zig-726-739 (1)

726-739: ⚠️ Potential issue | 🟠 Major

SendCtrlAltDel is a no-op right now.

This branch never injects a key event, triggers ACPI shutdown, or otherwise tells the guest to power off—it only waits up to 5 seconds for runtime.exited to change on its own. That means the endpoint will mostly time out while still advertising shutdown semantics. Return 501/400 until there is an actual shutdown path, or wire it to a real device/runtime action.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vmm/src/api.zig` around lines 726 - 739, The SendCtrlAltDel branch currently
does nothing to initiate a shutdown and merely waits on runtime.exited; update
it so that until you implement a real shutdown/inject path the handler returns a
clear not-implemented response instead of pretending to shut down: detect the
parsed.value.action_type == "SendCtrlAltDel" branch and call
respondError(request, .not_implemented, "SendCtrlAltDel not supported") (or
.bad_request as preferred) rather than spinning on runtime.exited/respondOk;
alternatively, if you wire it to a device/runtime action, invoke the appropriate
shutdown/inject function on the runtime/API and then wait on runtime.exited
before calling respondOk/respondError.
vmm/src/integration_tests.zig-32-43 (1)

32-43: ⚠️ Potential issue | 🟠 Major

Stop embedding init_script inside a raw single-quoted shell string.

Several callers pass echo -n '', which closes the shell quote early and turns the rest of the script into shell syntax. That makes buildInitrd() fail for the API/snapshot tests before Flint even starts. Write the file with Zig I/O or escape ' before interpolating.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vmm/src/integration_tests.zig` around lines 32 - 43, In buildInitrd, stop
embedding init_script inside a raw single-quoted shell string passed to
process.run (the current printf '...'), because unescaped single quotes in
init_script break the shell; instead either (preferred) write the init file
using Zig I/O (open a temp file in the TMPDIR and write init_script bytes
directly before running bsdcpio) or (if keeping the shell approach) escape
single quotes in init_script (replace each ' with '\'' or '"'"' in the string
used to build the shell command) before concatenation; update the process.run
invocation accordingly and keep references to init_script and process.run in the
revised buildInitrd implementation.
vmm/src/integration_tests.zig-200-219 (1)

200-219: ⚠️ Potential issue | 🟠 Major

Replace fixed sleeps with a real readiness check.

sleep_ms(500/1000/2000) does not guarantee that the Unix socket is listening or that the VM/restore has reached the expected state, so these tests will be flaky on slower CI or under load. Poll connect() or GET /vm until the API becomes reachable instead of sleeping a guessed duration.

Also applies to: 245-257, 299-318, 346-401

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vmm/src/integration_tests.zig` around lines 200 - 219, Replace the brittle
sleep_ms(...) calls with a polling readiness helper that attempts to connect to
the Firecracker API socket (using connect() on sock_path) or repeatedly calls
httpRequest(sock_path, "GET", "/vm", null) until a success/expected state (e.g.,
HTTP success and "Running" in the body) or a short timeout; remove the fixed
sleeps in the boot/restore/start sequences around boot_cmd/httpRequest and VM
status checks (locations using sleep_ms, httpRequest, sock_path, and the status
check that searches for "Running") and use the helper with backoff/retry and a
clear timeout, freeing responses via allocator.free on each attempt and
returning an error if the socket/API never becomes ready.
vmm/src/devices/virtio/blk.zig-154-168 (1)

154-168: ⚠️ Potential issue | 🟠 Major

Treat short pwrite results as I/O failure, not success.

A positive short write is legal on Linux. The current loop advances file_offset by the full descriptor length and can still return S_OK, which silently truncates guest writes under ENOSPC or signal interruption. Loop until each descriptor is fully written, or fail the request when the full payload cannot be persisted.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vmm/src/devices/virtio/blk.zig` around lines 154 - 168, The loop treating
pwrite short writes as success must be changed so each descriptor is fully
written: inside the for over descs use a loop that calls linux.pwrite(self.fd,
buf.ptr + written, buf.len - written, file_offset + written) until written ==
buf.len; on each pwrite check rc < 0 => set status = S_IOERR and break, treat rc
== 0 as failure and set S_IOERR, otherwise advance written by rc and only after
the descriptor finishes advance file_offset by the descriptor's written bytes
(cast rc to u64 for arithmetic), handling overflow the same way you currently
do; reference functions/vars: pwrite, file_offset, descs, desc.len, status,
S_IOERR.
vmm/src/integration_tests.zig-157-172 (1)

157-172: ⚠️ Potential issue | 🟠 Major

The 10s boot timeout never fires while read() is blocked.

linux.read() on the stdout pipe is blocking, so if the guest stalls before printing FLINT_BOOT_OK, this loop can hang indefinitely inside the first read and never re-check the monotonic clock. Use poll/nonblocking I/O around stdout so the timeout is enforceable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vmm/src/integration_tests.zig` around lines 157 - 172, The loop that reads
from the child stdout uses blocking linux.read(), so the timeout based on
start_ts never triggers; change the I/O to use poll or non-blocking reads: call
linux.poll (or set stdout.handle to O_NONBLOCK and use nonblocking reads) to
wait for readability with a timeout derived from start_ts (or pass a computed
timeout to poll), then only call linux.read when poll indicates data available
and handle EAGAIN/EWOULDBLOCK; update the loop around buf/total/FLINT_BOOT_OK to
break on poll timeout and keep existing logic for accumulating data and
searching for "FLINT_BOOT_OK".
src/sandbox/sandbox.ts-111-122 (1)

111-122: ⚠️ Potential issue | 🟠 Major

Clean up the child if restore initialization fails before new Sandbox(...).

Right now copy failures before spawn() leave a stale runDir, and failures after spawn() but before the constructor returns can also leave a live Flint process behind. Because activeSandboxes.add(this) only happens in the constructor, nothing reaps that partial restore later. Put the whole restore sequence under one cleanup guard.

Also applies to: 156-173

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/sandbox/sandbox.ts` around lines 111 - 122, The restoreFromDir flow can
leave a stale runDir or a live child process if any step fails before the
Sandbox constructor adds the instance to activeSandboxes; wrap the entire
restore sequence in a cleanup guard (e.g., local variables for runDir and
spawned child) so that on any error you remove the runDir and, if spawn()
already created a child process, kill/cleanup that child; ensure the guard runs
both for copyFile failures and for failures after spawn() but before the new
Sandbox(...) returns, and keep activeSandboxes.add(this) only in the constructor
as-is while the guard handles pre-constructor cleanup.
vmm/src/api.zig-488-499 (1)

488-499: ⚠️ Potential issue | 🟠 Major

Make the post-boot accept loop interruptible on guest exit.

The loop condition is checked before accept(), so once runtime.exited flips, the main thread can still block forever waiting for another client. In vmm/src/main.zig, Lines 426-457 call servePostBoot() before thread.join(), so this can keep Flint alive after the guest has already exited. Use nonblocking accept/poll or close the listener when the VM exits.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vmm/src/api.zig` around lines 488 - 499, The accept loop in servePostBoot
(the while loop that calls server.accept) can block after runtime.exited flips;
make the accept interruptible by either switching the listener to non-blocking
and polling/selecting on both the server socket and runtime.exited, or by
arranging for the listener to be closed when runtime.exited is set so accept()
returns. Locate the loop that references runtime.exited, server.accept, and
handlePostBootConnection and modify it to (1) avoid a blocking accept when
runtime.exited is true by checking/using a nonblocking accept or poll, or (2)
ensure the listener is closed from the VM-exit path so accept returns with an
error and the loop can exit; update error handling for accept to treat the
listener-closed case as loop termination rather than retry.
vmm/src/main.zig-854-860 (1)

854-860: ⚠️ Potential issue | 🟠 Major

The pause handshake can miss a resume→pause transition.

vmm/src/api.zig:606-620 clears ack_paused and flips paused back to false before returning. If another pause request sets paused = true again before this loop observes that false edge, the vCPU keeps spinning here, never publishes a fresh ack, and the second pause waits forever. A level-triggered bool pair is too weak here; this needs an epoch/futex/condition-style handshake.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vmm/src/main.zig` around lines 854 - 860, The current level-triggered boolean
handshake using rt.ack_paused and rt.paused (observed in the loop in main.zig
where vcpu spins and then clears vcpu.kvm_run.immediate_exit) can miss a
resume→pause race; replace it with an epoch-based handshake: add an atomic
pause_epoch counter (e.g., rt.pause_epoch) that the requestor increments to
request a pause and an atomic ack_epoch (e.g., rt.ack_epoch) that the vCPU sets
to match the requester’s epoch when it has acknowledged the pause; use release
stores when updating epochs and acquire loads when reading them, and block/wake
using a futex/condvar (or Zig's Event/Cond API) rather than a spin loop so the
vCPU can publish a fresh ack for every new epoch and the requester can wait for
ack_epoch == pause_epoch safely; update references to rt.paused/rt.ack_paused
and vcpu.kvm_run.immediate_exit handling to use the new epoch/futex handshake
(wake the vCPU when pause_epoch increments, and requester clears by incrementing
resume or moving to next epoch).
vmm/src/devices/virtio/queue.zig-159-165 (1)

159-165: ⚠️ Potential issue | 🟠 Major

Reject avail.idx values that outrun the queue.

popAvail() accepts any forward jump and just mods the ring index. A guest can advance avail.idx by more than size, which makes the VMM reconsume ring entries and duplicate blk/net work.

Suggested fix
 pub fn popAvail(self: *Self, mem: *Memory) !?u16 {
     const avail_idx = try self.getAvailIdx(mem);
     if (avail_idx == self.last_avail_idx) return null;
+    const pending = avail_idx -% self.last_avail_idx;
+    if (pending > self.size) return error.AvailRingCorrupt;

     const head = try self.getAvailRing(mem, self.last_avail_idx);
     self.last_avail_idx +%= 1;
     return head;
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vmm/src/devices/virtio/queue.zig` around lines 159 - 165, popAvail currently
accepts arbitrarily large forward jumps of avail.idx and just mods the ring
index; fix it by validating the advance before consuming: after calling
getAvailIdx (symbol: avail_idx) compute the forward distance from
self.last_avail_idx to avail_idx using ring-size arithmetic (use self.size/the
queue size) and reject (return an error) if that distance is greater than the
queue size (or >= self.size, whichever matches your semantics). Only if the
distance is in-range proceed to call getAvailRing(self.last_avail_idx) and
increment last_avail_idx; reference functions/fields to change: popAvail,
getAvailIdx, getAvailRing, last_avail_idx, and the queue size field (self.size).
vmm/src/devices/virtio/net.zig-171-178 (1)

171-178: ⚠️ Potential issue | 🟠 Major

Don't consume an RX buffer on WouldBlock.

The frame was never delivered, but this path still moves the descriptor to the used ring with len=0. A transient EAGAIN race drains the guest RX ring and can stall receive entirely.

Suggested fix
         const bytes_written = self.receiveFrame(mem, queue, head) catch |err| {
             // EAGAIN/EWOULDBLOCK means no more frames
             if (err == error.WouldBlock) {
-                // Already popped descriptor, push back as unused
-                queue.pushUsed(mem, head, 0) catch |e| log.warn("RX pushUsed failed: {}", .{e});
+                queue.last_avail_idx -%= 1;
                 break;
             }
             log.err("RX failed: {}", .{err});
             queue.pushUsed(mem, head, 0) catch |e| log.warn("RX pushUsed failed: {}", .{e});
             break;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vmm/src/devices/virtio/net.zig` around lines 171 - 178, When receiveFrame
returns error.WouldBlock after popping a descriptor (head) you must not move
that descriptor to the used ring with queue.pushUsed(..., 0); instead restore
the descriptor to the guest RX available ring (i.e., requeue head) or otherwise
avoid consuming it so the guest can retry. Update the error handler in the
popAvail/receiveFrame block (where bytes_written is assigned) to call the
appropriate requeue method (e.g., queue.pushAvail(mem, head) or the queue API
that returns the descriptor to available) and skip calling queue.pushUsed for
WouldBlock, then break/return as before.
vmm/src/devices/virtio/net.zig-75-80 (1)

75-80: ⚠️ Potential issue | 🟠 Major

TX packets are silently lost on writev() failures under host backpressure.

init() sets O_NONBLOCK on the TAP fd (lines 77-80), but transmitChain() does not handle write failures. When writev() returns EAGAIN due to host backpressure (lines 151-155), the code only logs a warning and continues to queue.pushUsed() (line 158), marking the descriptor complete. The packet data is lost without retry or deferral, while the guest believes transmission succeeded.

The RX path handles this correctly (lines 174-177) by re-queuing the descriptor on WouldBlock. The TX path needs equivalent handling.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vmm/src/devices/virtio/net.zig` around lines 75 - 80, init() sets O_NONBLOCK
on the TAP fd but transmitChain() currently treats writev() EAGAIN as a
successful transmission and calls queue.pushUsed(), losing the packet; modify
transmitChain() (the writev() handling) to detect EAGAIN/WouldBlock and NOT call
queue.pushUsed() for that descriptor, instead re-queue or stash the descriptor
for retry (e.g., push into a pending TX queue or re-add to the virtqueue) and
schedule a retry when the TAP fd becomes writable (via poll/epoll), and also
handle partial writes by tracking bytes written and adjusting iovecs so
remaining bytes are sent on retry; ensure queue.pushUsed() is only called after
the entire packet has actually been written.
vmm/src/devices/virtio/mmio.zig-209-217 (1)

209-217: ⚠️ Potential issue | 🟠 Major

Device reset needs a backend reset hook.

reset() clears only transport and queue state. For Vsock, established host sockets, pending control packets, and buffered writes survive a guest reset, so a reinitialized device can inherit stale connections and leaked FDs. Please clear backend-local state when status is written back to 0.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vmm/src/devices/virtio/mmio.zig` around lines 209 - 217, The device reset
currently only clears transport/queue fields in fn reset(self: *Self) but does
not clear backend-local state (e.g., Vsock's host sockets, pending control
packets, buffered writes), so add and call a backend reset hook from this code
path: define a reset method on the backend interface (e.g., Backend.reset() or
reset_on_device_reset()) and invoke it from the mmio reset implementation (fn
reset and/or wherever status is written back to 0) so that when status is
cleared the backend cleans up sockets, pending packets, and FDs; reference the
existing fn reset(self: *Self), Self.device_features_sel, Self.queue_sel, and
the Vsock backend implementation to implement the cleanup.
vmm/src/snapshot.zig-251-268 (1)

251-268: ⚠️ Potential issue | 🟠 Major

Reject truncated or mismatched device records during restore.

This loop only checks dev_len >= 16, then restores into whatever backend happens to be in slot i. A malformed snapshot can therefore trip bounds checks inside VirtioMmio.snapshotRestore(), and a reordered or missing CLI device can have queue state applied to the wrong transport. Since the first 16 bytes already carry device_id/mmio_base/irq, restore should fail unless the backend exists, the identity matches, and the consumed size matches dev_len.

vmm/src/snapshot.zig-286-292 (1)

286-292: ⚠️ Potential issue | 🟠 Major

Snapshot files should not be world-readable.

openCreate() is used for both vmstate and raw RAM dumps, and 0o644 exposes guest memory/register state to other local users. Please tighten this to 0o600 unless shared snapshots are an explicit requirement.

Suggested change
-    }, 0o644));
+    }, 0o600));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vmm/src/snapshot.zig` around lines 286 - 292, The mode used when creating
snapshot files in openCreate currently sets permissions to 0o644; update the
file creation mode in openCreate (in snapshot.zig) from 0o644 to 0o600 so
created vmstate and raw RAM dump files are owner-read/write only, ensuring
snapshots are not world-readable; keep the rest of the flags (.WRONLY, .CREAT,
.TRUNC, .CLOEXEC) unchanged.
vmm/src/kvm/vcpu.zig-192-202 (1)

192-202: ⚠️ Potential issue | 🟠 Major

Use the counted ioctl path for MSRs instead of ioctlVoid.

KVM_GET_MSRS and KVM_SET_MSRS return the number of MSR entries successfully transferred, but the current code drops this via ioctlVoid. If the kernel handles fewer entries than requested, the code won't detect it—the uninitialized tail of MsrBuffer gets serialized as snapshot state, silently corrupting restore. Replace abi.ioctlVoid() with abi.ioctl() and validate the returned count equals the expected entry count.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vmm/src/kvm/vcpu.zig` around lines 192 - 202, In getMsrs and setMsrs, replace
the use of abi.ioctlVoid with the counted path abi.ioctl and check the returned
number of MSRs transferred against the expected count (derived from
snapshot_msr_indices.len / buf.nmsrs or buf.nmsrs) to detect partial transfers;
after calling abi.ioctl(self.fd, c.KVM_GET_MSRS, `@intFromPtr`(buf)) /
abi.ioctl(self.fd, c.KVM_SET_MSRS, `@intFromPtr`(buf)) verify the returned int
equals the expected entry count and return an error (or handle it) if it is
smaller so you never serialize uninitialized tail entries from MsrBuffer.
🟡 Minor comments (2)
vmm/src/seccomp.zig-171-227 (1)

171-227: ⚠️ Potential issue | 🟡 Minor

Audit mode still hard-kills on argument violations; log message is misleading.

The default_action parameter only controls the fallthrough for unlisted syscalls at line 172. The argument-validation branches for clone, socket, and mprotect at lines 179, 185, and 192 hardcode SECCOMP_RET_KILL_PROCESS. This means in audit mode (enabled via --seccomp-audit), argument violations still terminate the process despite the log message at line 225 claiming "violations logged, not killed." Either route those reject branches through default_action so audit mode behaves consistently, or clarify both the log message and CLI help text to specify that only unlisted syscalls are logged.

Suggested fix if audit mode should cover these cases too
-    f[CLONE_BLK + 3] = bpf_stmt(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS);
+    f[CLONE_BLK + 3] = bpf_stmt(BPF_RET | BPF_K, default_action);

-    f[SOCKET_BLK + 2] = bpf_stmt(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS);
+    f[SOCKET_BLK + 2] = bpf_stmt(BPF_RET | BPF_K, default_action);

-    f[MPROT_BLK + 3] = bpf_stmt(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS);
+    f[MPROT_BLK + 3] = bpf_stmt(BPF_RET | BPF_K, default_action);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vmm/src/seccomp.zig` around lines 171 - 227, The argument-validation branches
for CLONE_BLK, SOCKET_BLK, and MPROT_BLK currently hardcode
SECCOMP_RET_KILL_PROCESS so audit mode still kills; change the return statements
at CLONE_BLK+3, SOCKET_BLK+2, and MPROT_BLK+3 to return the buildFilter's
default_action instead of SECCOMP_RET_KILL_PROCESS so that when buildFilter is
invoked with SECCOMP_RET_LOG (log_filter) those argument violations are logged
rather than killing the process; keep kill_filter/log_filter and buildFilter
usage as-is so behavior is driven by the passed default_action.
vmm/src/devices/serial.zig-151-158 (1)

151-158: ⚠️ Potential issue | 🟡 Minor

Clear the pending IRQ latch when THRE interrupts are disabled.

updateIir() switches iir back to IIR_NO_INT, but it leaves irq_pending set. If the guest clears IER_THRE before hasPendingIrq() runs, Flint still injects one stale interrupt.

Suggested fix
     if (self.ier & IER_THRE != 0) {
         self.iir = fifo_bits | IIR_THR_EMPTY;
         self.irq_pending = true;
     } else {
         self.iir = fifo_bits | IIR_NO_INT;
+        self.irq_pending = false;
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vmm/src/devices/serial.zig` around lines 151 - 158, The updateIir function
leaves the irq_pending latch set when THRE is disabled, causing stale
interrupts; modify updateIir (in serial.zig) so that in the branch where
(self.ier & IER_THRE) == 0 you not only set self.iir = fifo_bits | IIR_NO_INT
but also clear the pending latch (self.irq_pending = false) so hasPendingIrq()
won't report a stale IRQ; keep the existing behavior where setting IER_THRE sets
IIR_THR_EMPTY and irq_pending = true.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cli/setup.ts`:
- Around line 120-125: The setup flow in src/cli/setup.ts currently constructs a
listUrl that points to ELF vmlinux artifacts; change it to fetch a bzImage
artifact instead (e.g., use the Firecracker CI prefix that contains bzImage
builds) or add a flag to prevent downloading ELF by default; update the code
around the fetchText/listUrl construction in setup.ts so it requests the bzImage
file for the chosen architecture, and ensure this aligns with the loader used by
src/vm/snapshot.ts (getKernelPath) so base snapshot creation uses a bzImage
rather than an ELF until the ELF loader bug is fixed.

In `@vmm/build.zig`:
- Around line 4-6: The build currently forces an x86_64 Linux default by setting
.default_target in the target = b.standardTargetOptions(...) block; remove that
hardcoding so Zig's default/CLI-provided target is used instead (e.g., call
b.standardTargetOptions(.{}) or omit the .default_target field) and ensure the
same `target` symbol (used in the test and integration-test step definitions)
remains referenced so tests and the flint binary are built for the
host/CLI-provided target rather than always x86_64.

In `@vmm/src/boot/loader.zig`:
- Around line 238-249: The PT_LOAD virtual-to-physical translation uses the
wrong high-half base: change the subtraction base from 0xffffffff80000000 to the
documented 0xffffffff81000000 so that addresses like 0xffffffff81000000 map to
0x01000000 (PHYS_OFFSET) instead of being shifted by an extra 0x01000000; update
the same logic wherever you translate the kernel entry/RIP (the other block
mentioned) so both places that compute phys from p_paddr use the corrected
high-half base while keeping PHYS_OFFSET and the existing bounds checks.
- Around line 296-305: The setup_header fields are being written at the wrong
offsets (using params.OFF_SETUP_HEADER + 6 and +10); update the writes so the
HdrS magic and boot-protocol version are written at the start of the synthetic
setup_header. Replace the writeInt calls that target params.OFF_SETUP_HEADER + 6
and +10 with writes at params.OFF_SETUP_HEADER (for the 4-byte HDRS_MAGIC) and
params.OFF_SETUP_HEADER + 4 (for the 2-byte 0x020F version), keeping the same bp
buffer, std.mem.writeInt calls, and little-endian encoding.

In `@vmm/src/devices/virtio/vsock.zig`:
- Around line 403-447: The code currently consumes the TX descriptor chain
before ensuring data is actually kept: both the initial flush-block path and the
.AGAIN branch subtract full descriptor lengths and restart stashing from
descs[1..], which can drop or duplicate bytes. Fix by treating stashWrite() as
authoritative: track per-descriptor progress (an index and an offset into the
current descriptor) and only decrement remaining and advance to the next
descriptor by the actual number of bytes stashed/ written; in the .AGAIN branch,
stash the unsent suffix of the current descriptor and then continue stashing
subsequent descriptors starting at the next descriptor index (not re-iterating
from descs[1..]). Update conn.rx_cnt by the actual stashed count returned from
stashWrite() and use that to decide when to stop, and ensure any mem.slice()
failure breaks the descriptor loop rather than assuming the full length was
handled.
- Around line 479-556: deliverRxData currently pops an RX head (queue.popAvail)
before knowing if there's TX window or any readable socket data, which lets
polling drain the guest RX ring; change the logic in deliverRxData to first
check conn.availableForTx() and confirm the host socket is readable (e.g., via
non-blocking poll or a recv/peek) and only then call queue.popAvail to consume
descriptors, or if you must pop early ensure you restore the head (not
queue.pushUsed with len 0) when you hit tx_window==0 or EAGAIN; update the
branches around conn.availableForTx(), linux.readv handling of EAGAIN, and
queue.pushUsed usage so descriptors are not consumed unless header+data will
actually be written (refer to deliverRxData, queue.popAvail,
conn.availableForTx, linux.readv, and queue.pushUsed).

In `@vmm/src/jail.zig`:
- Around line 65-75: The device nodes are created as root:root and supplementary
groups remain after setuid/setgid; before calling linux.setgid(config.gid) and
linux.setuid(config.uid) you must change ownership of created nodes and clear
supplementary groups so the target user can access them. After creating dev/kvm
and (if config.need_tun) dev/net/tun, call chown/chownat (or the equivalent
syscall wrapper) to set owner to config.uid and group to config.gid for
linux.mknod-created files, and call the function to clear supplementary groups
(setgroups(0, NULL) or the Zig wrapper) before invoking linux.setgid and
linux.setuid; ensure these operations occur after linux.mknod and before
linux.setgid/linux.setuid so the jailed VMM can open the devices.

---

Outside diff comments:
In `@examples/claude-in-sandbox.ts`:
- Around line 5-16: Update the macOS prerequisite line in the
examples/claude-in-sandbox.ts README block: replace the old "hearth lima setup
(macOS M3+)" instruction with the new remote-daemon guidance (e.g., mention
connecting to a remote Linux daemon or running "npx hearth setup" on a remote
Linux host) so it matches README and Flint KVM requirements; edit the
Prerequisites bullet where "2." and the macOS note appear to reflect the
remote-daemon flow and remove the lima instruction.

In `@README.md`:
- Line 573: Update the "## Checkpoint and Rollback" section to match the current
behavior described in v0.6: change any text that says checkpointing terminates
or destroys the active session to state that sandbox.checkpoint() saves the VM
state while keeping the sandbox/session running, allowing rollback and branching
without terminating the active sandbox; ensure you mention sandbox.checkpoint()
by name and remove or replace phrases like "terminates the active session" or
"destroys the sandbox" so the README documents a single, consistent API
behavior.

---

Major comments:
In `@src/cli/setup.ts`:
- Around line 77-80: The current check treats any file at flintPath
(join(BIN_DIR, "flint")) as up-to-date; change it to validate a
version/fingerprint before skipping install: compute a stable fingerprint or
query the binary (e.g., run "flint --version" or compute SHA256 of flintPath)
and compare it to the expected FLINT_FINGERPRINT/FLINT_VERSION constant shipped
with the build; only log "flint: already installed" and return if the
fingerprint/version matches, otherwise remove/overwrite the stale binary and
proceed with building/installing; also handle failures to read or execute the
existing binary by falling back to reinstall.
- Around line 123-125: Change the kernel index request to use HTTPS: update the
construction of listUrl (where it's built using architecture and passed to
fetchText) to use "https://" instead of "http://", so the object listing that
determines latestKey is fetched over TLS; ensure fetchText is still used to
fetch the HTTPS URL and that any logic that derives latestKey from xml remains
unchanged.
- Around line 100-108: Define a typed setup error and throw it instead of a
generic Error: add a SetupError class (or similarly named) that extends
HearthError (import HearthError where needed) in the same module or a nearby
errors module, and replace the throw in findVmmDir() with throw new
SetupError("Could not find vmm/ directory. Run from the hearth repo root.");
ensure the new SetupError is exported if other modules need to catch it so
callers can distinguish setup failures from other error types.
- Around line 83-95: Read and parse the manifest's minimum_zig_version (from the
manifest file in the vmm dir found by findVmmDir()), then replace the current
naive execSync("zig version") check with logic that execSyncs "zig version",
captures and parses the installed Zig semantic version, compares it against the
manifest's minimum_zig_version string (handle pre-release/metadata as needed),
and if the installed version is lower print a clear error mentioning both the
installed version and the required minimum and exit; additionally, when invoking
execSync("zig build -Doptimize=ReleaseSafe", { cwd: vmmDir, ... }), ensure
stderr is not suppressed (either forward stdio or capture and log stderr on
failure) so build errors are visible to the user.

In `@src/sandbox/sandbox.ts`:
- Around line 111-122: The restoreFromDir flow can leave a stale runDir or a
live child process if any step fails before the Sandbox constructor adds the
instance to activeSandboxes; wrap the entire restore sequence in a cleanup guard
(e.g., local variables for runDir and spawned child) so that on any error you
remove the runDir and, if spawn() already created a child process, kill/cleanup
that child; ensure the guard runs both for copyFile failures and for failures
after spawn() but before the new Sandbox(...) returns, and keep
activeSandboxes.add(this) only in the constructor as-is while the guard handles
pre-constructor cleanup.

In `@src/vm/api.ts`:
- Around line 31-32: Add a module-level typed FlintApiError class (exported)
that extends Error and exposes structured fields: method: string, path: string,
statusCode: number, and body: string (and set name = "FlintApiError" and
preserve stack). Replace the generic reject(new Error(...)) in the
request/response path (the branch inside the code that currently checks
res.statusCode >= 300 in the function that sends HTTP requests) with reject(new
FlintApiError(method, path, res.statusCode, data)) so callers can
instanceof-check and access structured fields instead of parsing the message.

In `@vmm/src/api.zig`:
- Around line 109-117: handleAction() currently acknowledges InstanceStart
(returns 204) before validating the boot configuration, causing clients to see
success even when neither config.kernel_path nor config.snapshot_path is set;
update handleAction() (and the InstanceStart handling branches around the
started/config checks) to validate that either config.snapshot_path != null or
config.kernel_path != null before sending a 2xx response, and if validation
fails return a 4xx error instead; mirror the same validation logic used in
serve() (the code paths around the started flag and snapshot/kernel checks) so
the action handler only acknowledges when the boot config is actually valid.
- Around line 726-739: The SendCtrlAltDel branch currently does nothing to
initiate a shutdown and merely waits on runtime.exited; update it so that until
you implement a real shutdown/inject path the handler returns a clear
not-implemented response instead of pretending to shut down: detect the
parsed.value.action_type == "SendCtrlAltDel" branch and call
respondError(request, .not_implemented, "SendCtrlAltDel not supported") (or
.bad_request as preferred) rather than spinning on runtime.exited/respondOk;
alternatively, if you wire it to a device/runtime action, invoke the appropriate
shutdown/inject function on the runtime/API and then wait on runtime.exited
before calling respondOk/respondError.
- Around line 488-499: The accept loop in servePostBoot (the while loop that
calls server.accept) can block after runtime.exited flips; make the accept
interruptible by either switching the listener to non-blocking and
polling/selecting on both the server socket and runtime.exited, or by arranging
for the listener to be closed when runtime.exited is set so accept() returns.
Locate the loop that references runtime.exited, server.accept, and
handlePostBootConnection and modify it to (1) avoid a blocking accept when
runtime.exited is true by checking/using a nonblocking accept or poll, or (2)
ensure the listener is closed from the VM-exit path so accept returns with an
error and the loop can exit; update error handling for accept to treat the
listener-closed case as loop termination rather than retry.

In `@vmm/src/devices/virtio/blk.zig`:
- Around line 125-146: Ensure used_len/device_written reflects actual bytes
written rather than assumed totals: in the T_IN path (loop using linux.pread and
variables file_offset, rc, bytes_read) accumulate the real bytes written per
descriptor (bytes_read or 0 on short/failure) into used_len/device_written and
do not add total_data_len when a pread fails partway; similarly in the T_GET_ID
handling cap the increment to at most the actual number of bytes written (the
min of id_desc.len and the device-written ID length, e.g., 20). Update the logic
around variables used_len, device_written, and any additions of
total_data_len/id_desc.len so they sum the actual bytes written (using
rc/bytes_read) and stop adding on errors/breaks; also mirror this fix at the
other affected ranges noted (around lines 175-184 and 195-197) so queue.zig
receives correct used lengths.
- Around line 154-168: The loop treating pwrite short writes as success must be
changed so each descriptor is fully written: inside the for over descs use a
loop that calls linux.pwrite(self.fd, buf.ptr + written, buf.len - written,
file_offset + written) until written == buf.len; on each pwrite check rc < 0 =>
set status = S_IOERR and break, treat rc == 0 as failure and set S_IOERR,
otherwise advance written by rc and only after the descriptor finishes advance
file_offset by the descriptor's written bytes (cast rc to u64 for arithmetic),
handling overflow the same way you currently do; reference functions/vars:
pwrite, file_offset, descs, desc.len, status, S_IOERR.

In `@vmm/src/devices/virtio/mmio.zig`:
- Around line 209-217: The device reset currently only clears transport/queue
fields in fn reset(self: *Self) but does not clear backend-local state (e.g.,
Vsock's host sockets, pending control packets, buffered writes), so add and call
a backend reset hook from this code path: define a reset method on the backend
interface (e.g., Backend.reset() or reset_on_device_reset()) and invoke it from
the mmio reset implementation (fn reset and/or wherever status is written back
to 0) so that when status is cleared the backend cleans up sockets, pending
packets, and FDs; reference the existing fn reset(self: *Self),
Self.device_features_sel, Self.queue_sel, and the Vsock backend implementation
to implement the cleanup.

In `@vmm/src/devices/virtio/net.zig`:
- Around line 171-178: When receiveFrame returns error.WouldBlock after popping
a descriptor (head) you must not move that descriptor to the used ring with
queue.pushUsed(..., 0); instead restore the descriptor to the guest RX available
ring (i.e., requeue head) or otherwise avoid consuming it so the guest can
retry. Update the error handler in the popAvail/receiveFrame block (where
bytes_written is assigned) to call the appropriate requeue method (e.g.,
queue.pushAvail(mem, head) or the queue API that returns the descriptor to
available) and skip calling queue.pushUsed for WouldBlock, then break/return as
before.
- Around line 75-80: init() sets O_NONBLOCK on the TAP fd but transmitChain()
currently treats writev() EAGAIN as a successful transmission and calls
queue.pushUsed(), losing the packet; modify transmitChain() (the writev()
handling) to detect EAGAIN/WouldBlock and NOT call queue.pushUsed() for that
descriptor, instead re-queue or stash the descriptor for retry (e.g., push into
a pending TX queue or re-add to the virtqueue) and schedule a retry when the TAP
fd becomes writable (via poll/epoll), and also handle partial writes by tracking
bytes written and adjusting iovecs so remaining bytes are sent on retry; ensure
queue.pushUsed() is only called after the entire packet has actually been
written.

In `@vmm/src/devices/virtio/queue.zig`:
- Around line 159-165: popAvail currently accepts arbitrarily large forward
jumps of avail.idx and just mods the ring index; fix it by validating the
advance before consuming: after calling getAvailIdx (symbol: avail_idx) compute
the forward distance from self.last_avail_idx to avail_idx using ring-size
arithmetic (use self.size/the queue size) and reject (return an error) if that
distance is greater than the queue size (or >= self.size, whichever matches your
semantics). Only if the distance is in-range proceed to call
getAvailRing(self.last_avail_idx) and increment last_avail_idx; reference
functions/fields to change: popAvail, getAvailIdx, getAvailRing, last_avail_idx,
and the queue size field (self.size).

In `@vmm/src/integration_tests.zig`:
- Around line 32-43: In buildInitrd, stop embedding init_script inside a raw
single-quoted shell string passed to process.run (the current printf '...'),
because unescaped single quotes in init_script break the shell; instead either
(preferred) write the init file using Zig I/O (open a temp file in the TMPDIR
and write init_script bytes directly before running bsdcpio) or (if keeping the
shell approach) escape single quotes in init_script (replace each ' with '\'' or
'"'"' in the string used to build the shell command) before concatenation;
update the process.run invocation accordingly and keep references to init_script
and process.run in the revised buildInitrd implementation.
- Around line 200-219: Replace the brittle sleep_ms(...) calls with a polling
readiness helper that attempts to connect to the Firecracker API socket (using
connect() on sock_path) or repeatedly calls httpRequest(sock_path, "GET", "/vm",
null) until a success/expected state (e.g., HTTP success and "Running" in the
body) or a short timeout; remove the fixed sleeps in the boot/restore/start
sequences around boot_cmd/httpRequest and VM status checks (locations using
sleep_ms, httpRequest, sock_path, and the status check that searches for
"Running") and use the helper with backoff/retry and a clear timeout, freeing
responses via allocator.free on each attempt and returning an error if the
socket/API never becomes ready.
- Around line 157-172: The loop that reads from the child stdout uses blocking
linux.read(), so the timeout based on start_ts never triggers; change the I/O to
use poll or non-blocking reads: call linux.poll (or set stdout.handle to
O_NONBLOCK and use nonblocking reads) to wait for readability with a timeout
derived from start_ts (or pass a computed timeout to poll), then only call
linux.read when poll indicates data available and handle EAGAIN/EWOULDBLOCK;
update the loop around buf/total/FLINT_BOOT_OK to break on poll timeout and keep
existing logic for accumulating data and searching for "FLINT_BOOT_OK".

In `@vmm/src/kvm/vcpu.zig`:
- Around line 192-202: In getMsrs and setMsrs, replace the use of abi.ioctlVoid
with the counted path abi.ioctl and check the returned number of MSRs
transferred against the expected count (derived from snapshot_msr_indices.len /
buf.nmsrs or buf.nmsrs) to detect partial transfers; after calling
abi.ioctl(self.fd, c.KVM_GET_MSRS, `@intFromPtr`(buf)) / abi.ioctl(self.fd,
c.KVM_SET_MSRS, `@intFromPtr`(buf)) verify the returned int equals the expected
entry count and return an error (or handle it) if it is smaller so you never
serialize uninitialized tail entries from MsrBuffer.

In `@vmm/src/main.zig`:
- Around line 854-860: The current level-triggered boolean handshake using
rt.ack_paused and rt.paused (observed in the loop in main.zig where vcpu spins
and then clears vcpu.kvm_run.immediate_exit) can miss a resume→pause race;
replace it with an epoch-based handshake: add an atomic pause_epoch counter
(e.g., rt.pause_epoch) that the requestor increments to request a pause and an
atomic ack_epoch (e.g., rt.ack_epoch) that the vCPU sets to match the
requester’s epoch when it has acknowledged the pause; use release stores when
updating epochs and acquire loads when reading them, and block/wake using a
futex/condvar (or Zig's Event/Cond API) rather than a spin loop so the vCPU can
publish a fresh ack for every new epoch and the requester can wait for ack_epoch
== pause_epoch safely; update references to rt.paused/rt.ack_paused and
vcpu.kvm_run.immediate_exit handling to use the new epoch/futex handshake (wake
the vCPU when pause_epoch increments, and requester clears by incrementing
resume or moving to next epoch).

In `@vmm/src/snapshot.zig`:
- Around line 286-292: The mode used when creating snapshot files in openCreate
currently sets permissions to 0o644; update the file creation mode in openCreate
(in snapshot.zig) from 0o644 to 0o600 so created vmstate and raw RAM dump files
are owner-read/write only, ensuring snapshots are not world-readable; keep the
rest of the flags (.WRONLY, .CREAT, .TRUNC, .CLOEXEC) unchanged.

---

Minor comments:
In `@vmm/src/devices/serial.zig`:
- Around line 151-158: The updateIir function leaves the irq_pending latch set
when THRE is disabled, causing stale interrupts; modify updateIir (in
serial.zig) so that in the branch where (self.ier & IER_THRE) == 0 you not only
set self.iir = fifo_bits | IIR_NO_INT but also clear the pending latch
(self.irq_pending = false) so hasPendingIrq() won't report a stale IRQ; keep the
existing behavior where setting IER_THRE sets IIR_THR_EMPTY and irq_pending =
true.

In `@vmm/src/seccomp.zig`:
- Around line 171-227: The argument-validation branches for CLONE_BLK,
SOCKET_BLK, and MPROT_BLK currently hardcode SECCOMP_RET_KILL_PROCESS so audit
mode still kills; change the return statements at CLONE_BLK+3, SOCKET_BLK+2, and
MPROT_BLK+3 to return the buildFilter's default_action instead of
SECCOMP_RET_KILL_PROCESS so that when buildFilter is invoked with
SECCOMP_RET_LOG (log_filter) those argument violations are logged rather than
killing the process; keep kill_filter/log_filter and buildFilter usage as-is so
behavior is driven by the passed default_action.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5807f445-edf3-4944-b154-178ed15f036a

📥 Commits

Reviewing files that changed from the base of the PR and between ad51c11 and f013bd8.

📒 Files selected for processing (41)
  • .gitignore
  • ARCHITECTURE.md
  • CLAUDE.md
  • README.md
  • docs/exec-plans/active/merge-flint.md
  • examples/claude-in-sandbox.ts
  • package.json
  • src/cli/hearth.ts
  • src/cli/pool.ts
  • src/cli/setup.test.ts
  • src/cli/setup.ts
  • src/cli/status.ts
  • src/network/proxy.ts
  • src/sandbox/sandbox.ts
  • src/vm/api.ts
  • src/vm/binary.ts
  • src/vm/snapshot.ts
  • src/vm/thin.ts
  • vmm/build.zig
  • vmm/build.zig.zon
  • vmm/src/api.zig
  • vmm/src/boot/loader.zig
  • vmm/src/boot/params.zig
  • vmm/src/devices/serial.zig
  • vmm/src/devices/virtio.zig
  • vmm/src/devices/virtio/blk.zig
  • vmm/src/devices/virtio/mmio.zig
  • vmm/src/devices/virtio/net.zig
  • vmm/src/devices/virtio/queue.zig
  • vmm/src/devices/virtio/vsock.zig
  • vmm/src/integration_tests.zig
  • vmm/src/jail.zig
  • vmm/src/kvm/abi.zig
  • vmm/src/kvm/system.zig
  • vmm/src/kvm/vcpu.zig
  • vmm/src/kvm/vm.zig
  • vmm/src/main.zig
  • vmm/src/memory.zig
  • vmm/src/seccomp.zig
  • vmm/src/snapshot.zig
  • vmm/src/tests.zig
💤 Files with no reviewable changes (3)
  • src/cli/status.ts
  • src/cli/pool.ts
  • src/vm/thin.ts

Comment on lines +4 to +6
const target = b.standardTargetOptions(.{
.default_target = .{ .cpu_arch = .x86_64, .os_tag = .linux },
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail
echo "vmm/build.zig target configuration:"
sed -n '3,8p' vmm/build.zig
echo
echo "src/cli/setup.ts Flint build invocation:"
sed -n '76,96p' src/cli/setup.ts

Repository: joshuaisaact/hearth

Length of output: 1046


🏁 Script executed:

# Check for any comments explaining the x86_64 hardcoding in build.zig
echo "=== Checking for comments around hardcoded target ==="
head -20 vmm/build.zig

echo -e "\n=== Checking if build.zig defines any test steps ==="
grep -i "test\|zig build test" vmm/build.zig || echo "(no test references found)"

echo -e "\n=== Checking for any -Dtarget usage in the entire codebase ==="
rg "\-Dtarget" --type ts --type zig --type sh --type json 2>/dev/null || echo "(no -Dtarget found)"

echo -e "\n=== Checking if there's architecture documentation in README ==="
rg "x86_64|arm64|architecture|support" README.md 2>/dev/null | head -20 || echo "(no architecture info in README)"

Repository: joshuaisaact/hearth

Length of output: 1886


🏁 Script executed:

# Verify Zig standardTargetOptions behavior with and without cpu_arch
# Check if there are any similar patterns in the repo or comments about target selection
echo "=== Checking README for build/target guidance ==="
grep -A5 -B5 "cross-compile\|aarch64\|arm64" README.md | head -30

echo -e "\n=== Checking full build.zig for any conditional target logic ==="
wc -l vmm/build.zig

Repository: joshuaisaact/hearth

Length of output: 665


🏁 Script executed:

# Read full vmm/build.zig to see complete context
echo "=== Full vmm/build.zig ===" 
cat -n vmm/build.zig

echo -e "\n=== Check agent/build.zig for multi-arch pattern ==="
head -20 agent/build.zig

Repository: joshuaisaact/hearth

Length of output: 3086


Don't hardcode the default build target to x86_64.

Lines 4–6 hardcode .cpu_arch = .x86_64, .os_tag = .linux, but src/cli/setup.ts:86 invokes zig build without a -Dtarget override. This causes the setup to install an x86_64 flint binary on arm64 Linux hosts. Additionally, the test steps defined at lines 30–41 and 46–52 inherit this hardcoded target, causing zig build test and zig build integration-test to attempt execution of cross-compiled x86_64 artifacts on non-x86_64 machines, resulting in failures.

♻️ Suggested change
     const target = b.standardTargetOptions(.{
-        .default_target = .{ .cpu_arch = .x86_64, .os_tag = .linux },
+        .default_target = .{ .os_tag = .linux },
     });
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const target = b.standardTargetOptions(.{
.default_target = .{ .cpu_arch = .x86_64, .os_tag = .linux },
});
const target = b.standardTargetOptions(.{
.default_target = .{ .os_tag = .linux },
});
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vmm/build.zig` around lines 4 - 6, The build currently forces an x86_64 Linux
default by setting .default_target in the target = b.standardTargetOptions(...)
block; remove that hardcoding so Zig's default/CLI-provided target is used
instead (e.g., call b.standardTargetOptions(.{}) or omit the .default_target
field) and ensure the same `target` symbol (used in the test and
integration-test step definitions) remains referenced so tests and the flint
binary are built for the host/CLI-provided target rather than always x86_64.

Comment on lines +238 to +249
// Convert virtual address to physical if needed (Linux kernels
// use virtual addresses starting at 0xffffffff81000000, physical at 0x1000000)
const paddr: usize = blk: {
if (p_paddr < mem.size()) break :blk @intCast(p_paddr);
// Try using a standard kernel text offset (16MB)
const PHYS_OFFSET: u64 = 0x1000000;
if (p_paddr >= 0xffffffff80000000) {
const phys = p_paddr - 0xffffffff80000000 + PHYS_OFFSET;
if (phys < mem.size()) break :blk @intCast(phys);
}
log.err("PT_LOAD segment paddr 0x{x} out of guest memory bounds", .{p_paddr});
return error.InvalidKernel;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Use the documented high-half base when translating ELF addresses.

The comment says 0xffffffff81000000 should map to 0x1000000, but subtracting 0xffffffff80000000 maps that same address to 0x2000000. That shifts both PT_LOAD segments and the final entry RIP by 16 MiB.

Suggested fix
-            const PHYS_OFFSET: u64 = 0x1000000;
-            if (p_paddr >= 0xffffffff80000000) {
-                const phys = p_paddr - 0xffffffff80000000 + PHYS_OFFSET;
+            const PHYS_OFFSET: u64 = 0x1000000;
+            const KERNEL_VIRT_BASE: u64 = 0xffffffff81000000;
+            if (p_paddr >= KERNEL_VIRT_BASE) {
+                const phys = p_paddr - KERNEL_VIRT_BASE + PHYS_OFFSET;
                 if (phys < mem.size()) break :blk `@intCast`(phys);
             }
@@
-        const PHYS_OFFSET: u64 = 0x1000000;
-        if (entry >= 0xffffffff80000000) {
-            break :blk entry - 0xffffffff80000000 + PHYS_OFFSET;
+        const PHYS_OFFSET: u64 = 0x1000000;
+        const KERNEL_VIRT_BASE: u64 = 0xffffffff81000000;
+        if (entry >= KERNEL_VIRT_BASE) {
+            break :blk entry - KERNEL_VIRT_BASE + PHYS_OFFSET;
         }

Also applies to: 273-279

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vmm/src/boot/loader.zig` around lines 238 - 249, The PT_LOAD
virtual-to-physical translation uses the wrong high-half base: change the
subtraction base from 0xffffffff80000000 to the documented 0xffffffff81000000 so
that addresses like 0xffffffff81000000 map to 0x01000000 (PHYS_OFFSET) instead
of being shifted by an extra 0x01000000; update the same logic wherever you
translate the kernel entry/RIP (the other block mentioned) so both places that
compute phys from p_paddr use the corrected high-half base while keeping
PHYS_OFFSET and the existing bounds checks.

Comment on lines +479 to +556
fn deliverRxData(self: *Self, mem: *Memory, queue: *Queue, conn: *Connection) !bool {
// Pop an RX descriptor from the guest
const head = (try queue.popAvail(mem)) orelse return error.NoBuffers;

var descs: [16]Queue.Desc = undefined;
const desc_count = queue.collectChain(mem, head, &descs) catch |err| {
queue.pushUsed(mem, head, 0) catch |e| log.warn("pushUsed failed: {}", .{e});
return err;
};

if (desc_count == 0 or descs[0].len < HDR_SIZE) {
queue.pushUsed(mem, head, 0) catch |e| log.warn("pushUsed failed: {}", .{e});
return false;
}

// Calculate available data buffer space
var data_space: u32 = 0;
for (descs[1..desc_count]) |desc| {
data_space += desc.len;
}

// Also respect flow control
const tx_window = conn.availableForTx();
if (tx_window == 0 and data_space > 0) {
// No TX window — push descriptor back unused
queue.pushUsed(mem, head, 0) catch |e| log.warn("pushUsed failed: {}", .{e});
return false;
}
const max_read = @min(data_space, tx_window);

// Try to read from the host socket into data descriptors using readv
var total_read: u32 = 0;
const data_descs = descs[1..desc_count];
if (max_read > 0 and data_descs.len > 0) {
var iov: [16]std.posix.iovec = undefined;
var iov_count: usize = 0;
var remaining = max_read;
for (data_descs) |desc| {
if (remaining == 0) break;
const read_len = @min(desc.len, remaining);
const buf = mem.slice(@intCast(desc.addr), read_len) catch break;
iov[iov_count] = .{ .base = buf.ptr, .len = buf.len };
iov_count += 1;
remaining -= read_len;
}

if (iov_count == 0) {
queue.pushUsed(mem, head, 0) catch |e| log.warn("RX pushUsed failed: {}", .{e});
return false;
}

const rc: isize = @bitCast(linux.readv(conn.fd, @ptrCast(&iov), @intCast(iov_count)));
if (rc < 0) {
const errno: linux.E = @enumFromInt(@as(u16, @intCast(-rc)));
if (errno == .AGAIN) {
// No data available
queue.pushUsed(mem, head, 0) catch |e| log.warn("RX pushUsed failed: {}", .{e});
return false;
}
// Socket error — close connection, send RST to guest
log.warn("read from host socket failed: {}", .{errno});
self.sendRstToGuest(mem, queue, head, &descs, conn);
self.closeConnectionPtr(conn);
return true;
}
if (rc == 0) {
// EOF — host closed connection, send RST to guest
self.sendRstToGuest(mem, queue, head, &descs, conn);
self.closeConnectionPtr(conn);
return true;
}
total_read = @intCast(rc);
}

if (total_read == 0 and data_space > 0) {
// No data to deliver
queue.pushUsed(mem, head, 0) catch |e| log.warn("pushUsed failed: {}", .{e});
return false;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Avoid consuming RX descriptors when no packet can be produced.

deliverRxData() pops an RX head before it knows whether the socket is readable. Both the tx_window == 0 path and the readv(...)=EAGAIN path complete that descriptor with len 0 and return false, so polling an idle connection drains the guest RX ring without delivering data or raising an interrupt. Please gate popAvail() on readiness/window or keep the head until a header/data packet is actually written.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vmm/src/devices/virtio/vsock.zig` around lines 479 - 556, deliverRxData
currently pops an RX head (queue.popAvail) before knowing if there's TX window
or any readable socket data, which lets polling drain the guest RX ring; change
the logic in deliverRxData to first check conn.availableForTx() and confirm the
host socket is readable (e.g., via non-blocking poll or a recv/peek) and only
then call queue.popAvail to consume descriptors, or if you must pop early ensure
you restore the head (not queue.pushUsed with len 0) when you hit tx_window==0
or EAGAIN; update the branches around conn.availableForTx(), linux.readv
handling of EAGAIN, and queue.pushUsed usage so descriptors are not consumed
unless header+data will actually be written (refer to deliverRxData,
queue.popAvail, conn.availableForTx, linux.readv, and queue.pushUsed).

Comment on lines +65 to +75
try check(linux.mkdir("dev", 0o755), "mkdir(/dev)");
try check(linux.mknod("dev/kvm", S_IFCHR | 0o660, DEV_KVM), "mknod(/dev/kvm)");

if (config.need_tun) {
try check(linux.mkdir("dev/net", 0o755), "mkdir(/dev/net)");
try check(linux.mknod("dev/net/tun", S_IFCHR | 0o660, DEV_NET_TUN), "mknod(/dev/net/tun)");
}

// Drop privileges — last step requiring root
try check(linux.setgid(config.gid), "setgid");
try check(linux.setuid(config.uid), "setuid");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix device ownership and supplementary groups before dropping root.

/dev/kvm and /dev/net/tun are created as 0660 root:root, then the process switches to the non-zero UID/GID required by vmm/src/main.zig, Lines 146-226. After that, the jailed VMM can no longer open those devices unless it happens to retain a privileged group, and setuid/setgid here also leave any inherited supplementary groups intact. Chown the device nodes (or otherwise grant the target UID/GID access) and clear supplementary groups before the UID/GID drop.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vmm/src/jail.zig` around lines 65 - 75, The device nodes are created as
root:root and supplementary groups remain after setuid/setgid; before calling
linux.setgid(config.gid) and linux.setuid(config.uid) you must change ownership
of created nodes and clear supplementary groups so the target user can access
them. After creating dev/kvm and (if config.need_tun) dev/net/tun, call
chown/chownat (or the equivalent syscall wrapper) to set owner to config.uid and
group to config.gid for linux.mknod-created files, and call the function to
clear supplementary groups (setgroups(0, NULL) or the Zig wrapper) before
invoking linux.setgid and linux.setuid; ensure these operations occur after
linux.mknod and before linux.setgid/linux.setuid so the jailed VMM can open the
devices.

Three fixes for the ELF kernel boot path:

1. Set initial MSRs (IA32_MISC_ENABLE, IA32_APICBASE, TSC) before
   vCPU starts. Without these, the kernel hangs after LSM init because
   perf_event_init and APIC setup fail silently.

2. Return 0xFF for unhandled IO port reads. Without this, the kernel's
   8250 serial driver detects phantom UARTs at COM2/COM3/COM4 and spins
   trying to initialize them.

3. Switch to 5.10.x kernel from Firecracker CI. The 6.1.x kernel
   dropped CONFIG_VIRTIO_MMIO_CMDLINE_DEVICES which Flint needs for
   virtio device discovery via kernel cmdline.

Also: add boot_params fields (boot_flag, cmdline_size, kernel_alignment,
init_size, xloadflags) matching what Firecracker sets for ELF kernels.
Use sequential API calls in snapshot creation (Promise.all caused race).
Add waitForFile for vsock listener before VM spawn.

Fresh boot works e2e: setup → base snapshot creation succeeds.
Snapshot restore still needs debugging (guest doesn't resume).
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (2)
src/cli/setup.ts (1)

120-126: ⚠️ Potential issue | 🟠 Major

Setup still downloads ELF vmlinux which stalls during kernel init.

The PR objectives explicitly state "ELF vmlinux boot stalls during kernel init" and "bzImage kernels boot correctly." Yet setup fetches vmlinux-5.10 (ELF format). This will cause the default hearth setup flow to produce a non-functional base snapshot.

Switch to fetching a bzImage kernel until the ELF loader is fixed, or add a clear warning that ELF loading is experimental.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cli/setup.ts` around lines 120 - 126, The setup currently downloads an
ELF vmlinux (see listUrl and the fetchText import in src/cli/setup.ts) which
stalls during kernel init; change the download target to the corresponding
bzImage build (update the prefix/path in listUrl to point to bzImage for the
5.10 series) or, if you prefer a temporary opt-in, add an explicit warning
message before download (using console.warn) that ELF vmlinux is experimental
and may stall and provide a flag to force ELF. Update references to listUrl and
any downstream variable names so the code fetches and uses the bzImage artifact
(or emits the warning and respects the new flag) instead of vmlinux-5.10.
vmm/src/boot/loader.zig (1)

238-246: ⚠️ Potential issue | 🔴 Critical

Use 0xffffffff81000000 as the ELF high-half translation base.

This is still subtracting 0xffffffff80000000, so 0xffffffff81000000 maps to 0x02000000 instead of 0x01000000. That shifts both the loaded PT_LOAD segments and the final RIP by an extra 16 MiB.

🧭 Suggested fix
             const PHYS_OFFSET: u64 = 0x1000000;
-            if (p_paddr >= 0xffffffff80000000) {
-                const phys = p_paddr - 0xffffffff80000000 + PHYS_OFFSET;
+            const KERNEL_VIRT_BASE: u64 = 0xffffffff81000000;
+            if (p_paddr >= KERNEL_VIRT_BASE) {
+                const phys = p_paddr - KERNEL_VIRT_BASE + PHYS_OFFSET;
                 if (phys < mem.size()) break :blk `@intCast`(phys);
             }
@@
         if (entry < mem.size()) break :blk entry;
         const PHYS_OFFSET: u64 = 0x1000000;
-        if (entry >= 0xffffffff80000000) {
-            break :blk entry - 0xffffffff80000000 + PHYS_OFFSET;
+        const KERNEL_VIRT_BASE: u64 = 0xffffffff81000000;
+        if (entry >= KERNEL_VIRT_BASE) {
+            break :blk entry - KERNEL_VIRT_BASE + PHYS_OFFSET;
         }

Also applies to: 273-279

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vmm/src/boot/loader.zig` around lines 238 - 246, The virtual-to-physical
translation in the paddr calculation uses the wrong high-half base constant:
change the subtraction constant 0xffffffff80000000 to 0xffffffff81000000 so that
the computed phys = p_paddr - 0xffffffff81000000 + PHYS_OFFSET yields the
intended 0x01000000 offset; update the same constant in the later identical
block (the second occurrence around the other mapping) so both PT_LOAD segment
and final RIP mapping are shifted correctly; verify this in the code around the
paddr calculation (symbols: p_paddr, paddr, PHYS_OFFSET, mem).
🧹 Nitpick comments (3)
src/cli/setup.ts (3)

76-98: setupFlint() swallows build errors without diagnostic output.

When zig build fails, execSync throws but stdio: "pipe" suppresses the actual compiler errors. The user sees only "Setup failed" with no indication of what went wrong.

Consider capturing and logging stderr on failure:

Proposed fix
   const vmmDir = findVmmDir();
   console.log("  flint: building with Zig...");
-  execSync("zig build -Doptimize=ReleaseSafe", { cwd: vmmDir, stdio: "pipe" });
+  try {
+    execSync("zig build -Doptimize=ReleaseSafe", { cwd: vmmDir, stdio: "pipe" });
+  } catch (err) {
+    const stderr = err instanceof Error && "stderr" in err ? (err as { stderr: Buffer }).stderr?.toString() : "";
+    console.error("ERROR: Flint build failed.");
+    if (stderr) console.error(stderr);
+    process.exit(1);
+  }
   copyFileSync(join(vmmDir, "zig-out", "bin", "flint"), flintPath);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cli/setup.ts` around lines 76 - 98, The setupFlint() function currently
calls execSync("zig build -Doptimize=ReleaseSafe", { cwd: vmmDir, stdio: "pipe"
}) and if that throws the build stderr is hidden; update setupFlint to wrap the
zig build execSync in a try/catch, capture the thrown error (from execSync) and
log the error's stdout/stderr or message via console.error before exiting, so
users see the Zig compiler output; keep existing behavior of copying and chmod
on success and only perform those after a successful build.

125-126: Dynamic import of fetchText is unnecessary.

fetchText is already exported from ./download.js which is statically imported on line 17 for download. The dynamic import adds complexity without benefit.

Proposed fix
-import { download } from "./download.js";
+import { download, fetchText } from "./download.js";

Then remove the dynamic import:

   console.log("  kernel: finding latest from Firecracker CI...");
-  const { fetchText } = await import("./download.js");
   const listUrl = `http://spec.ccfc.min.s3.amazonaws.com/?prefix=firecracker-ci/v1.15/${architecture}/vmlinux-5.10&list-type=2`;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cli/setup.ts` around lines 125 - 126, The dynamic import of fetchText is
unnecessary; remove the dynamic import statement that destructures fetchText and
instead use the already statically imported export from the existing import
named download (i.e., reference the fetchText export directly where used).
Update any code that referenced the dynamically-imported fetchText to use the
static import, and delete the now-unused dynamic import line to simplify
setup.ts.

100-109: findVmmDir() throws a generic Error instead of a typed error.

Per coding guidelines, each module should define its own typed error types instead of using generic errors. This pattern also appears in findAgentDir().

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cli/setup.ts` around lines 100 - 109, findVmmDir() currently throws a
generic Error; change this to throw a module-specific typed error (e.g., create
and export a VmmNotFoundError class) and update findAgentDir() to use a
corresponding AgentNotFoundError if not already typed. Specifically, define a
subclass of Error (named VmmNotFoundError) in the same module, update
findVmmDir() to throw new VmmNotFoundError("Could not find vmm/ directory. Run
from the hearth repo root."), and ensure any callers or tests that expect the
generic Error are adjusted to handle the new typed error (or exported type) so
type checks and error handling remain correct.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cli/setup.ts`:
- Line 126: The S3 listing URL stored in the variable listUrl uses "http" and
should be switched to "https" to prevent MITM manipulation; update the template
string assigned to listUrl in src/cli/setup.ts (the declaration of listUrl) to
use "https://spec.ccfc.min.s3.amazonaws.com/..." and ensure any other code that
constructs or uses this S3 listing URL follows the same HTTPS scheme.

In `@src/sandbox/sandbox.ts`:
- Around line 131-134: The code races because waitForFile(`${vsockPath}_1024`,
...) is invoked before the server/listener started by
agent.waitForConnection(...) has been awaited; either await the promise returned
by agent.waitForConnection (e.g. capture agentConnected =
agent.waitForConnection(...) and await it before calling waitForFile) so the
listener is guaranteed to be active before polling the socket file, or
alternatively change waitForConnection() so it only resolves after the socket
file exists; update references to waitForConnection, agentConnected, and
waitForFile accordingly.

In `@vmm/src/boot/loader.zig`:
- Around line 164-188: loadInitrd currently only protects the fixed range
[params.KERNEL_ADDR, params.KERNEL_ADDR + kernel_size) which is fine for bzImage
but can overlap ELF PT_LOAD segments that are relocated around 0x01000000;
update loadInitrd to reserve the actual ELF load window instead of using
params.KERNEL_ADDR+kernel_size: compute (or accept as parameters) the ELF
PT_LOAD minimum and maximum virtual addresses (the ELF load window) and use that
range when checking and computing initrd_addr (replace uses of
params.KERNEL_ADDR and kernel_size in the overlap check and size calculations
with elf_load_min/elf_load_max), and apply the same change for the other
occurrence referenced in lines 315-317 so the initrd placement never collides
with any ELF PT_LOAD segments.
- Around line 252-269: The ELF segment handling currently writes file-backed
bytes and skips zeroing when paddr + p_memsz exceeds mem.size(), leaving a
partial load and kernel_end beyond guest RAM; update the logic to compute
seg_end = paddr + `@as`(usize, `@intCast`(p_memsz)) and if seg_end > mem.size()
return error.InvalidKernel before updating kernel_end or returning success so
that any segment that extends past guest RAM causes the loader to fail (adjust
the flow around the existing p_filesz write, the p_memsz zeroing branch, and the
kernel_end assignment to enforce this check using paddr, p_memsz, mem.size(),
and kernel_end).
- Around line 215-228: The loop currently slices program header bytes into ph
before validating phentsize, which can bounds-trap; add a guard before creating
ph that (1) validates phentsize is at least the minimum required size (use 56 as
suggested) and (2) ensures ph_start + phentsize <= kernel_data.len, returning
error.InvalidKernel if either check fails; update the code around phentsize,
ph_start, and the creation of ph (the variables phentsize, ph_start,
kernel_data, ph in the loop) so you reject undersized or out-of-bounds program
headers before slicing.

---

Duplicate comments:
In `@src/cli/setup.ts`:
- Around line 120-126: The setup currently downloads an ELF vmlinux (see listUrl
and the fetchText import in src/cli/setup.ts) which stalls during kernel init;
change the download target to the corresponding bzImage build (update the
prefix/path in listUrl to point to bzImage for the 5.10 series) or, if you
prefer a temporary opt-in, add an explicit warning message before download
(using console.warn) that ELF vmlinux is experimental and may stall and provide
a flag to force ELF. Update references to listUrl and any downstream variable
names so the code fetches and uses the bzImage artifact (or emits the warning
and respects the new flag) instead of vmlinux-5.10.

In `@vmm/src/boot/loader.zig`:
- Around line 238-246: The virtual-to-physical translation in the paddr
calculation uses the wrong high-half base constant: change the subtraction
constant 0xffffffff80000000 to 0xffffffff81000000 so that the computed phys =
p_paddr - 0xffffffff81000000 + PHYS_OFFSET yields the intended 0x01000000
offset; update the same constant in the later identical block (the second
occurrence around the other mapping) so both PT_LOAD segment and final RIP
mapping are shifted correctly; verify this in the code around the paddr
calculation (symbols: p_paddr, paddr, PHYS_OFFSET, mem).

---

Nitpick comments:
In `@src/cli/setup.ts`:
- Around line 76-98: The setupFlint() function currently calls execSync("zig
build -Doptimize=ReleaseSafe", { cwd: vmmDir, stdio: "pipe" }) and if that
throws the build stderr is hidden; update setupFlint to wrap the zig build
execSync in a try/catch, capture the thrown error (from execSync) and log the
error's stdout/stderr or message via console.error before exiting, so users see
the Zig compiler output; keep existing behavior of copying and chmod on success
and only perform those after a successful build.
- Around line 125-126: The dynamic import of fetchText is unnecessary; remove
the dynamic import statement that destructures fetchText and instead use the
already statically imported export from the existing import named download
(i.e., reference the fetchText export directly where used). Update any code that
referenced the dynamically-imported fetchText to use the static import, and
delete the now-unused dynamic import line to simplify setup.ts.
- Around line 100-109: findVmmDir() currently throws a generic Error; change
this to throw a module-specific typed error (e.g., create and export a
VmmNotFoundError class) and update findAgentDir() to use a corresponding
AgentNotFoundError if not already typed. Specifically, define a subclass of
Error (named VmmNotFoundError) in the same module, update findVmmDir() to throw
new VmmNotFoundError("Could not find vmm/ directory. Run from the hearth repo
root."), and ensure any callers or tests that expect the generic Error are
adjusted to handle the new typed error (or exported type) so type checks and
error handling remain correct.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e105c085-4a07-451e-8b5e-ce35c5b7557f

📥 Commits

Reviewing files that changed from the base of the PR and between f013bd8 and 57f3374.

📒 Files selected for processing (6)
  • src/cli/setup.ts
  • src/sandbox/sandbox.ts
  • src/vm/snapshot.ts
  • vmm/src/boot/loader.zig
  • vmm/src/boot/params.zig
  • vmm/src/main.zig
✅ Files skipped from review due to trivial changes (2)
  • vmm/src/boot/params.zig
  • vmm/src/main.zig
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/vm/snapshot.ts

Comment on lines +164 to +188
fn loadInitrd(bp: []u8, mem: *Memory, path: [*:0]const u8, kernel_size: usize) !void {
const initrd_data = try readFile(path);
defer std.heap.page_allocator.free(initrd_data);

if (initrd_data.len == 0) return error.EmptyInitrd;

// Place initrd at end of RAM, page-aligned down.
// Respect initrd_addr_max from the kernel header.
const initrd_max = std.mem.readInt(u32, bp[params.OFF_INITRD_ADDR_MAX..][0..4], .little);
const mem_top = mem.size();
const top = if (initrd_max > 0 and initrd_max < mem_top) initrd_max else mem_top;

if (initrd_data.len > top) {
log.err("initrd ({} bytes) larger than available memory ({})", .{ initrd_data.len, top });
return error.InitrdTooLarge;
}

const initrd_addr = (top - initrd_data.len) & ~@as(usize, 0xFFF); // page-align down

if (initrd_addr < params.KERNEL_ADDR + kernel_size) {
log.err("initrd too large: needs 0x{x} but only 0x{x} available", .{
initrd_data.len, top - (params.KERNEL_ADDR + kernel_size),
});
return error.InitrdTooLarge;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Reserve the actual ELF load window when placing the initrd.

loadInitrd only protects [params.KERNEL_ADDR, params.KERNEL_ADDR + kernel_size). That is correct for bzImage, but not for the ELF path here, where PT_LOAD segments are translated around 0x01000000. With a large initrd and modest guest RAM, this can place the initrd on top of an ELF segment. params.KERNEL_ADDR is fixed at 0x100000 in vmm/src/boot/params.zig:79-95, so this helper needs the ELF segment min/max addresses, not just a size.

Also applies to: 315-317

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vmm/src/boot/loader.zig` around lines 164 - 188, loadInitrd currently only
protects the fixed range [params.KERNEL_ADDR, params.KERNEL_ADDR + kernel_size)
which is fine for bzImage but can overlap ELF PT_LOAD segments that are
relocated around 0x01000000; update loadInitrd to reserve the actual ELF load
window instead of using params.KERNEL_ADDR+kernel_size: compute (or accept as
parameters) the ELF PT_LOAD minimum and maximum virtual addresses (the ELF load
window) and use that range when checking and computing initrd_addr (replace uses
of params.KERNEL_ADDR and kernel_size in the overlap check and size calculations
with elf_load_min/elf_load_max), and apply the same change for the other
occurrence referenced in lines 315-317 so the initrd placement never collides
with any ELF PT_LOAD segments.

Comment on lines +215 to +228
const entry = std.mem.readInt(u64, kernel_data[24..32], .little);
const phoff = std.mem.readInt(u64, kernel_data[32..40], .little);
const phentsize = std.mem.readInt(u16, kernel_data[54..56], .little);
const phnum = std.mem.readInt(u16, kernel_data[56..58], .little);

log.info("ELF kernel: entry=0x{x}, {} program headers", .{ entry, phnum });

// Load all PT_LOAD segments into guest memory
var kernel_end: u64 = 0;
var i: u16 = 0;
while (i < phnum) : (i += 1) {
const ph_start = phoff + @as(u64, i) * phentsize;
if (ph_start + phentsize > kernel_data.len) return error.InvalidKernel;
const ph = kernel_data[ph_start..][0..phentsize];
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject undersized ELF program headers before slicing them.

phentsize comes from the file, but the loop immediately reads up to ph[40..48]. If e_phentsize < 56, this will bounds-trap instead of returning error.InvalidKernel.

🛡️ Suggested guard
     const entry = std.mem.readInt(u64, kernel_data[24..32], .little);
     const phoff = std.mem.readInt(u64, kernel_data[32..40], .little);
     const phentsize = std.mem.readInt(u16, kernel_data[54..56], .little);
     const phnum = std.mem.readInt(u16, kernel_data[56..58], .little);
+    const file_len: u64 = `@intCast`(kernel_data.len);
+
+    if (phentsize < 56) {
+        log.err("unexpected ELF program header size: {}", .{phentsize});
+        return error.InvalidKernel;
+    }
+    const ph_table_size = try std.math.mul(u64, `@as`(u64, phentsize), `@as`(u64, phnum));
+    if (phoff > file_len or ph_table_size > file_len - phoff) {
+        return error.InvalidKernel;
+    }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const entry = std.mem.readInt(u64, kernel_data[24..32], .little);
const phoff = std.mem.readInt(u64, kernel_data[32..40], .little);
const phentsize = std.mem.readInt(u16, kernel_data[54..56], .little);
const phnum = std.mem.readInt(u16, kernel_data[56..58], .little);
log.info("ELF kernel: entry=0x{x}, {} program headers", .{ entry, phnum });
// Load all PT_LOAD segments into guest memory
var kernel_end: u64 = 0;
var i: u16 = 0;
while (i < phnum) : (i += 1) {
const ph_start = phoff + @as(u64, i) * phentsize;
if (ph_start + phentsize > kernel_data.len) return error.InvalidKernel;
const ph = kernel_data[ph_start..][0..phentsize];
const entry = std.mem.readInt(u64, kernel_data[24..32], .little);
const phoff = std.mem.readInt(u64, kernel_data[32..40], .little);
const phentsize = std.mem.readInt(u16, kernel_data[54..56], .little);
const phnum = std.mem.readInt(u16, kernel_data[56..58], .little);
const file_len: u64 = `@intCast`(kernel_data.len);
if (phentsize < 56) {
log.err("unexpected ELF program header size: {}", .{phentsize});
return error.InvalidKernel;
}
const ph_table_size = try std.math.mul(u64, `@as`(u64, phentsize), `@as`(u64, phnum));
if (phoff > file_len or ph_table_size > file_len - phoff) {
return error.InvalidKernel;
}
log.info("ELF kernel: entry=0x{x}, {} program headers", .{ entry, phnum });
// Load all PT_LOAD segments into guest memory
var kernel_end: u64 = 0;
var i: u16 = 0;
while (i < phnum) : (i += 1) {
const ph_start = phoff + `@as`(u64, i) * phentsize;
if (ph_start + phentsize > kernel_data.len) return error.InvalidKernel;
const ph = kernel_data[ph_start..][0..phentsize];
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vmm/src/boot/loader.zig` around lines 215 - 228, The loop currently slices
program header bytes into ph before validating phentsize, which can bounds-trap;
add a guard before creating ph that (1) validates phentsize is at least the
minimum required size (use 56 as suggested) and (2) ensures ph_start + phentsize
<= kernel_data.len, returning error.InvalidKernel if either check fails; update
the code around phentsize, ph_start, and the creation of ph (the variables
phentsize, ph_start, kernel_data, ph in the loop) so you reject undersized or
out-of-bounds program headers before slicing.

Comment on lines +252 to +269
if (p_filesz > 0) {
const end = p_offset + p_filesz;
if (end > kernel_data.len) return error.InvalidKernel;
try mem.write(paddr, kernel_data[@intCast(p_offset)..@intCast(end)]);
}

// Zero BSS (memsz > filesz)
if (p_memsz > p_filesz) {
const bss_start = paddr + @as(usize, @intCast(p_filesz));
const bss_size: usize = @intCast(p_memsz - p_filesz);
if (bss_start + bss_size <= mem.size()) {
const bss = try mem.slice(bss_start, bss_size);
@memset(bss, 0);
}
}

const seg_end = paddr + @as(usize, @intCast(p_memsz));
if (seg_end > kernel_end) kernel_end = seg_end;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fail the load when an ELF segment extends past guest RAM.

If paddr + p_memsz exceeds mem.size(), the file-backed bytes may still be written and the @memset branch is just skipped. That leaves a partially loaded kernel and a kernel_end that points past real guest memory. Return error.InvalidKernel instead of silently continuing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vmm/src/boot/loader.zig` around lines 252 - 269, The ELF segment handling
currently writes file-backed bytes and skips zeroing when paddr + p_memsz
exceeds mem.size(), leaving a partial load and kernel_end beyond guest RAM;
update the logic to compute seg_end = paddr + `@as`(usize, `@intCast`(p_memsz)) and
if seg_end > mem.size() return error.InvalidKernel before updating kernel_end or
returning success so that any segment that extends past guest RAM causes the
loader to fail (adjust the flow around the existing p_filesz write, the p_memsz
zeroing branch, and the kernel_end assignment to enforce this check using paddr,
p_memsz, mem.size(), and kernel_end).

- Register memory with KVM BEFORE restoring vCPU state (LAPIC/MSR
  state may reference guest memory addresses)
- Restore vCPU state BEFORE VM state (PIT/clock/irqchip) so
  irqchip-injected interrupts aren't overwritten
- Read file data in file order, apply in Firecracker's order
- Add KVM_KVMCLOCK_CTRL call after vCPU restore (prevents soft
  lockup watchdog false positives)
- Add KVM paravirt MSRs to snapshot save/restore list (kvm-clock,
  async PF, steal time, PV EOI)
- Add KVM_PIT_SPEAKER_DUMMY flag on PIT restore

Snapshot restore still hangs on kernel 6.19 with APICv=Y — known
KVM bug loses LAPIC timer interrupt during KVM_SET_LAPIC. Needs
either kernel patch or APICv disabled.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
vmm/src/snapshot.zig (2)

317-333: Consider handling EINTR for robustness.

The I/O helpers treat any rc <= 0 as an error, but read/write can return -EINTR if interrupted by a signal. In a controlled VMM environment this is unlikely, but could cause spurious snapshot failures if signals are delivered during I/O.

♻️ Example EINTR handling for writeAll
 fn writeAll(fd: i32, data: []const u8) !void {
     var written: usize = 0;
     while (written < data.len) {
         const rc: isize = `@bitCast`(linux.write(fd, data[written..].ptr, data.len - written));
-        if (rc <= 0) return error.SnapshotWriteFailed;
+        if (rc < 0) {
+            if (`@as`(linux.E, `@enumFromInt`(-rc)) == .INTR) continue;
+            return error.SnapshotWriteFailed;
+        }
+        if (rc == 0) return error.SnapshotWriteFailed;
         written += `@intCast`(rc);
     }
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vmm/src/snapshot.zig` around lines 317 - 333, The helpers writeAll and
readExact should treat interrupted syscalls specially: when the syscall returns
-linux.EINTR (i.e., rc < 0 and rc == -linux.EINTR) simply continue the loop
instead of returning error.SnapshotWriteFailed/SnapshotReadFailed; keep the
existing behavior for other negative/zero returns and still advance the counters
on positive rc. Update both functions to detect EINTR and retry the read/write
rather than failing immediately.

186-187: Duplicate and potentially misleading comment.

The comment "Restore vCPU state FIRST" appears here (line 186) and again at line 214. Line 186 is about reading state from the file, while lines 214-224 are about applying state to the vCPU. Consider clarifying the distinction or removing the duplicate.

✏️ Suggested clarification
-    // --- Restore vCPU state FIRST (matches Firecracker's order) ---
+    // --- Read vCPU state from file (in save order) ---
 
     var mp_state: c.kvm_mp_state = undefined;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vmm/src/snapshot.zig` around lines 186 - 187, The comment "Restore vCPU state
FIRST" is duplicated and ambiguous; update the earlier occurrence (the comment
adjacent to the snapshot read logic) to explicitly say it's reading/parsing vCPU
state from the snapshot (e.g., "Read vCPU state from snapshot — do not apply
yet") and keep the later comment that precedes the vCPU apply logic as the
authoritative "Restore vCPU state FIRST (apply to vCPU)" so the distinction
between reading and applying is clear and not misleading.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@vmm/src/snapshot.zig`:
- Around line 270-282: The restore path reads dev_type from the snapshot but
never validates it against the live backend; before calling
dev.snapshotRestore(dev_buf[0..dev_len]) you must check that the saved dev_type
matches the provided device's type (e.g., call or add a device method like
getType() on the device at devices[i] or compare against a known identity field
on the device struct) and if they differ log an error/warn and skip/abort
restore for that slot rather than applying state. Update the code around
devices[i] to perform this validation (use the existing dev_type variable and
the device instance at devices[i], and avoid relying on
VirtioMmio.snapshotRestore to validate identity) and ensure the code returns or
continues appropriately when a mismatch is detected.

---

Nitpick comments:
In `@vmm/src/snapshot.zig`:
- Around line 317-333: The helpers writeAll and readExact should treat
interrupted syscalls specially: when the syscall returns -linux.EINTR (i.e., rc
< 0 and rc == -linux.EINTR) simply continue the loop instead of returning
error.SnapshotWriteFailed/SnapshotReadFailed; keep the existing behavior for
other negative/zero returns and still advance the counters on positive rc.
Update both functions to detect EINTR and retry the read/write rather than
failing immediately.
- Around line 186-187: The comment "Restore vCPU state FIRST" is duplicated and
ambiguous; update the earlier occurrence (the comment adjacent to the snapshot
read logic) to explicitly say it's reading/parsing vCPU state from the snapshot
(e.g., "Read vCPU state from snapshot — do not apply yet") and keep the later
comment that precedes the vCPU apply logic as the authoritative "Restore vCPU
state FIRST (apply to vCPU)" so the distinction between reading and applying is
clear and not misleading.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 311c12f7-98e7-4288-8360-e008d5051342

📥 Commits

Reviewing files that changed from the base of the PR and between 57f3374 and 0df9a58.

📒 Files selected for processing (3)
  • vmm/src/kvm/vcpu.zig
  • vmm/src/main.zig
  • vmm/src/snapshot.zig
✅ Files skipped from review due to trivial changes (1)
  • vmm/src/main.zig
🚧 Files skipped from review as they are similar to previous changes (1)
  • vmm/src/kvm/vcpu.zig

The snapshot was missing KVM_SET_XSAVE/KVM_GET_XSAVE — the full
FPU/SSE/AVX register state. Without this, the vCPU had zeroed FPU
state after restore, causing the guest to stall on the first FPU
instruction. This unblocked MMIO exits (guest now produces 3 exits
after restore vs 0 before).

Snapshot restore still has a timer interrupt delivery issue — the
guest processes the initial MMIO exits but then enters HLT idle and
timer interrupts don't wake it (only ~3 exits in 5 seconds). Fresh
boot and snapshot creation work end-to-end.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@vmm/src/kvm/vcpu.zig`:
- Around line 104-109: The getCpuid function currently only sets cpuid.nent and
padding before issuing KVM_GET_CPUID2, leaving the tail of
Kvm.CpuidBuffer.entries uninitialized; zero the entire cpuid buffer (all bytes
of the Kvm.CpuidBuffer instance) before setting nent/padding and calling
abi.ioctlVoid so any unused entries are deterministic and not leaked into
snapshots. Locate the getCpuid function and the Kvm.CpuidBuffer usage (symbols:
getCpuid, Kvm.CpuidBuffer, cpuid.nent, cpuid.padding, Kvm.MAX_CPUID_ENTRIES,
KVM_GET_CPUID2) and add a full zeroing of the cpuid struct (using the project’s
preferred mem zero helper) prior to the ioctl call.
- Around line 213-217: The doc comment for kvmclockCtrl currently claims the
ioctl is "Non-fatal if the guest doesn't support kvmclock" but the
implementation uses try abi.ioctlVoid(self.fd, 0xAED5, 0) which propagates all
errors; update kvmclockCtrl to match the comment by catching the ioctl error and
only propagate unexpected errors: call abi.ioctlVoid and if it returns
error.InvalidArgument or error.NoDevice (or the corresponding errno values)
swallow/ignore them, otherwise return the error; reference the kvmclockCtrl
function, its use of abi.ioctlVoid and self.fd when making the change.
- Around line 100-102: The setCpuid function currently passes the
Kvm.CpuidBuffer directly into the KVM_SET_CPUID2 ioctl; add a defensive bounds
check on the buffer's nent field before calling abi.ioctlVoid to ensure nent <=
256 (the fixed capacity) and return an error if it exceeds that limit. Locate
the setCpuid function and Kvm.CpuidBuffer usage, read cpuid.nent (or equivalent
field) and validate it against the constant max entries (256) prior to invoking
abi.ioctlVoid(self.fd, c.KVM_SET_CPUID2, ...), failing early on invalid values
to prevent the kernel from reading past the userspace buffer. Ensure the error
returned is appropriate for the function signature (!void).
- Around line 199-210: The getMsrs and setMsrs implementations currently call
abi.ioctlVoid and assume success; change them to treat KVM_GET_MSRS/KVM_SET_MSRS
as counted transfers by calling the ioctl variant that returns an integer (e.g.,
abi.ioctl returning isize), capture the returned count, verify it equals
buf.nmsrs (or the expected number) and return typed errors (e.g.,
error.PartialMsrRead / error.PartialMsrWrite / error.InvalidMsrCount) on
mismatch; also ensure the MsrBuffer is zero-initialized before use (zero buf in
getMsrs and zero any temporary buffers before save) to avoid leaking
uninitialized memory, and when persisting/saving only write the actual number of
populated entries (snapshot_msr_indices.len / buf.nmsrs) rather than the full
32-entry MsrBuffer.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d0e7f02e-91aa-435f-adcf-bbf4b68b92c9

📥 Commits

Reviewing files that changed from the base of the PR and between 0df9a58 and ca7917c.

📒 Files selected for processing (2)
  • vmm/src/kvm/vcpu.zig
  • vmm/src/snapshot.zig
🚧 Files skipped from review as they are similar to previous changes (1)
  • vmm/src/snapshot.zig

Comment on lines +100 to +102
pub fn setCpuid(self: Self, cpuid: *Kvm.CpuidBuffer) !void {
try abi.ioctlVoid(self.fd, c.KVM_SET_CPUID2, @intFromPtr(cpuid));
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's find and examine the vcpu.zig file
find . -name "vcpu.zig" -type f

Repository: joshuaisaact/hearth

Length of output: 85


🏁 Script executed:

# Let's look at the setCpuid function context
head -120 vmm/src/kvm/vcpu.zig | tail -30

Repository: joshuaisaact/hearth

Length of output: 1160


🏁 Script executed:

# Search for CpuidBuffer definition
rg "CpuidBuffer" --type zig -B 2 -A 5

Repository: joshuaisaact/hearth

Length of output: 3308


🏁 Script executed:

# Search for MAX_CPUID_ENTRIES constant
rg "MAX_CPUID_ENTRIES" --type zig

Repository: joshuaisaact/hearth

Length of output: 408


🏁 Script executed:

# Also search for the Kvm module structure
fd "\.zig" | xargs rg "pub const|struct.*Cpuid" --type zig -B 1 -A 3 | head -50

Repository: joshuaisaact/hearth

Length of output: 2451


Add bounds check for nent in setCpuid before passing to KVM.

The setCpuid function passes the CPUID buffer directly to KVM_SET_CPUID2 without validating that nent does not exceed the fixed buffer capacity (256 entries). If nent is corrupted or oversized (e.g., from a malformed snapshot), the kernel ioctl will trust the embedded count and may read past the end of the userspace buffer. The snapshot.zig restore path already validates this defensively, but the check belongs in setCpuid itself to prevent the ioctl call from being invoked with invalid data.

Suggested fix
 pub fn setCpuid(self: Self, cpuid: *Kvm.CpuidBuffer) !void {
+    if (cpuid.nent > Kvm.MAX_CPUID_ENTRIES) return error.InvalidCpuidEntryCount;
     try abi.ioctlVoid(self.fd, c.KVM_SET_CPUID2, `@intFromPtr`(cpuid));
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pub fn setCpuid(self: Self, cpuid: *Kvm.CpuidBuffer) !void {
try abi.ioctlVoid(self.fd, c.KVM_SET_CPUID2, @intFromPtr(cpuid));
}
pub fn setCpuid(self: Self, cpuid: *Kvm.CpuidBuffer) !void {
if (cpuid.nent > Kvm.MAX_CPUID_ENTRIES) return error.InvalidCpuidEntryCount;
try abi.ioctlVoid(self.fd, c.KVM_SET_CPUID2, `@intFromPtr`(cpuid));
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vmm/src/kvm/vcpu.zig` around lines 100 - 102, The setCpuid function currently
passes the Kvm.CpuidBuffer directly into the KVM_SET_CPUID2 ioctl; add a
defensive bounds check on the buffer's nent field before calling abi.ioctlVoid
to ensure nent <= 256 (the fixed capacity) and return an error if it exceeds
that limit. Locate the setCpuid function and Kvm.CpuidBuffer usage, read
cpuid.nent (or equivalent field) and validate it against the constant max
entries (256) prior to invoking abi.ioctlVoid(self.fd, c.KVM_SET_CPUID2, ...),
failing early on invalid values to prevent the kernel from reading past the
userspace buffer. Ensure the error returned is appropriate for the function
signature (!void).

Comment on lines +104 to +109
/// Read back the CPUID entries currently set on this vCPU.
/// Needed for snapshot: the guest may see filtered CPUID vs host's supported set.
pub fn getCpuid(self: Self, cpuid: *Kvm.CpuidBuffer) !void {
cpuid.nent = Kvm.MAX_CPUID_ENTRIES;
cpuid.padding = 0;
try abi.ioctlVoid(self.fd, c.KVM_GET_CPUID2, @intFromPtr(cpuid));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's find the CpuidBuffer definition
find . -type f -name "*.zig" | head -20

Repository: joshuaisaact/hearth

Length of output: 576


🏁 Script executed:

# Search for CpuidBuffer definition
rg "CpuidBuffer" -t zig -B 5 -A 10

Repository: joshuaisaact/hearth

Length of output: 5473


🏁 Script executed:

# Find and inspect the snapshot.zig file mentioned in the original scratchpad
fd -t f "snapshot.zig" -x cat {}

Repository: joshuaisaact/hearth

Length of output: 12233


🏁 Script executed:

# Look at the getCpuid function in context
sed -n '100,120p' vmm/src/kvm/vcpu.zig

Repository: joshuaisaact/hearth

Length of output: 985


🏁 Script executed:

# Check if there's a getMsrs/setMsrs to compare patterns
rg "getMsrs|setMsrs" -t zig -B 2 -A 8

Repository: joshuaisaact/hearth

Length of output: 3149


🏁 Script executed:

# Get the full vcpu.zig file structure to understand the context
wc -l vmm/src/kvm/vcpu.zig

Repository: joshuaisaact/hearth

Length of output: 87


Zero the full CPUID buffer before KVM_GET_CPUID2.

The snapshot save path writes the entire Kvm.CpuidBuffer struct to disk, but getCpuid only initializes the live prefix. When KVM returns fewer than Kvm.MAX_CPUID_ENTRIES, the uninitialized tail of the entries array persists as stack garbage into the snapshot file.

Suggested fix
 pub fn getCpuid(self: Self, cpuid: *Kvm.CpuidBuffer) !void {
+    cpuid.* = std.mem.zeroes(Kvm.CpuidBuffer);
     cpuid.nent = Kvm.MAX_CPUID_ENTRIES;
-    cpuid.padding = 0;
     try abi.ioctlVoid(self.fd, c.KVM_GET_CPUID2, `@intFromPtr`(cpuid));
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/// Read back the CPUID entries currently set on this vCPU.
/// Needed for snapshot: the guest may see filtered CPUID vs host's supported set.
pub fn getCpuid(self: Self, cpuid: *Kvm.CpuidBuffer) !void {
cpuid.nent = Kvm.MAX_CPUID_ENTRIES;
cpuid.padding = 0;
try abi.ioctlVoid(self.fd, c.KVM_GET_CPUID2, @intFromPtr(cpuid));
/// Read back the CPUID entries currently set on this vCPU.
/// Needed for snapshot: the guest may see filtered CPUID vs host's supported set.
pub fn getCpuid(self: Self, cpuid: *Kvm.CpuidBuffer) !void {
cpuid.* = std.mem.zeroes(Kvm.CpuidBuffer);
cpuid.nent = Kvm.MAX_CPUID_ENTRIES;
try abi.ioctlVoid(self.fd, c.KVM_GET_CPUID2, `@intFromPtr`(cpuid));
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vmm/src/kvm/vcpu.zig` around lines 104 - 109, The getCpuid function currently
only sets cpuid.nent and padding before issuing KVM_GET_CPUID2, leaving the tail
of Kvm.CpuidBuffer.entries uninitialized; zero the entire cpuid buffer (all
bytes of the Kvm.CpuidBuffer instance) before setting nent/padding and calling
abi.ioctlVoid so any unused entries are deterministic and not leaked into
snapshots. Locate the getCpuid function and the Kvm.CpuidBuffer usage (symbols:
getCpuid, Kvm.CpuidBuffer, cpuid.nent, cpuid.padding, Kvm.MAX_CPUID_ENTRIES,
KVM_GET_CPUID2) and add a full zeroing of the cpuid struct (using the project’s
preferred mem zero helper) prior to the ioctl call.

Comment on lines +213 to +217
/// Notify KVM that the guest was paused (prevents soft lockup watchdog
/// false positives on resume). Non-fatal if the guest doesn't support kvmclock.
pub fn kvmclockCtrl(self: Self) !void {
try abi.ioctlVoid(self.fd, 0xAED5, 0); // KVM_KVMCLOCK_CTRL
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cd vmm && find . -name "vcpu.zig" -type f

Repository: joshuaisaact/hearth

Length of output: 81


🏁 Script executed:

cat -n vmm/src/kvm/vcpu.zig | sed -n '205,225p'

Repository: joshuaisaact/hearth

Length of output: 908


🏁 Script executed:

# Search for abi.ioctlVoid implementation
rg "fn ioctlVoid" vmm/

Repository: joshuaisaact/hearth

Length of output: 153


🏁 Script executed:

# Check for error handling patterns with ioctlVoid in the codebase
rg -A 3 "ioctlVoid" vmm/src/kvm/vcpu.zig | head -50

Repository: joshuaisaact/hearth

Length of output: 1210


🏁 Script executed:

cat -n vmm/src/kvm/abi.zig | grep -A 15 "pub fn ioctlVoid"

Repository: joshuaisaact/hearth

Length of output: 347


🏁 Script executed:

# Search for calls to kvmclockCtrl to see how errors are currently handled
rg "kvmclockCtrl" vmm/

Repository: joshuaisaact/hearth

Length of output: 178


🏁 Script executed:

# Check if there are any patterns in the codebase for ignoring specific errors
rg "catch|error:" vmm/src/kvm/ | head -30

Repository: joshuaisaact/hearth

Length of output: 167


🏁 Script executed:

cat -n vmm/src/kvm/abi.zig | grep -B 10 "pub fn ioctlVoid"

Repository: joshuaisaact/hearth

Length of output: 532


🏁 Script executed:

# Find and read the ioctl function
rg -A 20 "^fn ioctl\(" vmm/src/kvm/abi.zig

Repository: joshuaisaact/hearth

Length of output: 45


🏁 Script executed:

rg -B 25 "pub fn ioctlVoid" vmm/src/kvm/abi.zig | head -40

Repository: joshuaisaact/hearth

Length of output: 1081


🏁 Script executed:

# Get the full abi.zig file to see the complete error handling
cat vmm/src/kvm/abi.zig

Repository: joshuaisaact/hearth

Length of output: 1503


Fix the contract mismatch between the doc comment and implementation.

The doc comment states "Non-fatal if the guest doesn't support kvmclock", but the function propagates all ioctl errors via try. Either update the doc comment to accurately reflect that the function can error, or change the implementation to catch and ignore the specific errors that occur when the guest doesn't support kvmclock (typically error.InvalidArgument or error.NoDevice).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vmm/src/kvm/vcpu.zig` around lines 213 - 217, The doc comment for
kvmclockCtrl currently claims the ioctl is "Non-fatal if the guest doesn't
support kvmclock" but the implementation uses try abi.ioctlVoid(self.fd, 0xAED5,
0) which propagates all errors; update kvmclockCtrl to match the comment by
catching the ioctl error and only propagate unexpected errors: call
abi.ioctlVoid and if it returns error.InvalidArgument or error.NoDevice (or the
corresponding errno values) swallow/ignore them, otherwise return the error;
reference the kvmclockCtrl function, its use of abi.ioctlVoid and self.fd when
making the change.

- Fix KVM_KVMCLOCK_CTRL: was using hardcoded 0xAED5 (wrong),
  now uses c.KVM_KVMCLOCK_CTRL from KVM headers (0xAEAD)
- Add KVM_GET_XSAVE/KVM_SET_XSAVE to snapshot save/restore
  (full FPU/SSE/AVX register state)
- Log warning if KVMCLOCK_CTRL fails (non-fatal)

Snapshot restore still blocked: guest doesn't resume after restore.
KVM_RUN produces ~3 MMIO exits then the vCPU goes idle with no
timer interrupts waking it. All KVM state restoration succeeds
(matching Firecracker's exact order). APICv=N doesn't help.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
vmm/src/kvm/vcpu.zig (2)

100-109: ⚠️ Potential issue | 🟠 Major

Defensively bound and sanitize the CPUID buffer before these ioctls.

setCpuid() still trusts cpuid.nent, and getCpuid() still leaves the unused tail uninitialized even though vmm/src/snapshot.zig::save() writes the whole Kvm.CpuidBuffer to disk. That combination can both over-trust malformed input and leak stack garbage into snapshots.

🧩 Suggested fix
 pub fn setCpuid(self: Self, cpuid: *Kvm.CpuidBuffer) !void {
+    if (cpuid.nent > Kvm.MAX_CPUID_ENTRIES) return error.InvalidCpuidEntryCount;
     try abi.ioctlVoid(self.fd, c.KVM_SET_CPUID2, `@intFromPtr`(cpuid));
 }
@@
 pub fn getCpuid(self: Self, cpuid: *Kvm.CpuidBuffer) !void {
+    cpuid.* = std.mem.zeroes(Kvm.CpuidBuffer);
     cpuid.nent = Kvm.MAX_CPUID_ENTRIES;
-    cpuid.padding = 0;
     try abi.ioctlVoid(self.fd, c.KVM_GET_CPUID2, `@intFromPtr`(cpuid));
 }
Linux KVM API: what constraints does KVM_SET_CPUID2 place on struct kvm_cpuid2.nent, and does KVM_GET_CPUID2 populate only the first nent entries of the caller-provided buffer?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vmm/src/kvm/vcpu.zig` around lines 100 - 109, Ensure the CPUID buffer is
defensively bounded and zeroed: in setCpuid(), validate and clamp cpuid.nent to
Kvm.MAX_CPUID_ENTRIES (or return an error if cpuid.nent is out of range) before
calling ioctl to avoid trusting a malicious nent; in getCpuid(), set cpuid.nent
= Kvm.MAX_CPUID_ENTRIES as you do but also zero the unused entries/tail of the
Kvm.CpuidBuffer after the ioctl so any entries beyond the populated count are
deterministic (avoid leaking stack garbage), and document/handle the case where
the kernel returns fewer entries than MAX_CPUID_ENTRIES by only trusting the
populated count the kernel reports while keeping the rest zeroed.

199-210: ⚠️ Potential issue | 🟠 Major

Treat MSR ioctls as counted transfers and zero the fixed buffer first.

KVM_GET_MSRS and KVM_SET_MSRS can succeed partially, so ioctlVoid hides incomplete transfers. getMsrs() also initializes only the live prefix of a 32-entry buffer, while vmm/src/snapshot.zig::save() persists the full struct.

🧩 Suggested fix
 pub fn getMsrs(self: Self, buf: *MsrBuffer) !void {
+    buf.* = std.mem.zeroes(MsrBuffer);
     buf.nmsrs = snapshot_msr_indices.len;
-    buf.pad = 0;
     for (snapshot_msr_indices, 0..) |idx, i| {
         buf.entries[i] = .{ .index = idx, .reserved = 0, .data = 0 };
     }
-    try abi.ioctlVoid(self.fd, c.KVM_GET_MSRS, `@intFromPtr`(buf));
+    const requested: usize = `@intCast`(buf.nmsrs);
+    const got = try abi.ioctl(self.fd, c.KVM_GET_MSRS, `@intFromPtr`(buf));
+    if (got != requested) return error.PartialMsrRead;
 }
 
 pub fn setMsrs(self: Self, buf: *const MsrBuffer) !void {
-    try abi.ioctlVoid(self.fd, c.KVM_SET_MSRS, `@intFromPtr`(buf));
+    if (buf.nmsrs > MAX_MSR_ENTRIES) return error.InvalidMsrCount;
+    const requested: usize = `@intCast`(buf.nmsrs);
+    const written = try abi.ioctl(self.fd, c.KVM_SET_MSRS, `@intFromPtr`(buf));
+    if (written != requested) return error.PartialMsrWrite;
 }
Linux KVM API: what do KVM_GET_MSRS and KVM_SET_MSRS return on success, and can either ioctl report a partial transfer instead of failing?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vmm/src/kvm/vcpu.zig` around lines 199 - 210, getMsrs and setMsrs must treat
KVM_GET_MSRS/KVM_SET_MSRS as counted transfers and operate on a zeroed
fixed-size MsrBuffer to avoid persisting garbage; zero the entire MsrBuffer
first, populate buf.nmsrs with the intended count, initialize all buf.entries to
zero then fill the live prefix using snapshot_msr_indices in getMsrs, then call
the ioctl variant that returns the actual number of MSRs processed (don’t use
ioctlVoid), check the returned count and update buf.nmsrs to that count to
reflect a partial success; similarly, in setMsrs ensure you only send the number
of entries indicated by buf.nmsrs (zero the buffer beforehand) and call the
ioctl that returns the actual processed count and handle partial writes instead
of assuming success.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@vmm/src/snapshot.zig`:
- Around line 309-315: The snapshot file creation in openCreate currently uses
mode 0o644 which exposes sensitive guest RAM; change the file mode passed to
linux.open in openCreate from 0o644 to owner-only 0o600 so created snapshot
artifacts are readable/writable only by the owner (update the mode argument in
the `@bitCast` call that constructs the open flags).
- Around line 286-290: The restore loop currently treats a saved device without
a live backend as a warning, allowing a partially reconstructed VM; change the
branch that handles devices[i] == null to abort the restore instead: in the
function performing the restore loop (the code around devices[i] checks in
snapshot.zig that calls dev.snapshotRestore), replace the log.warn path with a
failing return (e.g., return error.RestoreMissingBackend or an appropriate
existing error variant) so load()/restore exits immediately when a saved device
slot (dev_type at slot i) has no provided backend, and include the device slot
and type in the error context for diagnostics.
- Around line 185-187: The code currently calls Memory.initFromFile and
mem.deinit() via errdefer after vm.setMemoryRegion(0, 0, mem.alignedMem()),
which can unmap RAM while KVM slot 0 still references it; add an API in the
VM/KVM layer (e.g., vm.deleteMemoryRegion or vm.unsetMemoryRegion) in
vmm/src/kvm/vm.zig to unregister a memory region/slot from KVM, and change
snapshot.zig so the cleanup path first calls that new vm method to remove slot 0
before calling mem.deinit() (or reverse the errdefer order / add an errdefer to
call the new vm.unsetMemoryRegion before mem.deinit) to ensure KVM no longer
points at the mapped memory.

---

Duplicate comments:
In `@vmm/src/kvm/vcpu.zig`:
- Around line 100-109: Ensure the CPUID buffer is defensively bounded and
zeroed: in setCpuid(), validate and clamp cpuid.nent to Kvm.MAX_CPUID_ENTRIES
(or return an error if cpuid.nent is out of range) before calling ioctl to avoid
trusting a malicious nent; in getCpuid(), set cpuid.nent = Kvm.MAX_CPUID_ENTRIES
as you do but also zero the unused entries/tail of the Kvm.CpuidBuffer after the
ioctl so any entries beyond the populated count are deterministic (avoid leaking
stack garbage), and document/handle the case where the kernel returns fewer
entries than MAX_CPUID_ENTRIES by only trusting the populated count the kernel
reports while keeping the rest zeroed.
- Around line 199-210: getMsrs and setMsrs must treat KVM_GET_MSRS/KVM_SET_MSRS
as counted transfers and operate on a zeroed fixed-size MsrBuffer to avoid
persisting garbage; zero the entire MsrBuffer first, populate buf.nmsrs with the
intended count, initialize all buf.entries to zero then fill the live prefix
using snapshot_msr_indices in getMsrs, then call the ioctl variant that returns
the actual number of MSRs processed (don’t use ioctlVoid), check the returned
count and update buf.nmsrs to that count to reflect a partial success;
similarly, in setMsrs ensure you only send the number of entries indicated by
buf.nmsrs (zero the buffer beforehand) and call the ioctl that returns the
actual processed count and handle partial writes instead of assuming success.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e78380d7-fdd7-43ff-8051-0aa0d9241619

📥 Commits

Reviewing files that changed from the base of the PR and between ca7917c and ab61952.

📒 Files selected for processing (3)
  • vmm/src/kvm/vcpu.zig
  • vmm/src/main.zig
  • vmm/src/snapshot.zig
✅ Files skipped from review due to trivial changes (1)
  • vmm/src/main.zig

Comment on lines +185 to +187
const mem = try Memory.initFromFile(mem_path, header.mem_size);
errdefer mem.deinit();
try vm.setMemoryRegion(0, 0, mem.alignedMem());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Linux KVM API: after KVM_SET_USER_MEMORY_REGION registers a userspace_addr, is it valid to munmap that range without first deleting or replacing the memory slot?

💡 Result:

No, it is not valid to munmap the userspace memory range registered via KVM_SET_USER_MEMORY_REGION without first deleting or replacing the memory slot. The userspace memory must remain valid and accessible at the registered userspace_addr for the lifetime of the slot. KVM relies on MMU notifiers to detect changes like munmap and invalidate its internal mappings (e.g. EPT/NPT/SPTEs), but munmapping while the slot is active can lead to use-after-free scenarios where KVM or the guest accesses unmapped memory before invalidation completes. Official docs state changes like mmap/madvise are reflected automatically with KVM_CAP_SYNC_MMU, implying the backing must stay valid. Examples in code (e.g. QEMU, tutorials) mmap once and keep mapped until slot modification/removal. To change backing, use KVM_SET_USER_MEMORY_REGION to delete (memory_size=0) or replace the slot first.

Citations:


🏁 Script executed:

find . -name "*.zig" -type f | head -20

Repository: joshuaisaact/hearth

Length of output: 576


🏁 Script executed:

# Locate the snapshot, memory, and vm files
fd -e zig | grep -E "(snapshot|memory|vm\.zig)" | head -20

Repository: joshuaisaact/hearth

Length of output: 121


🏁 Script executed:

# Check the structure
ls -la vmm/src/ 2>/dev/null | head -30

Repository: joshuaisaact/hearth

Length of output: 838


🏁 Script executed:

# Check snapshot.zig around lines 185-210 to see the error handling
cat -n vmm/src/snapshot.zig | sed -n '175,220p'

Repository: joshuaisaact/hearth

Length of output: 2108


🏁 Script executed:

# Check memory.zig to see if deinit unmaps
cat -n vmm/src/memory.zig

Repository: joshuaisaact/hearth

Length of output: 4278


🏁 Script executed:

# Check vm.zig for setMemoryRegion and any clearMemoryRegion function
cat -n vmm/src/kvm/vm.zig | head -100

Repository: joshuaisaact/hearth

Length of output: 4399


🏁 Script executed:

# Search for clearMemoryRegion or unmap in vm.zig
rg -n "clear|unmap" vmm/src/kvm/vm.zig

Repository: joshuaisaact/hearth

Length of output: 45


🏁 Script executed:

# See the rest of the restore function to understand all possible error paths
cat -n vmm/src/snapshot.zig | sed -n '185,280p'

Repository: joshuaisaact/hearth

Length of output: 4532


🏁 Script executed:

# Check if there's a clearMemoryRegion or deleteMemoryRegion function anywhere
rg -n "Memory.*Region|delete|clear" vmm/src/kvm/vm.zig

Repository: joshuaisaact/hearth

Length of output: 185


🏁 Script executed:

# Look for any error handling patterns in snapshot.zig around the memory setup
rg -n -A5 "setMemoryRegion" vmm/src/snapshot.zig

Repository: joshuaisaact/hearth

Length of output: 313


Don't unmap restored RAM while slot 0 still points at it.

After vm.setMemoryRegion(0, 0, mem.alignedMem()) at line 187, any error in the subsequent restoration steps (lines 192–261) will trigger errdefer mem.deinit(), which unmaps the memory via std.posix.munmap(). Since setMemoryRegion() only registers the userspace pointer with KVM and no corresponding deletion exists in vm.zig, slot 0 remains active, pointing to freed memory. The VM will fault or access invalid memory if the caller doesn't immediately tear down the VM.

🧹 Suggested fix

Add a helper to delete memory regions in vmm/src/kvm/vm.zig:

/// Register a guest physical memory region backed by host memory.
pub fn setMemoryRegion(self: Self, slot: u32, guest_phys_addr: u64, memory: []align(std.heap.page_size_min) u8) !void {
    var region = c.kvm_userspace_memory_region{
        .slot = slot,
        .flags = 0,
        .guest_phys_addr = guest_phys_addr,
        .memory_size = memory.len,
        .userspace_addr = `@intFromPtr`(memory.ptr),
    };
    try abi.ioctlVoid(self.fd, c.KVM_SET_USER_MEMORY_REGION, `@intFromPtr`(&region));
    log.info("memory region: slot={} guest=0x{x} size=0x{x}", .{ slot, guest_phys_addr, memory.len });
}

+/// Delete a guest physical memory region by setting memory_size to 0.
+pub fn clearMemoryRegion(self: Self, slot: u32) !void {
+    var region = std.mem.zeroes(c.kvm_userspace_memory_region);
+    region.slot = slot;
+    try abi.ioctlVoid(self.fd, c.KVM_SET_USER_MEMORY_REGION, `@intFromPtr`(&region));
+}

Then in vmm/src/snapshot.zig lines 185–187:

    const mem = try Memory.initFromFile(mem_path, header.mem_size);
    errdefer mem.deinit();
    try vm.setMemoryRegion(0, 0, mem.alignedMem());
+   errdefer vm.clearMemoryRegion(0) catch {};
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@vmm/src/snapshot.zig` around lines 185 - 187, The code currently calls
Memory.initFromFile and mem.deinit() via errdefer after vm.setMemoryRegion(0, 0,
mem.alignedMem()), which can unmap RAM while KVM slot 0 still references it; add
an API in the VM/KVM layer (e.g., vm.deleteMemoryRegion or vm.unsetMemoryRegion)
in vmm/src/kvm/vm.zig to unregister a memory region/slot from KVM, and change
snapshot.zig so the cleanup path first calls that new vm method to remove slot 0
before calling mem.deinit() (or reverse the errdefer order / add an errdefer to
call the new vm.unsetMemoryRegion before mem.deinit) to ensure KVM no longer
points at the mapped memory.

Snapshot restore has a KVM vCPU resume issue (KVM_RUN blocks forever
after state restore, likely related to APICv and LAPIC timer state).
As a workaround, Sandbox.create() now boots a fresh VM each time
instead of restoring from a snapshot.

This is slower (~6s vs target ~135ms) but functionally correct.
Fresh boot + exec + destroy cycle works end-to-end.

Also: fix KVM_GET_MSRS to check return count, add MSR_IA32_EFER
and FS/GS_BASE to MSR list, remove identity map addr from restore
paths (Firecracker doesn't use it).

Snapshot restore remains as a TODO for optimization.
…tion

Issues fixed:
- Remove SA_RESTART from SIGUSR1 handler (deadlocked VM pause)
- Add missing seccomp syscalls (epoll, nanosleep, statx)
- Fix vsock EAGAIN recovery re-iterating from wrong descriptor index
- Add DESC_F_WRITE validation in virtio-blk and virtio-net
- Add KVM_SET_IDENTITY_MAP_ADDR to both snapshot restore paths
- Validate device type matches on snapshot restore
- SendCtrlAltDel now actually signals VM exit
- Path traversal validation on snapshot create API
- Check agent.ping() return value in freshBoot/restoreFromDir
- Remove dead imports and loadSnapshot() method
- Fix VmConfig memory leak on repeated API requests
- Add backoff to pause spin loops (nanosleep after 1000 spins)
- Fix /dev/kvm permissions inside jail (0666 in private namespace)
- Add mem_size upper bound check on snapshot restore
- Show zig build errors on setup failure
Root cause: the ELF kernel loader's synthetic boot_params broke resume
from HLT after snapshot restore. Switching to bzImage format (which
uses the kernel's own setup header) fixes restore completely.

Changes:
- Switch kernel from ELF vmlinux to bzImage (built from 5.10 source via Docker)
- getKernelPath() prefers bzImage over vmlinux
- Setup builds bzImage from Firecracker CI 5.10 kernel config
- Sandbox.create() uses snapshot restore (68ms) instead of fresh boot (6s)
- Add 200ms settle delay before snapshot capture so vCPU enters HALTED
  state (mp_state=3) instead of RUNNABLE (mp_state=0)
- Dynamic MSR discovery via KVM_GET_MSR_INDEX_LIST (60+ MSRs vs 21)
- Add debug register save/restore (KVM_GET/SET_DEBUGREGS)
- Bump snapshot format to v2
Eliminates Docker dependency from setup. The 5.10.245 bzImage is hosted
at github.com/joshuaisaact/hearth/releases/tag/kernel-5.10.245.
…tion

Fixes:
- blk.zig: DESC_F_WRITE validation now actually prevents I/O (else → else if)
- api.zig: move immediate_exit=1 after cmpxchg to avoid pause race
- api.zig: add isValidBasename() to handleSnapshotLoad (was missing)
- api.zig: remove noop SendCtrlAltDel wait loop
- seccomp.zig: add open(2), shutdown(48), fdatasync(75) to allowlist
- snapshot.zig: assert device slots non-null during save, mode 0600
- sandbox.ts: check ping() in reconnectAgent()
- setup.test.ts: accept bzImage or vmlinux

Tests added:
- snapshot header mem_size bounds validation
- isValidBasename accepts/rejects (path traversal regression guard)
- seccomp filter contains all required syscalls
… sync

  Node.js copyFile uses copy_file_range which does server-side CoW on btrfs
  regardless of flags. The VMM mmaps the memory snapshot with MAP_PRIVATE for
  demand-paging, and reflinked extents break page fault handling on restore.
  Shell out to cp --reflink=never for the memory file.

  Also includes prior review fixes: tar path traversal guards, env var key
  validation, checkpoint resume error handling, cmpxchg pause race, IO exit
  bounds check, getIoData optional return, hardcoded MSR list, snapshot file
  permissions, device min size validation, blk pwrite retry, vsock payload
  cap, and vsock queue index sync after restore.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant