Skip to content

[uservm] E: Linux KVM guest sampling and perf host correlation#17

Open
esaurez wants to merge 23 commits into
feature/profiler-windowsfrom
feature/profiler-linux
Open

[uservm] E: Linux KVM guest sampling and perf host correlation#17
esaurez wants to merge 23 commits into
feature/profiler-windowsfrom
feature/profiler-linux

Conversation

@esaurez

@esaurez esaurez commented Apr 22, 2026

Copy link
Copy Markdown
Owner

Replica of nanvix#2168.

Summary

Adds Linux KVM guest sampling and perf record-based host kernel stack correlation, completing the cross-platform profiler started in PR #9 (Windows ETW). Plugs into the HostKernelSession type alias introduced there.

What this PR adds

KVM guest sampling (vmm/microvm/mod.rs)

  • Timer thread sends SIGUSR2 to vCPU thread at configurable frequency
  • On Interrupted exit: reads KVM_GET_REGS + KVM_GET_SREGS for EIP/EBP/CR3, walks frame-pointer chain

Host kernel tracing (perf_linux.rs)

  • PerfSession wrapping perf record with system-wide sampling when running as root
  • Falls back to per-PID recording when unprivileged

E2E script (full-flamegraph.sh)

  • Mirrors full-flamegraph.ps1 structure for Linux

Ungated register access (kvm/vcpu/mod.rs)

  • Removed #[cfg(feature = gdb)] from get_regs() and get_sregs()

Stacked on

@esaurez esaurez force-pushed the feature/profiler-windows branch from 514b3db to 1dd5d0e Compare April 22, 2026 23:27
@esaurez esaurez force-pushed the feature/profiler-linux branch 3 times, most recently from 9f1eab5 to 512deb5 Compare April 23, 2026 01:42
github-actions Bot and others added 22 commits April 23, 2026 03:36
Extend the mmap kernel call to accept an npages parameter that specifies
how many contiguous pages to map in a single invocation. This replaces
the previous page-at-a-time approach with a batched allocation strategy
that reduces kernel transitions from user space.

Kernel changes:
- Extend kcall dispatcher to pass arg3 (npages) to the mmap handler.
- Add input validation in the kcall entry point: reject zero npages,
  cap to mmap region capacity, and guard against address overflow.
- Rewrite ProcessManager::mmap() with a batched loop that uses
  try_reserve_exact to guarantee Vec capacity matches the intended page
  count (alloc_upages derives its count from uframes.capacity()).
- Implement a fallback chain: try batch size first, fall back to
  single-page allocation, then return OutOfMemory.
- Add rollback_mmap() helper to unmap partially mapped pages on failure.

User-space changes:
- Update the mmap kcall wrapper to use kcall4! and accept npages, with
  u32 range validation.
- Replace the page-by-page loop in sysalloc::heap::map_range() with a
  single batch mmap call, delegating rollback to the kernel.
- Update syscall::safe::mem::segment::map_range() to pass npages=1
  (batch conversion deferred via TODO).
- Update all existing callers (stress tests, testd) to pass npages=1.

Testing:
- Add test_mmap_multi_page() in testd that maps 4 pages in one call,
  verifies zero-initialization, writes and reads back per-page markers,
  and unmaps all pages.
[kernel] Add npages parameter to mmap kcall
Add a periodic guest stack sampling profiler that runs from the host side
and produces folded stack files compatible with flamegraph tools.

The profiler works by:
1. Starting a 1kHz timer that calls WHvCancelRunVirtualProcessor
2. On each Interrupted exit, reading guest EIP/EBP/CR3 registers
3. Walking the frame-pointer chain through host-mapped guest memory
4. Resolving addresses against ELF symbol tables (.symtab or .dynsym)
5. Writing folded stacks to a file for flamegraph generation

Key components:
- guest_profiler/gva.rs: Guest VA to GPA translation (2-level page walk)
- guest_profiler/samples.rs: Stack sampling and frame-pointer walking
- guest_profiler/symbols.rs: ELF32 symbol table parser with .symtab/.dynsym
- vmm/microvm/whp/mod.rs: Timer integration and sampling on Interrupted exits
- vcpu/mod.rs: get_profile_regs() for reading EIP/EBP/CR3

Also adds a release-profiling Cargo profile for building with symbols.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
[uservm] E: Host kernel stack correlation via ETW (Windows)
Bumps [Mozilla-Actions/sccache-action](https://github.com/mozilla-actions/sccache-action) from 0.0.9 to 0.0.10.
- [Release notes](https://github.com/mozilla-actions/sccache-action/releases)
- [Commits](Mozilla-Actions/sccache-action@v0.0.9...v0.0.10)

---
updated-dependencies:
- dependency-name: Mozilla-Actions/sccache-action
  dependency-version: 0.0.10
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
…Mozilla-Actions-sccache-action-0.0.10

Bump Mozilla-Actions/sccache-action from 0.0.9 to 0.0.10
Add host-mount.md describing the snapshot-based mounting feature that
allows users to mount a host directory into a Nanvix guest VM via a
new `-mount <host-dir>` CLI flag for nanvixd.

The document covers:
- Problem statement and user interface (CLI usage)
- Goals and non-goals (e.g., not replacing linuxd)
- High-level snapshot-based design: host dir is packaged into a FAT32
  image at launch, mapped via MMIO, and extracted back at shutdown
- Multi-image binary container format (MIMG) that supports packing
  multiple sub-images (ROOTFS + MOUNTFS) into a single MMIO region
- Size constraints, edge cases, and backward compatibility strategy
- Known limitations (copy overhead, 16 MiB ceiling, no live sync)
Introduce the `-mount <host-dir>` option for standalone mode, enabling
bidirectional file sharing between the host and the guest VM. A host
directory is packaged into a FAT32 image, mapped into guest memory at
/mnt via a new multi-image container format, and copied back to the host
on VM shutdown.

Multi-image container format (multiimage crate):
- Define a binary format with a HEAD page (header + entry table) followed
  by page-aligned sub-images, each identified by an 8-byte MMIO tag.
- Support both building (std) and parsing (no_std) paths so the same
  format works on host and guest.
- Provide compute_multiimage_layout() for zero-copy VMM integration
  (no intermediate concatenated file on disk).

Guest-side changes (nvx crate):
- Detect multi-image containers in the RAMFS MMIO region during VFS
  initialization.
- Mount ROOTFS at "/" and MOUNTFS at "/mnt" when sub-images are present;
  fall back to the legacy single-image path otherwise.

VMM backend changes:
- Add mount module (src/uservm/src/vmm/microvm/mount.rs) with helpers to
  build FAT images from host directories, compute unified layouts, and
  extract modified files after shutdown (copyback).
- Add MultiRamFs to the ramfs module for loading multi-image containers
  with per-file zero-copy mappings into guest memory.
- KVM: add remap_files_at() and attach_backing_files() to VirtualMemory.
- WHP: refactor placeholder splitting to support multiple file-backed
  views (MultiFileRemap), including gap re-commit and proper cleanup in
  Drop.

mkramfs library extensions:
- Export compute_image_size(), generate_image(), dir_size(), and
  copy_dir_recursive() as public APIs.
- Change mkfatfs() to return Result instead of panicking.

CLI and plumbing:
- Add -mount option to nanvixd argument parser with directory validation.
- Thread mount_directory through UserVmArgs, MicroVmArgs,
  StandaloneConfig, TerminalConfig, and all sandbox/benchmark call sites.
- Add ROOTFS_MMIO_TAG and MOUNTFS_MMIO_TAG to config::region_tags.
- Add MmioTag::as_bytes() accessor to mmio-tag crate.

Tests and benchmarks:
- Add mount-test guest binary: verifies read, nested read, create, and
  modify-then-re-read on /mnt.
- Add mount-bench-nostd guest binary: measures sequential 4 KiB read,
  write, and file creation latency on /mnt.
- Add test-standalone.toml entry for mount-test.
- Build system: register new crates, add to standalone guest binary list,
  and seed test/bench data directories in the Makefile.
[uservm] Add host directory mounting on guest
- perf_linux.rs: PerfSession with system-wide perf (-a) for kernel stacks,
  per-PID fallback. Waits for perf_event fds, detects early perf exit.
  chmod 644 on perf.data after stop for script accessibility.
- Re-export PerfSession as HostKernelSession (matches Windows type alias)
- KVM sampling: SIGUSR2 timer, Interrupted continues when profiler active
- full-flamegraph.sh: perf script -F, grep nanvixd|uservm filtering
- Safe register reads, freq_hz clamped 1-10kHz

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@esaurez esaurez force-pushed the feature/profiler-linux branch from 512deb5 to c3f64d3 Compare April 23, 2026 17:47
Add Linux KVM guest profiler with perf host correlation:

- perf_linux.rs: PerfSession with named constants,
  system-wide fallback, readiness wait, chmod
- mod.rs: SIGUSR2 profiler timer extracted to
  spawn_profiler_timer(), error capture on join
- flamegraph_host.py: Linux perf extraction with
  returncode checks for perf script and collapse
- flamegraph.py: add PERF_SESSION to stderr filter
- Remove full-flamegraph.sh (replaced by flamegraph.py)
- Downgrade EINTR log from warn to trace (profiler
  generates up to 10kHz interrupts)
- Fix perf_linux.rs docs: no timestamp correlation
- Fix start() docstring: describes actual behavior
- Fix mod.rs comment: Linux uses PerfSession not stub
- Document Relaxed ordering safety in timer shutdown

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@esaurez esaurez force-pushed the feature/profiler-linux branch from c3f64d3 to 8056cc0 Compare April 23, 2026 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants