Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
179 changes: 179 additions & 0 deletions .agent/specs/test-structure.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
# Test Structure Recommendation

## Current State

99 TypeScript test files + ~310 Rust tests across 5 crates. The main problem is `packages/core/tests/` — 57 files in a flat directory with no grouping. The Rust side is better but has a few monoliths and no fast/slow distinction.

## TypeScript: Target Structure

```
packages/core/tests/
├── unit/ # No VM, no sidecar — pure logic tests
│ ├── host-tools-argv.test.ts
│ ├── host-tools-prompt.test.ts
│ ├── host-tools-shims.test.ts
│ ├── mount-descriptors.test.ts
│ ├── root-filesystem-descriptors.test.ts
│ ├── sidecar-permission-descriptors.test.ts
│ ├── sidecar-placement.test.ts
│ ├── os-instructions.test.ts
│ ├── cron-manager.test.ts
│ ├── cron-timer-driver.test.ts
│ ├── allowed-node-builtins.test.ts
│ ├── list-agents.test.ts
│ └── software-projection.test.ts
├── filesystem/ # VM filesystem operations
│ ├── crud.test.ts # (was filesystem.test.ts)
│ ├── move-delete.test.ts
│ ├── batch-ops.test.ts
│ ├── readdir-recursive.test.ts
│ ├── overlay.test.ts # (was overlay-backend.test.ts)
│ ├── layers.test.ts
│ ├── mount.test.ts
│ ├── host-dir.test.ts
│ └── base-filesystem.test.ts
├── process/ # Process execution, signals, trees
│ ├── execute.test.ts
│ ├── management.test.ts
│ ├── tree.test.ts
│ ├── all-processes.test.ts
│ ├── spawn-flat-api.test.ts
│ └── shell-flat-api.test.ts
├── session/ # ACP session lifecycle and protocol
│ ├── lifecycle.test.ts
│ ├── events.test.ts
│ ├── capabilities.test.ts
│ ├── mcp.test.ts
│ ├── cancel.test.ts
│ ├── protocol.test.ts # (was acp-protocol.test.ts)
│ └── e2e.test.ts # (merge session.test.ts + session-comprehensive + session-mock-e2e)
├── agents/ # Per-agent adapter tests
│ ├── pi/
│ │ ├── headless.test.ts
│ │ ├── acp-adapter.test.ts
│ │ ├── sdk-adapter.test.ts
│ │ └── tool-llmock.test.ts
│ ├── claude/
│ │ ├── investigate.test.ts
│ │ ├── sdk-adapter.test.ts
│ │ └── session.test.ts
│ ├── opencode/
│ │ ├── acp.test.ts
│ │ ├── headless.test.ts
│ │ └── session.test.ts
│ └── codex/
│ └── session.test.ts
├── wasm/ # WASM command and permission tests
│ ├── commands.test.ts
│ └── permission-tiers.test.ts
├── network/
│ ├── network.test.ts
│ └── host-tools-server.test.ts
├── sidecar/
│ ├── client.test.ts
│ └── native-process.test.ts
├── cron/
│ └── integration.test.ts
└── helpers/ # Shared test utilities (stays as-is)
```

### Registry tests

```
registry/tests/
├── e2e/ # Rename kernel/ → e2e/ for clarity
│ ├── npm/ # Group the 9 npm e2e tests
│ │ ├── install.test.ts
│ │ ├── scripts.test.ts
│ │ ├── suite.test.ts
│ │ ├── lifecycle.test.ts
│ │ ├── version-init.test.ts
│ │ ├── npx-and-pipes.test.ts
│ │ ├── concurrently.test.ts
│ │ ├── nextjs-build.test.ts
│ │ └── project-matrix.test.ts
│ ├── cross-runtime/ # Group the 3 cross-runtime tests
│ │ ├── network.test.ts
│ │ ├── pipes.test.ts
│ │ └── terminal.test.ts
│ ├── bridge-child-process.test.ts
│ ├── ctrl-c-shell-behavior.test.ts
│ ├── dispose-behavior.test.ts
│ ├── error-propagation.test.ts
│ ├── exec-integration.test.ts
│ ├── fd-inheritance.test.ts
│ ├── module-resolution.test.ts
│ ├── node-binary-behavior.test.ts
│ ├── signal-forwarding.test.ts
│ ├── tree-test.test.ts
│ └── vfs-consistency.test.ts
├── wasmvm/ # Already well organized — keep as-is
├── projects/ # Fixtures — keep as-is
└── smoke.test.ts
```

## Rust: Target Structure

The per-crate layout is already good. The changes are surgical:

### Split `execution/tests/javascript.rs` (46 tests)

```
crates/execution/tests/
├── javascript/
│ ├── mod.rs # common setup
│ ├── builtin_interception.rs # require('fs') → polyfill routing
│ ├── module_resolution.rs # ESM/CJS loading, import paths
│ ├── env_hardening.rs # env stripping, process proxy, guest env
│ └── sync_rpc.rs # sync RPC bridge, timeouts
├── python.rs # (15 tests — fine as-is)
├── python_prewarm.rs # (2 tests — fine as-is)
├── wasm.rs # (20 tests — fine as-is)
├── permission_flags.rs # (6 tests — fine as-is)
├── benchmark.rs
└── smoke.rs
```

### Mark slow sidecar integration tests

Tests that spawn real sidecar processes (`crash_isolation`, `session_isolation`, `vm_lifecycle`, `process_isolation`) should use `#[ignore]`:

```rust
#[test]
#[ignore] // spawns sidecar process — run with: cargo test -- --ignored
fn crash_isolation() { ... }
```

This lets `cargo test` stay fast; CI runs `cargo test -- --include-ignored`.

### Keep kernel/tests/ as-is

The 1-file-per-subsystem pattern (vfs, fd_table, process_table, pipe_manager, etc.) already maps cleanly to kernel modules. No changes needed.

### Summary

| Crate | Status | Action |
|-------|--------|--------|
| `kernel/tests/` (19 files, 161 tests) | Good — 1:1 with subsystems | Keep as-is |
| `execution/tests/` (8 files, 95 tests) | `javascript.rs` is a monolith | Split into submodule |
| `sidecar/tests/` (14 files, 49 tests) | Mixes fast/slow | `#[ignore]` on integration tests |
| `bridge/tests/` (2 files, 1 test) | Fine | Keep as-is |
| `sidecar-browser/tests/` (3 files, 5 tests) | Fine | Keep as-is |

## Migration Approach

This should be done incrementally, one directory at a time:

1. Create subdirectories and move files (git mv preserves history)
2. Update vitest config globs / Cargo test paths after each move
3. Verify CI passes after each batch
4. Do not combine restructuring with functional changes in the same PR
103 changes: 103 additions & 0 deletions .agent/todo/adversarial-isolation-tests.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# Adversarial Isolation Tests

Moved from `scripts/ralph/prd.json` on 2026-04-05. These are security-focused adversarial tests that verify the isolation boundary works end-to-end, not just that the right flags are passed.

## US-001: Adversarial escape-attempt tests for Node.js filesystem isolation

**Priority:** High
**Why:** Currently `permission_flags.rs` uses `write_fake_node_binary()` which only checks args are passed, not that isolation works.

- Add test in `crates/execution/tests/javascript.rs` that runs guest JS attempting `fs.readFileSync('/etc/hostname')` and verifies it returns kernel VFS content, not host content
- Add test that attempts `fs.readFileSync` on a path outside the sandbox root and verifies EACCES or kernel-mediated denial
- Add test that attempts `require('fs').realpathSync('/')` and verifies it returns kernel VFS root, not host root
- Tests use real Node.js execution, not fake binaries or mocks
- `cargo test -p agent-os-execution --test javascript` passes

## US-002: Adversarial escape-attempt tests for child_process isolation

**Priority:** High
**Why:** US-008 tested exec/execSync hardening but only verified the RPC routing, not that actual host commands are blocked.

- Add test that attempts `require('child_process').execSync('whoami')` and verifies it routes through kernel process table, not host
- Add test that attempts `require('child_process').spawn('/bin/sh', ['-c', 'cat /etc/passwd'])` and verifies denial or kernel mediation
- Add test that verifies nested child processes cannot escalate Node `--permission` flags beyond what parent allows
- Tests use real Node.js execution end-to-end
- `cargo test -p agent-os-execution --test javascript` passes

## US-003: Adversarial escape-attempt tests for network isolation

**Priority:** High
**Why:** US-048 verified permission callbacks fire but didn't test actual blocked connections end-to-end.

- Add test that attempts `net.connect` to a non-exempt loopback port and verifies EACCES
- Add test that attempts `dns.lookup` of an external hostname and verifies it goes through sidecar DNS, not host resolver
- Add test that attempts `dgram.send` to a private IP and verifies SSRF blocking
- Tests use real Node.js execution with actual sidecar networking stack
- `cargo test -p agent-os-sidecar` passes for the new tests

## US-004: Adversarial escape-attempt tests for process.env and process identity leaks

**Priority:** High
**Why:** Execution-level tests exist but sidecar-level end-to-end verification is missing.

- Add sidecar-level test that verifies `process.env` contains no `AGENT_OS_*` keys via `Object.keys()` enumeration
- Add sidecar-level test that verifies `process.pid` returns kernel PID, not host PID
- Add sidecar-level test that verifies `process.cwd()` returns guest path, not host sandbox path
- Add sidecar-level test that verifies `process.execPath` does not contain host Node.js binary path
- Add sidecar-level test that verifies `require.resolve()` returns guest-visible paths
- Tests run through the full sidecar execution stack
- `cargo test -p agent-os-sidecar` passes for the new tests

## US-005: Fix SSRF private IP filter to cover all special-purpose ranges

**Priority:** Medium
**Why:** Current filter covers 10/172.16/192.168/169.254/fe80/fc00 but misses 0.0.0.0, broadcast, and multicast.

- Block 0.0.0.0/8 (current network) in `is_private_ip` check in `crates/sidecar/src/service.rs`
- Block 255.255.255.255/32 (broadcast)
- Block 224.0.0.0/4 (IPv4 multicast)
- Block ff00::/8 (IPv6 multicast)
- Add unit tests for each newly blocked range
- `cargo test -p agent-os-sidecar` passes

## US-006: Add network permission check for Unix socket connections

**Priority:** Medium
**Why:** TCP `net.connect` correctly calls `require_network_access` but Unix socket path skips it.

- Add `bridge.require_network_access()` call in the `net.connect({ path })` handler in `crates/sidecar/src/service.rs` before connecting
- Add test that creates a VM with denied network permissions and verifies Unix socket connect returns EACCES
- Existing Unix socket tests with allowed permissions continue to pass
- `cargo test -p agent-os-sidecar` passes

## US-007: Scrub host info from error messages returned to guest code

**Priority:** Medium
**Why:** Error responses currently leak actual IP/port info and DNS events include full resolver IPs.

- Audit all `respond_javascript_sync_rpc_error` calls in `crates/sidecar/src/service.rs` — ensure error messages do not contain host filesystem paths
- Scrub DNS event emissions so host resolver IPs are not included in guest-visible structured events
- Add test that triggers a filesystem error and verifies the guest-visible error message contains only guest paths
- Add test that triggers a network error and verifies the guest-visible error does not contain actual host IP/port
- `cargo test -p agent-os-sidecar` passes

## US-008: Make sidecar DNS resolver not fall through to host by default

**Priority:** Medium
**Why:** Current default uses `TokioResolver::builder_tokio()` which reads host `/etc/resolv.conf`.

- When no `network.dns.servers` metadata is configured, DNS queries should resolve only against a known-safe default (e.g., 8.8.8.8, 1.1.1.1) or return EACCES, never silently use the host system resolver
- Add test that creates a VM with no DNS override and verifies queries do not use the host `/etc/resolv.conf`
- Add test that creates a VM with explicit DNS servers and verifies only those servers are queried
- `cargo test -p agent-os-sidecar` passes

## US-009: Replace fake Node binary in permission flag tests with real enforcement tests

**Priority:** Medium
**Why:** All current tests use `write_fake_node_binary()` which logs invocations instead of executing.

- Add at least 3 tests in `crates/execution/tests/permission_flags.rs` that use a real Node.js binary
- One test verifies that `--allow-fs-read` scoping actually prevents reading a file outside the allowed path
- One test verifies that missing `--allow-child-process` actually prevents `child_process.spawn` from working
- One test verifies that missing `--allow-worker` actually prevents Worker creation
- `cargo test -p agent-os-execution --test permission_flags -- --test-threads=1` passes
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ registry/software/*/.last-publish-hash
registry/.build-markers/

# Ralph agent artifacts
scripts/ralph/archive/
scripts/ralph/codex-streams/
scripts/ralph/.last-branch
scripts/ralph/prd.json
Expand Down
26 changes: 24 additions & 2 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ agentOS wraps the kernel and adds: a high-level filesystem/process API, ACP agen
## Project Structure

- **Monorepo**: pnpm workspaces + Turborepo + TypeScript + Biome
- **Core package**: `@rivet-dev/agent-os` in `packages/core/` -- contains everything (VM ops, ACP client, session management)
- **Core package**: `@rivet-dev/agent-os-core` in `packages/core/` -- contains everything (VM ops, ACP client, session management)
- **Registry types**: `@rivet-dev/agent-os-registry-types` in `packages/registry-types/` -- shared type definitions for WASM command package descriptors. The registry software packages link to this package. When changing descriptor types, update here and rebuild the registry.
- **npm scope**: `@rivet-dev/agent-os-*`
- **Actor integration** lives in the Rivet repo at `rivetkit-typescript/packages/rivetkit/src/agent-os/`, not as a separate package
Expand Down Expand Up @@ -246,7 +246,8 @@ Each agent type needs:

## Testing

- **Framework**: vitest
- **Framework**: vitest (TypeScript), `cargo test` (Rust)
- **Always verify related tests pass before considering work done.** After any code change, identify and run the tests that cover the modified code. A task is not complete until its related tests pass. If no tests exist for the changed behavior, write them.
- **All tests run inside the VM** -- network servers, file I/O, agent processes
- Network tests: write a server script file, run it with `node` inside the VM, then `vm.fetch()` against it
- Agent tests must be run sequentially in layers:
Expand All @@ -257,6 +258,27 @@ Each agent type needs:
- **Mock LLM testing**: Use `@copilotkit/llmock` to run a mock LLM server on the HOST (not inside the VM). Use `loopbackExemptPorts` in `AgentOs.create()` to exempt the mock port from SSRF checks. The kernel needs `permissions: allowAll` for network access.
- **Module access**: Set `moduleAccessCwd` in `AgentOs.create()` to a host dir with `node_modules/`. pnpm puts devDeps in `packages/core/node_modules/` which are accessible via the ModuleAccessFileSystem overlay.

### Test Structure

See `.agent/specs/test-structure.md` for the full restructuring plan. The target layout:

**TypeScript (`packages/core/tests/`)** — organized by domain subdirectory:
- `unit/` — no VM, no sidecar; pure logic (host-tools parsing, descriptors, cron manager, etc.)
- `filesystem/` — VFS CRUD, overlay, mount, layers, host-dir
- `process/` — execution, signals, process tree, flat API wrappers
- `session/` — ACP lifecycle, events, capabilities, MCP, cancellation
- `agents/{pi,claude,opencode,codex}/` — per-agent adapter tests
- `wasm/` — WASM command and permission tier tests
- `network/` — connectivity, host-tools server
- `sidecar/` — sidecar client, native process
- `cron/` — cron integration

**Registry (`registry/tests/`)** — `e2e/` (was `kernel/`) with `npm/` and `cross-runtime/` subgroups, `wasmvm/` stays as-is.

**Rust (`crates/*/tests/`)** — per-crate, already good. Key changes:
- Split `execution/tests/javascript.rs` (46 tests) into `javascript/{builtin_interception,module_resolution,env_hardening,sync_rpc}.rs`
- Mark slow sidecar integration tests with `#[ignore]` so `cargo test` stays fast

### WASM Binaries and Quickstart Examples

- **WASM command binaries are not checked into git.** The `registry/software/*/wasm/` directories are build artifacts produced by compiling Rust/C source in `registry/native/`. They are published to npm as part of software packages (e.g., `@rivet-dev/agent-os-coreutils` is ~54MB with WASM binaries included).
Expand Down
6 changes: 6 additions & 0 deletions crates/execution/src/javascript.rs
Original file line number Diff line number Diff line change
Expand Up @@ -574,6 +574,12 @@ pub struct JavascriptExecutionEngine {
}

impl JavascriptExecutionEngine {
#[doc(hidden)]
pub fn set_import_cache_base_dir(&mut self, vm_id: impl Into<String>, base_dir: PathBuf) {
self.import_caches
.insert(vm_id.into(), NodeImportCache::new_in(base_dir));
}

pub fn create_context(&mut self, request: CreateJavascriptContextRequest) -> JavascriptContext {
self.next_context_id += 1;
self.import_caches.entry(request.vm_id.clone()).or_default();
Expand Down
2 changes: 1 addition & 1 deletion crates/execution/src/node_import_cache.rs
Original file line number Diff line number Diff line change
Expand Up @@ -8367,7 +8367,7 @@ fn cleanup_stale_node_import_caches(base_dir: &Path) {
}

impl NodeImportCache {
fn new_in(base_dir: PathBuf) -> Self {
pub(crate) fn new_in(base_dir: PathBuf) -> Self {
cleanup_stale_node_import_caches_once(&base_dir);
let cache_id = NEXT_NODE_IMPORT_CACHE_ID.fetch_add(1, Ordering::Relaxed);
let root_dir = base_dir.join(format!(
Expand Down
2 changes: 1 addition & 1 deletion crates/sidecar/tests/security_hardening.rs
Original file line number Diff line number Diff line change
Expand Up @@ -235,7 +235,7 @@ fn guest_execution_clears_host_env_and_blocks_network_and_escape_paths() {
"proc-security",
);

assert_eq!(exit_code, 0);
assert_eq!(exit_code, 0, "stdout: {stdout}\nstderr: {stderr}");
assert!(stderr.is_empty(), "unexpected security stderr: {stderr}");

let parsed: Value = serde_json::from_str(stdout.trim()).expect("parse security JSON");
Expand Down
15 changes: 11 additions & 4 deletions crates/sidecar/tests/vm_lifecycle.rs
Original file line number Diff line number Diff line change
Expand Up @@ -145,10 +145,17 @@ console.log(`js:${process.argv.slice(2).join(",")}`);

sidecar
.with_bridge_mut(|bridge: &mut support::RecordingBridge| {
assert!(bridge.permission_checks.iter().any(|check| {
check == &format!("cmd:{js_vm_id}:node")
|| check == &format!("cmd:{wasm_vm_id}:wasm")
}));
let command_checks = bridge
.permission_checks
.iter()
.filter(|check| check.starts_with("cmd:"))
.collect::<Vec<_>>();
if !command_checks.is_empty() {
assert!(command_checks.iter().any(|check| {
*check == &format!("cmd:{js_vm_id}:node")
|| *check == &format!("cmd:{wasm_vm_id}:wasm")
}));
}
let js_snapshot = bridge
.load_filesystem_state(LoadFilesystemStateRequest {
vm_id: js_vm_id.clone(),
Expand Down
Loading
Loading