Skip to content

feat(witness): tool execution witness recording + zkperf-service integration#470

Open
jmikedupont2 wants to merge 2 commits intomoltis-org:mainfrom
meta-introspector:pr/zkperf-witness
Open

feat(witness): tool execution witness recording + zkperf-service integration#470
jmikedupont2 wants to merge 2 commits intomoltis-org:mainfrom
meta-introspector:pr/zkperf-witness

Conversation

@jmikedupont2
Copy link
Copy Markdown

Add best-effort witness recording for every tool execution, enabling performance monitoring and audit trails.

Changes

  • record_tool_witness() in runner.rs — records tool name, params, elapsed time, result/error for every tool call
  • witness_download tool — allows agents to download witness records
  • Integration with zkperf-service (HTTP API on 127.0.0.1:9718) for external perf monitoring with perf record correlation
  • Falls back to local file write (~/.local/share/moltis/witness/) when zkperf-service is unavailable

Architecture

moltis tool call → record_tool_witness() → POST /boundary to zkperf-service (fire-and-forget, 50ms timeout)
                                         → local JSON file (fallback)

Testing

  • Compiles clean on Linux and Windows
  • Tested with zkperf-service attached to moltis PID via perf record

mike dupont added 2 commits March 23, 2026 15:46
- record_tool_witness() in runner.rs wraps ALL tool.execute() calls
- Records tool name, params, elapsed_ms, result/error, platform to
  ~/.moltis/witness/<timestamp>_<tool>.witness.json
- New witness_download tool: collects witness logs, returns base64 tar.gz
  Supports last_n and tool_filter parameters
- Removed web_search-specific witness code (now redundant)
- Best-effort: witness recording never blocks tool results
record_tool_witness() now:
- POSTs boundary data to zkperf-service on 127.0.0.1:9718 (fire-and-forget, 50ms timeout)
- Falls back to local file write for offline operation
- Includes PID for perf record correlation
- Same call sites, no changes to tool execution flow
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 23, 2026

Greptile Summary

This PR adds best-effort witness recording around every tool execution — posting a boundary notification to an optional local zkperf-service and writing a fallback JSON file — plus a witness_download agent tool to retrieve those files as a base64 tar archive. The architecture is sound in intent, but the implementation has several significant correctness and security issues that need to be addressed before merging.

Key issues found:

  • Blocking I/O on Tokio async threads (P0): record_tool_witness is a sync function called directly from async code. It performs std::net::TcpStream::connect_timeout (up to 50 ms blocking) and std::fs::write (unbounded blocking) on the Tokio worker thread. The inline comment calling this "never blocks" is incorrect. This will cause latency spikes and executor starvation under concurrent load. The function must be refactored to use tokio::task::spawn with async I/O, or at minimum spawn_blocking.
  • No write timeout on TCP stream (P0): The 50 ms timeout only covers TCP connection setup. The subsequent write! has no timeout; if zkperf-service stalls after accepting the connection, the write blocks indefinitely on the executor thread.
  • Sensitive data written to disk unredacted (P1): Full tool params and result are persisted verbatim to ~/.local/share/moltis/witness/. These frequently contain API keys, file contents, and other secrets. Combined with the witness_download tool being available to every agent with no access control, this creates an easy cross-session data exfiltration path.
  • Blocking I/O in WitnessDownloadTool::execute (P1): std::fs::read_dir and per-file std::fs::read are blocking calls inside an async fn, violating the project's "async all the way down" rule.
  • Filename timestamp collision (P1): Second-precision timestamps in witness filenames silently overwrite records when two tools of the same name complete within the same second — common during parallel tool execution.
  • Missing Content-Type header (P2): The hand-rolled HTTP POST to zkperf-service omits Content-Type: application/json.
  • Manual epoch arithmetic (P2): std::time::SystemTime::now().duration_since(UNIX_EPOCH) used three times, against CLAUDE.md's explicit rule to use the time crate instead.

Confidence Score: 2/5

  • Not safe to merge — blocking I/O on Tokio threads will cause production latency regressions, and unredacted secret logging creates a security vulnerability.
  • Two P0 runtime issues (blocking syscalls on async executor threads, unbounded TCP write) that will degrade throughput under any concurrent load, plus a P1 security issue (unredacted params/results accessible to all agents). These are not theoretical; they affect every single tool invocation.
  • crates/agents/src/runner.rs (blocking I/O + sensitive data) and crates/tools/src/witness_download.rs (blocking I/O + missing access control) require the most attention before merging.

Important Files Changed

Filename Overview
crates/agents/src/runner.rs Adds record_tool_witness() called at every tool execution site (two call sites, four total invocations). Critical issues: blocking std I/O (TCP connect + file write) on Tokio executor thread, no write timeout on TCP stream, unredacted sensitive params/results written to disk, second-precision filename collisions, and manual epoch arithmetic violating style guide.
crates/tools/src/witness_download.rs New tool that tars and base64-encodes all local witness files for any requesting agent. Blocking std::fs I/O in async execute method; no access control limiting which agents can call it; no cap on total archive size (potential OOM with many/large witness files).
crates/tools/src/lib.rs Adds pub mod witness_download module declaration; otherwise unchanged.
crates/gateway/src/server.rs Registers WitnessDownloadTool in the global tool registry at server startup; single-line change, no direct issues here beyond the tool itself being available to all agents.
crates/tools/Cargo.toml Adds reqwest blocking feature and tar/flate2 dependencies to support witness_download; changes are minimal and correctly use workspace references.

Sequence Diagram

sequenceDiagram
    participant A as Async Tool Loop
    participant W as record_tool_witness()
    participant T as std::net::TcpStream
    participant Z as zkperf-service :9718
    participant F as ~/.moltis/witness/*.json

    A->>W: call (sync, on Tokio thread)
    Note over W: ⚠️ Blocks executor thread
    W->>T: connect_timeout(50ms) — blocking syscall
    alt zkperf-service reachable
        T-->>W: Ok(stream)
        W->>Z: write! HTTP POST /boundary — NO write timeout ⚠️
        Z-->>W: (response ignored / not read)
    else connection refused/timeout
        T-->>W: Err(_) — ignored
    end
    W->>F: std::fs::write — blocking syscall ⚠️
    F-->>W: Ok(())
    W-->>A: returns (unit, errors swallowed)

    Note over A,F: witness_download tool later reads all *.json files<br/>and returns them base64-encoded to any requesting agent
Loading

Reviews (1): Last reviewed commit: "refactor: replace inline witness recordi..." | Re-trigger Greptile

Comment on lines +35 to +53
if let Ok(mut stream) = std::net::TcpStream::connect_timeout(
&"127.0.0.1:9718".parse()?, std::time::Duration::from_millis(50),
) {
use std::io::Write;
let payload = body.to_string();
let _ = write!(stream,
"POST /boundary HTTP/1.1\r\nHost: localhost\r\nContent-Length: {}\r\n\r\n{}",
payload.len(), payload);
}
// Also write locally for offline operation
let dir = moltis_config::data_dir().join("witness");
std::fs::create_dir_all(&dir)?;
let slug: String = tool_name.chars().take(30).collect();
let ts = std::time::SystemTime::now()
.duration_since(std::time::UNIX_EPOCH)?.as_secs();
std::fs::write(
dir.join(format!("{ts}_{slug}.witness.json")),
serde_json::to_string(&body)?,
)?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Blocking I/O executed on Tokio async executor thread

record_tool_witness is a synchronous function called directly from an async context (the tool execution loop). It performs two categories of blocking syscalls on the Tokio worker thread:

  1. std::net::TcpStream::connect_timeout — up to 50 ms blocking connect. Even though the timeout is short, this stalls the async thread and can cause latency spikes and head-of-line blocking for all other tasks on that thread, especially under load when many tools run concurrently.
  2. std::fs::create_dir_all / std::fs::write — unbounded blocking file I/O. On a slow disk or NFS mount, this can block for much longer.

The comment on line 8 says "never blocks", which is factually incorrect — both operations are blocking.

The project style guide (CLAUDE.md) is explicit: "Async all the way down — never block_on in async context. All HTTP/IO must be async."

The function should be refactored to use tokio::task::spawn_blocking for the file I/O, and tokio::net::TcpStream or an async HTTP client for the network call. Because this is fire-and-forget, spawning a detached task is the right pattern:

// Spawn a detached task so recording never delays tool execution
tokio::task::spawn(async move {
    // async TCP write to zkperf-service
    // tokio::fs::write for local fallback
});

Context Used: CLAUDE.md (source)

Comment on lines +40 to +42
let _ = write!(stream,
"POST /boundary HTTP/1.1\r\nHost: localhost\r\nContent-Length: {}\r\n\r\n{}",
payload.len(), payload);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 TCP write has no timeout — can block indefinitely

The 50 ms timeout on line 36 only covers the TCP connection handshake (connect_timeout). Once the connection is established, the subsequent write! call on line 40 has no timeout whatsoever.

If the zkperf-service accepts the TCP connection but is slow to drain its receive buffer (e.g., busy processing, GC pause, or near-full buffer), the write! will block until the OS send buffer fills up and the call blocks. On Linux, the default SO_SNDBUF is 128–212 KB, so a small JSON payload will fit immediately — but under pathological conditions (loopback congestion, service crash mid-accept) the write can stall for the full TCP retransmit timeout (minutes).

This is particularly dangerous here because, as noted in the blocking I/O comment, this runs on a Tokio executor thread.

Fix: set a write timeout before writing:

stream.set_write_timeout(Some(std::time::Duration::from_millis(50)))?;

Comment on lines +21 to +33
let body = serde_json::json!({
"pid": pid,
"sig": sig,
"tool": tool_name,
"params": params,
"elapsed_ms": elapsed.as_millis() as u64,
"success": error.is_none(),
"error": error,
"result": result,
"timestamp": std::time::SystemTime::now()
.duration_since(std::time::UNIX_EPOCH)?.as_secs(),
"platform": std::env::consts::OS,
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Full tool params and results written to disk unredacted

The witness JSON includes the raw params and result of every tool call. Tool parameters frequently contain sensitive material — file paths with personal data, web fetch URLs with credentials in the query string, exec commands, shell output, and any secrets passed as tool arguments.

Writing this verbatim to ~/.local/share/moltis/witness/ (world-readable by default on most Linux configurations) violates the project's secrecy policy:

"Use secrecy::Secret<String> for all passwords/keys/tokens. expose_secret() only at consumption point."

Additionally, the witness_download tool lets any agent in any session retrieve this data as a base64 tar archive with no authentication check, making this an easy exfiltration vector: a compromised or malicious agent can call witness_download to collect the execution history of all other tools across all sessions.

At minimum, consider:

  • Omitting or hashing params/result in the local file (record only tool name + timing + success/failure).
  • Restricting witness_download to privileged/admin agents only, mirroring the existing policy layer in crates/tools/src/policy.rs.

Comment on lines +51 to +63
let mut entries: Vec<_> = std::fs::read_dir(&dir)?
.filter_map(|e| e.ok())
.filter(|e| e.path().extension().is_some_and(|x| x == "json"))
.filter(|e| {
tool_filter.map_or(true, |f| {
e.file_name().to_string_lossy().contains(f)
})
})
.collect();

entries.sort_by_key(|e| std::cmp::Reverse(e.file_name()));
if let Some(n) = last_n {
entries.truncate(n);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Blocking file I/O in async execute method

std::fs::read_dir, std::fs::read, and entries.sort_by_key (which calls file_name() on each entry) are all blocking calls executed inside an async fn. This stalls the Tokio executor thread while enumerating and reading witness files, which can be non-trivial if many files have accumulated over time.

Per CLAUDE.md: "Async all the way down — never block_on in async context. All HTTP/IO must be async."

Wrap the entire I/O block in tokio::task::spawn_blocking:

let entries = tokio::task::spawn_blocking(move || {
    let mut entries: Vec<_> = std::fs::read_dir(&dir)?
        ...
        .collect();
    entries.sort_by_key(|e| std::cmp::Reverse(e.file_name()));
    Ok::<_, std::io::Error>(entries)
}).await??;

Then use tokio::fs::read for the individual file reads, or keep them inside the same spawn_blocking closure.

Comment on lines +47 to +52
let slug: String = tool_name.chars().take(30).collect();
let ts = std::time::SystemTime::now()
.duration_since(std::time::UNIX_EPOCH)?.as_secs();
std::fs::write(
dir.join(format!("{ts}_{slug}.witness.json")),
serde_json::to_string(&body)?,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Second-precision timestamp causes silent file overwrites

The witness filename is {ts}_{slug}.witness.json where ts is seconds since epoch. If two tools with the same name complete within the same second — common in parallel tool execution — the second std::fs::write silently overwrites the first, destroying a witness record with no error reported.

Use millisecond or sub-millisecond precision, or add a random suffix:

let ts = std::time::SystemTime::now()
    .duration_since(std::time::UNIX_EPOCH)?.as_millis();
let random: u32 = rand::random(); // or use a counter / uuid
std::fs::write(
    dir.join(format!("{ts}_{slug}_{random:08x}.witness.json")),
    ...
)?;

Alternatively, since sig already contains millisecond precision (line 17–18), the filename could be derived from it.

Comment on lines +40 to +42
let _ = write!(stream,
"POST /boundary HTTP/1.1\r\nHost: localhost\r\nContent-Length: {}\r\n\r\n{}",
payload.len(), payload);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Missing Content-Type header in raw HTTP request

The hand-rolled HTTP request omits the Content-Type: application/json header. Most HTTP servers will accept the body regardless, but without this header the zkperf-service cannot reliably determine how to parse the payload. Standard HTTP practice requires this header for POST bodies with a JSON content type.

Suggested change
let _ = write!(stream,
"POST /boundary HTTP/1.1\r\nHost: localhost\r\nContent-Length: {}\r\n\r\n{}",
payload.len(), payload);
let _ = write!(stream,
"POST /boundary HTTP/1.1\r\nHost: localhost\r\nContent-Type: application/json\r\nContent-Length: {}\r\n\r\n{}",
payload.len(), payload);

Comment on lines +17 to +18
let sig = format!("moltis:{}:{}", tool_name, std::time::SystemTime::now()
.duration_since(std::time::UNIX_EPOCH)?.as_millis());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Manual epoch arithmetic violates project style guide

std::time::SystemTime::now().duration_since(std::time::UNIX_EPOCH) is used three times in this function (lines 17–18, 30–31, 48–49). CLAUDE.md explicitly states:

"Use time crate (workspace dep) for date/time — no manual epoch math or magic constants like 86400."

Use time::OffsetDateTime::now_utc() and its unix_timestamp() / unix_timestamp_nanos() helpers instead. The time crate is already a workspace dependency (crates/tools/Cargo.toml).

Context Used: CLAUDE.md (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant