feat(witness): tool execution witness recording + zkperf-service integration#470
feat(witness): tool execution witness recording + zkperf-service integration#470jmikedupont2 wants to merge 2 commits intomoltis-org:mainfrom
Conversation
- record_tool_witness() in runner.rs wraps ALL tool.execute() calls - Records tool name, params, elapsed_ms, result/error, platform to ~/.moltis/witness/<timestamp>_<tool>.witness.json - New witness_download tool: collects witness logs, returns base64 tar.gz Supports last_n and tool_filter parameters - Removed web_search-specific witness code (now redundant) - Best-effort: witness recording never blocks tool results
record_tool_witness() now: - POSTs boundary data to zkperf-service on 127.0.0.1:9718 (fire-and-forget, 50ms timeout) - Falls back to local file write for offline operation - Includes PID for perf record correlation - Same call sites, no changes to tool execution flow
Greptile SummaryThis PR adds best-effort witness recording around every tool execution — posting a boundary notification to an optional local Key issues found:
Confidence Score: 2/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant A as Async Tool Loop
participant W as record_tool_witness()
participant T as std::net::TcpStream
participant Z as zkperf-service :9718
participant F as ~/.moltis/witness/*.json
A->>W: call (sync, on Tokio thread)
Note over W: ⚠️ Blocks executor thread
W->>T: connect_timeout(50ms) — blocking syscall
alt zkperf-service reachable
T-->>W: Ok(stream)
W->>Z: write! HTTP POST /boundary — NO write timeout ⚠️
Z-->>W: (response ignored / not read)
else connection refused/timeout
T-->>W: Err(_) — ignored
end
W->>F: std::fs::write — blocking syscall ⚠️
F-->>W: Ok(())
W-->>A: returns (unit, errors swallowed)
Note over A,F: witness_download tool later reads all *.json files<br/>and returns them base64-encoded to any requesting agent
Reviews (1): Last reviewed commit: "refactor: replace inline witness recordi..." | Re-trigger Greptile |
| if let Ok(mut stream) = std::net::TcpStream::connect_timeout( | ||
| &"127.0.0.1:9718".parse()?, std::time::Duration::from_millis(50), | ||
| ) { | ||
| use std::io::Write; | ||
| let payload = body.to_string(); | ||
| let _ = write!(stream, | ||
| "POST /boundary HTTP/1.1\r\nHost: localhost\r\nContent-Length: {}\r\n\r\n{}", | ||
| payload.len(), payload); | ||
| } | ||
| // Also write locally for offline operation | ||
| let dir = moltis_config::data_dir().join("witness"); | ||
| std::fs::create_dir_all(&dir)?; | ||
| let slug: String = tool_name.chars().take(30).collect(); | ||
| let ts = std::time::SystemTime::now() | ||
| .duration_since(std::time::UNIX_EPOCH)?.as_secs(); | ||
| std::fs::write( | ||
| dir.join(format!("{ts}_{slug}.witness.json")), | ||
| serde_json::to_string(&body)?, | ||
| )?; |
There was a problem hiding this comment.
Blocking I/O executed on Tokio async executor thread
record_tool_witness is a synchronous function called directly from an async context (the tool execution loop). It performs two categories of blocking syscalls on the Tokio worker thread:
std::net::TcpStream::connect_timeout— up to 50 ms blocking connect. Even though the timeout is short, this stalls the async thread and can cause latency spikes and head-of-line blocking for all other tasks on that thread, especially under load when many tools run concurrently.std::fs::create_dir_all/std::fs::write— unbounded blocking file I/O. On a slow disk or NFS mount, this can block for much longer.
The comment on line 8 says "never blocks", which is factually incorrect — both operations are blocking.
The project style guide (CLAUDE.md) is explicit: "Async all the way down — never block_on in async context. All HTTP/IO must be async."
The function should be refactored to use tokio::task::spawn_blocking for the file I/O, and tokio::net::TcpStream or an async HTTP client for the network call. Because this is fire-and-forget, spawning a detached task is the right pattern:
// Spawn a detached task so recording never delays tool execution
tokio::task::spawn(async move {
// async TCP write to zkperf-service
// tokio::fs::write for local fallback
});Context Used: CLAUDE.md (source)
| let _ = write!(stream, | ||
| "POST /boundary HTTP/1.1\r\nHost: localhost\r\nContent-Length: {}\r\n\r\n{}", | ||
| payload.len(), payload); |
There was a problem hiding this comment.
TCP write has no timeout — can block indefinitely
The 50 ms timeout on line 36 only covers the TCP connection handshake (connect_timeout). Once the connection is established, the subsequent write! call on line 40 has no timeout whatsoever.
If the zkperf-service accepts the TCP connection but is slow to drain its receive buffer (e.g., busy processing, GC pause, or near-full buffer), the write! will block until the OS send buffer fills up and the call blocks. On Linux, the default SO_SNDBUF is 128–212 KB, so a small JSON payload will fit immediately — but under pathological conditions (loopback congestion, service crash mid-accept) the write can stall for the full TCP retransmit timeout (minutes).
This is particularly dangerous here because, as noted in the blocking I/O comment, this runs on a Tokio executor thread.
Fix: set a write timeout before writing:
stream.set_write_timeout(Some(std::time::Duration::from_millis(50)))?;| let body = serde_json::json!({ | ||
| "pid": pid, | ||
| "sig": sig, | ||
| "tool": tool_name, | ||
| "params": params, | ||
| "elapsed_ms": elapsed.as_millis() as u64, | ||
| "success": error.is_none(), | ||
| "error": error, | ||
| "result": result, | ||
| "timestamp": std::time::SystemTime::now() | ||
| .duration_since(std::time::UNIX_EPOCH)?.as_secs(), | ||
| "platform": std::env::consts::OS, | ||
| }); |
There was a problem hiding this comment.
Full tool params and results written to disk unredacted
The witness JSON includes the raw params and result of every tool call. Tool parameters frequently contain sensitive material — file paths with personal data, web fetch URLs with credentials in the query string, exec commands, shell output, and any secrets passed as tool arguments.
Writing this verbatim to ~/.local/share/moltis/witness/ (world-readable by default on most Linux configurations) violates the project's secrecy policy:
"Use
secrecy::Secret<String>for all passwords/keys/tokens.expose_secret()only at consumption point."
Additionally, the witness_download tool lets any agent in any session retrieve this data as a base64 tar archive with no authentication check, making this an easy exfiltration vector: a compromised or malicious agent can call witness_download to collect the execution history of all other tools across all sessions.
At minimum, consider:
- Omitting or hashing
params/resultin the local file (record only tool name + timing + success/failure). - Restricting
witness_downloadto privileged/admin agents only, mirroring the existing policy layer incrates/tools/src/policy.rs.
| let mut entries: Vec<_> = std::fs::read_dir(&dir)? | ||
| .filter_map(|e| e.ok()) | ||
| .filter(|e| e.path().extension().is_some_and(|x| x == "json")) | ||
| .filter(|e| { | ||
| tool_filter.map_or(true, |f| { | ||
| e.file_name().to_string_lossy().contains(f) | ||
| }) | ||
| }) | ||
| .collect(); | ||
|
|
||
| entries.sort_by_key(|e| std::cmp::Reverse(e.file_name())); | ||
| if let Some(n) = last_n { | ||
| entries.truncate(n); |
There was a problem hiding this comment.
Blocking file I/O in async
execute method
std::fs::read_dir, std::fs::read, and entries.sort_by_key (which calls file_name() on each entry) are all blocking calls executed inside an async fn. This stalls the Tokio executor thread while enumerating and reading witness files, which can be non-trivial if many files have accumulated over time.
Per CLAUDE.md: "Async all the way down — never block_on in async context. All HTTP/IO must be async."
Wrap the entire I/O block in tokio::task::spawn_blocking:
let entries = tokio::task::spawn_blocking(move || {
let mut entries: Vec<_> = std::fs::read_dir(&dir)?
...
.collect();
entries.sort_by_key(|e| std::cmp::Reverse(e.file_name()));
Ok::<_, std::io::Error>(entries)
}).await??;Then use tokio::fs::read for the individual file reads, or keep them inside the same spawn_blocking closure.
| let slug: String = tool_name.chars().take(30).collect(); | ||
| let ts = std::time::SystemTime::now() | ||
| .duration_since(std::time::UNIX_EPOCH)?.as_secs(); | ||
| std::fs::write( | ||
| dir.join(format!("{ts}_{slug}.witness.json")), | ||
| serde_json::to_string(&body)?, |
There was a problem hiding this comment.
Second-precision timestamp causes silent file overwrites
The witness filename is {ts}_{slug}.witness.json where ts is seconds since epoch. If two tools with the same name complete within the same second — common in parallel tool execution — the second std::fs::write silently overwrites the first, destroying a witness record with no error reported.
Use millisecond or sub-millisecond precision, or add a random suffix:
let ts = std::time::SystemTime::now()
.duration_since(std::time::UNIX_EPOCH)?.as_millis();
let random: u32 = rand::random(); // or use a counter / uuid
std::fs::write(
dir.join(format!("{ts}_{slug}_{random:08x}.witness.json")),
...
)?;Alternatively, since sig already contains millisecond precision (line 17–18), the filename could be derived from it.
| let _ = write!(stream, | ||
| "POST /boundary HTTP/1.1\r\nHost: localhost\r\nContent-Length: {}\r\n\r\n{}", | ||
| payload.len(), payload); |
There was a problem hiding this comment.
Missing
Content-Type header in raw HTTP request
The hand-rolled HTTP request omits the Content-Type: application/json header. Most HTTP servers will accept the body regardless, but without this header the zkperf-service cannot reliably determine how to parse the payload. Standard HTTP practice requires this header for POST bodies with a JSON content type.
| let _ = write!(stream, | |
| "POST /boundary HTTP/1.1\r\nHost: localhost\r\nContent-Length: {}\r\n\r\n{}", | |
| payload.len(), payload); | |
| let _ = write!(stream, | |
| "POST /boundary HTTP/1.1\r\nHost: localhost\r\nContent-Type: application/json\r\nContent-Length: {}\r\n\r\n{}", | |
| payload.len(), payload); |
| let sig = format!("moltis:{}:{}", tool_name, std::time::SystemTime::now() | ||
| .duration_since(std::time::UNIX_EPOCH)?.as_millis()); |
There was a problem hiding this comment.
Manual epoch arithmetic violates project style guide
std::time::SystemTime::now().duration_since(std::time::UNIX_EPOCH) is used three times in this function (lines 17–18, 30–31, 48–49). CLAUDE.md explicitly states:
"Use
timecrate (workspace dep) for date/time — no manual epoch math or magic constants like86400."
Use time::OffsetDateTime::now_utc() and its unix_timestamp() / unix_timestamp_nanos() helpers instead. The time crate is already a workspace dependency (crates/tools/Cargo.toml).
Context Used: CLAUDE.md (source)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Add best-effort witness recording for every tool execution, enabling performance monitoring and audit trails.
Changes
record_tool_witness()in runner.rs — records tool name, params, elapsed time, result/error for every tool callwitness_downloadtool — allows agents to download witness recordsperf recordcorrelation~/.local/share/moltis/witness/) when zkperf-service is unavailableArchitecture
Testing
perf record