Summary
In fuzz testing with our internal agent-simulation harness, agents frequently attempt to call tools that don't exist in their available tool set — wrong names, hallucinated tools, or tools from a different agent's registry. The call fails, often kicking off a retry/loop, and wastes the turn.
Simulated example
available tools: [list_dir, read_file, http_get]
→ search_files(query="config") # not in the available set
← error: unknown tool "search_files"
→ search_files(query="config") # retries the same nonexistent tool
This is a common trigger for the no-progress loops tracked separately.
Expected behavior
Agents only call tools in their current available set; an invalid name is caught early with a corrective signal (and the available set surfaced) rather than silently failing into a retry loop.
Suggested fix / acceptance
- Validate tool name against the live available set before dispatch; return a structured "unknown tool — available: [...]" error.
- Keep the advertised tool list and the actual registry in sync per agent/tier.
- Constrain tool selection to the available set where possible (schema/enum).
- Harness regression: assert zero unknown-tool calls across simulated runs.
Surfaced by our internal agent-simulation harness during large-scale, aggressive fuzz testing of agent behaviors. The example above is synthetic and contains no real data.
Summary
In fuzz testing with our internal agent-simulation harness, agents frequently attempt to call tools that don't exist in their available tool set — wrong names, hallucinated tools, or tools from a different agent's registry. The call fails, often kicking off a retry/loop, and wastes the turn.
Simulated example
This is a common trigger for the no-progress loops tracked separately.
Expected behavior
Agents only call tools in their current available set; an invalid name is caught early with a corrective signal (and the available set surfaced) rather than silently failing into a retry loop.
Suggested fix / acceptance
Surfaced by our internal agent-simulation harness during large-scale, aggressive fuzz testing of agent behaviors. The example above is synthetic and contains no real data.