The Kernel's result_handler evaluates whether the sandbox executed code without raising an unhandled exception. If the code runs to completion and returns any output — including a payload explicitly reporting failure — the outer result envelope is marked "status": "success".
Observed Behavior
An agent executes code that explicitly returns a failure payload. The result envelope returned to the caller:
{
"status": "success",
"output": "{\"success\": false, \"status\": \"delegated_to_researcher\", \"reason\": \"Could not complete the operation\"}"
}
The outer status field reports success. The inner payload reports failure. Any monitoring system observing result["status"] will report 100% success rates while end-users are receiving failed responses.
Root Cause
The Kernel conflates two distinct concepts under a single field:
| Concept |
Question |
| Execution Status |
"Did the sandbox run the code without crashing?" |
| Semantic Status |
"Did the code achieve the user's intended goal?" |
Only execution status is tracked. Semantic status is never evaluated.
Proposed Solutions
Option A — Automatic semantic signal: The Kernel inspects the return value for common failure signals ("success": false, "error": "...", None, empty collections) and surfaces a separate semantic_success: bool in the result envelope without breaking the existing status contract.
Option B — Developer-defined evaluator hook:
class DataAgent(AutoAgent):
def evaluate_result(self, result: dict) -> bool:
"""Return True only if the result meaningfully achieved the task."""
return result.get("success", False) and result.get("data") is not None
The Kernel calls this post-execution and sets semantic_success accordingly. This is more accurate for domain-specific agents where "success" has business-logic meaning.
The Kernel's
result_handlerevaluates whether the sandbox executed code without raising an unhandled exception. If the code runs to completion and returns any output — including a payload explicitly reporting failure — the outer result envelope is marked"status": "success".Observed Behavior
An agent executes code that explicitly returns a failure payload. The result envelope returned to the caller:
{ "status": "success", "output": "{\"success\": false, \"status\": \"delegated_to_researcher\", \"reason\": \"Could not complete the operation\"}" }The outer
statusfield reports success. The inner payload reports failure. Any monitoring system observingresult["status"]will report 100% success rates while end-users are receiving failed responses.Root Cause
The Kernel conflates two distinct concepts under a single field:
Only execution status is tracked. Semantic status is never evaluated.
Proposed Solutions
Option A — Automatic semantic signal: The Kernel inspects the return value for common failure signals (
"success": false,"error": "...",None, empty collections) and surfaces a separatesemantic_success: boolin the result envelope without breaking the existingstatuscontract.Option B — Developer-defined evaluator hook:
The Kernel calls this post-execution and sets
semantic_successaccordingly. This is more accurate for domain-specific agents where "success" has business-logic meaning.