Skip to content

Kernel result_handler conflates execution status with semantic status; domain failures reported as success #37

@sangalo20

Description

@sangalo20

The Kernel's result_handler evaluates whether the sandbox executed code without raising an unhandled exception. If the code runs to completion and returns any output — including a payload explicitly reporting failure — the outer result envelope is marked "status": "success".

Observed Behavior

An agent executes code that explicitly returns a failure payload. The result envelope returned to the caller:

{
  "status": "success",
  "output": "{\"success\": false, \"status\": \"delegated_to_researcher\", \"reason\": \"Could not complete the operation\"}"
}

The outer status field reports success. The inner payload reports failure. Any monitoring system observing result["status"] will report 100% success rates while end-users are receiving failed responses.

Root Cause

The Kernel conflates two distinct concepts under a single field:

Concept Question
Execution Status "Did the sandbox run the code without crashing?"
Semantic Status "Did the code achieve the user's intended goal?"

Only execution status is tracked. Semantic status is never evaluated.

Proposed Solutions

Option A — Automatic semantic signal: The Kernel inspects the return value for common failure signals ("success": false, "error": "...", None, empty collections) and surfaces a separate semantic_success: bool in the result envelope without breaking the existing status contract.

Option B — Developer-defined evaluator hook:

class DataAgent(AutoAgent):
    def evaluate_result(self, result: dict) -> bool:
        """Return True only if the result meaningfully achieved the task."""
        return result.get("success", False) and result.get("data") is not None

The Kernel calls this post-execution and sets semantic_success accordingly. This is more accurate for domain-specific agents where "success" has business-logic meaning.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingquestionFurther information is requested

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions