Kernel result_handler conflates execution status with semantic status; domain failures reported as success

The Kernel's `result_handler` evaluates whether the sandbox executed code without raising an unhandled exception. If the code runs to completion and returns any output — including a payload explicitly reporting failure — the outer result envelope is marked `"status": "success"`.

## Observed Behavior

An agent executes code that explicitly returns a failure payload. The result envelope returned to the caller:

```json
{
  "status": "success",
  "output": "{\"success\": false, \"status\": \"delegated_to_researcher\", \"reason\": \"Could not complete the operation\"}"
}
```

The outer `status` field reports success. The inner payload reports failure. Any monitoring system observing `result["status"]` will report 100% success rates while end-users are receiving failed responses.

## Root Cause

The Kernel conflates two distinct concepts under a single field:

| Concept | Question |
|---|---|
| **Execution Status** | "Did the sandbox run the code without crashing?" |
| **Semantic Status** | "Did the code achieve the user's intended goal?" |

Only execution status is tracked. Semantic status is never evaluated.

## Proposed Solutions

**Option A — Automatic semantic signal:** The Kernel inspects the return value for common failure signals (`"success": false`, `"error": "..."`, `None`, empty collections) and surfaces a separate `semantic_success: bool` in the result envelope without breaking the existing `status` contract.

**Option B — Developer-defined evaluator hook:**

```python
class DataAgent(AutoAgent):
    def evaluate_result(self, result: dict) -> bool:
        """Return True only if the result meaningfully achieved the task."""
        return result.get("success", False) and result.get("data") is not None
```

The Kernel calls this post-execution and sets `semantic_success` accordingly. This is more accurate for domain-specific agents where "success" has business-logic meaning.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kernel result_handler conflates execution status with semantic status; domain failures reported as success #37

Observed Behavior

Root Cause

Proposed Solutions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Concept	Question
Execution Status	"Did the sandbox run the code without crashing?"
Semantic Status	"Did the code achieve the user's intended goal?"

Kernel result_handler conflates execution status with semantic status; domain failures reported as success #37

Description

Observed Behavior

Root Cause

Proposed Solutions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions