Skip to content

genty live-stack: non-gpt-5.5 models pass orchestration but skip the file write (file-creation fails) #956

@tmuskal

Description

@tmuskal

Summary

After #936 was fully resolved, genty vanilla NI is GREEN on gpt-5.5 across all 3 OSes (Ubuntu, macOS, Windows).

But the other 5 models fail on the file-creation check only (gpt-5.4-mini, claude-sonnet-4-6, gemini-3.5-flash, gemini-3.1-pro-preview, DeepSeek-V4-Pro) — across all 3 OSes (runs 27372409793 / 27372411571 / 27372413000).

The #936 infrastructure all works for these models:

✓ model-response (agent responded, ~12k chars)
✓ proxy-communication
✗ file-creation: agent did not create .a5c-live-test/<id>-odyssey.md (output: 12092 chars)
✓ babysitter-run-completion: run exists with 6 journal events (>=5)
✓ babysitter-completion-proof: completed with processId + completionProof

Diagnosis

The model produces the full odyssey content as agent output text (~12k chars) and the run completes with a valid completion proof — but the content is never written to the expected file .a5c-live-test/<sessionId>-odyssey.md. gpt-5.5 reliably authors a process whose delegated worker writes the file; weaker/other models author/execute a process that returns the content instead of writing it (or write to the wrong path).

This is a model-adherence / prompt-robustness gap, not an orchestration bug. Likely fix: strengthen the authoring + delegated-worker prompts so the file write to the exact target path is mandatory and verified, independent of model strength.

Repro (local, fast)

AZURE_OPENAI_API_KEY + AZURE_OPENAI_PROJECT_NAMEnode packages/genty/cli/dist/cli/main.js yolo --prompt "<odyssey...save to .a5c-live-test/x.md>" --model gpt-5.4-mini --no-interactive --workspace <tmp> and check whether the file is created.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions