fix(flue): switch kimi automations to k2.7-code and handle 429 capacity gracefully#1490
Conversation
…city gracefully Workers AI returns 429 when a model is over capacity, and under sustained load the binding can hold a request open indefinitely. That left the deployed review workflow hung forever on a stalled inference call: no result, no posted review, just the container's keep-alive alarm firing for minutes (the cause of reviews silently not posting). - Switch every kimi usage (review, fix, both classifiers, investigate classifier) to kimi-k2.7-code, which is currently less loaded. - Add withCapacityRetry: bounds each model call with a hard per-attempt timeout (fails loudly instead of hanging) and retries genuine 429 capacity errors with exponential backoff + full jitter. - Apply it to the flue-review skill call and to all model-bearing stages of the investigate/classify workflows.
|
Scope checkThis PR changes 553 lines across 8 files. Large PRs are harder to review and more likely to be closed without review. If this scope is intentional, no action needed. A maintainer will review it. If not, please consider splitting this into smaller PRs. See CONTRIBUTING.md for contribution guidelines. |
@emdash-cms/admin
@emdash-cms/auth
@emdash-cms/auth-atproto
@emdash-cms/blocks
@emdash-cms/cloudflare
@emdash-cms/contentful-to-portable-text
emdash
create-emdash
@emdash-cms/gutenberg-to-portable-text
@emdash-cms/plugin-cli
@emdash-cms/plugin-types
@emdash-cms/registry-client
@emdash-cms/registry-lexicons
@emdash-cms/sandbox-workerd
@emdash-cms/x402
@emdash-cms/plugin-ai-moderation
@emdash-cms/plugin-atproto
@emdash-cms/plugin-audit-log
@emdash-cms/plugin-color
@emdash-cms/plugin-embeds
@emdash-cms/plugin-field-kit
@emdash-cms/plugin-forms
@emdash-cms/plugin-webhook-notifier
commit: |
What does this PR do?
Fixes the regression where the automated PR reviewer (
emdash-flue-review) stopped posting reviews around #1484, with no deploy and no error in logs.Root cause: Workers AI returns HTTP 429 when a model is over capacity. We confirmed sustained 429s on
@cf/moonshotai/kimi-k2.6(and, to a lesser extent, 2.7). Under load the AI binding can hold a request open indefinitely, so the review workflow'ssession.skillcall never returned: no result, no posted review, just the Sandbox container's keep-alive alarm firing for minutes. There was no timeout or capacity handling anywhere, so a transient capacity spike turned into a permanent silent hang.Changes:
kimi-k2.7-code(less loaded right now): the review agent, the fix agent, both reply classifiers, and the investigate classifier. (/bonk//reviewkimi alias already moved in chore: bump flue/bonk coding agents to kimi-k2.7-code #1485.)withCapacityRetry(one copy per flue deploy unit, since.flueandinfra/flue-revieware independent workspaces): bounds each model call with a hard per-attempt timeout so a stalled call fails loudly instead of hanging, and retries genuine 429 capacity errors with exponential backoff + full jitter. Per-attempt timeouts are intentionally not retried (can't distinguish a stall from slow-but-working progress) — they fail loud and bounded, and the workflow's at-least-once restart handles re-running.Note:
ModelConfigin@flue/runtimeis just a model-id string, so there's no provider-level retry/timeout knob — this is the application-level contract.Verified by live
wrangler tailofemdash-flue-review: the pipeline is healthy through git checkout andgit diff, then goes silent at the (kimi) inference call with zero exceptions — consistent with a held-open 429.Closes #
Type of change
Checklist
pnpm typecheckpassespnpm lintpassespnpm testpasses (or targeted tests for my change)pnpm formathas been runAI-generated code disclosure
Screenshots / test output
tsc --noEmitclean ininfra/flue-review(afterwrangler types) and.flue; oxfmt clean on all changed files.