fix: Unbounded Recovery Validation Goroutines Exhaustion by ouicate · Pull Request #1277 · gonka-ai/gonka

ouicate · 2026-05-30T08:53:55Z

Summary

This fixes an unbounded-concurrency bug in the validator's missed-inference recovery path where every entry in the missed-validation backlog spawned its own long-lived goroutine, holding payloads, retry state, an HTTP connection, and a broker model-lock reservation each. The same fan-out pattern lived in the live sampled-validation path, and the underlying ML-node replay used http.Post with no timeout. After an extended outage the missed-inference backlog can reach thousands of entries; combined with three independent trigger paths invoking recovery (new-block dispatcher, startup reward recovery, and admin handler) the API process could OOM and starve live traffic of the broker's node pool. The on-chain validation path is still the active execution surface for missed-inference recovery and sampled-validation fan-out today, so this is the same code path validators run in production.

Root Cause

ExecuteRecoveryValidations spawned one goroutine per missed inference and joined them on a single sync.WaitGroup. There was no per-process cap on in-flight validations, no cross-trigger mutual exclusion (any of three callers could fire concurrently and double the live load), and no upper bound on how long an individual validation could remain in flight: each goroutine could spend up to 20 minutes in retrievePayloadsWithRetry, then enter the LockNode retry loop, and finally http.Post into the local ML node with http.DefaultClient, which has no timeout. A hung or slow ML node endpoint would pin a goroutine and the model lock it was holding indefinitely. SampleInferenceToValidate had the same per-id goroutine fan-out for real-time sampled validations.

Fix

Add maxConcurrentValidations = 10 and route both ExecuteRecoveryValidations and SampleInferenceToValidate through a fixed-size worker pool that drains a buffered channel. Concurrent broker load and resident goroutine count are now O(1) in the backlog size instead of O(N).
Add InferenceValidator.recoveryRunning atomic.Bool and gate ExecuteRecoveryValidations with CompareAndSwap so only one recovery run is in flight across all three trigger paths (new-block dispatcher, startup reward recovery, admin handler). Repeat triggers log and return without doubling up.
Preserve fire-and-forget semantics for SampleInferenceToValidate (it is called from the event-handler worker pool) by hosting its worker pool inside a single dispatcher goroutine, so the caller still does not block.
Replace http.Post in validateWithPayloads with a package-level validationHTTPClient = &http.Client{Timeout: 5 * time.Minute} and an explicit http.NewRequest, so a stalled ML node endpoint cannot pin a validation goroutine and the model lock it holds.

Why This Closes The Vulnerability

The exploit/operational failure required three conditions: any caller could spawn an unbounded number of long-lived validation goroutines, multiple trigger paths could fire those fan-outs concurrently, and individual goroutines could be pinned indefinitely on a slow ML node. This PR removes all three. Worker count is bounded at 10 regardless of backlog size, recoveryRunning collapses overlapping triggers into a single run, and the 5-minute HTTP timeout ensures each validation eventually releases its broker lock and exits even when the local ML node misbehaves. The chain-side validation surface is unchanged; this is a narrowly scoped resource-control fix to the validator process that is required as long as on-chain validation remains the execution path missed-inference recovery and sampled validation flow through.

Test plan

go test ./internal/validation/... in decentralized-api
Confirm a synthetic backlog of >10 missed inferences produces at most maxConcurrentValidations worker goroutines (e.g. inspect runtime.NumGoroutine() from a test harness or via pprof during a soak run).
Confirm two concurrent calls to ExecuteRecoveryValidations from different trigger paths result in exactly one execution (the second returns immediately with a warning log).
Confirm validateWithPayloads aborts within ~5 minutes when the ML node endpoint is artificially stalled (instead of blocking the goroutine forever on http.DefaultClient).

…plays Replaces the unbounded per-inference goroutine fan-out in ExecuteRecoveryValidations and SampleInferenceToValidate with a fixed worker pool (maxConcurrentValidations) and guards ExecuteRecoveryValidations with an atomic flag so the three independent trigger paths (new-block dispatcher, startup reward recovery, admin handler) cannot run concurrently. Also swaps the no-timeout http.Post in validateWithPayloads for a shared http.Client with a 5m timeout so a hung ML node endpoint cannot pin a validation goroutine and the model lock it holds indefinitely. Each in-flight validation retains payloads, retry state, an HTTP connection, and a model-lock reservation; after extended downtime the missed-inference backlog can reach thousands of entries, and the previous fan-out could OOM the API and starve live traffic of the node pool. The bounded pool keeps memory and concurrent broker load O(1) in the backlog size. Co-authored-by: Cursor <cursoragent@cursor.com>

tcharchian · 2026-05-30T20:47:34Z

@ouicate Are you sure this covers all cases? How did you check that there are no other places where this issue is happening?

The previous worker pools in ExecuteRecoveryValidations and SampleInferenceToValidate were per-call, so concurrent recovery and sampled-validation calls could each spawn their own pool and exceed maxConcurrentValidations process-wide. VerifyInvalidation also still fanned out one unbounded goroutine per revalidation event. Replace the per-call pools with one shared per-process semaphore (validationSlots) owned by InferenceValidator. ExecuteRecoveryValidations, SampleInferenceToValidate, and VerifyInvalidation now all acquire a slot from the same channel before launching their validation goroutines, so total live validation replays are capped at maxConcurrentValidations across every trigger path. SampleInferenceToValidate keeps its dispatcher goroutine so the event-handler worker stays fire-and-forget while slot acquisition happens off the event-worker path. ML-node replay now also uses http.NewRequestWithContext with the recorder's context, so a process shutdown actually cancels in-flight validation HTTP calls instead of just relying on the 5-minute timeout.

ouicate · 2026-05-30T22:02:50Z

@ouicate Are you sure this covers all cases? How did you check that there are no other places where this issue is happening?

Yes @tcharchian, with 6deb918 I’m confident this now covers the production validation-replay cases behind this issue. I re-checked it from the call graph rather than only patching the two obvious functions.

What I checked:

I traced every production caller of recovery execution.

ExecuteRecoveryValidations has three production callers: the new-block dispatcher, startup reward recovery, and the admin recovery handler. In production there is one InferenceValidator constructed in main.go and that same instance is passed into the event listener and admin server, so recoveryRunning is a real process-local guard across those trigger paths. If two triggers reach execution at the same time, one wins the CAS and the other returns without starting another recovery fan-out.

I traced every path that reaches the actual validation replay.

The only caller of retrievePayloadsWithRetry, broker.LockNode, and validateWithPayloads is validateInferenceAndSendValMessage. That function is reached from exactly three entry points:

ExecuteRecoveryValidations for missed-validation recovery.
SampleInferenceToValidate for live sampled validation.
VerifyInvalidation for revalidation.

All three now acquire from the same validationSlots channel owned by the InferenceValidator instance before entering validation replay work. That means the cap is shared process-wide across recovery, live sampled validation, and revalidation; it is not 10 per call site. At most maxConcurrentValidations active validation replays can be retrieving payloads, retrying, holding broker model locks, or calling the local ML node at once.

I re-checked the event-handler contract.

The tx event handlers run on a fixed worker pool. Blocking those workers on a saturated validation semaphore would stall unrelated event processing, so 6deb918 keeps slot waiting off the event-worker path: sampled validation uses a dispatcher goroutine, and revalidation acquires its slot inside the background validation goroutine. The bounded resource is the active replay work; event workers return immediately.

I searched for the same class of bug elsewhere in decentralized-api.

I checked the validation replay symbols (validateInferenceAndSendValMessage, retrievePayloadsWithRetry, validateWithPayloads, and broker.LockNode) and found no other production callers. I also searched for other go func fan-outs and HTTP clients. The remaining production fan-outs I found are fixed worker pools, node-list-bounded/admin checks, startup/maintenance goroutines, or unrelated request-scoped work. They do not run the missed-validation replay path and do not hold the combination that made this issue dangerous: payloads, retry state, an ML-node HTTP call, and a broker model-lock reservation.

For HTTP timeouts, payload retrieval already uses payloadRetrievalClient with a 30s timeout, the ML-node client has its own timeout, and the remaining no-timeout validation replay call was the http.Post in validateWithPayloads. This PR replaces it with validationHTTPClient with a 5m timeout and http.NewRequestWithContext(s.recorder.GetContext(), ...), so shutdown also cancels in-flight ML-node validation calls.

a-kuprin · 2026-06-03T12:57:27Z

This is valid fix for unused on-chain inference/validation path

Currently inferences come via gateways to devshards, and legacy on-chain inferences is deprecated.
devshard validation do not have this bug.

While this old path is still in the code it could be fixed and fix is valid. But this is closed path so not a real risk for the network.

ouicate · 2026-06-16T15:16:06Z

When you have a chance, please take a look @patimen

ouicate changed the title ~~Unbounded Recovery Validation Goroutines Exhaustion~~ fix: Unbounded Recovery Validation Goroutines Exhaustion May 30, 2026

tcharchian requested a review from patimen May 30, 2026 20:46

ouicate added 2 commits May 30, 2026 23:36

Merge branch 'main' into fix/bounded-recovery-validation-workers

7761060

Merge branch 'main' into fix/bounded-recovery-validation-workers

f770296

a-kuprin approved these changes Jun 3, 2026

View reviewed changes

tcharchian added this to the v0.2.14 milestone Jun 3, 2026

ouicate added 2 commits June 5, 2026 00:04

fix the test

158f28c

Merge branch 'main' into fix/bounded-recovery-validation-workers

2526bf7

tcharchian requested review from DimaOrekhovPS, GLiberman and x0152 and removed request for patimen June 16, 2026 20:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Unbounded Recovery Validation Goroutines Exhaustion#1277

fix: Unbounded Recovery Validation Goroutines Exhaustion#1277
ouicate wants to merge 6 commits into
gonka-ai:mainfrom
ouicate:fix/bounded-recovery-validation-workers

ouicate commented May 30, 2026 •

edited

Loading

Uh oh!

tcharchian commented May 30, 2026

Uh oh!

ouicate commented May 30, 2026

Uh oh!

a-kuprin commented Jun 3, 2026

Uh oh!

ouicate commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ouicate commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Fix

Why This Closes The Vulnerability

Test plan

Uh oh!

tcharchian commented May 30, 2026

Uh oh!

ouicate commented May 30, 2026

Uh oh!

a-kuprin commented Jun 3, 2026

Uh oh!

ouicate commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ouicate commented May 30, 2026 •

edited

Loading