Skip to content

fix: Unbounded Recovery Validation Goroutines Exhaustion#1277

Open
ouicate wants to merge 6 commits into
gonka-ai:mainfrom
ouicate:fix/bounded-recovery-validation-workers
Open

fix: Unbounded Recovery Validation Goroutines Exhaustion#1277
ouicate wants to merge 6 commits into
gonka-ai:mainfrom
ouicate:fix/bounded-recovery-validation-workers

Conversation

@ouicate

@ouicate ouicate commented May 30, 2026

Copy link
Copy Markdown

Summary

This fixes an unbounded-concurrency bug in the validator's missed-inference recovery path where every entry in the missed-validation backlog spawned its own long-lived goroutine, holding payloads, retry state, an HTTP connection, and a broker model-lock reservation each. The same fan-out pattern lived in the live sampled-validation path, and the underlying ML-node replay used http.Post with no timeout. After an extended outage the missed-inference backlog can reach thousands of entries; combined with three independent trigger paths invoking recovery (new-block dispatcher, startup reward recovery, and admin handler) the API process could OOM and starve live traffic of the broker's node pool. The on-chain validation path is still the active execution surface for missed-inference recovery and sampled-validation fan-out today, so this is the same code path validators run in production.

Root Cause

ExecuteRecoveryValidations spawned one goroutine per missed inference and joined them on a single sync.WaitGroup. There was no per-process cap on in-flight validations, no cross-trigger mutual exclusion (any of three callers could fire concurrently and double the live load), and no upper bound on how long an individual validation could remain in flight: each goroutine could spend up to 20 minutes in retrievePayloadsWithRetry, then enter the LockNode retry loop, and finally http.Post into the local ML node with http.DefaultClient, which has no timeout. A hung or slow ML node endpoint would pin a goroutine and the model lock it was holding indefinitely. SampleInferenceToValidate had the same per-id goroutine fan-out for real-time sampled validations.

Fix

  • Add maxConcurrentValidations = 10 and route both ExecuteRecoveryValidations and SampleInferenceToValidate through a fixed-size worker pool that drains a buffered channel. Concurrent broker load and resident goroutine count are now O(1) in the backlog size instead of O(N).
  • Add InferenceValidator.recoveryRunning atomic.Bool and gate ExecuteRecoveryValidations with CompareAndSwap so only one recovery run is in flight across all three trigger paths (new-block dispatcher, startup reward recovery, admin handler). Repeat triggers log and return without doubling up.
  • Preserve fire-and-forget semantics for SampleInferenceToValidate (it is called from the event-handler worker pool) by hosting its worker pool inside a single dispatcher goroutine, so the caller still does not block.
  • Replace http.Post in validateWithPayloads with a package-level validationHTTPClient = &http.Client{Timeout: 5 * time.Minute} and an explicit http.NewRequest, so a stalled ML node endpoint cannot pin a validation goroutine and the model lock it holds.

Why This Closes The Vulnerability

The exploit/operational failure required three conditions: any caller could spawn an unbounded number of long-lived validation goroutines, multiple trigger paths could fire those fan-outs concurrently, and individual goroutines could be pinned indefinitely on a slow ML node. This PR removes all three. Worker count is bounded at 10 regardless of backlog size, recoveryRunning collapses overlapping triggers into a single run, and the 5-minute HTTP timeout ensures each validation eventually releases its broker lock and exits even when the local ML node misbehaves. The chain-side validation surface is unchanged; this is a narrowly scoped resource-control fix to the validator process that is required as long as on-chain validation remains the execution path missed-inference recovery and sampled validation flow through.

Test plan

  • go test ./internal/validation/... in decentralized-api
  • Confirm a synthetic backlog of >10 missed inferences produces at most maxConcurrentValidations worker goroutines (e.g. inspect runtime.NumGoroutine() from a test harness or via pprof during a soak run).
  • Confirm two concurrent calls to ExecuteRecoveryValidations from different trigger paths result in exactly one execution (the second returns immediately with a warning log).
  • Confirm validateWithPayloads aborts within ~5 minutes when the ML node endpoint is artificially stalled (instead of blocking the goroutine forever on http.DefaultClient).

…plays

Replaces the unbounded per-inference goroutine fan-out in
ExecuteRecoveryValidations and SampleInferenceToValidate with a fixed
worker pool (maxConcurrentValidations) and guards ExecuteRecoveryValidations
with an atomic flag so the three independent trigger paths (new-block
dispatcher, startup reward recovery, admin handler) cannot run concurrently.
Also swaps the no-timeout http.Post in validateWithPayloads for a shared
http.Client with a 5m timeout so a hung ML node endpoint cannot pin a
validation goroutine and the model lock it holds indefinitely.

Each in-flight validation retains payloads, retry state, an HTTP connection,
and a model-lock reservation; after extended downtime the missed-inference
backlog can reach thousands of entries, and the previous fan-out could OOM
the API and starve live traffic of the node pool. The bounded pool keeps
memory and concurrent broker load O(1) in the backlog size.

Co-authored-by: Cursor <cursoragent@cursor.com>
@ouicate ouicate changed the title Unbounded Recovery Validation Goroutines Exhaustion fix: Unbounded Recovery Validation Goroutines Exhaustion May 30, 2026
@tcharchian tcharchian requested a review from patimen May 30, 2026 20:46
@tcharchian

Copy link
Copy Markdown
Collaborator

@ouicate Are you sure this covers all cases? How did you check that there are no other places where this issue is happening?

ouicate added 2 commits May 30, 2026 23:36
The previous worker pools in ExecuteRecoveryValidations and
SampleInferenceToValidate were per-call, so concurrent recovery and
sampled-validation calls could each spawn their own pool and exceed
maxConcurrentValidations process-wide. VerifyInvalidation also still
fanned out one unbounded goroutine per revalidation event.

Replace the per-call pools with one shared per-process semaphore
(validationSlots) owned by InferenceValidator. ExecuteRecoveryValidations,
SampleInferenceToValidate, and VerifyInvalidation now all acquire a slot
from the same channel before launching their validation goroutines, so
total live validation replays are capped at maxConcurrentValidations
across every trigger path. SampleInferenceToValidate keeps its dispatcher
goroutine so the event-handler worker stays fire-and-forget while slot
acquisition happens off the event-worker path.

ML-node replay now also uses http.NewRequestWithContext with the
recorder's context, so a process shutdown actually cancels in-flight
validation HTTP calls instead of just relying on the 5-minute timeout.
@ouicate

ouicate commented May 30, 2026

Copy link
Copy Markdown
Author

@ouicate Are you sure this covers all cases? How did you check that there are no other places where this issue is happening?

Yes @tcharchian, with 6deb918 I’m confident this now covers the production validation-replay cases behind this issue. I re-checked it from the call graph rather than only patching the two obvious functions.

What I checked:

  1. I traced every production caller of recovery execution.

ExecuteRecoveryValidations has three production callers: the new-block dispatcher, startup reward recovery, and the admin recovery handler. In production there is one InferenceValidator constructed in main.go and that same instance is passed into the event listener and admin server, so recoveryRunning is a real process-local guard across those trigger paths. If two triggers reach execution at the same time, one wins the CAS and the other returns without starting another recovery fan-out.

  1. I traced every path that reaches the actual validation replay.

The only caller of retrievePayloadsWithRetry, broker.LockNode, and validateWithPayloads is validateInferenceAndSendValMessage. That function is reached from exactly three entry points:

  • ExecuteRecoveryValidations for missed-validation recovery.
  • SampleInferenceToValidate for live sampled validation.
  • VerifyInvalidation for revalidation.

All three now acquire from the same validationSlots channel owned by the InferenceValidator instance before entering validation replay work. That means the cap is shared process-wide across recovery, live sampled validation, and revalidation; it is not 10 per call site. At most maxConcurrentValidations active validation replays can be retrieving payloads, retrying, holding broker model locks, or calling the local ML node at once.

  1. I re-checked the event-handler contract.

The tx event handlers run on a fixed worker pool. Blocking those workers on a saturated validation semaphore would stall unrelated event processing, so 6deb918 keeps slot waiting off the event-worker path: sampled validation uses a dispatcher goroutine, and revalidation acquires its slot inside the background validation goroutine. The bounded resource is the active replay work; event workers return immediately.

  1. I searched for the same class of bug elsewhere in decentralized-api.

I checked the validation replay symbols (validateInferenceAndSendValMessage, retrievePayloadsWithRetry, validateWithPayloads, and broker.LockNode) and found no other production callers. I also searched for other go func fan-outs and HTTP clients. The remaining production fan-outs I found are fixed worker pools, node-list-bounded/admin checks, startup/maintenance goroutines, or unrelated request-scoped work. They do not run the missed-validation replay path and do not hold the combination that made this issue dangerous: payloads, retry state, an ML-node HTTP call, and a broker model-lock reservation.

For HTTP timeouts, payload retrieval already uses payloadRetrievalClient with a 30s timeout, the ML-node client has its own timeout, and the remaining no-timeout validation replay call was the http.Post in validateWithPayloads. This PR replaces it with validationHTTPClient with a 5m timeout and http.NewRequestWithContext(s.recorder.GetContext(), ...), so shutdown also cancels in-flight ML-node validation calls.

@a-kuprin

a-kuprin commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

This is valid fix for unused on-chain inference/validation path

Currently inferences come via gateways to devshards, and legacy on-chain inferences is deprecated.
devshard validation do not have this bug.

While this old path is still in the code it could be fixed and fix is valid. But this is closed path so not a real risk for the network.

@tcharchian tcharchian added this to the v0.2.14 milestone Jun 3, 2026
@ouicate

ouicate commented Jun 16, 2026

Copy link
Copy Markdown
Author

When you have a chance, please take a look @patimen

@tcharchian tcharchian requested review from DimaOrekhovPS, GLiberman and x0152 and removed request for patimen June 16, 2026 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants