fix: Unbounded Recovery Validation Goroutines Exhaustion#1277
Conversation
…plays Replaces the unbounded per-inference goroutine fan-out in ExecuteRecoveryValidations and SampleInferenceToValidate with a fixed worker pool (maxConcurrentValidations) and guards ExecuteRecoveryValidations with an atomic flag so the three independent trigger paths (new-block dispatcher, startup reward recovery, admin handler) cannot run concurrently. Also swaps the no-timeout http.Post in validateWithPayloads for a shared http.Client with a 5m timeout so a hung ML node endpoint cannot pin a validation goroutine and the model lock it holds indefinitely. Each in-flight validation retains payloads, retry state, an HTTP connection, and a model-lock reservation; after extended downtime the missed-inference backlog can reach thousands of entries, and the previous fan-out could OOM the API and starve live traffic of the node pool. The bounded pool keeps memory and concurrent broker load O(1) in the backlog size. Co-authored-by: Cursor <cursoragent@cursor.com>
|
@ouicate Are you sure this covers all cases? How did you check that there are no other places where this issue is happening? |
The previous worker pools in ExecuteRecoveryValidations and SampleInferenceToValidate were per-call, so concurrent recovery and sampled-validation calls could each spawn their own pool and exceed maxConcurrentValidations process-wide. VerifyInvalidation also still fanned out one unbounded goroutine per revalidation event. Replace the per-call pools with one shared per-process semaphore (validationSlots) owned by InferenceValidator. ExecuteRecoveryValidations, SampleInferenceToValidate, and VerifyInvalidation now all acquire a slot from the same channel before launching their validation goroutines, so total live validation replays are capped at maxConcurrentValidations across every trigger path. SampleInferenceToValidate keeps its dispatcher goroutine so the event-handler worker stays fire-and-forget while slot acquisition happens off the event-worker path. ML-node replay now also uses http.NewRequestWithContext with the recorder's context, so a process shutdown actually cancels in-flight validation HTTP calls instead of just relying on the 5-minute timeout.
Yes @tcharchian, with What I checked:
The only caller of
All three now acquire from the same
The tx event handlers run on a fixed worker pool. Blocking those workers on a saturated validation semaphore would stall unrelated event processing, so
I checked the validation replay symbols ( For HTTP timeouts, payload retrieval already uses |
|
This is valid fix for unused on-chain inference/validation path Currently inferences come via gateways to devshards, and legacy on-chain inferences is deprecated. While this old path is still in the code it could be fixed and fix is valid. But this is closed path so not a real risk for the network. |
|
When you have a chance, please take a look @patimen |
Summary
This fixes an unbounded-concurrency bug in the validator's missed-inference recovery path where every entry in the missed-validation backlog spawned its own long-lived goroutine, holding payloads, retry state, an HTTP connection, and a broker model-lock reservation each. The same fan-out pattern lived in the live sampled-validation path, and the underlying ML-node replay used
http.Postwith no timeout. After an extended outage the missed-inference backlog can reach thousands of entries; combined with three independent trigger paths invoking recovery (new-block dispatcher, startup reward recovery, and admin handler) the API process could OOM and starve live traffic of the broker's node pool. The on-chain validation path is still the active execution surface for missed-inference recovery and sampled-validation fan-out today, so this is the same code path validators run in production.Root Cause
ExecuteRecoveryValidationsspawned one goroutine per missed inference and joined them on a singlesync.WaitGroup. There was no per-process cap on in-flight validations, no cross-trigger mutual exclusion (any of three callers could fire concurrently and double the live load), and no upper bound on how long an individual validation could remain in flight: each goroutine could spend up to 20 minutes in retrievePayloadsWithRetry, then enter the LockNode retry loop, and finallyhttp.Postinto the local ML node withhttp.DefaultClient, which has no timeout. A hung or slow ML node endpoint would pin a goroutine and the model lock it was holding indefinitely.SampleInferenceToValidatehad the same per-id goroutine fan-out for real-time sampled validations.Fix
maxConcurrentValidations = 10and route bothExecuteRecoveryValidationsandSampleInferenceToValidatethrough a fixed-size worker pool that drains a buffered channel. Concurrent broker load and resident goroutine count are now O(1) in the backlog size instead of O(N).InferenceValidator.recoveryRunning atomic.Booland gateExecuteRecoveryValidationswithCompareAndSwapso only one recovery run is in flight across all three trigger paths (new-block dispatcher, startup reward recovery, admin handler). Repeat triggers log and return without doubling up.SampleInferenceToValidate(it is called from the event-handler worker pool) by hosting its worker pool inside a single dispatcher goroutine, so the caller still does not block.http.PostinvalidateWithPayloadswith a package-levelvalidationHTTPClient = &http.Client{Timeout: 5 * time.Minute}and an explicithttp.NewRequest, so a stalled ML node endpoint cannot pin a validation goroutine and the model lock it holds.Why This Closes The Vulnerability
The exploit/operational failure required three conditions: any caller could spawn an unbounded number of long-lived validation goroutines, multiple trigger paths could fire those fan-outs concurrently, and individual goroutines could be pinned indefinitely on a slow ML node. This PR removes all three. Worker count is bounded at 10 regardless of backlog size,
recoveryRunningcollapses overlapping triggers into a single run, and the 5-minute HTTP timeout ensures each validation eventually releases its broker lock and exits even when the local ML node misbehaves. The chain-side validation surface is unchanged; this is a narrowly scoped resource-control fix to the validator process that is required as long as on-chain validation remains the execution path missed-inference recovery and sampled validation flow through.Test plan
go test ./internal/validation/...indecentralized-apimaxConcurrentValidationsworker goroutines (e.g. inspectruntime.NumGoroutine()from a test harness or via pprof during a soak run).ExecuteRecoveryValidationsfrom different trigger paths result in exactly one execution (the second returns immediately with a warning log).validateWithPayloadsaborts within ~5 minutes when the ML node endpoint is artificially stalled (instead of blocking the goroutine forever onhttp.DefaultClient).