Skip to content

fix: DKG goroutine and instance lifecycle (v3.1.0)#239

Draft
olegshmuelov wants to merge 3 commits intomainfrom
fix/dkg-goroutine-lifecycle
Draft

fix: DKG goroutine and instance lifecycle (v3.1.0)#239
olegshmuelov wants to merge 3 commits intomainfrom
fix/dkg-goroutine-lifecycle

Conversation

@olegshmuelov
Copy link
Copy Markdown
Contributor

@olegshmuelov olegshmuelov commented Apr 23, 2026

Summary

Fixes v3.1.0 QA 2.6 "goroutine cleanup — memory leaks still persist." Threads a lifecycle context through LocalOwner + instWrapper, replaces kyber's TimePhaser with a cancel-aware version, and adds a background reaper so expired instances release heap pressure under sparse traffic.

What's fixed

  • PostReshare <-o.done deadlock on the success path — nothing ever sent to o.done, so every successful reshare leaked its WaitEnd goroutine. sync.Once-guarded close now used across PostDKG, PostReshare, broadcastError.
  • Orphaned bchan senders on timeout — broadcast closure exits via instanceCtx.Done().
  • WaitEnd watchers outliving the instancerunWaitEnd races WaitEnd() against ctx.Done() with a 5s grace for late completions.
  • kyber TimePhaser residue (~30s) — replaced with cancellablePhaser in pkgs/wire/phaser.go; kyber exits within one phase signal.
  • Instance lifecycle: Instance.Close() plumbed into ProcessMessage timeout, cleanInstances, validateInstances; ProcessMessages honors the lifecycle ctx.
  • Heap retention: Switch.StartReaper sweeps expired entries every 30s.

Scope

In: goroutine lifecycle, kyber cancellation, instance eviction, heap-retention reaper.

Deferred (future releases): error-code unification (1.4 / 1.5 / 2.3), Pong version (4.1), body-limit tightening (2.1), atomic MaxInstances cap, CLI, Docker/CI, streaming SSZ.

Known limitations

  • Post-cancel broadcast can rarely deliver a legitimate result to an in-flight caller (no leak; real work).
  • Under extreme concurrent admissions, MaxInstances=1024 may briefly overshoot by the number of in-flight admission goroutines.

Test plan

  • go test -race -timeout 300s ./pkgs/... — all pass
  • golangci-lint run ./pkgs/... — 0 issues
  • New: pkgs/wire/phaser_test.go, pkgs/dkg/lifecycle_test.go, pkgs/operator/lifecycle_test.go; goleak verifies kyber residue clears on cancel via TestDKGCancelReleasesKyberGoroutines
  • QA 2.8 EIP-1271 integration tests re-run
  • Full v3.1.0 QA re-pass under sustained + timeout-storm traffic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant