-
-
Notifications
You must be signed in to change notification settings - Fork 866
fix(supervisor): prevent escalating duplicate reconnections in failedPodHandler #2627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(supervisor): prevent escalating duplicate reconnections in failedPodHandler #2627
Conversation
|
WalkthroughThe FailedPodHandler class (apps/supervisor/src/services/failedPodHandler.ts) adds a private Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Pre-merge checks and finishing touches✅ Passed checks (5 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
6b15642 to
55c45ab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
apps/supervisor/src/services/failedPodHandler.ts (2)
291-293: Mirror the listener wrapper forconnectto be defensive.Arrow already captures
this, but the async call can still reject. Wrap and log to avoid unhandled rejections (rare here, but consistent).Apply:
- private makeOnConnect(informerName: string) { - return () => this.onConnect(informerName); - } + private makeOnConnect(informerName: string) { + return () => { + this.onConnect(informerName).catch((handlerError) => { + const error = handlerError instanceof Error ? handlerError : undefined; + this.logger.error("onConnect handler failure", { + informerName, + error: error?.message, + errorType: error?.name, + errorStack: error?.stack, + }); + }); + }; + }
67-69: Redundant.bind(this)on arrow callbacks.
makeOnConnect/makeOnErrorreturn arrow functions that already lexically bindthis. The extra.bind(this)is unnecessary.Apply:
- this.informer.on("connect", this.makeOnConnect("failed-pod-informer").bind(this)); - this.informer.on("error", this.makeOnError("failed-pod-informer").bind(this)); + this.informer.on("connect", this.makeOnConnect("failed-pod-informer")); + this.informer.on("error", this.makeOnError("failed-pod-informer"));
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
apps/supervisor/src/services/failedPodHandler.ts(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{ts,tsx}: Always prefer using isomorphic code like fetch, ReadableStream, etc. instead of Node.js specific code
For TypeScript, we usually use types over interfaces
Avoid enums
No default exports, use function declarations
Files:
apps/supervisor/src/services/failedPodHandler.ts
🧬 Code graph analysis (1)
apps/supervisor/src/services/failedPodHandler.ts (1)
apps/supervisor/src/workloadManager/kubernetes.ts (1)
err(237-253)
🔇 Additional comments (1)
apps/supervisor/src/services/failedPodHandler.ts (1)
28-28: Good addition: reconnection guard flag.The
reconnectingflag is the right primitive to prevent concurrentstart()calls. LGTM.
|
CodeRabbit has suggested two defensive enhancements that could be added if desired:
These are good defensive measures but not critical to fixing the core bug. Happy to implement them if maintainers prefer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thank you! 🙏
Just the catch handler to add, I dismissed the suggestion re wrapping the entire function.
Will get this up on our test env asap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
apps/supervisor/src/services/failedPodHandler.ts (2)
253-255: Wrap async handler to avoid unhandled rejections; also drop redundant.bind(this).Event emitters don’t await handlers. Wrap
onError()and catch/log rejections; keep semantics unchanged. SincemakeOnErrorreturns an arrow that closes overthis,.bind(this)at registration is unnecessary.- private makeOnError(informerName: string) { - return (err?: unknown) => this.onError(informerName, err); - } + private makeOnError(informerName: string) { + return (err?: unknown) => { + this.onError(informerName, err).catch((handlerError) => { + const error = handlerError instanceof Error ? handlerError : undefined; + this.logger.error("onError handler failure", { + informerName, + error: error?.message, + errorType: error?.name, + errorStack: error?.stack, + }); + }); + }; + }Apply at registration (outside this hunk):
- this.informer.on("error", this.makeOnError("failed-pod-informer").bind(this)); + this.informer.on("error", this.makeOnError("failed-pod-informer"));
283-286: Optional: skip reconnect if stopped during backoff.If
stop()occurs during the timeout, we may still callstart(). A cheap check avoids a surprise restart.// Reconnect on errors await setTimeout(this.reconnectIntervalMs); + if (!this.isRunning) { + this.logger.debug("onError: handler stopped during backoff; skipping reconnect", { + informerName, + }); + return; + } await this.informer.start();
🧹 Nitpick comments (1)
apps/supervisor/src/services/failedPodHandler.ts (1)
273-296: Count every error event, even when skipping reconnection.
informerEventsTotalis described as “informer events”; unlikeadd/connect,erroris only incremented inside the try. When skipping due toreconnecting, the error event isn’t counted. Move the increment before the guard and remove the later one.private async onError(informerName: string, err?: unknown) { if (!this.isRunning) { this.logger.warn("onError: informer not running"); return; } + // Count every error event, even if a reconnect is skipped + this.informerEventsTotal.inc({ namespace: this.namespace, verb: "error" }); - - // Guard against multiple simultaneous reconnections + // Guard against multiple simultaneous reconnections if (this.reconnecting) { this.logger.debug("onError: reconnection already in progress, skipping", { informerName, }); return; } @@ this.logger.error("error event fired", { informerName, error: error?.message, errorType: error?.name, errorStack: error?.stack, }); - this.informerEventsTotal.inc({ namespace: this.namespace, verb: "error" });Please confirm whether you want the metric to reflect “events emitted” vs “reconnections attempted.”
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
apps/supervisor/src/services/failedPodHandler.ts(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{ts,tsx}: Always prefer using isomorphic code like fetch, ReadableStream, etc. instead of Node.js specific code
For TypeScript, we usually use types over interfaces
Avoid enums
No default exports, use function declarations
Files:
apps/supervisor/src/services/failedPodHandler.ts
🧬 Code graph analysis (1)
apps/supervisor/src/services/failedPodHandler.ts (1)
apps/supervisor/src/workloadManager/kubernetes.ts (1)
err(237-253)
🔇 Additional comments (2)
apps/supervisor/src/services/failedPodHandler.ts (2)
28-28: Reconnection guard flag: LGTM.This single‑flight gate is the right fix; the finally reset is in place.
263-269: Single‑flight reconnection guard: LGTM.Prevents the escalating reconnection storm reported in #2623.
Thanks! I've added the catch handler and pushed. One thought: The {"timestamp":"2025-10-22T15:09:02.192Z","message":"error event fired","$name":"failed-pod-handler","$level":"error","id":"9eeyr8vrx1","informerName":"failed-pod-informer","error":"The user aborted a request.","errorType":"AbortError","errorStack":"AbortError: The user aborted a request.\n at abort (/app/apps/supervisor/node_modules/.pnpm/[email protected]/node_modules/node-fetch/lib/index.js:1462:16)\n at AbortSignal.abortAndFinalize (/app/apps/supervisor/node_modules/.pnpm/[email protected]/node_modules/node-fetch/lib/index.js:1477:4)\n at [nodejs.internal.kHybridDispatch] (node:internal/event_target:827:20)\n at AbortSignal.dispatchEvent (node:internal/event_target:762:26)\n at runAbort (node:internal/abort_controller:447:10)\n at abortSignal (node:internal/abort_controller:433:3)\n at AbortController.abort (node:internal/abort_controller:466:5)\n at PassThrough.doneCallOnce (file:///app/apps/supervisor/node_modules/.pnpm/@[email protected]_patch_hash=s75bgwaoixupmywtvgoy5ruszq/node_modules/@kubernetes/client-node/dist/watch.js:32:28)\n at PassThrough.emit (node:events:530:35)\n at emitErrorNT (node:internal/streams/destroy:170:8)"}
{"timestamp":"2025-10-22T15:09:02.192Z","message":"onError: reconnection already in progress, skipping","$name":"failed-pod-handler","$level":"debug","id":"9eeyr8vrx1","informerName":"failed-pod-informer"}Happy to adjust based on what works best for your logging setup! 👍 |
|
Thanks @NERLOE
Sure, makes sense to remove |
|
Done✅ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
apps/supervisor/src/services/failedPodHandler.ts (1)
253-255: Wrap async handler to avoid unhandled rejections (and remove redundant bind).Event emitters don’t await returned Promises; any rejection becomes an unhandled rejection. Wrap the call and log inside the returned function. Also, since you return an arrow function (lexically binds
this), the.bind(this)at registration is redundant.Apply:
- private makeOnError(informerName: string) { - return (err?: unknown) => this.onError(informerName, err); - } + private makeOnError(informerName: string) { + return (err?: unknown): void => { + void this.onError(informerName, err).catch((handlerError) => { + const error = handlerError instanceof Error ? handlerError : undefined; + this.logger.error("onError handler failure", { + informerName, + error: error?.message, + errorType: error?.name, + }); + }); + }; + }And update the registration (outside this hunk) to drop the unnecessary bind:
- this.informer.on("error", this.makeOnError("failed-pod-informer").bind(this)); + this.informer.on("error", this.makeOnError("failed-pod-informer"));
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
apps/supervisor/src/services/failedPodHandler.ts(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{ts,tsx}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{ts,tsx}: Always prefer using isomorphic code like fetch, ReadableStream, etc. instead of Node.js specific code
For TypeScript, we usually use types over interfaces
Avoid enums
No default exports, use function declarations
Files:
apps/supervisor/src/services/failedPodHandler.ts
🧬 Code graph analysis (1)
apps/supervisor/src/services/failedPodHandler.ts (1)
apps/supervisor/src/workloadManager/kubernetes.ts (1)
err(237-253)
🔇 Additional comments (1)
apps/supervisor/src/services/failedPodHandler.ts (1)
28-28: Good guard to serialize reconnects.The
reconnectingflag with afinallyreset correctly prevents overlapping reconnections. Nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice one, thanks again @NERLOE 🔥
Closes #2623
✅ Checklist
Testing
Production Testing (Self-Hosted GKE Autopilot):
Results:
AbortError: The user aborted a request.Changelog
Bug Fix:
failedPodHandlerImprovement:
Technical Details:
reconnectingflag to guard concurrent reconnection attemptsonErrorhandler to accept error parameter and log detailsfinallyblock to ensure flag is always clearedScreenshots
Before (escalating errors):