fix(supervisor): prevent escalating duplicate reconnections in failedPodHandler #2627

NERLOE · 2025-10-22T10:30:31Z

Closes #2623

✅ Checklist

I have followed every step in the contributing guide
The PR title follows the convention
I ran and tested the code works

Testing

Production Testing (Self-Hosted GKE Autopilot):

Deployed to production cluster that was experiencing catastrophic escalation
Monitored for 25+ minutes across 5 disconnect cycles

Results:

✅ Before fix: Errors escalated 2 → 3 → 5 → 9 → 17 → 33 → continuous loop
✅ After fix: Errors remain stable at 1 per cycle (no escalation)
✅ Reconnection guard activates every cycle: "reconnection already in progress, skipping"
✅ System recovers in ~1 second and continues normally
✅ Error details now logged: AbortError: The user aborted a request.

Changelog

Bug Fix:

Add reconnection guard to prevent concurrent reconnection attempts in failedPodHandler
Prevents escalating duplicate connections when Kubernetes informer disconnects
Fixes exponential growth pattern: 2 → 3 → 5 → 9 → 17 → 33+ errors/reconnects

Improvement:

Add error details logging (message, type, stack) to capture actual error information
Previously only logged "error event fired" without details, making debugging difficult

Technical Details:

Added reconnecting flag to guard concurrent reconnection attempts
Updated onError handler to accept error parameter and log details
Used finally block to ensure flag is always cleared

Screenshots

Before (escalating errors):

changeset-bot · 2025-10-22T10:30:35Z

⚠️ No Changeset found

Latest commit: 1f0a5bb

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

coderabbitai · 2025-10-22T10:30:57Z

Walkthrough

The FailedPodHandler class (apps/supervisor/src/services/failedPodHandler.ts) adds a private reconnecting: boolean flag to prevent overlapping reconnection attempts. makeOnError now returns a handler that forwards an optional err to onError, and onError accepts err?: unknown. The reconnection flow is wrapped in try/catch/finally so reconnecting is always reset; reconnection errors are logged with message, type/name, and stack, and an informer error metric is incremented. onError skips attempting reconnection when the guard is set.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The pull request title "fix(supervisor): prevent escalating duplicate reconnections in failedPodHandler" directly and clearly describes the primary change in the changeset. It uses the correct commit convention format (fix with scope), is concise and readable, and accurately reflects the main objective of adding a reconnection guard to prevent concurrent/escalating reconnection attempts in the failedPodHandler service.
Linked Issues Check	✅ Passed	The code changes directly address all primary coding requirements from linked issue #2623. The implementation adds a `reconnecting` boolean flag to guard against concurrent reconnection attempts, wraps the reconnection logic in try/catch/finally to ensure flag cleanup, and introduces a guard path to skip reconnection if one is already in progress. The production testing validates the fix resolves the escalation pattern (2 → 3 → 5 → 9 → 17 → 33) to a stable 1 error per cycle, with error details now properly logged to aid debugging. The enhanced error logging (message, type, stack) represents an additional improvement aligned with the PR objectives.
Out of Scope Changes Check	✅ Passed	All code changes in the pull request are directly scoped to the requirements specified in issue #2623 and the PR objectives. The modifications—adding the `reconnecting` flag, updating the `onError` signature to accept error parameters, enhancing error logging, and wrapping reconnection logic in try/catch/finally—are all essential to implement the reconnection guard and the mentioned improvement of better error logging. No changes appear to introduce unrelated functionality or scope creep outside the stated objectives.
Description Check	✅ Passed	The pull request description follows the required template structure and includes all major sections: the issue closure reference (#2623), a completed checklist, a detailed testing section documenting production validation across multiple disconnect cycles with before/after results, and a comprehensive changelog explaining the bug fix and improvements. While the screenshots section appears incomplete, the description is substantive and provides sufficient detail to understand the changes and their validation.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…PodHandler

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

apps/supervisor/src/services/failedPodHandler.ts (2)

291-293: Mirror the listener wrapper for connect to be defensive.

Arrow already captures this, but the async call can still reject. Wrap and log to avoid unhandled rejections (rare here, but consistent).

Apply:

-  private makeOnConnect(informerName: string) {
-    return () => this.onConnect(informerName);
-  }
+  private makeOnConnect(informerName: string) {
+    return () => {
+      this.onConnect(informerName).catch((handlerError) => {
+        const error = handlerError instanceof Error ? handlerError : undefined;
+        this.logger.error("onConnect handler failure", {
+          informerName,
+          error: error?.message,
+          errorType: error?.name,
+          errorStack: error?.stack,
+        });
+      });
+    };
+  }

67-69: Redundant .bind(this) on arrow callbacks.

makeOnConnect/makeOnError return arrow functions that already lexically bind this. The extra .bind(this) is unnecessary.

Apply:

-    this.informer.on("connect", this.makeOnConnect("failed-pod-informer").bind(this));
-    this.informer.on("error", this.makeOnError("failed-pod-informer").bind(this));
+    this.informer.on("connect", this.makeOnConnect("failed-pod-informer"));
+    this.informer.on("error", this.makeOnError("failed-pod-informer"));

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d90da7a and 55c45ab.

📒 Files selected for processing (1)

apps/supervisor/src/services/failedPodHandler.ts (2 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Always prefer using isomorphic code like fetch, ReadableStream, etc. instead of Node.js specific code
For TypeScript, we usually use types over interfaces
Avoid enums
No default exports, use function declarations

Files:

apps/supervisor/src/services/failedPodHandler.ts

🧬 Code graph analysis (1)

apps/supervisor/src/services/failedPodHandler.ts (1)

apps/supervisor/src/workloadManager/kubernetes.ts (1)

err (237-253)

🔇 Additional comments (1)

apps/supervisor/src/services/failedPodHandler.ts (1)

28-28: Good addition: reconnection guard flag.

The reconnecting flag is the right primitive to prevent concurrent start() calls. LGTM.

apps/supervisor/src/services/failedPodHandler.ts

NERLOE · 2025-10-22T11:15:10Z

CodeRabbit has suggested two defensive enhancements that could be added if desired:

Unhandled rejection handling: Wrap the async error handler to catch any exceptions
Stop() during reconnection: Re-check isRunning after backoff delay

These are good defensive measures but not critical to fixing the core bug. Happy to implement them if maintainers prefer.

nicktrn

This is great, thank you! 🙏

Just the catch handler to add, I dismissed the suggestion re wrapping the entire function.

Will get this up on our test env asap

apps/supervisor/src/services/failedPodHandler.ts

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

apps/supervisor/src/services/failedPodHandler.ts (2)

253-255: Wrap async handler to avoid unhandled rejections; also drop redundant .bind(this).

Event emitters don’t await handlers. Wrap onError() and catch/log rejections; keep semantics unchanged. Since makeOnError returns an arrow that closes over this, .bind(this) at registration is unnecessary.

-  private makeOnError(informerName: string) {
-    return (err?: unknown) => this.onError(informerName, err);
-  }
+  private makeOnError(informerName: string) {
+    return (err?: unknown) => {
+      this.onError(informerName, err).catch((handlerError) => {
+        const error = handlerError instanceof Error ? handlerError : undefined;
+        this.logger.error("onError handler failure", {
+          informerName,
+          error: error?.message,
+          errorType: error?.name,
+          errorStack: error?.stack,
+        });
+      });
+    };
+  }

Apply at registration (outside this hunk):

-    this.informer.on("error", this.makeOnError("failed-pod-informer").bind(this));
+    this.informer.on("error", this.makeOnError("failed-pod-informer"));

283-286: Optional: skip reconnect if stopped during backoff.

If stop() occurs during the timeout, we may still call start(). A cheap check avoids a surprise restart.

       // Reconnect on errors
       await setTimeout(this.reconnectIntervalMs);
+      if (!this.isRunning) {
+        this.logger.debug("onError: handler stopped during backoff; skipping reconnect", {
+          informerName,
+        });
+        return;
+      }
       await this.informer.start();

🧹 Nitpick comments (1)

apps/supervisor/src/services/failedPodHandler.ts (1)
273-296: Count every error event, even when skipping reconnection.

informerEventsTotal is described as “informer events”; unlike add/connect, error is only incremented inside the try. When skipping due to reconnecting, the error event isn’t counted. Move the increment before the guard and remove the later one.
   private async onError(informerName: string, err?: unknown) {
     if (!this.isRunning) {
       this.logger.warn("onError: informer not running");
       return;
     }
 
+    // Count every error event, even if a reconnect is skipped
+    this.informerEventsTotal.inc({ namespace: this.namespace, verb: "error" });
-
-    // Guard against multiple simultaneous reconnections
+    // Guard against multiple simultaneous reconnections
     if (this.reconnecting) {
       this.logger.debug("onError: reconnection already in progress, skipping", {
         informerName,
       });
       return;
     }
@@
       this.logger.error("error event fired", {
         informerName,
         error: error?.message,
         errorType: error?.name,
         errorStack: error?.stack,
       });
-      this.informerEventsTotal.inc({ namespace: this.namespace, verb: "error" });
Please confirm whether you want the metric to reflect “events emitted” vs “reconnections attempted.”

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 55c45ab and 47361ff.

📒 Files selected for processing (1)

apps/supervisor/src/services/failedPodHandler.ts (2 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Always prefer using isomorphic code like fetch, ReadableStream, etc. instead of Node.js specific code
For TypeScript, we usually use types over interfaces
Avoid enums
No default exports, use function declarations

Files:

apps/supervisor/src/services/failedPodHandler.ts

🧬 Code graph analysis (1)

apps/supervisor/src/services/failedPodHandler.ts (1)

apps/supervisor/src/workloadManager/kubernetes.ts (1)

err (237-253)

🔇 Additional comments (2)

apps/supervisor/src/services/failedPodHandler.ts (2)

28-28: Reconnection guard flag: LGTM.

This single‑flight gate is the right fix; the finally reset is in place.

263-269: Single‑flight reconnection guard: LGTM.

Prevents the escalating reconnection storm reported in #2623.

NERLOE · 2025-10-22T16:24:57Z

This is great, thank you! 🙏

Just the catch handler to add, I dismissed the suggestion re wrapping the entire function.

Will get this up on our test env asap

Thanks! I've added the catch handler and pushed.

One thought: The errorStack field is pretty verbose (~400 chars) and gets logged every ~5 min during normal watch timeouts. Since it's always the same AbortError stack trace, would you prefer I remove it and just keep error and errorType?

{"timestamp":"2025-10-22T15:09:02.192Z","message":"error event fired","$name":"failed-pod-handler","$level":"error","id":"9eeyr8vrx1","informerName":"failed-pod-informer","error":"The user aborted a request.","errorType":"AbortError","errorStack":"AbortError: The user aborted a request.\n at abort (/app/apps/supervisor/node_modules/.pnpm/[email protected]/node_modules/node-fetch/lib/index.js:1462:16)\n at AbortSignal.abortAndFinalize (/app/apps/supervisor/node_modules/.pnpm/[email protected]/node_modules/node-fetch/lib/index.js:1477:4)\n at [nodejs.internal.kHybridDispatch] (node:internal/event_target:827:20)\n at AbortSignal.dispatchEvent (node:internal/event_target:762:26)\n at runAbort (node:internal/abort_controller:447:10)\n at abortSignal (node:internal/abort_controller:433:3)\n at AbortController.abort (node:internal/abort_controller:466:5)\n at PassThrough.doneCallOnce (file:///app/apps/supervisor/node_modules/.pnpm/@[email protected]_patch_hash=s75bgwaoixupmywtvgoy5ruszq/node_modules/@kubernetes/client-node/dist/watch.js:32:28)\n at PassThrough.emit (node:events:530:35)\n at emitErrorNT (node:internal/streams/destroy:170:8)"}
{"timestamp":"2025-10-22T15:09:02.192Z","message":"onError: reconnection already in progress, skipping","$name":"failed-pod-handler","$level":"debug","id":"9eeyr8vrx1","informerName":"failed-pod-informer"}

Happy to adjust based on what works best for your logging setup! 👍

nicktrn · 2025-10-22T16:53:36Z

Thanks @NERLOE

Since it's always the same AbortError stack trace, would you prefer I remove it and just keep error and errorType?

Sure, makes sense to remove errorStack

NERLOE · 2025-10-22T17:35:26Z

Done✅

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

apps/supervisor/src/services/failedPodHandler.ts (1)
253-255: Wrap async handler to avoid unhandled rejections (and remove redundant bind).

Event emitters don’t await returned Promises; any rejection becomes an unhandled rejection. Wrap the call and log inside the returned function. Also, since you return an arrow function (lexically binds this), the .bind(this) at registration is redundant.

Apply:
-  private makeOnError(informerName: string) {
-    return (err?: unknown) => this.onError(informerName, err);
-  }
+  private makeOnError(informerName: string) {
+    return (err?: unknown): void => {
+      void this.onError(informerName, err).catch((handlerError) => {
+        const error = handlerError instanceof Error ? handlerError : undefined;
+        this.logger.error("onError handler failure", {
+          informerName,
+          error: error?.message,
+          errorType: error?.name,
+        });
+      });
+    };
+  }
And update the registration (outside this hunk) to drop the unnecessary bind:
-    this.informer.on("error", this.makeOnError("failed-pod-informer").bind(this));
+    this.informer.on("error", this.makeOnError("failed-pod-informer"));

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 47361ff and 1f0a5bb.

📒 Files selected for processing (1)

apps/supervisor/src/services/failedPodHandler.ts (2 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Always prefer using isomorphic code like fetch, ReadableStream, etc. instead of Node.js specific code
For TypeScript, we usually use types over interfaces
Avoid enums
No default exports, use function declarations

Files:

apps/supervisor/src/services/failedPodHandler.ts

🧬 Code graph analysis (1)

apps/supervisor/src/services/failedPodHandler.ts (1)

apps/supervisor/src/workloadManager/kubernetes.ts (1)

err (237-253)

🔇 Additional comments (1)

apps/supervisor/src/services/failedPodHandler.ts (1)

28-28: Good guard to serialize reconnects.

The reconnecting flag with a finally reset correctly prevents overlapping reconnections. Nice.

apps/supervisor/src/services/failedPodHandler.ts

nicktrn

Nice one, thanks again @NERLOE 🔥

fix(supervisor): prevent escalating duplicate reconnections in failed…

55c45ab

…PodHandler

NERLOE force-pushed the fix/failed-pod-handler-reconnection-guard branch from 6b15642 to 55c45ab Compare October 22, 2025 10:31

coderabbitai bot reviewed Oct 22, 2025

View reviewed changes

apps/supervisor/src/services/failedPodHandler.ts Show resolved Hide resolved

apps/supervisor/src/services/failedPodHandler.ts Show resolved Hide resolved

nicktrn requested changes Oct 22, 2025

View reviewed changes

apps/supervisor/src/services/failedPodHandler.ts Show resolved Hide resolved

fix: added catch handler for informer.start() failures

47361ff

coderabbitai bot reviewed Oct 22, 2025

View reviewed changes

fix: removed 'errorStack' from error log

1f0a5bb

coderabbitai bot reviewed Oct 22, 2025

View reviewed changes

apps/supervisor/src/services/failedPodHandler.ts Show resolved Hide resolved

nicktrn approved these changes Oct 23, 2025

View reviewed changes

nicktrn merged commit 255ea0a into triggerdotdev:main Oct 23, 2025
26 checks passed

Uh oh!

fix(supervisor): prevent escalating duplicate reconnections in failedPodHandler #2627

fix(supervisor): prevent escalating duplicate reconnections in failedPodHandler #2627

Uh oh!

Conversation

NERLOE commented Oct 22, 2025

✅ Checklist

Testing

Changelog

Screenshots

Uh oh!

changeset-bot bot commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

coderabbitai bot commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

NERLOE commented Oct 22, 2025

Uh oh!

nicktrn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

NERLOE commented Oct 22, 2025

Uh oh!

nicktrn commented Oct 22, 2025

Uh oh!

NERLOE commented Oct 22, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nicktrn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

changeset-bot bot commented Oct 22, 2025 •

edited

Loading

coderabbitai bot commented Oct 22, 2025 •

edited

Loading