Skip to content

Conversation

@NERLOE
Copy link
Contributor

@NERLOE NERLOE commented Oct 22, 2025

Closes #2623

✅ Checklist

  • I have followed every step in the contributing guide
  • The PR title follows the convention
  • I ran and tested the code works

Testing

Production Testing (Self-Hosted GKE Autopilot):

  • Deployed to production cluster that was experiencing catastrophic escalation
  • Monitored for 25+ minutes across 5 disconnect cycles

Results:

  • ✅ Before fix: Errors escalated 2 → 3 → 5 → 9 → 17 → 33 → continuous loop
  • ✅ After fix: Errors remain stable at 1 per cycle (no escalation)
  • ✅ Reconnection guard activates every cycle: "reconnection already in progress, skipping"
  • ✅ System recovers in ~1 second and continues normally
  • ✅ Error details now logged: AbortError: The user aborted a request.

Changelog

Bug Fix:

  • Add reconnection guard to prevent concurrent reconnection attempts in failedPodHandler
  • Prevents escalating duplicate connections when Kubernetes informer disconnects
  • Fixes exponential growth pattern: 2 → 3 → 5 → 9 → 17 → 33+ errors/reconnects

Improvement:

  • Add error details logging (message, type, stack) to capture actual error information
  • Previously only logged "error event fired" without details, making debugging difficult

Technical Details:

  • Added reconnecting flag to guard concurrent reconnection attempts
  • Updated onError handler to accept error parameter and log details
  • Used finally block to ensure flag is always cleared

Screenshots

Before (escalating errors):

@changeset-bot
Copy link

changeset-bot bot commented Oct 22, 2025

⚠️ No Changeset found

Latest commit: 1f0a5bb

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 22, 2025

Walkthrough

The FailedPodHandler class (apps/supervisor/src/services/failedPodHandler.ts) adds a private reconnecting: boolean flag to prevent overlapping reconnection attempts. makeOnError now returns a handler that forwards an optional err to onError, and onError accepts err?: unknown. The reconnection flow is wrapped in try/catch/finally so reconnecting is always reset; reconnection errors are logged with message, type/name, and stack, and an informer error metric is incremented. onError skips attempting reconnection when the guard is set.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)
Check name Status Explanation
Title Check ✅ Passed The pull request title "fix(supervisor): prevent escalating duplicate reconnections in failedPodHandler" directly and clearly describes the primary change in the changeset. It uses the correct commit convention format (fix with scope), is concise and readable, and accurately reflects the main objective of adding a reconnection guard to prevent concurrent/escalating reconnection attempts in the failedPodHandler service.
Linked Issues Check ✅ Passed The code changes directly address all primary coding requirements from linked issue #2623. The implementation adds a reconnecting boolean flag to guard against concurrent reconnection attempts, wraps the reconnection logic in try/catch/finally to ensure flag cleanup, and introduces a guard path to skip reconnection if one is already in progress. The production testing validates the fix resolves the escalation pattern (2 → 3 → 5 → 9 → 17 → 33) to a stable 1 error per cycle, with error details now properly logged to aid debugging. The enhanced error logging (message, type, stack) represents an additional improvement aligned with the PR objectives.
Out of Scope Changes Check ✅ Passed All code changes in the pull request are directly scoped to the requirements specified in issue #2623 and the PR objectives. The modifications—adding the reconnecting flag, updating the onError signature to accept error parameters, enhancing error logging, and wrapping reconnection logic in try/catch/finally—are all essential to implement the reconnection guard and the mentioned improvement of better error logging. No changes appear to introduce unrelated functionality or scope creep outside the stated objectives.
Description Check ✅ Passed The pull request description follows the required template structure and includes all major sections: the issue closure reference (#2623), a completed checklist, a detailed testing section documenting production validation across multiple disconnect cycles with before/after results, and a comprehensive changelog explaining the bug fix and improvements. While the screenshots section appears incomplete, the description is substantive and provides sufficient detail to understand the changes and their validation.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@NERLOE NERLOE force-pushed the fix/failed-pod-handler-reconnection-guard branch from 6b15642 to 55c45ab Compare October 22, 2025 10:31
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
apps/supervisor/src/services/failedPodHandler.ts (2)

291-293: Mirror the listener wrapper for connect to be defensive.

Arrow already captures this, but the async call can still reject. Wrap and log to avoid unhandled rejections (rare here, but consistent).

Apply:

-  private makeOnConnect(informerName: string) {
-    return () => this.onConnect(informerName);
-  }
+  private makeOnConnect(informerName: string) {
+    return () => {
+      this.onConnect(informerName).catch((handlerError) => {
+        const error = handlerError instanceof Error ? handlerError : undefined;
+        this.logger.error("onConnect handler failure", {
+          informerName,
+          error: error?.message,
+          errorType: error?.name,
+          errorStack: error?.stack,
+        });
+      });
+    };
+  }

67-69: Redundant .bind(this) on arrow callbacks.

makeOnConnect/makeOnError return arrow functions that already lexically bind this. The extra .bind(this) is unnecessary.

Apply:

-    this.informer.on("connect", this.makeOnConnect("failed-pod-informer").bind(this));
-    this.informer.on("error", this.makeOnError("failed-pod-informer").bind(this));
+    this.informer.on("connect", this.makeOnConnect("failed-pod-informer"));
+    this.informer.on("error", this.makeOnError("failed-pod-informer"));
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d90da7a and 55c45ab.

📒 Files selected for processing (1)
  • apps/supervisor/src/services/failedPodHandler.ts (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Always prefer using isomorphic code like fetch, ReadableStream, etc. instead of Node.js specific code
For TypeScript, we usually use types over interfaces
Avoid enums
No default exports, use function declarations

Files:

  • apps/supervisor/src/services/failedPodHandler.ts
🧬 Code graph analysis (1)
apps/supervisor/src/services/failedPodHandler.ts (1)
apps/supervisor/src/workloadManager/kubernetes.ts (1)
  • err (237-253)
🔇 Additional comments (1)
apps/supervisor/src/services/failedPodHandler.ts (1)

28-28: Good addition: reconnection guard flag.

The reconnecting flag is the right primitive to prevent concurrent start() calls. LGTM.

@NERLOE
Copy link
Contributor Author

NERLOE commented Oct 22, 2025

CodeRabbit has suggested two defensive enhancements that could be added if desired:

  1. Unhandled rejection handling: Wrap the async error handler to catch any exceptions
  2. Stop() during reconnection: Re-check isRunning after backoff delay

These are good defensive measures but not critical to fixing the core bug. Happy to implement them if maintainers prefer.

Copy link
Collaborator

@nicktrn nicktrn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thank you! 🙏

Just the catch handler to add, I dismissed the suggestion re wrapping the entire function.

Will get this up on our test env asap

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
apps/supervisor/src/services/failedPodHandler.ts (2)

253-255: Wrap async handler to avoid unhandled rejections; also drop redundant .bind(this).

Event emitters don’t await handlers. Wrap onError() and catch/log rejections; keep semantics unchanged. Since makeOnError returns an arrow that closes over this, .bind(this) at registration is unnecessary.

-  private makeOnError(informerName: string) {
-    return (err?: unknown) => this.onError(informerName, err);
-  }
+  private makeOnError(informerName: string) {
+    return (err?: unknown) => {
+      this.onError(informerName, err).catch((handlerError) => {
+        const error = handlerError instanceof Error ? handlerError : undefined;
+        this.logger.error("onError handler failure", {
+          informerName,
+          error: error?.message,
+          errorType: error?.name,
+          errorStack: error?.stack,
+        });
+      });
+    };
+  }

Apply at registration (outside this hunk):

-    this.informer.on("error", this.makeOnError("failed-pod-informer").bind(this));
+    this.informer.on("error", this.makeOnError("failed-pod-informer"));

283-286: Optional: skip reconnect if stopped during backoff.

If stop() occurs during the timeout, we may still call start(). A cheap check avoids a surprise restart.

       // Reconnect on errors
       await setTimeout(this.reconnectIntervalMs);
+      if (!this.isRunning) {
+        this.logger.debug("onError: handler stopped during backoff; skipping reconnect", {
+          informerName,
+        });
+        return;
+      }
       await this.informer.start();
🧹 Nitpick comments (1)
apps/supervisor/src/services/failedPodHandler.ts (1)

273-296: Count every error event, even when skipping reconnection.

informerEventsTotal is described as “informer events”; unlike add/connect, error is only incremented inside the try. When skipping due to reconnecting, the error event isn’t counted. Move the increment before the guard and remove the later one.

   private async onError(informerName: string, err?: unknown) {
     if (!this.isRunning) {
       this.logger.warn("onError: informer not running");
       return;
     }
 
+    // Count every error event, even if a reconnect is skipped
+    this.informerEventsTotal.inc({ namespace: this.namespace, verb: "error" });
-
-    // Guard against multiple simultaneous reconnections
+    // Guard against multiple simultaneous reconnections
     if (this.reconnecting) {
       this.logger.debug("onError: reconnection already in progress, skipping", {
         informerName,
       });
       return;
     }
@@
       this.logger.error("error event fired", {
         informerName,
         error: error?.message,
         errorType: error?.name,
         errorStack: error?.stack,
       });
-      this.informerEventsTotal.inc({ namespace: this.namespace, verb: "error" });

Please confirm whether you want the metric to reflect “events emitted” vs “reconnections attempted.”

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 55c45ab and 47361ff.

📒 Files selected for processing (1)
  • apps/supervisor/src/services/failedPodHandler.ts (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Always prefer using isomorphic code like fetch, ReadableStream, etc. instead of Node.js specific code
For TypeScript, we usually use types over interfaces
Avoid enums
No default exports, use function declarations

Files:

  • apps/supervisor/src/services/failedPodHandler.ts
🧬 Code graph analysis (1)
apps/supervisor/src/services/failedPodHandler.ts (1)
apps/supervisor/src/workloadManager/kubernetes.ts (1)
  • err (237-253)
🔇 Additional comments (2)
apps/supervisor/src/services/failedPodHandler.ts (2)

28-28: Reconnection guard flag: LGTM.

This single‑flight gate is the right fix; the finally reset is in place.


263-269: Single‑flight reconnection guard: LGTM.

Prevents the escalating reconnection storm reported in #2623.

@NERLOE
Copy link
Contributor Author

NERLOE commented Oct 22, 2025

This is great, thank you! 🙏

Just the catch handler to add, I dismissed the suggestion re wrapping the entire function.

Will get this up on our test env asap

Thanks! I've added the catch handler and pushed.

One thought: The errorStack field is pretty verbose (~400 chars) and gets logged every ~5 min during normal watch timeouts. Since it's always the same AbortError stack trace, would you prefer I remove it and just keep error and errorType?

{"timestamp":"2025-10-22T15:09:02.192Z","message":"error event fired","$name":"failed-pod-handler","$level":"error","id":"9eeyr8vrx1","informerName":"failed-pod-informer","error":"The user aborted a request.","errorType":"AbortError","errorStack":"AbortError: The user aborted a request.\n at abort (/app/apps/supervisor/node_modules/.pnpm/[email protected]/node_modules/node-fetch/lib/index.js:1462:16)\n at AbortSignal.abortAndFinalize (/app/apps/supervisor/node_modules/.pnpm/[email protected]/node_modules/node-fetch/lib/index.js:1477:4)\n at [nodejs.internal.kHybridDispatch] (node:internal/event_target:827:20)\n at AbortSignal.dispatchEvent (node:internal/event_target:762:26)\n at runAbort (node:internal/abort_controller:447:10)\n at abortSignal (node:internal/abort_controller:433:3)\n at AbortController.abort (node:internal/abort_controller:466:5)\n at PassThrough.doneCallOnce (file:///app/apps/supervisor/node_modules/.pnpm/@[email protected]_patch_hash=s75bgwaoixupmywtvgoy5ruszq/node_modules/@kubernetes/client-node/dist/watch.js:32:28)\n at PassThrough.emit (node:events:530:35)\n at emitErrorNT (node:internal/streams/destroy:170:8)"}
{"timestamp":"2025-10-22T15:09:02.192Z","message":"onError: reconnection already in progress, skipping","$name":"failed-pod-handler","$level":"debug","id":"9eeyr8vrx1","informerName":"failed-pod-informer"}

Happy to adjust based on what works best for your logging setup! 👍

@nicktrn
Copy link
Collaborator

nicktrn commented Oct 22, 2025

Thanks @NERLOE

Since it's always the same AbortError stack trace, would you prefer I remove it and just keep error and errorType?

Sure, makes sense to remove errorStack

@NERLOE
Copy link
Contributor Author

NERLOE commented Oct 22, 2025

Done✅

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
apps/supervisor/src/services/failedPodHandler.ts (1)

253-255: Wrap async handler to avoid unhandled rejections (and remove redundant bind).

Event emitters don’t await returned Promises; any rejection becomes an unhandled rejection. Wrap the call and log inside the returned function. Also, since you return an arrow function (lexically binds this), the .bind(this) at registration is redundant.

Apply:

-  private makeOnError(informerName: string) {
-    return (err?: unknown) => this.onError(informerName, err);
-  }
+  private makeOnError(informerName: string) {
+    return (err?: unknown): void => {
+      void this.onError(informerName, err).catch((handlerError) => {
+        const error = handlerError instanceof Error ? handlerError : undefined;
+        this.logger.error("onError handler failure", {
+          informerName,
+          error: error?.message,
+          errorType: error?.name,
+        });
+      });
+    };
+  }

And update the registration (outside this hunk) to drop the unnecessary bind:

-    this.informer.on("error", this.makeOnError("failed-pod-informer").bind(this));
+    this.informer.on("error", this.makeOnError("failed-pod-informer"));
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 47361ff and 1f0a5bb.

📒 Files selected for processing (1)
  • apps/supervisor/src/services/failedPodHandler.ts (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Always prefer using isomorphic code like fetch, ReadableStream, etc. instead of Node.js specific code
For TypeScript, we usually use types over interfaces
Avoid enums
No default exports, use function declarations

Files:

  • apps/supervisor/src/services/failedPodHandler.ts
🧬 Code graph analysis (1)
apps/supervisor/src/services/failedPodHandler.ts (1)
apps/supervisor/src/workloadManager/kubernetes.ts (1)
  • err (237-253)
🔇 Additional comments (1)
apps/supervisor/src/services/failedPodHandler.ts (1)

28-28: Good guard to serialize reconnects.

The reconnecting flag with a finally reset correctly prevents overlapping reconnections. Nice.

Copy link
Collaborator

@nicktrn nicktrn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one, thanks again @NERLOE 🔥

@nicktrn nicktrn merged commit 255ea0a into triggerdotdev:main Oct 23, 2025
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: Escalating duplicate reconnections in failedPodHandler

2 participants