fix: automatically handle container exits and reconnect clients #2634

therealnb · 2025-11-18T11:38:00Z

Problem

When containers exit or are removed, clients (Cursor, Cline, VSCode) lose connection and tools stop working. The issue noted: "When a pod dies, we have to change the config (just touching the file doesn't work. we need to actually change the config)"

Solution

This PR implements automatic container exit handling with client config updates to force reconnection.

Key Changes

1. Automatic Container Exit Detection

Transport layer detects when containers exit or are removed
Clear logging distinguishes between crash vs intentional removal
Made TransparentProxy.Stop() idempotent to prevent panics

2. Smart Workload Existence Check

Verifies container actually exists in Docker runtime (not just status files)
Uses thv ls as source of truth for whether to restart or clean up

3. Automatic Restart with Exponential Backoff

Retries: 5s → 10s → 20s → 40s → 60s (capped)
Up to 10 attempts (~10 minutes total)
Only retries actual container exits, not other errors

4. Client Config Management (The Core Fix!)

Before restart: Removes server from ~/.cursor/mcp.json
After restart: Re-adds server to ~/.cursor/mcp.json
This forces clients to see the config change and reconnect with fresh sessions
Solves the "we have to change the config" requirement from the issue

5. Clean Removal Handling

If container removed (docker rm, thv delete): permanently removes from client config
Tool vanishes from client as expected

Behavior

Container exits (crash):

Exit detected → Remove from ~/.cursor/mcp.json → Wait with backoff → 
Restart container → Re-add to mcp.json → Client reconnects ✅

Container removed intentionally:

Removal detected → Remove from ~/.cursor/mcp.json → Exit gracefully →
Tool vanishes from client ✅

Files Changed

pkg/transport/http.go - Container exit handling for HTTP transport
pkg/transport/stdio.go - Container exit handling for stdio transport
pkg/transport/proxy/transparent/transparent_proxy.go - Idempotent Stop()
pkg/runner/runner.go - Workload existence check and client config cleanup
pkg/workloads/manager.go - Automatic restart with backoff + config refresh
docs/fixes/container-exit-handling.md - Documentation (new)

Testing

✅ Container kill → Automatic restart → Client reconnects
✅ Container remove → Clean config removal → Tool vanishes
✅ Multiple failures → Exponential backoff works
✅ No panics on double-close

Tested with Cursor on macOS.

Fixes #2270 When containers exit or are removed, clients (Cursor, Cline) lose connection and the tool stops working. This PR implements automatic handling of container exits with client config updates to force reconnection. ## Changes ### Transport Layer - Updated handleContainerExit() in http.go and stdio.go to log clear warnings - Made TransparentProxy.Stop() idempotent to prevent double-close panics - Added logging to distinguish between container exit vs removal ### Runner Layer - Check if workload exists in runtime (not just status files) to determine if container was removed or just exited - If container removed: clean up client config and exit gracefully - If container exited: signal restart needed to workload manager ### Workload Manager - Automatic restart with exponential backoff (5s, 10s, 20s, 40s, 60s) - Up to 10 retry attempts (~10 minutes) - Remove from client config before restart, re-add after successful restart - This forces clients to see the config change and reconnect with fresh sessions ## Behavior **Container exits (crash)**: - Detect exit → Remove from ~/.cursor/mcp.json → Wait with backoff → Restart container → Re-add to mcp.json → Client reconnects ✅ **Container removed (docker rm / thv delete)**: - Detect removal → Remove from ~/.cursor/mcp.json → Exit gracefully → Tool vanishes from client ✅ ## Testing Tested with Cursor on macOS: - Container kill: Automatic restart and reconnection works - Container remove: Clean removal from client config

codecov · 2025-11-18T11:40:47Z

Codecov Report

❌ Patch coverage is 25.00000% with 117 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.17%. Comparing base (e9a311b) to head (d72be34).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/runner/runner.go	0.00%	44 Missing ⚠️
pkg/workloads/manager.go	17.39%	35 Missing and 3 partials ⚠️
pkg/container/docker/monitor.go	10.00%	17 Missing and 1 partial ⚠️
pkg/transport/http.go	40.00%	9 Missing ⚠️
pkg/transport/stdio.go	42.85%	8 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2634      +/-   ##
==========================================
- Coverage   55.21%   55.17%   -0.05%     
==========================================
  Files         315      315              
  Lines       30131    30264     +133     
==========================================
+ Hits        16637    16697      +60     
- Misses      12045    12109      +64     
- Partials     1449     1458       +9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: nigel brown <[email protected]>

Adds unit tests for: - HTTPTransport.ShouldRestart() - restart logic for container exits vs removals - StdioTransport.ShouldRestart() - restart logic for container exits vs removals - TransparentProxy.Stop() - idempotent stop functionality - DefaultManager.RunWorkload() - container exit handling These tests cover the new automatic container exit detection and restart functionality, improving code coverage for the PR.

therealnb · 2025-11-18T16:29:21Z

This may be related to https://discord.com/channels/1184987096302239844/1439759480069034036/1440007583145394297

Fixes test failures caused by corrupted module cache in CI environment. The 'go clean -modcache' step ensures a clean state before downloading dependencies, preventing 'dial tcp: lookup proxy.golang.org' errors during test compilation. Related to #2634

Adds explicit download step for github.com/Microsoft/[email protected] before general module download to prevent network/DNS failures during test compilation. Related to #2634

Disables the built-in actions/setup-go cache and implements manual module caching with explicit control. This prevents cache corruption issues that were causing 'go test' to attempt network downloads during compilation. Changes: - Set cache: false in actions/setup-go - Add explicit actions/cache for ~/go/pkg/mod - Remove go clean -modcache and go get steps (no longer needed) - Cache key based on go.sum ensures proper invalidation Related to #2634

Fixes data race where Stop() could set p.server = nil while a goroutine in Start() was still accessing it. Now captures server in local variable before starting the goroutine. Fixes race detector failure: - Read at transparent_proxy.go:380 (Start goroutine) - Write at transparent_proxy.go:461 (Stop method) Related to #2634

The test failures were caused by a race condition in the code, not by module cache issues. Reverting the workflow changes since they were not needed. The actual fix was in transparent_proxy.go to prevent race condition when accessing p.server between Start() and Stop(). Related to #2634

pkg/transport/http.go

pkg/runner/runner.go

Addresses review comments in PR #2634: 1. Replace string matching with typed error checking - Add ErrContainerRemoved error type to docker package - Update container monitor to return ErrContainerRemoved when container not found - Replace strings.Contains() with errors.Is() in http.go and stdio.go - Update tests to use typed error 2. Simplify workload existence checking - Move DoesWorkloadExist method directly onto Runner struct - Remove workloadExistenceChecker type and getWorkloadManager helper All tests pass successfully.

…ogic The DoesWorkloadExist method was using IsWorkloadRunning which returns false for stopped containers. This caused the system to treat stopped containers as removed, cleaning up client config instead of restarting. Now using GetWorkloadInfo to check if the container exists (regardless of running state), so stopped containers are correctly identified as existing and trigger automatic restart.

Adds detection for container restarts (e.g., 'docker restart <container>'). When a container is restarted, Docker assigns a new StartedAt timestamp while keeping the same container name. The monitor now tracks the initial StartedAt time and compares it on each check to detect restarts. Changes: - Add StartedAt field to ContainerInfo to track when container last started - Populate StartedAt from Docker State.StartedAt field - Monitor checks for StartedAt changes to detect restarts - Trigger automatic restart logic when restart detected - Update tests to mock GetWorkloadInfo calls This fixes the issue where restarting a container broke the proxy connection without triggering automatic reconnection.

pkg/runner/runner.go

Signed-off-by: nigel brown <[email protected]>

therealnb requested a review from amirejaz November 18, 2025 11:38

therealnb added 4 commits November 18, 2025 11:41

Superfluous - info is in the PR

ab3d5fa

Signed-off-by: nigel brown <[email protected]>

fix: apply gofmt formatting

5c57fd1

fix: address linter errors (line length, unused params)

f3352cf

Merge branch 'main' into fix/container-exit-reconnection

eb04391

therealnb enabled auto-merge (squash) November 18, 2025 13:30

therealnb added 4 commits November 18, 2025 13:38

fix: apply gci import formatting to test files

6576b90

Merge branch 'main' into fix/container-exit-reconnection

3e2327a

Merge branch 'main' into fix/container-exit-reconnection

13fb72a

therealnb added 9 commits November 19, 2025 09:21

Merge branch 'main' into fix/container-exit-reconnection

569eb56

Merge branch 'main' into fix/container-exit-reconnection

63f9c3f

Merge branch 'main' into fix/container-exit-reconnection

5e35010

Merge branch 'main' into fix/container-exit-reconnection

6b094bf

ci: explicitly download go-winio dependency before mod download

8c1facb

Adds explicit download step for github.com/Microsoft/[email protected] before general module download to prevent network/DNS failures during test compilation. Related to #2634

amirejaz requested changes Nov 19, 2025

View reviewed changes

pkg/transport/http.go Outdated Show resolved Hide resolved

pkg/transport/http.go Show resolved Hide resolved

pkg/runner/runner.go Outdated Show resolved Hide resolved

therealnb requested a review from amirejaz November 19, 2025 16:57

therealnb added 5 commits November 19, 2025 17:03

Merge branch 'main' into fix/container-exit-reconnection

baf9204

Merge branch 'main' into fix/container-exit-reconnection

95d3c9b

amirejaz previously approved these changes Nov 20, 2025

View reviewed changes

pkg/runner/runner.go Outdated Show resolved Hide resolved

made the DoesWorkloadExist function private

d72be34

Signed-off-by: nigel brown <[email protected]>

therealnb dismissed amirejaz’s stale review via d72be34 November 20, 2025 11:51

amirejaz approved these changes Nov 20, 2025

View reviewed changes

therealnb merged commit 0f4b185 into main Nov 20, 2025
17 checks passed

therealnb deleted the fix/container-exit-reconnection branch November 20, 2025 11:52

therealnb mentioned this pull request Nov 20, 2025

Clients lose their connection #2270

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: automatically handle container exits and reconnect clients #2634

fix: automatically handle container exits and reconnect clients #2634

therealnb commented Nov 18, 2025

Uh oh!

codecov bot commented Nov 18, 2025 •

edited

Loading

Uh oh!

therealnb commented Nov 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: automatically handle container exits and reconnect clients #2634

fix: automatically handle container exits and reconnect clients #2634

Conversation

therealnb commented Nov 18, 2025

Problem

Solution

Key Changes

Behavior

Files Changed

Testing

Uh oh!

codecov bot commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

therealnb commented Nov 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Nov 18, 2025 •

edited

Loading