Skip to content

Conversation

@therealnb
Copy link
Contributor

Fixes #2270

Problem

When containers exit or are removed, clients (Cursor, Cline, VSCode) lose connection and tools stop working. The issue noted: "When a pod dies, we have to change the config (just touching the file doesn't work. we need to actually change the config)"

Solution

This PR implements automatic container exit handling with client config updates to force reconnection.

Key Changes

1. Automatic Container Exit Detection

  • Transport layer detects when containers exit or are removed
  • Clear logging distinguishes between crash vs intentional removal
  • Made TransparentProxy.Stop() idempotent to prevent panics

2. Smart Workload Existence Check

  • Verifies container actually exists in Docker runtime (not just status files)
  • Uses thv ls as source of truth for whether to restart or clean up

3. Automatic Restart with Exponential Backoff

  • Retries: 5s → 10s → 20s → 40s → 60s (capped)
  • Up to 10 attempts (~10 minutes total)
  • Only retries actual container exits, not other errors

4. Client Config Management (The Core Fix!)

  • Before restart: Removes server from ~/.cursor/mcp.json
  • After restart: Re-adds server to ~/.cursor/mcp.json
  • This forces clients to see the config change and reconnect with fresh sessions
  • Solves the "we have to change the config" requirement from the issue

5. Clean Removal Handling

  • If container removed (docker rm, thv delete): permanently removes from client config
  • Tool vanishes from client as expected

Behavior

Container exits (crash):

Exit detected → Remove from ~/.cursor/mcp.json → Wait with backoff → 
Restart container → Re-add to mcp.json → Client reconnects ✅

Container removed intentionally:

Removal detected → Remove from ~/.cursor/mcp.json → Exit gracefully →
Tool vanishes from client ✅

Files Changed

  • pkg/transport/http.go - Container exit handling for HTTP transport
  • pkg/transport/stdio.go - Container exit handling for stdio transport
  • pkg/transport/proxy/transparent/transparent_proxy.go - Idempotent Stop()
  • pkg/runner/runner.go - Workload existence check and client config cleanup
  • pkg/workloads/manager.go - Automatic restart with backoff + config refresh
  • docs/fixes/container-exit-handling.md - Documentation (new)

Testing

✅ Container kill → Automatic restart → Client reconnects
✅ Container remove → Clean config removal → Tool vanishes
✅ Multiple failures → Exponential backoff works
✅ No panics on double-close

Tested with Cursor on macOS.

Fixes #2270

When containers exit or are removed, clients (Cursor, Cline) lose connection
and the tool stops working. This PR implements automatic handling of container
exits with client config updates to force reconnection.

## Changes

### Transport Layer
- Updated handleContainerExit() in http.go and stdio.go to log clear warnings
- Made TransparentProxy.Stop() idempotent to prevent double-close panics
- Added logging to distinguish between container exit vs removal

### Runner Layer
- Check if workload exists in runtime (not just status files) to determine
  if container was removed or just exited
- If container removed: clean up client config and exit gracefully
- If container exited: signal restart needed to workload manager

### Workload Manager
- Automatic restart with exponential backoff (5s, 10s, 20s, 40s, 60s)
- Up to 10 retry attempts (~10 minutes)
- Remove from client config before restart, re-add after successful restart
- This forces clients to see the config change and reconnect with fresh sessions

## Behavior

**Container exits (crash)**:
- Detect exit → Remove from ~/.cursor/mcp.json → Wait with backoff →
  Restart container → Re-add to mcp.json → Client reconnects ✅

**Container removed (docker rm / thv delete)**:
- Detect removal → Remove from ~/.cursor/mcp.json → Exit gracefully →
  Tool vanishes from client ✅

## Testing
Tested with Cursor on macOS:
- Container kill: Automatic restart and reconnection works
- Container remove: Clean removal from client config
@therealnb therealnb requested a review from amirejaz November 18, 2025 11:38
@codecov
Copy link

codecov bot commented Nov 18, 2025

Codecov Report

❌ Patch coverage is 25.00000% with 117 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.17%. Comparing base (e9a311b) to head (d72be34).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/runner/runner.go 0.00% 44 Missing ⚠️
pkg/workloads/manager.go 17.39% 35 Missing and 3 partials ⚠️
pkg/container/docker/monitor.go 10.00% 17 Missing and 1 partial ⚠️
pkg/transport/http.go 40.00% 9 Missing ⚠️
pkg/transport/stdio.go 42.85% 8 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2634      +/-   ##
==========================================
- Coverage   55.21%   55.17%   -0.05%     
==========================================
  Files         315      315              
  Lines       30131    30264     +133     
==========================================
+ Hits        16637    16697      +60     
- Misses      12045    12109      +64     
- Partials     1449     1458       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@therealnb therealnb enabled auto-merge (squash) November 18, 2025 13:30
Adds unit tests for:
- HTTPTransport.ShouldRestart() - restart logic for container exits vs removals
- StdioTransport.ShouldRestart() - restart logic for container exits vs removals
- TransparentProxy.Stop() - idempotent stop functionality
- DefaultManager.RunWorkload() - container exit handling

These tests cover the new automatic container exit detection and restart
functionality, improving code coverage for the PR.
@therealnb
Copy link
Contributor Author

Fixes test failures caused by corrupted module cache in CI environment.
The 'go clean -modcache' step ensures a clean state before downloading
dependencies, preventing 'dial tcp: lookup proxy.golang.org' errors
during test compilation.

Related to #2634
Adds explicit download step for github.com/Microsoft/[email protected]
before general module download to prevent network/DNS failures during
test compilation.

Related to #2634
Disables the built-in actions/setup-go cache and implements manual
module caching with explicit control. This prevents cache corruption
issues that were causing 'go test' to attempt network downloads during
compilation.

Changes:
- Set cache: false in actions/setup-go
- Add explicit actions/cache for ~/go/pkg/mod
- Remove go clean -modcache and go get steps (no longer needed)
- Cache key based on go.sum ensures proper invalidation

Related to #2634
Fixes data race where Stop() could set p.server = nil while a goroutine
in Start() was still accessing it. Now captures server in local variable
before starting the goroutine.

Fixes race detector failure:
- Read at transparent_proxy.go:380 (Start goroutine)
- Write at transparent_proxy.go:461 (Stop method)

Related to #2634
The test failures were caused by a race condition in the code, not by
module cache issues. Reverting the workflow changes since they were
not needed.

The actual fix was in transparent_proxy.go to prevent race condition
when accessing p.server between Start() and Stop().

Related to #2634
@therealnb therealnb requested a review from amirejaz November 19, 2025 16:57
Addresses review comments in PR #2634:

1. Replace string matching with typed error checking
   - Add ErrContainerRemoved error type to docker package
   - Update container monitor to return ErrContainerRemoved when container not found
   - Replace strings.Contains() with errors.Is() in http.go and stdio.go
   - Update tests to use typed error

2. Simplify workload existence checking
   - Move DoesWorkloadExist method directly onto Runner struct
   - Remove workloadExistenceChecker type and getWorkloadManager helper

All tests pass successfully.
…ogic

The DoesWorkloadExist method was using IsWorkloadRunning which returns false
for stopped containers. This caused the system to treat stopped containers as
removed, cleaning up client config instead of restarting.

Now using GetWorkloadInfo to check if the container exists (regardless of
running state), so stopped containers are correctly identified as existing
and trigger automatic restart.
Adds detection for container restarts (e.g., 'docker restart <container>').
When a container is restarted, Docker assigns a new StartedAt timestamp
while keeping the same container name. The monitor now tracks the initial
StartedAt time and compares it on each check to detect restarts.

Changes:
- Add StartedAt field to ContainerInfo to track when container last started
- Populate StartedAt from Docker State.StartedAt field
- Monitor checks for StartedAt changes to detect restarts
- Trigger automatic restart logic when restart detected
- Update tests to mock GetWorkloadInfo calls

This fixes the issue where restarting a container broke the proxy connection
without triggering automatic reconnection.
amirejaz
amirejaz previously approved these changes Nov 20, 2025
@therealnb therealnb merged commit 0f4b185 into main Nov 20, 2025
17 checks passed
@therealnb therealnb deleted the fix/container-exit-reconnection branch November 20, 2025 11:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Clients lose their connection

3 participants