-
Notifications
You must be signed in to change notification settings - Fork 145
fix: automatically handle container exits and reconnect clients #2634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fixes #2270 When containers exit or are removed, clients (Cursor, Cline) lose connection and the tool stops working. This PR implements automatic handling of container exits with client config updates to force reconnection. ## Changes ### Transport Layer - Updated handleContainerExit() in http.go and stdio.go to log clear warnings - Made TransparentProxy.Stop() idempotent to prevent double-close panics - Added logging to distinguish between container exit vs removal ### Runner Layer - Check if workload exists in runtime (not just status files) to determine if container was removed or just exited - If container removed: clean up client config and exit gracefully - If container exited: signal restart needed to workload manager ### Workload Manager - Automatic restart with exponential backoff (5s, 10s, 20s, 40s, 60s) - Up to 10 retry attempts (~10 minutes) - Remove from client config before restart, re-add after successful restart - This forces clients to see the config change and reconnect with fresh sessions ## Behavior **Container exits (crash)**: - Detect exit → Remove from ~/.cursor/mcp.json → Wait with backoff → Restart container → Re-add to mcp.json → Client reconnects ✅ **Container removed (docker rm / thv delete)**: - Detect removal → Remove from ~/.cursor/mcp.json → Exit gracefully → Tool vanishes from client ✅ ## Testing Tested with Cursor on macOS: - Container kill: Automatic restart and reconnection works - Container remove: Clean removal from client config
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2634 +/- ##
==========================================
- Coverage 55.21% 55.17% -0.05%
==========================================
Files 315 315
Lines 30131 30264 +133
==========================================
+ Hits 16637 16697 +60
- Misses 12045 12109 +64
- Partials 1449 1458 +9 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Adds unit tests for: - HTTPTransport.ShouldRestart() - restart logic for container exits vs removals - StdioTransport.ShouldRestart() - restart logic for container exits vs removals - TransparentProxy.Stop() - idempotent stop functionality - DefaultManager.RunWorkload() - container exit handling These tests cover the new automatic container exit detection and restart functionality, improving code coverage for the PR.
|
This may be related to https://discord.com/channels/1184987096302239844/1439759480069034036/1440007583145394297 |
Fixes test failures caused by corrupted module cache in CI environment. The 'go clean -modcache' step ensures a clean state before downloading dependencies, preventing 'dial tcp: lookup proxy.golang.org' errors during test compilation. Related to #2634
Adds explicit download step for github.com/Microsoft/[email protected] before general module download to prevent network/DNS failures during test compilation. Related to #2634
Disables the built-in actions/setup-go cache and implements manual module caching with explicit control. This prevents cache corruption issues that were causing 'go test' to attempt network downloads during compilation. Changes: - Set cache: false in actions/setup-go - Add explicit actions/cache for ~/go/pkg/mod - Remove go clean -modcache and go get steps (no longer needed) - Cache key based on go.sum ensures proper invalidation Related to #2634
Fixes data race where Stop() could set p.server = nil while a goroutine in Start() was still accessing it. Now captures server in local variable before starting the goroutine. Fixes race detector failure: - Read at transparent_proxy.go:380 (Start goroutine) - Write at transparent_proxy.go:461 (Stop method) Related to #2634
The test failures were caused by a race condition in the code, not by module cache issues. Reverting the workflow changes since they were not needed. The actual fix was in transparent_proxy.go to prevent race condition when accessing p.server between Start() and Stop(). Related to #2634
Addresses review comments in PR #2634: 1. Replace string matching with typed error checking - Add ErrContainerRemoved error type to docker package - Update container monitor to return ErrContainerRemoved when container not found - Replace strings.Contains() with errors.Is() in http.go and stdio.go - Update tests to use typed error 2. Simplify workload existence checking - Move DoesWorkloadExist method directly onto Runner struct - Remove workloadExistenceChecker type and getWorkloadManager helper All tests pass successfully.
…ogic The DoesWorkloadExist method was using IsWorkloadRunning which returns false for stopped containers. This caused the system to treat stopped containers as removed, cleaning up client config instead of restarting. Now using GetWorkloadInfo to check if the container exists (regardless of running state), so stopped containers are correctly identified as existing and trigger automatic restart.
Adds detection for container restarts (e.g., 'docker restart <container>'). When a container is restarted, Docker assigns a new StartedAt timestamp while keeping the same container name. The monitor now tracks the initial StartedAt time and compares it on each check to detect restarts. Changes: - Add StartedAt field to ContainerInfo to track when container last started - Populate StartedAt from Docker State.StartedAt field - Monitor checks for StartedAt changes to detect restarts - Trigger automatic restart logic when restart detected - Update tests to mock GetWorkloadInfo calls This fixes the issue where restarting a container broke the proxy connection without triggering automatic reconnection.
Signed-off-by: nigel brown <[email protected]>
Fixes #2270
Problem
When containers exit or are removed, clients (Cursor, Cline, VSCode) lose connection and tools stop working. The issue noted: "When a pod dies, we have to change the config (just touching the file doesn't work. we need to actually change the config)"
Solution
This PR implements automatic container exit handling with client config updates to force reconnection.
Key Changes
1. Automatic Container Exit Detection
TransparentProxy.Stop()idempotent to prevent panics2. Smart Workload Existence Check
thv lsas source of truth for whether to restart or clean up3. Automatic Restart with Exponential Backoff
4. Client Config Management (The Core Fix!)
~/.cursor/mcp.json~/.cursor/mcp.json5. Clean Removal Handling
docker rm,thv delete): permanently removes from client configBehavior
Container exits (crash):
Container removed intentionally:
Files Changed
pkg/transport/http.go- Container exit handling for HTTP transportpkg/transport/stdio.go- Container exit handling for stdio transportpkg/transport/proxy/transparent/transparent_proxy.go- Idempotent Stop()pkg/runner/runner.go- Workload existence check and client config cleanuppkg/workloads/manager.go- Automatic restart with backoff + config refreshdocs/fixes/container-exit-handling.md- Documentation (new)Testing
✅ Container kill → Automatic restart → Client reconnects
✅ Container remove → Clean config removal → Tool vanishes
✅ Multiple failures → Exponential backoff works
✅ No panics on double-close
Tested with Cursor on macOS.