fix: background polling for dashboard health checks (Windows + Linux)#519
Merged
Lightheartdevs merged 2 commits intomainfrom Mar 21, 2026
Merged
fix: background polling for dashboard health checks (Windows + Linux)#519Lightheartdevs merged 2 commits intomainfrom
Lightheartdevs merged 2 commits intomainfrom
Conversation
Root cause: Docker Desktop's embedded DNS takes ~4 seconds to return NXDOMAIN for non-running containers. With 19 services checked concurrently via asyncio.gather, the slow DNS lookups blocked running services from being checked in time, causing everything to show as "degraded" on the dashboard. Fix (three-part): 1. Fresh session per poll cycle — eliminates stale connection pool issues. The global aiohttp session accumulated dead connections from non-running services, poisoning subsequent polls. Now each cycle creates a fresh session with force_close=True and use_dns_cache=False, then closes it. 2. Not-deployed cache with TTL — services that fail DNS get cached for 15 seconds. Subsequent polls skip them entirely, so the slow 4-second DNS lookups only happen once per service. 3. Two-phase polling — Phase 1 returns cached not_deployed results instantly. Phase 2 checks remaining services with a semaphore (limit=4) to prevent DNS contention. Total timeout raised to 30s so the first poll (which has no cache) can complete even with slow DNS. Net effect: first poll takes ~4-5 seconds (DNS for non-deployed services), subsequent polls complete in <50ms. All running services show healthy with 1-5ms response times. No behavior change on native Linux Docker where DNS failures are instant. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces request-triggered health checks with a background polling loop. API endpoints return cached results instantly (<1ms) instead of running live checks on every request (8-16s on Docker Desktop). Architecture: - Background task polls get_all_services() every 10 seconds - Results stored in module-level cache - All endpoints read from cache, falling back to live check only on first request before the poll completes helpers.py changes (reverted from previous PR, minimal diff): - Restored original shared aiohttp session pattern - Increased total timeout from 5s to 30s (no user impact since it only runs in the background poll) - Added asyncio.TimeoutError handling in _check_host_service_health (bug fix: was raising unhandled NameError) - Added get_cached_services() / set_services_cache() for the background poll to write and endpoints to read main.py changes: - Added _poll_service_health() background task (started on app startup) - Added _get_services() async helper for cache-or-live fallback - Updated /services, /status, _build_api_status() to read from cache routers/features.py: - Updated /api/features to read cached services instead of live check Tested on: - Windows Docker Desktop (RTX 5090): 11 healthy, 0 degraded, <350ms - Linux native Docker (Strix Halo): 18/18 healthy (no regression) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Dashboard health checks ran on every API request. On Docker Desktop (Windows/WSL2), DNS takes ~4s per non-running service — with 8+ disabled services, each request took 8-16 seconds, making the dashboard unusable.
Fix: Move health checks to a background polling loop. API endpoints return cached results instantly.
Changes
helpers.py (reverted to near-original, minimal additions):
asyncio.TimeoutErrorhandling in_check_host_service_health(was raising unhandled exception)get_cached_services()/set_services_cache()cache interfacemain.py:
_poll_service_health()background task started on app startup_get_services()async helper (cache-or-fallback)/services,/status,_build_api_status()to read from cacherouters/features.py:
/api/featuresto read cached servicesTest results
Behavior
🤖 Generated with Claude Code