Skip to content

fix: background polling for dashboard health checks (Windows + Linux)#519

Merged
Lightheartdevs merged 2 commits intomainfrom
fix/dashboard-health-check-windows
Mar 21, 2026
Merged

fix: background polling for dashboard health checks (Windows + Linux)#519
Lightheartdevs merged 2 commits intomainfrom
fix/dashboard-health-check-windows

Conversation

@Lightheartdevs
Copy link
Collaborator

@Lightheartdevs Lightheartdevs commented Mar 21, 2026

Summary

Dashboard health checks ran on every API request. On Docker Desktop (Windows/WSL2), DNS takes ~4s per non-running service — with 8+ disabled services, each request took 8-16 seconds, making the dashboard unusable.

Fix: Move health checks to a background polling loop. API endpoints return cached results instantly.

BEFORE:  Browser → /api/status → 19 health checks → 8-16s response
AFTER:   Background poll (every 10s) → cache
         Browser → /api/status → read cache → <350ms response

Changes

helpers.py (reverted to near-original, minimal additions):

  • Restored original shared aiohttp session (removed all caching/semaphore/heuristic workarounds from first attempt)
  • Increased timeout from 5s → 30s (invisible — only runs in background)
  • Added asyncio.TimeoutError handling in _check_host_service_health (was raising unhandled exception)
  • Added get_cached_services() / set_services_cache() cache interface

main.py:

  • Added _poll_service_health() background task started on app startup
  • Added _get_services() async helper (cache-or-fallback)
  • Updated /services, /status, _build_api_status() to read from cache

routers/features.py:

  • Updated /api/features to read cached services

Test results

Platform Healthy Degraded Response time
Windows Docker Desktop (RTX 5090) 11/11 0 328ms
Linux native Docker (Strix Halo) 18/18 0 ~10ms

Behavior

Scenario What happens
First 2 seconds after startup No cache — falls back to live check
Normal operation Background poll every 10s, API reads cache instantly
Service starts/stops Detected within 10 seconds (next poll)
Poll fails Logged, retried next cycle. Last good data retained
Multiple browser tabs All read same cache — zero extra load

🤖 Generated with Claude Code

Lightheartdevs and others added 2 commits March 20, 2026 23:05
Root cause: Docker Desktop's embedded DNS takes ~4 seconds to return
NXDOMAIN for non-running containers. With 19 services checked
concurrently via asyncio.gather, the slow DNS lookups blocked running
services from being checked in time, causing everything to show as
"degraded" on the dashboard.

Fix (three-part):

1. Fresh session per poll cycle — eliminates stale connection pool
   issues. The global aiohttp session accumulated dead connections
   from non-running services, poisoning subsequent polls. Now each
   cycle creates a fresh session with force_close=True and
   use_dns_cache=False, then closes it.

2. Not-deployed cache with TTL — services that fail DNS get cached
   for 15 seconds. Subsequent polls skip them entirely, so the slow
   4-second DNS lookups only happen once per service.

3. Two-phase polling — Phase 1 returns cached not_deployed results
   instantly. Phase 2 checks remaining services with a semaphore
   (limit=4) to prevent DNS contention. Total timeout raised to 30s
   so the first poll (which has no cache) can complete even with
   slow DNS.

Net effect: first poll takes ~4-5 seconds (DNS for non-deployed
services), subsequent polls complete in <50ms. All running services
show healthy with 1-5ms response times. No behavior change on native
Linux Docker where DNS failures are instant.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces request-triggered health checks with a background polling
loop. API endpoints return cached results instantly (<1ms) instead
of running live checks on every request (8-16s on Docker Desktop).

Architecture:
- Background task polls get_all_services() every 10 seconds
- Results stored in module-level cache
- All endpoints read from cache, falling back to live check
  only on first request before the poll completes

helpers.py changes (reverted from previous PR, minimal diff):
- Restored original shared aiohttp session pattern
- Increased total timeout from 5s to 30s (no user impact since
  it only runs in the background poll)
- Added asyncio.TimeoutError handling in _check_host_service_health
  (bug fix: was raising unhandled NameError)
- Added get_cached_services() / set_services_cache() for the
  background poll to write and endpoints to read

main.py changes:
- Added _poll_service_health() background task (started on app startup)
- Added _get_services() async helper for cache-or-live fallback
- Updated /services, /status, _build_api_status() to read from cache

routers/features.py:
- Updated /api/features to read cached services instead of live check

Tested on:
- Windows Docker Desktop (RTX 5090): 11 healthy, 0 degraded, <350ms
- Linux native Docker (Strix Halo): 18/18 healthy (no regression)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Lightheartdevs Lightheartdevs changed the title fix: dashboard health checks on Docker Desktop (Windows/WSL2) fix: background polling for dashboard health checks (Windows + Linux) Mar 21, 2026
@Lightheartdevs Lightheartdevs merged commit 7478e2c into main Mar 21, 2026
16 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant