Skip to content

[WIP] Fix smoke-test startup and health check configuration#19

Draft
Copilot wants to merge 1 commit into
mainfrom
copilot/fix-smoke-test-health-check
Draft

[WIP] Fix smoke-test startup and health check configuration#19
Copilot wants to merge 1 commit into
mainfrom
copilot/fix-smoke-test-health-check

Conversation

Copy link
Copy Markdown

Copilot AI commented May 8, 2026

Thanks for asking me to work on this. I will get started on it and keep this PR's description up to date as I form a plan and make progress.

Original prompt

The CI job "windows-packaging-validation" is failing in the "Smoke-test startup and /health endpoint" step because the server started with node src/core/master.js doesn't respond to health checks on http://127.0.0.1:3000/health. The issue is that the server binds to HOST: "0.0.0.0" from the config, which may not properly include localhost in the Windows CI environment.

Fix this by modifying the smoke-test step in .github/workflows/ci.yml to:

  • After copying the config files, update configs/config.json to set HOST to "127.0.0.1" instead of "0.0.0.0".
  • Increase the initial sleep time from 10 to 15 seconds.
  • Add a check after starting the process to throw an error if the server process has already exited (indicating a crash).
  • Implement a retry loop for the health check: try up to 3 times, with 5-second sleeps between attempts.
  • Ensure the process is properly stopped in the finally block.

This should ensure the server binds correctly to localhost and is given enough time to start before checking health.

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: Please find a solution for this failing job. Use the logs, job definition, and any referenced files where the failure occurred. Keep your response focused on the solution and include code suggestions when appropriate.

User: fix issue for me

Assistant: The failing job is due to the server not responding to the health check at http://127.0.0.1:3000/health. The server starts but doesn't bind correctly, likely because HOST: "0.0.0.0" doesn't properly include localhost in the Windows CI environment. The /health endpoint exists in the code, but the binding issue prevents access.

To fix, modify the CI workflow to:

  • Set HOST to "127.0.0.1" in configs/config.json after copying.
  • Increase the sleep time to 15 seconds.
  • Add a check to fail if the server process exits prematurely.
  • Retry the health check up to 3 times with 5-second intervals.

Update the smoke-test step in .github/workflows/ci.yml:

- name: Smoke-test startup and /health endpoint
  shell: pwsh
  run: |
    Copy-Item "configs\config.json.example" "configs\config.json" -Force
    Copy-Item "configs\provider_pools.json.example" "configs\provider_pools.json" -Force
    
    # Set HOST to 127.0.0.1 for localhost binding
    $config = Get-Content "configs\config.json" -Raw | ConvertFrom-Json
    $config.HOST = "127.0.0.1"
    $config | ConvertTo-Json -Depth 10 | Set-Content "configs\config.json"
    
    $proc = Start-Process -FilePath "node" -ArgumentList "src/core/master.js" -PassThru -NoNewWindow
    
    try {
      Start-Sleep -Seconds 15
      
      # Check if process exited
      if ($proc.HasExited) {
        throw "Server process exited prematurely with code $($proc.ExitCode)"
      }
      
      # Retry health check
      $maxRetries = 3
      $retryCount = 0
      $healthy = $false
      
      while ($retryCount -lt $maxRetries -and -not $healthy) {
        try {
          $resp = Invoke-WebRequest -Uri "http://127.0.0.1:3000/health" -UseBasicParsing -TimeoutSec 10
          if ($resp.StatusCode -eq 200) {
            $healthy = $true
          }
        } catch {
          Write-Host "Health check failed (attempt $($retryCount + 1)): $($_.Exception.Message)"
          if ($retryCount -lt $maxRetries - 1) {
            Start-Sleep -Seconds 5
          }
        }
        $retryCount++
      }
      
      if (-not $healthy) {
        throw "Health check failed after $maxRetries attempts"
      }
    } finally {
      if ($proc -and -not $proc.HasExited) {
        Stop-Process -Id $proc.Id -Force
      }
    }

User: do it for me

Assistant: The pull request has been created successfully!

Pull Request Details:

  • Title: Fix failing Windows CI smoke-test by improving server startup and health check
  • URL: feat: Wave 1 observability, security regression suite, and governance foundation #1
  • Changes Made:
    • Updated the smoke-test step in .github/workflows/ci.yml for the windows-packaging-validation job
    • Added code to modify configs/config.json after copying, setting HOST to "127.0.0.1" for proper localhost binding
    • Increased initial sleep from 10 to 15 seconds
    • Added a check to detect if the server process exits prematurely (indicating a crash)
    • Implemented a retry mechanism for the health check (up to 3 attempts with 5-second intervals)
    • Ensured proper process cleanup in the finally block

The changes address the root cause where the server wasn't binding to localhost correctly in the Windows CI environment. By explicitly setting the host to 127.0.0.1, increasing startup time, adding crash detection, and implementing health check retries, the CI should now pass consistently. You can merge this PR to apply the fix.

This pull request was created from Copilot chat.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants