Classify Copilot auth-null startup failures and add bounded scheduled retry#36663
Classify Copilot auth-null startup failures and add bounded scheduled retry#36663Copilot wants to merge 4 commits into
Conversation
Co-authored-by: gh-aw-bot <259018956+gh-aw-bot@users.noreply.github.com>
Co-authored-by: gh-aw-bot <259018956+gh-aw-bot@users.noreply.github.com>
Co-authored-by: gh-aw-bot <259018956+gh-aw-bot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR improves observability and classification of Copilot “auth-null” startup failures (notably in scheduled runs), and adds a bounded scheduled retry path intended to absorb transient auth provisioning races.
Changes:
- Add mask-safe per-attempt auth preflight logging that reports presence/absence (not values) of
COPILOT_GITHUB_TOKEN,GH_TOKEN, andGITHUB_TOKEN. - Add a bounded “scheduled auth-null” retry (max 1) when Copilot reports “No authentication information found” on a fresh attempt and all auth env vars are absent, plus emit a
report_incompletesignal when it remains terminal. - Update harness tests for auth-env presence helpers and the scheduled auth-null retry policy; rename workflow step labels to clarify the Copilot SDK install is Node.js-based.
Show a summary per file
| File | Description |
|---|---|
| actions/setup/js/copilot_harness.cjs | Adds auth env presence snapshot logging and scheduled auth-null retry + classified report_incomplete emission. |
| actions/setup/js/copilot_harness.test.cjs | Adds unit coverage for env presence helpers and scheduled auth-null retry logic; refactors retry test helper signature. |
| .github/workflows/smoke-copilot-sdk.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/q.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/python-data-charts.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/pr-triage-agent.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/pr-nitpick-reviewer.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/pr-code-quality-reviewer.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/plan.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/pdf-summary.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/org-health-report.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/metrics-collector.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/mergefest.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/mcp-inspector.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/linter-miner.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/layout-spec-maintainer.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/jsweep.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/firewall.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/firewall-escape.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/draft-pr-cleanup.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/docs-noob-tester.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/discussion-task-miner.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/dictation-prompt.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/dev-hawk.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/deployment-incident-monitor.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/delight.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/dead-code-remover.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/daily-workflow-updater.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/daily-testify-uber-super-expert.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/daily-syntax-error-quality.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/daily-spdd-spec-planner.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/daily-skill-optimizer.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/daily-sentrux-report.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/daily-security-observability.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/daily-secrets-analysis.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/daily-safe-output-integrator.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/daily-repo-chronicle.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/daily-performance-summary.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/daily-model-inventory.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/daily-mcp-concurrency-analysis.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/daily-malicious-code-scan.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/daily-issues-report.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/daily-geo-optimizer.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/daily-experiment-report.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/daily-compiler-threat-spec-optimizer.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/daily-compiler-quality.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/daily-cli-performance.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/daily-assign-issue-to-user.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/daily-architecture-diagram.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/daily-agent-of-the-day-blog-writer.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/craft.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/copilot-pr-prompt-analysis.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/copilot-pr-nlp-analysis.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/copilot-pr-merged-report.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/copilot-opt.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/copilot-cli-deep-research.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/code-scanning-fixer.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/cli-consistency-checker.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/ci-coach.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/breaking-change-checker.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/brave.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/artifacts-summary.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/architecture-guardian.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
| .github/workflows/agent-performance-analyzer.lock.yml | Renames Copilot SDK install step label to “(Node.js)”. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 64/64 changed files
- Comments generated: 1
| scheduledAuthNullRetries += 1; | ||
| useContinueOnRetry = false; | ||
| continueDisabledPermanently = true; | ||
| log(`attempt ${attempt + 1}: no authentication info on fresh run with all auth env vars absent — retrying once after delay (authNullRetry=${scheduledAuthNullRetries}/${MAX_SCHEDULED_AUTH_NULL_RETRIES})`); | ||
| continue; |
|
🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅ |
|
✅ Design Decision Gate 🏗️ completed the design decision gate check. No ADR enforcement needed: PR #36663 does not have the 'implementation' label (has_implementation_label=false) and has 0 new lines of code in business logic directories (default_business_additions=0, threshold=100). No custom design-gate config is present. Neither enforcement condition is met. |
|
🧪 Test Quality Sentinel completed test quality analysis. |
|
|
🧪 Test Quality Sentinel Report✅ Test Quality Score: 97/100 — Excellent
📊 Metrics & Test Classification (12 tests analyzed)
Test Classification Details
Language SupportTests analyzed:
Verdict
📖 Understanding Test ClassificationsDesign Tests (High Value) verify what the system does:
Implementation Tests (Low Value) verify how the system does it:
Goal: Shift toward tests that describe the system's behavioral contract — the promises it makes to its users and collaborators. References: §26898522910
|
There was a problem hiding this comment.
🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer · sonnet46 1.4M
| * @returns {boolean} | ||
| */ | ||
| function shouldRetry(result, attempt, useContinueOnRetry = false) { | ||
| function shouldRetry(result, attempt, options = {}) { |
There was a problem hiding this comment.
[/tdd] The shouldRetry helper here is a hand-rolled reimplementation of the production retry logic, not a call to any exported harness function. As the production decision tree grows (this PR added a new branch), this copy must be kept in sync manually — a silent drift risk.
💡 Suggestion
Consider extracting the retry-decision logic from main() into a pure, exported function (e.g. shouldRetryAfterAuthNullError(opts)) so tests can call the real code rather than a replica. This is the /tdd pattern: production logic is the source of truth, tests exercise it directly.
Alternatively, at minimum add a comment linking this helper to the specific harness lines it mirrors (e.g. // mirrors copilot_harness.cjs lines 960-987) so reviewers know where to look when the harness changes.
| log(`attempt ${attempt + 1}: no authentication info on fresh run with all auth env vars absent — retrying once after delay (authNullRetry=${scheduledAuthNullRetries}/${MAX_SCHEDULED_AUTH_NULL_RETRIES})`); | ||
| continue; | ||
| } | ||
| if (!authEnvPresence.anyTokenPresent) { |
There was a problem hiding this comment.
[/diagnose] emitInfrastructureIncomplete is now called when all auth env vars are absent after retries are exhausted — this is the key new diagnostic signal. But there is no test that verifies this side effect is triggered after the MAX_SCHEDULED_AUTH_NULL_RETRIES budget is spent.
💡 What to add
A test (even inline in the existing describe block) should verify that after the bounded retry fires and the follow-up attempt also fails with no tokens, emitInfrastructureIncomplete is eventually called. The current tests only assert shouldRetry → false (line 1258), not the escalation that follows.
This matters because the whole motivation for the PR is surfacing auth-null as a first-class failure — the test suite should prove that guarantee holds end-to-end.
| */ | ||
| function shouldRetry(result, attempt, useContinueOnRetry = false) { | ||
| function shouldRetry(result, attempt, options = {}) { | ||
| const { useContinueOnRetry = false, isScheduledRun = false, anyTokenPresent = true, scheduledAuthNullRetries = 0 } = options; |
There was a problem hiding this comment.
[/tdd] The anyTokenPresent default is true in the test helper's destructuring, which is the opposite of the failure scenario being tested. Any test that omits anyTokenPresent is silently testing the "tokens present" path, making it easy to write an incomplete test that passes.
💡 Suggestion
Consider defaulting to false (matching the failure case under study), or make it a required field without a default so callers are forced to be explicit. At minimum, add a comment explaining the chosen default.
| if (isScheduledRun && !authEnvPresence.anyTokenPresent && scheduledAuthNullRetries < MAX_SCHEDULED_AUTH_NULL_RETRIES && attempt < MAX_RETRIES) { | ||
| scheduledAuthNullRetries += 1; | ||
| useContinueOnRetry = false; | ||
| continueDisabledPermanently = true; |
There was a problem hiding this comment.
[/diagnose] continueDisabledPermanently = true is set here to prevent --continue from being re-armed after the auth-null retry. This is correct, but there is no test verifying this side effect — if the flag were accidentally omitted, a subsequent non-auth failure on the same run could re-enable --continue and replay a corrupt conversation.
💡 What to add
The existing null-type tool_call restarts fresh instead of --continue describe block (around line 1302) shows how to test continueDisabledPermanently state transitions. A similar micro-test for the auth-null retry path would document the invariant.
Scheduled
copilotengine runs started failing with a new signature: Copilot exits in ~1s with0 turns / 0 tokens,agent_output.jsonis empty, and logs showisAuthError=true(No authentication information found). The failure was being treated as a generic exit and not surfaced as a first-class auth provisioning failure.Auth preflight visibility (mask-safe)
copilot_harness.cjsthat reports presence/absence (never values) of:COPILOT_GITHUB_TOKENGH_TOKENGITHUB_TOKENBounded recovery path for scheduled auth-null races
MAX_SCHEDULED_AUTH_NULL_RETRIES = 1.Explicit classified failure signal
report_incompleteinfrastructure signal withcopilot_auth_token_missingdetail so this class is visible/classifiable instead of a genericexitCode=1.Focused harness test updates