You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
New P0 today: PR #31363 (merged 2026-05-11 ~05:35 UTC) introduced .github/workflows/aw-portfolio-yield.md whose imported shared/otel-observability.md carries a placeholder npm package and an empty OTel endpoint. Two distinct downstream failures result:
Agentic Maintenancecompile-workflows now fails on every schedule because npx package '@your-org/otel-query-mcp' not found on npm registry. This took the maintenance workflow from green (last success §25653034240 at 05:58 UTC) to two consecutive failures.
Agentic Workflow Portfolio Yield itself can't start: MCP Gateway v0.3.6 rejects the config with gateway/opentelemetry/endpoint length must be ≥ 1 (the OTLP_ENDPOINT secret is empty/unset in this repo).
One P1 recurrence — Step Name Alignment hit max_turns=30 again with the same /tmp/gh-aw/cache-memory/* Bash denial signature as #31178.
One P2 mixed — Smoke Claude PR run saw safe_outputs fail on resolve_pull_request_review_thread (Resource not accessible by integration) and also reported missing_tool: playwright-cli.
~50 agentic runs observed, 4 distinct agentic-workflow failures (excluding push-triggered CI noise on copilot agent branches).
audit of §25657460521: 4 of 11 jobs ran, compile-workflows failed at the Compile workflows step:
✓ Successfully compiled 218 out of 219 workflow files
✗ Compiled 219 workflow(s): 1 error(s), 52 warning(s)
✗ Failed workflows:
✗ aw-portfolio-yield.md
.github/workflows/aw-portfolio-yield.md:1:1: error: runtime package validation failed:
Validation failed for field 'runtime.packages'
Value: 1 package validation errors
Reason: runtime package validation failed
Suggestion: Fix the following package issues:
Validation failed for field 'npx.packages'
Value: 1 packages not found
Reason: npx packages not found on npm registry
Suggestion: Fix package names or verify they exist on npm:
npx package '`@your-org/otel-query-mcp`' not found on npm registry: npm error code E404
##[error]Process completed with exit code 1.
The placeholder package name comes from .github/workflows/shared/otel-observability.md:
audit of §25654663141: activation succeeded (23s), agent job failed at MCP Gateway init (40s):
[error] ERROR: Gateway process (PID: 4114) exited during initialization
config:validation_schema Schema validation failed:
jsonschema: '/gateway/opentelemetry/endpoint' does not validate with
https://docs.github.com/gh-aw/schemas/mcp-gateway-config.schema.json#/.../endpoint/minLength:
length must be >= 1, but got 0
failed to load config: Configuration validation error (MCP Gateway version: v0.3.6):
Error: length must be >= 1, but got 0
Error: does not match pattern '^((redacted)+|\${[A-Za-z_][A-Za-z0-9_]*})$'
##[error]Process completed with exit code 1.
The ${{ secrets.OTLP_ENDPOINT }} interpolation in shared/otel-observability.md resolves to an empty string when the secret is unset — and the MCP gateway schema enforces minLength: 1 on gateway/opentelemetry/endpoint, so the gateway refuses to start.
Cluster 3 — Step Name Alignment max-turns (recurrence of #31178)
audit of §25651479635: 31 turns / 1.8M tokens / $1.34, terminated error_max_turns. Permission denial loop from agent-stdio.log:
ls in '/tmp/gh-aw/cache-memory' was blocked. For security, Claude Code may only list
files in the allowed working directories for this session: '/home/runner/work/gh-aw/gh-aw'.
find in '/tmp/gh-aw/cache-memory' was blocked. ...
cat in '/tmp/gh-aw/cache-memory/step-name-alignment.json' was blocked. ...
mkdir in '/tmp/gh-aw/agent' was blocked. ...
The workflow's --allowed-tools list includes Bash(ls), Bash(cat), Bash(cat /tmp/gh-aw/cache-memory/), Bash(mkdir -p /tmp/gh-aw/cache-memory/), etc., but Claude Code's working-directory restriction (/home/runner/work/gh-aw/gh-aw only) overrides those prefix allows for paths under /tmp. Same exact signature as the 3 prior occurrences tracked in #31178.
audit-diff vs baseline §25620382907 (success on 2026-05-10): turns went from 0 → 31, classification changed, reason_code turns_increase.
Cluster 4 — Smoke Claude safe_outputs failure
audit of §25649467832: agent succeeded (7.3m), safe_outputs job failed (27s). Two distinct errors in 3_safe_outputs.txt:
##[error]Failed to resolve review thread: Request failed due to following response errors:
- Resource not accessible by integration
##[error]✗ Message 7 (resolve_pull_request_review_thread) failed:
- Resource not accessible by integration
✓ Recorded missing tool: playwright-cli
Reason: playwright-cli is not mounted on PATH and not present in
/home/runner/work/_temp/gh-aw/mcp-cli/manifest.json;
cannot run browser_navigate/browser_snapshot
Alternatives: Add playwright MCP server to the workflow's mcp-cli mounts or remove the test step
##[error]1 safe output(s) failed
This was triggered from PR #31398. The GH_AW token used by safe_outputs lacks GraphQL resolveReviewThread permission on the PR. missing_tool is reported as a failure because the workflow sets missing-tool-report-as-failure: true.
Excluded: daily-fact.lock.yml push-triggered CI noise
The runs list shows ~25 failures on the workflow registered as .github/workflows/daily-fact.lock.yml (workflow id 210263564). All failures are event=push on copilot/investigate-failing-agent-step and copilot/fix-agent-job-failure branches — these are noise from Copilot agents iterating in their working branches. The workflow's on: only has schedule + workflow_dispatch, so these runs do not represent operational failures of the deployed workflow on main. Not tracked.
Fix shared/otel-observability.md to remove the placeholder npm package and tolerate an unset OTel endpoint. Replace @your-org/otel-query-mcp with the real MCP server name (or remove the otel MCP server entirely until the package is published), and either gate the gateway.opentelemetry.endpoint config block behind OTLP_ENDPOINT being non-empty or change the schema to allow empty endpoint = disabled. (success criteria: gh aw compile .github/workflows/aw-portfolio-yield.md returns 0 errors; a fresh Agentic Workflow Portfolio Yield workflow_dispatch reaches the agent step without MCP Gateway startup failure)
Smoke Claude ([aw] Smoke Claude failed #31410): drop resolve_pull_request_review_thread from the smoke test, or grant the GH_AW safe_outputs token pull-requests:write so the GraphQL resolveReviewThread mutation is accessible. Mount playwright MCP server (or drop the browser_navigate/browser_snapshot test step) to clear the missing_tool warning that's being treated as failure.
Executive summary
New P0 today: PR #31363 (merged 2026-05-11 ~05:35 UTC) introduced
.github/workflows/aw-portfolio-yield.mdwhose importedshared/otel-observability.mdcarries a placeholder npm package and an empty OTel endpoint. Two distinct downstream failures result:compile-workflowsnow fails on every schedule becausenpx package '@your-org/otel-query-mcp' not found on npm registry. This took the maintenance workflow from green (last success §25653034240 at 05:58 UTC) to two consecutive failures.gateway/opentelemetry/endpointlength must be ≥ 1 (theOTLP_ENDPOINTsecret is empty/unset in this repo).One P1 recurrence — Step Name Alignment hit
max_turns=30again with the same/tmp/gh-aw/cache-memory/*Bash denial signature as #31178.One P2 mixed — Smoke Claude PR run saw safe_outputs fail on
resolve_pull_request_review_thread(Resource not accessible by integration) and also reportedmissing_tool: playwright-cli.Failure clusters
error_max_turns, 3.5m, 31 turns, $1.34)/tmp/gh-aw/cache-memory/*resolve_pull_request_review_threadperms +missing_tool: playwright-cliEvidence
Cluster 1 — Agentic Maintenance compile failure (aw-portfolio-yield npm package)
auditof §25657460521: 4 of 11 jobs ran,compile-workflowsfailed at theCompile workflowsstep:The placeholder package name comes from
.github/workflows/shared/otel-observability.md:audit-diffvs the most recent successful Agentic Maintenance run §25653034240 (2026-05-11 05:58 UTC, 5 minutes before #31363 was pushed):Cluster 2 — Agentic Workflow Portfolio Yield MCP Gateway startup failure (empty OTel endpoint)
auditof §25654663141: activation succeeded (23s),agentjob failed at MCP Gateway init (40s):The
${{ secrets.OTLP_ENDPOINT }}interpolation inshared/otel-observability.mdresolves to an empty string when the secret is unset — and the MCP gateway schema enforcesminLength: 1ongateway/opentelemetry/endpoint, so the gateway refuses to start.Cluster 3 — Step Name Alignment max-turns (recurrence of #31178)
auditof §25651479635: 31 turns / 1.8M tokens / $1.34, terminatederror_max_turns. Permission denial loop fromagent-stdio.log:The workflow's
--allowed-toolslist includesBash(ls),Bash(cat),Bash(cat /tmp/gh-aw/cache-memory/),Bash(mkdir -p /tmp/gh-aw/cache-memory/), etc., but Claude Code's working-directory restriction (/home/runner/work/gh-aw/gh-awonly) overrides those prefix allows for paths under/tmp. Same exact signature as the 3 prior occurrences tracked in #31178.audit-diffvs baseline §25620382907 (success on 2026-05-10): turns went from 0 → 31, classificationchanged, reason_codeturns_increase.Cluster 4 — Smoke Claude safe_outputs failure
auditof §25649467832: agent succeeded (7.3m),safe_outputsjob failed (27s). Two distinct errors in3_safe_outputs.txt:This was triggered from PR #31398. The GH_AW token used by safe_outputs lacks GraphQL
resolveReviewThreadpermission on the PR.missing_toolis reported as a failure because the workflow setsmissing-tool-report-as-failure: true.Excluded: daily-fact.lock.yml push-triggered CI noise
The runs list shows ~25 failures on the workflow registered as
.github/workflows/daily-fact.lock.yml(workflow id 210263564). All failures areevent=pushoncopilot/investigate-failing-agent-stepandcopilot/fix-agent-job-failurebranches — these are noise from Copilot agents iterating in their working branches. The workflow'son:only hasschedule+workflow_dispatch, so these runs do not represent operational failures of the deployed workflow on main. Not tracked.Existing issue correlation
success; same "no safe outputs" false-positive class as #31309success; same false-positive class as #31287success; false-positive classsuccess; false-positive class (outside 6h window)Proposed fix roadmap
P0 (new, this window)
shared/otel-observability.mdto remove the placeholder npm package and tolerate an unset OTel endpoint. Replace@your-org/otel-query-mcpwith the real MCP server name (or remove theotelMCP server entirely until the package is published), and either gate thegateway.opentelemetry.endpointconfig block behindOTLP_ENDPOINTbeing non-empty or change the schema to allow empty endpoint = disabled. (success criteria:gh aw compile .github/workflows/aw-portfolio-yield.mdreturns 0 errors; a fresh Agentic Workflow Portfolio Yield workflow_dispatch reaches the agent step without MCP Gateway startup failure)P1 (recurrence)
/tmp/gh-aw/cache-memory/*denials despite an explicit allow-list. Either (a) lift the Claude Code working-directory restriction for paths in the--allowed-toolsallow-list, or (b) move the workflow's cache-memory I/O entirely under/home/runner/work/gh-aw/gh-aw/...so the workdir restriction doesn't apply.P2
resolve_pull_request_review_threadfrom the smoke test, or grant the GH_AW safe_outputs tokenpull-requests:writeso the GraphQLresolveReviewThreadmutation is accessible. MountplaywrightMCP server (or drop thebrowser_navigate/browser_snapshottest step) to clear themissing_toolwarning that's being treated as failure.Sub-issues linked
References
Related to #30961