Executive Summary
Overall health for the last 24h is good. Telemetry is flowing: the spans dataset has ~18.4k spans and gh-aw conclusion spans are well-attributed (gh-aw.workflow.name, gh-aw.run.status, gh-aw.engine.id all populated). Of 1,889 run-conclusion spans, 52 carried gh-aw.run.status:failure, corresponding to ~13 distinct failed runs (≈0.7% of conclusion spans). Failures are mostly one-off and spread across many workflows; the only recurring patterns are PR Sous Chef (3 failed runs) and PR Code Quality Reviewer (2 failed runs), both on the copilot engine. Trace continuity was verified intact on a representative PR Sous Chef failure.
No timeouts, cancellations, or OTLP-export failures were confirmed. However, several reliability-relevant fields are not consumable from Sentry: gen_ai.response.finish_reasons returns zero results over 7d (blocking truncation / runaway-token detection), span.status is null on every span, release is null, and the errors and logs datasets are empty — so failure root-causes cannot be corroborated from log/error telemetry. These are observability gaps, not confirmed runtime failures.
Top Reliability Findings
| Priority |
Workflow |
Problem |
Evidence |
Next Action |
| P1 |
PR Sous Chef |
Recurring run failures at agent phase (copilot) |
3 failed runs/24h (~18% of ~17 runs); spans gh-aw.agent.conclusion + detection/safe_outputs/conclusion = failure; runs 26834737280, 26818750469, 26794554270; trace 1c17b6dd5b51b810e3ee7f84be03ab3a |
Open the 3 run logs; agent phase ran ~311s before failing — check for agent error vs. safe-output rejection |
| P2 |
PR Code Quality Reviewer |
Recurring run failures at agent phase (copilot) |
2 failed runs/24h (~10% of 19 runs); runs 26849261658, 26837142276; trace 5d9db5c5976cad556efcf783b0b57c80 |
Compare the 2 failed runs against passing runs of same workflow |
| P3 |
(instrumentation) |
gen_ai.response.finish_reasons not queryable in Sentry → truncation / runaway-token blind spot |
has:gen_ai.response.finish_reasons → 0 results over 7d, despite emit-side always emitting it (send_otlp_span.cjs:2011, array attr via buildArrayAttr) |
Verify Sentry indexes array-valued span attrs, or emit a scalar mirror (e.g. gh-aw.finish_reason) |
| P4 |
(instrumentation) |
span.status null on all spans; release null → no OTLP-status filtering, no regression correlation |
All spans span.status:null; release:null. Emit-side sets OTLP status.code=2 on failures (send_otlp_span.cjs:1908/1944) and service.version resource attr (:322) |
Confirm OTLP status.code→Sentry span.status and service.version→release mapping in the Sentry ingest config |
| P5 |
(observability) |
errors and logs datasets empty → failure root-cause can't be corroborated from logs/errors |
errors and logs queries return 0 results / 24h (and unfiltered) |
Decide whether gh-aw should ship error/log telemetry, or document that spans are the sole signal |
Representative Traces
View representative traces
PR Sous Chef failure (P1) — trace 1c17b6dd5b51b810e3ee7f84be03ab3a
- Continuity intact:
gh-aw.activation.conclusion = success, then gh-aw.agent.conclusion = failure (span.duration ≈ 311,698 ms / ~5.2 min), gh-aw.detection.conclusion = failure (~49.9 s), gh-aw.safe_outputs.conclusion = failure (~6.0 s).
- Many
api_proxy.copilot.request child spans on the same trace (7–15 s each) confirm copilot engine activity before the agent-phase failure.
- Run: 26818750469
PR Code Quality Reviewer failure (P2) — trace 5d9db5c5976cad556efcf783b0b57c80, run 26849261658
Recommendations
- Triage the 3 PR Sous Chef + 2 PR Code Quality Reviewer copilot failures (smallest fix first): open the linked run logs to determine whether the agent phase is failing on an agent error or a downstream safe-output rejection — both currently surface only as
gh-aw.run.status:failure with no log corroboration.
- Close the truncation blind spot: emit a scalar finish-reason attribute alongside the array (
gen_ai.response.finish_reasons), since the array form is not queryable in Sentry — without it, finish_reasons:length / runaway-token detection is impossible.
- Fix backend field mapping: ensure OTLP
status.code surfaces as Sentry span.status and service.version surfaces as release; both are emitted but null at the consumer, blocking status-based filtering and any regression-vs-baseline comparison.
- Decide on error/log telemetry: the
errors and logs datasets are empty, so this review relies solely on span attributes — either ship complementary error/log signal or document spans as the single source of truth.
Notes
View notes
- Tooling: Sentry MCP build exposes
list_events (used here); search_events and get_trace_details were not available, so trace continuity was validated via list_events filtered by trace:<id> per the skill fallback.
- Inconclusive vs confirmed: run-level failures are confirmed via
gh-aw.run.status:failure. Root causes are inconclusive (no errors/logs data; gen_ai.response.finish_reasons/span.status not consumable). Do not read these as confirmed timeouts.
- No timeouts/cancellations observed: no
cancelled status; agent.setup spans peak ~12 s and the ~311 s PR Sous Chef agent phase is a phase duration, not a hung-span timeout.
- Healthy attributes (present & well-populated):
gh-aw.workflow.name, gh-aw.run.status, gh-aw.engine.id (copilot 2563, claude 888, codex 194, gemini 58, pi 40, antigravity 40 spans). Note the attribute key is gh-aw.engine.id, not gh-aw.engine.
- Missing/null at consumer:
gen_ai.response.finish_reasons (0/7d), span.status (all), release (all), gen_ai.usage.total_tokens & gen_ai.response.model (null on gen_ai spans → no token-cost outlier analysis from spans).
- Regression assessment: low, distributed failure rate; treated as normal background except the two recurring copilot workflows. No clear baseline beyond 24h was used.
References: §26818750469 · §26849261658 · §26834737280
Generated by 🚨 Daily Reliability Review · opus48 1.5M · ◷
Executive Summary
Overall health for the last 24h is good. Telemetry is flowing: the
spansdataset has ~18.4k spans and gh-aw conclusion spans are well-attributed (gh-aw.workflow.name,gh-aw.run.status,gh-aw.engine.idall populated). Of 1,889 run-conclusion spans, 52 carriedgh-aw.run.status:failure, corresponding to ~13 distinct failed runs (≈0.7% of conclusion spans). Failures are mostly one-off and spread across many workflows; the only recurring patterns are PR Sous Chef (3 failed runs) and PR Code Quality Reviewer (2 failed runs), both on the copilot engine. Trace continuity was verified intact on a representative PR Sous Chef failure.No timeouts, cancellations, or OTLP-export failures were confirmed. However, several reliability-relevant fields are not consumable from Sentry:
gen_ai.response.finish_reasonsreturns zero results over 7d (blocking truncation / runaway-token detection),span.statusis null on every span,releaseis null, and theerrorsandlogsdatasets are empty — so failure root-causes cannot be corroborated from log/error telemetry. These are observability gaps, not confirmed runtime failures.Top Reliability Findings
gh-aw.agent.conclusion+detection/safe_outputs/conclusion=failure; runs 26834737280, 26818750469, 26794554270; trace1c17b6dd5b51b810e3ee7f84be03ab3a5d9db5c5976cad556efcf783b0b57c80gen_ai.response.finish_reasonsnot queryable in Sentry → truncation / runaway-token blind spothas:gen_ai.response.finish_reasons→ 0 results over 7d, despite emit-side always emitting it (send_otlp_span.cjs:2011, array attr viabuildArrayAttr)gh-aw.finish_reason)span.statusnull on all spans;releasenull → no OTLP-status filtering, no regression correlationspan.status:null;release:null. Emit-side sets OTLPstatus.code=2on failures (send_otlp_span.cjs:1908/1944) andservice.versionresource attr (:322)status.code→Sentryspan.statusandservice.version→releasemapping in the Sentry ingest configerrorsandlogsdatasets empty → failure root-cause can't be corroborated from logs/errorserrorsandlogsqueries return 0 results / 24h (and unfiltered)Representative Traces
View representative traces
PR Sous Chef failure (P1) — trace
1c17b6dd5b51b810e3ee7f84be03ab3agh-aw.activation.conclusion=success, thengh-aw.agent.conclusion= failure (span.duration≈ 311,698 ms / ~5.2 min),gh-aw.detection.conclusion= failure (~49.9 s),gh-aw.safe_outputs.conclusion= failure (~6.0 s).api_proxy.copilot.requestchild spans on the same trace (7–15 s each) confirm copilot engine activity before the agent-phase failure.PR Code Quality Reviewer failure (P2) — trace
5d9db5c5976cad556efcf783b0b57c80, run 26849261658Recommendations
gh-aw.run.status:failurewith no log corroboration.gen_ai.response.finish_reasons), since the array form is not queryable in Sentry — without it,finish_reasons:length/ runaway-token detection is impossible.status.codesurfaces as Sentryspan.statusandservice.versionsurfaces asrelease; both are emitted but null at the consumer, blocking status-based filtering and any regression-vs-baseline comparison.errorsandlogsdatasets are empty, so this review relies solely on span attributes — either ship complementary error/log signal or document spans as the single source of truth.Notes
View notes
list_events(used here);search_eventsandget_trace_detailswere not available, so trace continuity was validated vialist_eventsfiltered bytrace:<id>per the skill fallback.gh-aw.run.status:failure. Root causes are inconclusive (noerrors/logsdata;gen_ai.response.finish_reasons/span.statusnot consumable). Do not read these as confirmed timeouts.cancelledstatus; agent.setup spans peak ~12 s and the ~311 s PR Sous Chef agent phase is a phase duration, not a hung-span timeout.gh-aw.workflow.name,gh-aw.run.status,gh-aw.engine.id(copilot 2563, claude 888, codex 194, gemini 58, pi 40, antigravity 40 spans). Note the attribute key isgh-aw.engine.id, notgh-aw.engine.gen_ai.response.finish_reasons(0/7d),span.status(all),release(all),gen_ai.usage.total_tokens&gen_ai.response.model(null on gen_ai spans → no token-cost outlier analysis from spans).References: §26818750469 · §26849261658 · §26834737280