[reliability] Daily Reliability Review - 2026-06-02

### Executive Summary

Overall health for the last 24h is **good**. Telemetry is flowing: the `spans` dataset has ~18.4k spans and gh-aw conclusion spans are well-attributed (`gh-aw.workflow.name`, `gh-aw.run.status`, `gh-aw.engine.id` all populated). Of **1,889 run-conclusion spans, 52 carried `gh-aw.run.status:failure`**, corresponding to **~13 distinct failed runs** (≈0.7% of conclusion spans). Failures are mostly one-off and spread across many workflows; the only **recurring** patterns are **PR Sous Chef (3 failed runs)** and **PR Code Quality Reviewer (2 failed runs)**, both on the **copilot** engine. Trace continuity was verified intact on a representative PR Sous Chef failure.

No timeouts, cancellations, or OTLP-export failures were confirmed. However, several reliability-relevant fields are **not consumable from Sentry**: `gen_ai.response.finish_reasons` returns zero results over 7d (blocking truncation / runaway-token detection), `span.status` is null on every span, `release` is null, and the `errors` and `logs` datasets are **empty** — so failure root-causes cannot be corroborated from log/error telemetry. These are observability gaps, not confirmed runtime failures.

### Top Reliability Findings

| Priority | Workflow | Problem | Evidence | Next Action |
| --- | --- | --- | --- | --- |
| P1 | PR Sous Chef | Recurring run failures at agent phase (copilot) | 3 failed runs/24h (~18% of ~17 runs); spans `gh-aw.agent.conclusion` + `detection/safe_outputs/conclusion` = `failure`; runs [26834737280](https://github.com/github/gh-aw/actions/runs/26834737280), [26818750469](https://github.com/github/gh-aw/actions/runs/26818750469), [26794554270](https://github.com/github/gh-aw/actions/runs/26794554270); trace `1c17b6dd5b51b810e3ee7f84be03ab3a` | Open the 3 run logs; agent phase ran ~311s before failing — check for agent error vs. safe-output rejection |
| P2 | PR Code Quality Reviewer | Recurring run failures at agent phase (copilot) | 2 failed runs/24h (~10% of 19 runs); runs [26849261658](https://github.com/github/gh-aw/actions/runs/26849261658), [26837142276](https://github.com/github/gh-aw/actions/runs/26837142276); trace `5d9db5c5976cad556efcf783b0b57c80` | Compare the 2 failed runs against passing runs of same workflow |
| P3 | (instrumentation) | `gen_ai.response.finish_reasons` not queryable in Sentry → truncation / runaway-token blind spot | `has:gen_ai.response.finish_reasons` → **0 results over 7d**, despite emit-side always emitting it (`send_otlp_span.cjs:2011`, array attr via `buildArrayAttr`) | Verify Sentry indexes array-valued span attrs, or emit a scalar mirror (e.g. `gh-aw.finish_reason`) |
| P4 | (instrumentation) | `span.status` null on all spans; `release` null → no OTLP-status filtering, no regression correlation | All spans `span.status:null`; `release:null`. Emit-side sets OTLP `status.code=2` on failures (`send_otlp_span.cjs:1908/1944`) and `service.version` resource attr (`:322`) | Confirm OTLP `status.code`→Sentry `span.status` and `service.version`→`release` mapping in the Sentry ingest config |
| P5 | (observability) | `errors` and `logs` datasets empty → failure root-cause can't be corroborated from logs/errors | `errors` and `logs` queries return **0 results / 24h** (and unfiltered) | Decide whether gh-aw should ship error/log telemetry, or document that spans are the sole signal |

### Representative Traces
<details>
<summary>View representative traces</summary>

**PR Sous Chef failure (P1)** — trace [`1c17b6dd5b51b810e3ee7f84be03ab3a`](https://github.sentry.io/explore/traces/trace/1c17b6dd5b51b810e3ee7f84be03ab3a)
- Continuity intact: `gh-aw.activation.conclusion` = `success`, then `gh-aw.agent.conclusion` = **failure** (`span.duration` ≈ 311,698 ms / ~5.2 min), `gh-aw.detection.conclusion` = failure (~49.9 s), `gh-aw.safe_outputs.conclusion` = failure (~6.0 s).
- Many `api_proxy.copilot.request` child spans on the same trace (7–15 s each) confirm copilot engine activity before the agent-phase failure.
- Run: [26818750469](https://github.com/github/gh-aw/actions/runs/26818750469)

**PR Code Quality Reviewer failure (P2)** — trace [`5d9db5c5976cad556efcf783b0b57c80`](https://github.sentry.io/explore/traces/trace/5d9db5c5976cad556efcf783b0b57c80), run [26849261658](https://github.com/github/gh-aw/actions/runs/26849261658)

</details>

### Recommendations

1. **Triage the 3 PR Sous Chef + 2 PR Code Quality Reviewer copilot failures** (smallest fix first): open the linked run logs to determine whether the agent phase is failing on an agent error or a downstream safe-output rejection — both currently surface only as `gh-aw.run.status:failure` with no log corroboration.
2. **Close the truncation blind spot**: emit a **scalar** finish-reason attribute alongside the array (`gen_ai.response.finish_reasons`), since the array form is not queryable in Sentry — without it, `finish_reasons:length` / runaway-token detection is impossible.
3. **Fix backend field mapping**: ensure OTLP `status.code` surfaces as Sentry `span.status` and `service.version` surfaces as `release`; both are emitted but null at the consumer, blocking status-based filtering and any regression-vs-baseline comparison.
4. **Decide on error/log telemetry**: the `errors` and `logs` datasets are empty, so this review relies solely on span attributes — either ship complementary error/log signal or document spans as the single source of truth.

### Notes
<details>
<summary>View notes</summary>

- **Tooling**: Sentry MCP build exposes `list_events` (used here); `search_events` and `get_trace_details` were not available, so trace continuity was validated via `list_events` filtered by `trace:<id>` per the skill fallback.
- **Inconclusive vs confirmed**: run-level failures are **confirmed** via `gh-aw.run.status:failure`. Root causes are **inconclusive** (no `errors`/`logs` data; `gen_ai.response.finish_reasons`/`span.status` not consumable). Do not read these as confirmed timeouts.
- **No timeouts/cancellations** observed: no `cancelled` status; agent.setup spans peak ~12 s and the ~311 s PR Sous Chef agent phase is a phase duration, not a hung-span timeout.
- **Healthy attributes** (present & well-populated): `gh-aw.workflow.name`, `gh-aw.run.status`, `gh-aw.engine.id` (copilot 2563, claude 888, codex 194, gemini 58, pi 40, antigravity 40 spans). Note the attribute key is `gh-aw.engine.id`, not `gh-aw.engine`.
- **Missing/null at consumer**: `gen_ai.response.finish_reasons` (0/7d), `span.status` (all), `release` (all), `gen_ai.usage.total_tokens` & `gen_ai.response.model` (null on gen_ai spans → no token-cost outlier analysis from spans).
- **Regression assessment**: low, distributed failure rate; treated as normal background except the two recurring copilot workflows. No clear baseline beyond 24h was used.

**References:** [§26818750469](https://github.com/github/gh-aw/actions/runs/26818750469) · [§26849261658](https://github.com/github/gh-aw/actions/runs/26849261658) · [§26834737280](https://github.com/github/gh-aw/actions/runs/26834737280)

</details>







> Generated by [🚨 Daily Reliability Review](https://github.com/github/gh-aw/actions/runs/26853914729) · opus48 1.5M · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Fdaily-reliability-review%22&type=issues)
> - [x] expires  on Jun 4, 2026, 11:30 PM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[reliability] Daily Reliability Review - 2026-06-02 #36550

Executive Summary

Top Reliability Findings

Representative Traces

Recommendations

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Priority	Workflow	Problem	Evidence	Next Action
P1	PR Sous Chef	Recurring run failures at agent phase (copilot)	3 failed runs/24h (~18% of ~17 runs); spans `gh-aw.agent.conclusion` + `detection/safe_outputs/conclusion` = `failure`; runs 26834737280, 26818750469, 26794554270; trace `1c17b6dd5b51b810e3ee7f84be03ab3a`	Open the 3 run logs; agent phase ran ~311s before failing — check for agent error vs. safe-output rejection
P2	PR Code Quality Reviewer	Recurring run failures at agent phase (copilot)	2 failed runs/24h (~10% of 19 runs); runs 26849261658, 26837142276; trace `5d9db5c5976cad556efcf783b0b57c80`	Compare the 2 failed runs against passing runs of same workflow
P3	(instrumentation)	`gen_ai.response.finish_reasons` not queryable in Sentry → truncation / runaway-token blind spot	`has:gen_ai.response.finish_reasons` → 0 results over 7d, despite emit-side always emitting it (`send_otlp_span.cjs:2011`, array attr via `buildArrayAttr`)	Verify Sentry indexes array-valued span attrs, or emit a scalar mirror (e.g. `gh-aw.finish_reason`)
P4	(instrumentation)	`span.status` null on all spans; `release` null → no OTLP-status filtering, no regression correlation	All spans `span.status:null`; `release:null`. Emit-side sets OTLP `status.code=2` on failures (`send_otlp_span.cjs:1908/1944`) and `service.version` resource attr (`:322`)	Confirm OTLP `status.code`→Sentry `span.status` and `service.version`→`release` mapping in the Sentry ingest config
P5	(observability)	`errors` and `logs` datasets empty → failure root-cause can't be corroborated from logs/errors	`errors` and `logs` queries return 0 results / 24h (and unfiltered)	Decide whether gh-aw should ship error/log telemetry, or document that spans are the sole signal

[reliability] Daily Reliability Review - 2026-06-02 #36550

Description

Executive Summary

Top Reliability Findings

Representative Traces

Recommendations

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions