[grafana-otel-advisor] OTel improvement: report OTLP export errors on all conclusion spans, not just the conclusion job

### OTel Instrumentation Improvement: report OTLP export errors on all conclusion spans, not just the "conclusion" job

**Analysis Date**: 2026-05-08 
**Priority**: Medium 
**Effort**: Small (< 2h)

### Problem

`gh-aw.otlp.export_errors` — the attribute that reports how many OTLP HTTP exports failed during a job — is only added to the span named `gh-aw.conclusion.conclusion` (the dedicated "conclusion" workflow job). For every other job (`agent`, `activation`, `safe-outputs`), the attribute is silently absent even when export failures occurred.

The gating condition is in `actions/setup/js/send_otlp_span.cjs` at line 1326–1328:

```javascript
if (spanName === "gh-aw.conclusion.conclusion") {
 attributes.push(buildAttr("gh-aw.otlp.export_errors", readOTLPExportErrorCount()));
}
```

Because each workflow job runs on an **independent GitHub Actions runner**, the `/tmp/gh-aw/otlp-export-errors.count` file is runner-local. The conclusion job runs on a different runner than the agent job and can never observe the agent job's export failures. This means OTLP export failures within the `agent`, `activation`, and `safe-outputs` jobs are **completely invisible** in the exported trace data — they are only `console.warn`'d in the job logs.

As a result, an on-call engineer looking at a trace in Grafana cannot distinguish between:
- "the agent job spans are missing because the collector rejected them"
- "the agent job spans are missing because the spans were never generated"

<details>
<summary>Why This Matters (DevOps Perspective)</summary>

OTLP export failures are silent by design (`sendOTLPSpan` catches all errors and never re-throws). This is correct — tracing must never break the workflow. But the side effect is that span loss is invisible in the backend.

When an engineer opens a trace for a failed run and sees only 2 of the expected 6 spans, there is currently no signal in Grafana to explain the gap. The missing `gh-aw.otlp.export_errors` attribute means:

- **False root-cause diagnoses**: engineers assume the job didn't run when actually the spans just failed to export
- **No alerting on collector degradation**: you cannot write a Grafana alert on `gh-aw.otlp.export_errors > 0` for agent jobs because the attribute never appears
- **Increased MTTR**: diagnosing a flaky collector requires manually scanning GitHub Actions logs for `console.warn` lines across multiple jobs, rather than querying Grafana

The fix enables a dashboard panel showing "% of runs with export errors by job type" and alerts like "agent job had ≥ 3 export failures" — both previously unachievable.

</details>

<details>
<summary>Current Behavior</summary>

```javascript
// actions/setup/js/send_otlp_span.cjs lines 1326-1328
// gh-aw.otlp.export_errors is ONLY added for the "conclusion" job's conclusion span.
// Agent, activation, and safe-outputs conclusion spans never carry this attribute.
if (spanName === "gh-aw.conclusion.conclusion") {
 attributes.push(buildAttr("gh-aw.otlp.export_errors", readOTLPExportErrorCount()));
}
```

`recordOTLPExportError()` (lines 1033–1039) is called whenever an OTLP HTTP request fails after all retries. It increments a counter at `/tmp/gh-aw/otlp-export-errors.count` on the current runner. Because each job runs on a separate runner, this counter is job-scoped — but it is only read and emitted for one specific span name.

</details>

<details>
<summary>Proposed Change</summary>

Remove the span-name gate and report the error count on every conclusion span. When there are no errors the value is `0`, which is still useful as a health signal (it confirms the exporter ran without issues).

```javascript
// actions/setup/js/send_otlp_span.cjs (replace lines 1326-1328)

// Before:
if (spanName === "gh-aw.conclusion.conclusion") {
 attributes.push(buildAttr("gh-aw.otlp.export_errors", readOTLPExportErrorCount()));
}

// After: surface export failures on every conclusion span so silent span loss is
// visible in the backend regardless of which job encountered collector issues.
attributes.push(buildAttr("gh-aw.otlp.export_errors", readOTLPExportErrorCount()));
```

This is a two-line diff (remove the `if` condition and its closing `}`). No other files require changes; `readOTLPExportErrorCount()` already handles missing files gracefully by returning `0`.

</details>

<details>
<summary>Expected Outcome</summary>

After this change:

- **In Grafana / Honeycomb / Datadog**: every conclusion span (`gh-aw.agent.conclusion`, `gh-aw.activation.conclusion`, `gh-aw.safe-outputs.conclusion`, `gh-aw.conclusion.conclusion`) now carries `gh-aw.otlp.export_errors`. Engineers can filter traces to runs where `gh-aw.otlp.export_errors > 0`, build a dashboard panel for exporter health by job type, and set up an alert when the agent job's export error rate spikes.
- **In the JSONL mirror**: every conclusion span line in `/tmp/gh-aw/otel.jsonl` includes `gh-aw.otlp.export_errors`, making the mirror self-contained for offline post-mortem debugging — no need to cross-reference job logs.
- **For on-call engineers**: when a trace shows fewer spans than expected, `gh-aw.otlp.export_errors` on the visible conclusion span immediately confirms or rules out a collector delivery problem, cutting the first step of MTTR from "search GitHub Actions logs" to "read one attribute."

</details>

<details>
<summary>Implementation Steps</summary>

- [ ] Open `actions/setup/js/send_otlp_span.cjs`
- [ ] Replace lines 1326–1328 with the single unconditional `attributes.push(...)` call shown above
- [ ] Update `actions/setup/js/send_otlp_span.test.cjs` to assert that `gh-aw.otlp.export_errors` is present on `gh-aw.agent.conclusion`, `gh-aw.activation.conclusion`, and `gh-aw.safe-outputs.conclusion` spans (not just `gh-aw.conclusion.conclusion`)
- [ ] Run `make test-unit` (or `cd actions/setup/js && npx vitest run`) to confirm tests pass
- [ ] Run `make fmt` to ensure formatting
- [ ] Open a PR referencing this issue

</details>

<details>
<summary>Evidence from Live Grafana Data</summary>

The Grafana MCP available in this session exposes Tempo via datasource UID `grafanacloud-traces` (Grafana Cloud, EU West 2), but no Tempo trace-search tool is available in the current MCP surface — only datasource discovery and deeplink generation. Direct span sampling was therefore not possible.

Static code analysis of `send_otlp_span.cjs` confirmed the gap: `readOTLPExportErrorCount()` is called only inside the `if (spanName === "gh-aw.conclusion.conclusion")` block (line 1326). Cross-referencing the compiled lock workflows confirms that `GH_AW_INFO_VERSION` is always set (e.g., `"1.0.40"`, `"2.1.126"`), so `service.version` is present on all real spans — that potential gap does not exist in practice.

The standard OTel resource attributes (`service.name`, `service.version`, `github.repository`, `github.run_id`, `github.event_name`, `deployment.environment`) are all populated via `buildGitHubActionsResourceAttributes` and confirmed present in the code path for both setup and conclusion spans.

</details>

<details>
<summary>Related Files</summary>

- `actions/setup/js/send_otlp_span.cjs` — contains the gated condition at line 1326 and the `recordOTLPExportError` / `readOTLPExportErrorCount` helpers
- `actions/setup/js/action_conclusion_otlp.cjs` — calls `sendJobConclusionSpan`; no changes needed here
- `actions/setup/js/action_setup_otlp.cjs` — no changes needed
- `actions/setup/js/generate_observability_summary.cjs` — no changes needed

</details>

---

*Generated by the [Daily Grafana OTel Instrumentation Advisor](https://github.com/github/gh-aw/actions/runs/25538213864) workflow*







> Generated by [Daily Grafana OTel Instrumentation Advisor](https://github.com/github/gh-aw/actions/runs/25538213864/agentic_workflow) · ● 398.2K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Fdaily-grafana-otel-instrumentation-advisor%22&type=issues)
> - [x] expires  on May 15, 2026, 5:25 AM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[grafana-otel-advisor] OTel improvement: report OTLP export errors on all conclusion spans, not just the conclusion job #30943

OTel Instrumentation Improvement: report OTLP export errors on all conclusion spans, not just the "conclusion" job

Problem

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[grafana-otel-advisor] OTel improvement: report OTLP export errors on all conclusion spans, not just the conclusion job #30943

Description

OTel Instrumentation Improvement: report OTLP export errors on all conclusion spans, not just the "conclusion" job

Problem

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions