Skip to content

[grafana-otel-advisor] OTel improvement: report OTLP export errors on all conclusion spans, not just the conclusion job #30943

@github-actions

Description

@github-actions

OTel Instrumentation Improvement: report OTLP export errors on all conclusion spans, not just the "conclusion" job

Analysis Date: 2026-05-08
Priority: Medium
Effort: Small (< 2h)

Problem

gh-aw.otlp.export_errors — the attribute that reports how many OTLP HTTP exports failed during a job — is only added to the span named gh-aw.conclusion.conclusion (the dedicated "conclusion" workflow job). For every other job (agent, activation, safe-outputs), the attribute is silently absent even when export failures occurred.

The gating condition is in actions/setup/js/send_otlp_span.cjs at line 1326–1328:

if (spanName === "gh-aw.conclusion.conclusion") {
  attributes.push(buildAttr("gh-aw.otlp.export_errors", readOTLPExportErrorCount()));
}

Because each workflow job runs on an independent GitHub Actions runner, the /tmp/gh-aw/otlp-export-errors.count file is runner-local. The conclusion job runs on a different runner than the agent job and can never observe the agent job's export failures. This means OTLP export failures within the agent, activation, and safe-outputs jobs are completely invisible in the exported trace data — they are only console.warn'd in the job logs.

As a result, an on-call engineer looking at a trace in Grafana cannot distinguish between:

  • "the agent job spans are missing because the collector rejected them"
  • "the agent job spans are missing because the spans were never generated"
Why This Matters (DevOps Perspective)

OTLP export failures are silent by design (sendOTLPSpan catches all errors and never re-throws). This is correct — tracing must never break the workflow. But the side effect is that span loss is invisible in the backend.

When an engineer opens a trace for a failed run and sees only 2 of the expected 6 spans, there is currently no signal in Grafana to explain the gap. The missing gh-aw.otlp.export_errors attribute means:

  • False root-cause diagnoses: engineers assume the job didn't run when actually the spans just failed to export
  • No alerting on collector degradation: you cannot write a Grafana alert on gh-aw.otlp.export_errors > 0 for agent jobs because the attribute never appears
  • Increased MTTR: diagnosing a flaky collector requires manually scanning GitHub Actions logs for console.warn lines across multiple jobs, rather than querying Grafana

The fix enables a dashboard panel showing "% of runs with export errors by job type" and alerts like "agent job had ≥ 3 export failures" — both previously unachievable.

Current Behavior
// actions/setup/js/send_otlp_span.cjs  lines 1326-1328
// gh-aw.otlp.export_errors is ONLY added for the "conclusion" job's conclusion span.
// Agent, activation, and safe-outputs conclusion spans never carry this attribute.
if (spanName === "gh-aw.conclusion.conclusion") {
  attributes.push(buildAttr("gh-aw.otlp.export_errors", readOTLPExportErrorCount()));
}

recordOTLPExportError() (lines 1033–1039) is called whenever an OTLP HTTP request fails after all retries. It increments a counter at /tmp/gh-aw/otlp-export-errors.count on the current runner. Because each job runs on a separate runner, this counter is job-scoped — but it is only read and emitted for one specific span name.

Proposed Change

Remove the span-name gate and report the error count on every conclusion span. When there are no errors the value is 0, which is still useful as a health signal (it confirms the exporter ran without issues).

// actions/setup/js/send_otlp_span.cjs  (replace lines 1326-1328)

// Before:
if (spanName === "gh-aw.conclusion.conclusion") {
  attributes.push(buildAttr("gh-aw.otlp.export_errors", readOTLPExportErrorCount()));
}

// After: surface export failures on every conclusion span so silent span loss is
// visible in the backend regardless of which job encountered collector issues.
attributes.push(buildAttr("gh-aw.otlp.export_errors", readOTLPExportErrorCount()));

This is a two-line diff (remove the if condition and its closing }). No other files require changes; readOTLPExportErrorCount() already handles missing files gracefully by returning 0.

Expected Outcome

After this change:

  • In Grafana / Honeycomb / Datadog: every conclusion span (gh-aw.agent.conclusion, gh-aw.activation.conclusion, gh-aw.safe-outputs.conclusion, gh-aw.conclusion.conclusion) now carries gh-aw.otlp.export_errors. Engineers can filter traces to runs where gh-aw.otlp.export_errors > 0, build a dashboard panel for exporter health by job type, and set up an alert when the agent job's export error rate spikes.
  • In the JSONL mirror: every conclusion span line in /tmp/gh-aw/otel.jsonl includes gh-aw.otlp.export_errors, making the mirror self-contained for offline post-mortem debugging — no need to cross-reference job logs.
  • For on-call engineers: when a trace shows fewer spans than expected, gh-aw.otlp.export_errors on the visible conclusion span immediately confirms or rules out a collector delivery problem, cutting the first step of MTTR from "search GitHub Actions logs" to "read one attribute."
Implementation Steps
  • Open actions/setup/js/send_otlp_span.cjs
  • Replace lines 1326–1328 with the single unconditional attributes.push(...) call shown above
  • Update actions/setup/js/send_otlp_span.test.cjs to assert that gh-aw.otlp.export_errors is present on gh-aw.agent.conclusion, gh-aw.activation.conclusion, and gh-aw.safe-outputs.conclusion spans (not just gh-aw.conclusion.conclusion)
  • Run make test-unit (or cd actions/setup/js && npx vitest run) to confirm tests pass
  • Run make fmt to ensure formatting
  • Open a PR referencing this issue
Evidence from Live Grafana Data

The Grafana MCP available in this session exposes Tempo via datasource UID grafanacloud-traces (Grafana Cloud, EU West 2), but no Tempo trace-search tool is available in the current MCP surface — only datasource discovery and deeplink generation. Direct span sampling was therefore not possible.

Static code analysis of send_otlp_span.cjs confirmed the gap: readOTLPExportErrorCount() is called only inside the if (spanName === "gh-aw.conclusion.conclusion") block (line 1326). Cross-referencing the compiled lock workflows confirms that GH_AW_INFO_VERSION is always set (e.g., "1.0.40", "2.1.126"), so service.version is present on all real spans — that potential gap does not exist in practice.

The standard OTel resource attributes (service.name, service.version, github.repository, github.run_id, github.event_name, deployment.environment) are all populated via buildGitHubActionsResourceAttributes and confirmed present in the code path for both setup and conclusion spans.

Related Files
  • actions/setup/js/send_otlp_span.cjs — contains the gated condition at line 1326 and the recordOTLPExportError / readOTLPExportErrorCount helpers
  • actions/setup/js/action_conclusion_otlp.cjs — calls sendJobConclusionSpan; no changes needed here
  • actions/setup/js/action_setup_otlp.cjs — no changes needed
  • actions/setup/js/generate_observability_summary.cjs — no changes needed

Generated by the Daily Grafana OTel Instrumentation Advisor workflow

Generated by Daily Grafana OTel Instrumentation Advisor · ● 398.2K ·

  • expires on May 15, 2026, 5:25 AM UTC

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions