OTel Instrumentation Improvement: report OTLP export errors on all conclusion spans, not just the "conclusion" job
Analysis Date: 2026-05-08
Priority: Medium
Effort: Small (< 2h)
Problem
gh-aw.otlp.export_errors — the attribute that reports how many OTLP HTTP exports failed during a job — is only added to the span named gh-aw.conclusion.conclusion (the dedicated "conclusion" workflow job). For every other job (agent, activation, safe-outputs), the attribute is silently absent even when export failures occurred.
The gating condition is in actions/setup/js/send_otlp_span.cjs at line 1326–1328:
if (spanName === "gh-aw.conclusion.conclusion") {
attributes.push(buildAttr("gh-aw.otlp.export_errors", readOTLPExportErrorCount()));
}
Because each workflow job runs on an independent GitHub Actions runner, the /tmp/gh-aw/otlp-export-errors.count file is runner-local. The conclusion job runs on a different runner than the agent job and can never observe the agent job's export failures. This means OTLP export failures within the agent, activation, and safe-outputs jobs are completely invisible in the exported trace data — they are only console.warn'd in the job logs.
As a result, an on-call engineer looking at a trace in Grafana cannot distinguish between:
- "the agent job spans are missing because the collector rejected them"
- "the agent job spans are missing because the spans were never generated"
Why This Matters (DevOps Perspective)
OTLP export failures are silent by design (sendOTLPSpan catches all errors and never re-throws). This is correct — tracing must never break the workflow. But the side effect is that span loss is invisible in the backend.
When an engineer opens a trace for a failed run and sees only 2 of the expected 6 spans, there is currently no signal in Grafana to explain the gap. The missing gh-aw.otlp.export_errors attribute means:
- False root-cause diagnoses: engineers assume the job didn't run when actually the spans just failed to export
- No alerting on collector degradation: you cannot write a Grafana alert on
gh-aw.otlp.export_errors > 0 for agent jobs because the attribute never appears
- Increased MTTR: diagnosing a flaky collector requires manually scanning GitHub Actions logs for
console.warn lines across multiple jobs, rather than querying Grafana
The fix enables a dashboard panel showing "% of runs with export errors by job type" and alerts like "agent job had ≥ 3 export failures" — both previously unachievable.
Current Behavior
// actions/setup/js/send_otlp_span.cjs lines 1326-1328
// gh-aw.otlp.export_errors is ONLY added for the "conclusion" job's conclusion span.
// Agent, activation, and safe-outputs conclusion spans never carry this attribute.
if (spanName === "gh-aw.conclusion.conclusion") {
attributes.push(buildAttr("gh-aw.otlp.export_errors", readOTLPExportErrorCount()));
}
recordOTLPExportError() (lines 1033–1039) is called whenever an OTLP HTTP request fails after all retries. It increments a counter at /tmp/gh-aw/otlp-export-errors.count on the current runner. Because each job runs on a separate runner, this counter is job-scoped — but it is only read and emitted for one specific span name.
Proposed Change
Remove the span-name gate and report the error count on every conclusion span. When there are no errors the value is 0, which is still useful as a health signal (it confirms the exporter ran without issues).
// actions/setup/js/send_otlp_span.cjs (replace lines 1326-1328)
// Before:
if (spanName === "gh-aw.conclusion.conclusion") {
attributes.push(buildAttr("gh-aw.otlp.export_errors", readOTLPExportErrorCount()));
}
// After: surface export failures on every conclusion span so silent span loss is
// visible in the backend regardless of which job encountered collector issues.
attributes.push(buildAttr("gh-aw.otlp.export_errors", readOTLPExportErrorCount()));
This is a two-line diff (remove the if condition and its closing }). No other files require changes; readOTLPExportErrorCount() already handles missing files gracefully by returning 0.
Expected Outcome
After this change:
- In Grafana / Honeycomb / Datadog: every conclusion span (
gh-aw.agent.conclusion, gh-aw.activation.conclusion, gh-aw.safe-outputs.conclusion, gh-aw.conclusion.conclusion) now carries gh-aw.otlp.export_errors. Engineers can filter traces to runs where gh-aw.otlp.export_errors > 0, build a dashboard panel for exporter health by job type, and set up an alert when the agent job's export error rate spikes.
- In the JSONL mirror: every conclusion span line in
/tmp/gh-aw/otel.jsonl includes gh-aw.otlp.export_errors, making the mirror self-contained for offline post-mortem debugging — no need to cross-reference job logs.
- For on-call engineers: when a trace shows fewer spans than expected,
gh-aw.otlp.export_errors on the visible conclusion span immediately confirms or rules out a collector delivery problem, cutting the first step of MTTR from "search GitHub Actions logs" to "read one attribute."
Implementation Steps
Evidence from Live Grafana Data
The Grafana MCP available in this session exposes Tempo via datasource UID grafanacloud-traces (Grafana Cloud, EU West 2), but no Tempo trace-search tool is available in the current MCP surface — only datasource discovery and deeplink generation. Direct span sampling was therefore not possible.
Static code analysis of send_otlp_span.cjs confirmed the gap: readOTLPExportErrorCount() is called only inside the if (spanName === "gh-aw.conclusion.conclusion") block (line 1326). Cross-referencing the compiled lock workflows confirms that GH_AW_INFO_VERSION is always set (e.g., "1.0.40", "2.1.126"), so service.version is present on all real spans — that potential gap does not exist in practice.
The standard OTel resource attributes (service.name, service.version, github.repository, github.run_id, github.event_name, deployment.environment) are all populated via buildGitHubActionsResourceAttributes and confirmed present in the code path for both setup and conclusion spans.
Related Files
actions/setup/js/send_otlp_span.cjs — contains the gated condition at line 1326 and the recordOTLPExportError / readOTLPExportErrorCount helpers
actions/setup/js/action_conclusion_otlp.cjs — calls sendJobConclusionSpan; no changes needed here
actions/setup/js/action_setup_otlp.cjs — no changes needed
actions/setup/js/generate_observability_summary.cjs — no changes needed
Generated by the Daily Grafana OTel Instrumentation Advisor workflow
Generated by Daily Grafana OTel Instrumentation Advisor · ● 398.2K · ◷
OTel Instrumentation Improvement: report OTLP export errors on all conclusion spans, not just the "conclusion" job
Analysis Date: 2026-05-08
Priority: Medium
Effort: Small (< 2h)
Problem
gh-aw.otlp.export_errors— the attribute that reports how many OTLP HTTP exports failed during a job — is only added to the span namedgh-aw.conclusion.conclusion(the dedicated "conclusion" workflow job). For every other job (agent,activation,safe-outputs), the attribute is silently absent even when export failures occurred.The gating condition is in
actions/setup/js/send_otlp_span.cjsat line 1326–1328:Because each workflow job runs on an independent GitHub Actions runner, the
/tmp/gh-aw/otlp-export-errors.countfile is runner-local. The conclusion job runs on a different runner than the agent job and can never observe the agent job's export failures. This means OTLP export failures within theagent,activation, andsafe-outputsjobs are completely invisible in the exported trace data — they are onlyconsole.warn'd in the job logs.As a result, an on-call engineer looking at a trace in Grafana cannot distinguish between:
Why This Matters (DevOps Perspective)
OTLP export failures are silent by design (
sendOTLPSpancatches all errors and never re-throws). This is correct — tracing must never break the workflow. But the side effect is that span loss is invisible in the backend.When an engineer opens a trace for a failed run and sees only 2 of the expected 6 spans, there is currently no signal in Grafana to explain the gap. The missing
gh-aw.otlp.export_errorsattribute means:gh-aw.otlp.export_errors > 0for agent jobs because the attribute never appearsconsole.warnlines across multiple jobs, rather than querying GrafanaThe fix enables a dashboard panel showing "% of runs with export errors by job type" and alerts like "agent job had ≥ 3 export failures" — both previously unachievable.
Current Behavior
recordOTLPExportError()(lines 1033–1039) is called whenever an OTLP HTTP request fails after all retries. It increments a counter at/tmp/gh-aw/otlp-export-errors.counton the current runner. Because each job runs on a separate runner, this counter is job-scoped — but it is only read and emitted for one specific span name.Proposed Change
Remove the span-name gate and report the error count on every conclusion span. When there are no errors the value is
0, which is still useful as a health signal (it confirms the exporter ran without issues).This is a two-line diff (remove the
ifcondition and its closing}). No other files require changes;readOTLPExportErrorCount()already handles missing files gracefully by returning0.Expected Outcome
After this change:
gh-aw.agent.conclusion,gh-aw.activation.conclusion,gh-aw.safe-outputs.conclusion,gh-aw.conclusion.conclusion) now carriesgh-aw.otlp.export_errors. Engineers can filter traces to runs wheregh-aw.otlp.export_errors > 0, build a dashboard panel for exporter health by job type, and set up an alert when the agent job's export error rate spikes./tmp/gh-aw/otel.jsonlincludesgh-aw.otlp.export_errors, making the mirror self-contained for offline post-mortem debugging — no need to cross-reference job logs.gh-aw.otlp.export_errorson the visible conclusion span immediately confirms or rules out a collector delivery problem, cutting the first step of MTTR from "search GitHub Actions logs" to "read one attribute."Implementation Steps
actions/setup/js/send_otlp_span.cjsattributes.push(...)call shown aboveactions/setup/js/send_otlp_span.test.cjsto assert thatgh-aw.otlp.export_errorsis present ongh-aw.agent.conclusion,gh-aw.activation.conclusion, andgh-aw.safe-outputs.conclusionspans (not justgh-aw.conclusion.conclusion)make test-unit(orcd actions/setup/js && npx vitest run) to confirm tests passmake fmtto ensure formattingEvidence from Live Grafana Data
The Grafana MCP available in this session exposes Tempo via datasource UID
grafanacloud-traces(Grafana Cloud, EU West 2), but no Tempo trace-search tool is available in the current MCP surface — only datasource discovery and deeplink generation. Direct span sampling was therefore not possible.Static code analysis of
send_otlp_span.cjsconfirmed the gap:readOTLPExportErrorCount()is called only inside theif (spanName === "gh-aw.conclusion.conclusion")block (line 1326). Cross-referencing the compiled lock workflows confirms thatGH_AW_INFO_VERSIONis always set (e.g.,"1.0.40","2.1.126"), soservice.versionis present on all real spans — that potential gap does not exist in practice.The standard OTel resource attributes (
service.name,service.version,github.repository,github.run_id,github.event_name,deployment.environment) are all populated viabuildGitHubActionsResourceAttributesand confirmed present in the code path for both setup and conclusion spans.Related Files
actions/setup/js/send_otlp_span.cjs— contains the gated condition at line 1326 and therecordOTLPExportError/readOTLPExportErrorCounthelpersactions/setup/js/action_conclusion_otlp.cjs— callssendJobConclusionSpan; no changes needed hereactions/setup/js/action_setup_otlp.cjs— no changes neededactions/setup/js/generate_observability_summary.cjs— no changes neededGenerated by the Daily Grafana OTel Instrumentation Advisor workflow