Skip to content

fix(csharp): emit retry telemetry from RetryHttpHandler#497

Open
eric-wang-1990 wants to merge 5 commits into
mainfrom
tracing/479-retry-telemetry
Open

fix(csharp): emit retry telemetry from RetryHttpHandler#497
eric-wang-1990 wants to merge 5 commits into
mainfrom
tracing/479-retry-telemetry

Conversation

@eric-wang-1990

@eric-wang-1990 eric-wang-1990 commented May 28, 2026

Copy link
Copy Markdown
Collaborator

Summary

Issue #479: RetryHttpHandler runs on every HTTP send and retries on Retry-After (503/429) responses and transient transport errors, but historically emitted zero per-attempt telemetry. A user reading their own driver logs asking "did my failure retry then succeed, or did it fail without any retry?" couldn't answer from the trace alone.

This PR makes each retry visible as an event on the SendAsync activity.

Event vocabulary

On the activity that wraps each RetryHttpHandler.SendAsync call — one retry.attempt event per retry (NOT emitted on first-try success):

  • attempt_number (int) — 1-indexed retry counter, sourced directly from the existing attemptCount
  • delay_ms (long) — milliseconds the handler will sleep before re-sending
  • reason (string) — one of:
    • retry_after_<status> (e.g. retry_after_503, retry_after_429)
    • retry_after_<status>_invalid_header — header present but unparseable
    • retry_after_<status>_no_header — retryable status, no Retry-After
    • transport_error_<ExceptionTypeName> — transient transport failure

What's intentionally NOT included

No retry.total_attempts / retry.exhausted summary tags. The target consumer is a user reading their own driver logs, so dashboard-shaped aggregation tags aren't worth the code complexity — the per-retry events already carry the diagnostic surface ("how many times did this retry, and why"). The pre-existing http.retry.total_attempts tag on the exhaustion-break paths (already on main before this PR) is untouched.

Files changed

  • csharp/src/RetryHttpHandler.cs — emits retry.attempt events inside the existing TraceActivityAsync block, both on the transport-error retry path and the Retry-After 503/429 retry path. Uses the existing attemptCount for the event's attempt_number (no new counters).
  • csharp/test/Unit/RetryHttpHandlerTest.cs — two new tests assert on the events emitted from the SendAsync activity:
    • RetryTelemetry_NoRetry_EmitsNoEvents_Issue479 — happy path: no events on first-try success.
    • RetryTelemetry_RetryAfter_EmitsAttemptEvents_Issue479[Theory] over 503 and 429: every retry emits an event with monotonically increasing attempt_number, delay_ms >= 1000 (Retry-After: 1), and a reason containing the status code so 429 throttles are distinguishable from 503s.

Test plan

  • dotnet test --filter "FullyQualifiedName~RetryTelemetry" — 3/3 pass (1 [Fact] + 2 [Theory] cases)
  • dotnet test --filter "FullyQualifiedName~RetryHttpHandlerTest" — full retry suite passes
  • Build clean across netstandard2.0, net472, net8.0

Closes #479

This pull request and its description were written by Isaac.

…479)

Adds four failing unit tests in RetryHttpHandlerTest.cs that capture the
Activity emitted by RetryHttpHandler.SendAsync via an ActivityListener
and assert on its events and tags:

  - RetryTelemetry_NoRetryNeeded_AlwaysSetsSummaryTags_Issue479
    On the no-retry-needed path, the SendAsync activity must still
    carry retry.total_attempts=1 and retry.exhausted=false, so
    dashboards can group/filter without detecting "tag absent."

  - RetryTelemetry_RetryAfter503_EmitsAttemptEventsAndSummaryTags_Issue479
    Each 503-driven retry emits a retry.attempt event with
    attempt_number, delay_ms, reason tags; final tags reflect
    total_attempts=2 and exhausted=false on success.

  - RetryTelemetry_RetryBudgetExhausted_SetsExhaustedTrue_Issue479
    When the retry budget is exceeded, retry.exhausted=true.

  - RetryTelemetry_RetryAfter429_EmitsAttemptEventWithRateLimitReason_Issue479
    429 retries emit retry.attempt events whose reason contains "429"
    so throttling is distinguishable from generic 503.

All four tests currently fail because RetryHttpHandler emits no
retry telemetry. The implementation fix lands in a follow-up commit.

Refs #479

Co-authored-by: Isaac
Adds per-attempt retry.attempt events (with attempt_number, delay_ms,
reason) before each Task.Delay in RetryHttpHandler.SendAsync, plus
final retry.total_attempts + retry.exhausted summary tags on the
activity that wraps each HTTP send. Both summary tags fire
unconditionally — including the no-retry-needed path — so dashboards
can group/filter without detecting "tag absent."

Previously the handler emitted zero per-attempt telemetry. Customer
debugging of "flaky vs hard failure" required server-side log access
because the driver's retry loop was invisible to traces: a 503-with-
Retry-After path that succeeded on the 3rd send and a 503-no-Retry-
After path that failed immediately produced identical client traces.

Reason vocabulary on retry.attempt events:
  retry_after_<status>                 — Retry-After header parsed
  retry_after_<status>_invalid_header  — header present but unparseable
  retry_after_<status>_no_header       — retryable status, no header
  transport_error_<ExceptionTypeName>  — transient transport failure

The new total_attempts counter reflects the number of HTTP sends
actually issued (1 if first try succeeds, N if N-1 retries happened);
retry.exhausted is true when we gave up (budget exceeded or cancelled).

The four new unit tests added in the previous commit (RED) all pass.

Closes #479

Co-authored-by: Isaac
Comment thread csharp/src/RetryHttpHandler.cs Outdated
Comment thread csharp/src/RetryHttpHandler.cs Outdated
// Issue #479: ALWAYS emit summary tags — even on the
// no-retry-needed path, even when we threw. Without
// always-set tags, dashboards have to detect "tag
// absent" to distinguish "successful first try" from

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this will be mostly used for local user logs, so dashboard view is less important, re-evaluate if we need this PR

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — reframed the PR for local-log readers in dc6cd7a. Kept the per-retry retry.attempt events (the actually-useful diagnostic surface for someone reading their own logs) and dropped the always-fires summary tags retry.total_attempts / retry.exhausted along with the try/finally block that hosted them, since the dashboard-aggregation justification doesn't apply. The two summary-tag-only tests were deleted; the two event-asserting tests stay (with summary-tag assertions trimmed). Net diff is 173 insertions / 291 deletions, so the PR is significantly smaller and more focused now.


This comment was generated with GitHub MCP.

…ount

Addresses PR review feedback on #479. Two changes:

- Drop the redundant httpSendCount / retryEventNumber counters and use
  the pre-existing attemptCount for the retry.attempt event's
  attempt_number. attemptCount is already incremented before each retry
  event fires, so the numbering is identical.

- Drop the always-fires summary tags retry.total_attempts and
  retry.exhausted (and the try/finally that hosted them). The intended
  consumer is a user reading their own driver logs, where the
  per-retry retry.attempt events on the SendAsync activity already
  carry the diagnostic surface. Dashboard-shaped aggregation tags are
  not the target use case.

Tests: delete the two summary-tag-only tests, retain the two
event-asserting tests with summary-tag assertions removed.

Co-authored-by: Isaac
@eric-wang-1990 eric-wang-1990 requested review from jadewang-db and msrathore-db and removed request for msrathore-db May 29, 2026 07:20
@eric-wang-1990 eric-wang-1990 marked this pull request as ready for review May 29, 2026 07:20
@eric-wang-1990 eric-wang-1990 changed the title tracing(csharp): emit retry telemetry from RetryHttpHandler fix(csharp): emit retry telemetry from RetryHttpHandler May 29, 2026
Collapses the two Retry-After tests (503 and 429) into a single Theory
parameterized by status code, since they exercise the same code path.
Trims the verbose Assert.True(tag.Key == "...", "long message") pattern
in favor of direct dictionary lookups and Assert.Equal.

Net: ~150 lines -> ~50 lines; 3 telemetry tests still pass.

Co-authored-by: Isaac
Comment thread csharp/src/RetryHttpHandler.cs
Addresses review feedback (#497): the per-retry SetTag calls duplicated the new
retry.attempt event. Make each event self-contained and drop the overwritten per-retry
span tags so the attempt is logged once.

- Transport-error path: remove http.retry.attempt / http.retry.transport_error /
  http.retry.transport_error_message tags; the event now carries error_message
  (the error type is already in reason).
- HTTP-status path: remove http.retry.attempt / http.response.status_code tags; the
  event now carries status_code.
- Terminal break-path tags (http.retry.outcome / total_attempts) are unchanged — they
  fire once on exit, not per retry.

Add a transport-error event test (asserts reason + error_message) and a status_code
assertion to the Retry-After event test. All 29 RetryHttpHandler tests pass.

Co-authored-by: Isaac

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds per-retry tracing telemetry to the C# RetryHttpHandler so users can see whether an HTTP request retried (and why) by inspecting the single SendAsync activity, addressing issue #479.

Changes:

  • Emit retry.attempt Activity events (with attempt_number, delay_ms, reason) for Retry-After-based retries (503/429) and transient transport-error retries.
  • Add unit tests that capture the SendAsync activity and assert retry events are (a) absent on first-try success and (b) present/ordered for Retry-After retries on 503/429.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
csharp/src/RetryHttpHandler.cs Adds retry.attempt event emission on both Retry-After and transport-error retry paths.
csharp/test/Unit/RetryHttpHandlerTest.cs Adds ActivityListener-based tests asserting retry.attempt events for Retry-After retries and no events on no-retry success.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +165 to 174
// span tags (attempt number, error type via reason, error message).
activity?.AddEvent("retry.attempt", new List<KeyValuePair<string, object?>>
{
new("attempt_number", attemptCount),
new("delay_ms", (long)transportWaitSeconds * 1000L),
new("reason", $"transport_error_{ex.GetType().Name}"),
new("error_message", ex.Message)
});

await Task.Delay(TimeSpan.FromSeconds(transportWaitSeconds), cancellationToken);

@msrathore-db msrathore-db left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified: 29/29 RetryHttpHandlerTest pass, including the new retry.attempt event tests. The event vocabulary is clean. LGTM.

One note worth surfacing: this also removes the pre-existing per-attempt span tags (http.retry.attempt, http.response.status_code, http.retry.transport_error, http.retry.transport_error_message) in favor of the events. The description frames it as a pure addition, but it's effectively a tag→event migration — if any dashboard/alert keys off those removed tags it would need updating. Fine by me given they were overwritten-per-retry anyway, just flagging it.

@eric-wang-1990 eric-wang-1990 added this pull request to the merge queue Jun 1, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tracing(csharp): RetryHttpHandler emits no telemetry — retries are invisible

3 participants