fix(csharp): emit retry telemetry from RetryHttpHandler#497
fix(csharp): emit retry telemetry from RetryHttpHandler#497eric-wang-1990 wants to merge 5 commits into
Conversation
…479) Adds four failing unit tests in RetryHttpHandlerTest.cs that capture the Activity emitted by RetryHttpHandler.SendAsync via an ActivityListener and assert on its events and tags: - RetryTelemetry_NoRetryNeeded_AlwaysSetsSummaryTags_Issue479 On the no-retry-needed path, the SendAsync activity must still carry retry.total_attempts=1 and retry.exhausted=false, so dashboards can group/filter without detecting "tag absent." - RetryTelemetry_RetryAfter503_EmitsAttemptEventsAndSummaryTags_Issue479 Each 503-driven retry emits a retry.attempt event with attempt_number, delay_ms, reason tags; final tags reflect total_attempts=2 and exhausted=false on success. - RetryTelemetry_RetryBudgetExhausted_SetsExhaustedTrue_Issue479 When the retry budget is exceeded, retry.exhausted=true. - RetryTelemetry_RetryAfter429_EmitsAttemptEventWithRateLimitReason_Issue479 429 retries emit retry.attempt events whose reason contains "429" so throttling is distinguishable from generic 503. All four tests currently fail because RetryHttpHandler emits no retry telemetry. The implementation fix lands in a follow-up commit. Refs #479 Co-authored-by: Isaac
Adds per-attempt retry.attempt events (with attempt_number, delay_ms, reason) before each Task.Delay in RetryHttpHandler.SendAsync, plus final retry.total_attempts + retry.exhausted summary tags on the activity that wraps each HTTP send. Both summary tags fire unconditionally — including the no-retry-needed path — so dashboards can group/filter without detecting "tag absent." Previously the handler emitted zero per-attempt telemetry. Customer debugging of "flaky vs hard failure" required server-side log access because the driver's retry loop was invisible to traces: a 503-with- Retry-After path that succeeded on the 3rd send and a 503-no-Retry- After path that failed immediately produced identical client traces. Reason vocabulary on retry.attempt events: retry_after_<status> — Retry-After header parsed retry_after_<status>_invalid_header — header present but unparseable retry_after_<status>_no_header — retryable status, no header transport_error_<ExceptionTypeName> — transient transport failure The new total_attempts counter reflects the number of HTTP sends actually issued (1 if first try succeeds, N if N-1 retries happened); retry.exhausted is true when we gave up (budget exceeded or cancelled). The four new unit tests added in the previous commit (RED) all pass. Closes #479 Co-authored-by: Isaac
| // Issue #479: ALWAYS emit summary tags — even on the | ||
| // no-retry-needed path, even when we threw. Without | ||
| // always-set tags, dashboards have to detect "tag | ||
| // absent" to distinguish "successful first try" from |
There was a problem hiding this comment.
Note this will be mostly used for local user logs, so dashboard view is less important, re-evaluate if we need this PR
There was a problem hiding this comment.
Agreed — reframed the PR for local-log readers in dc6cd7a. Kept the per-retry retry.attempt events (the actually-useful diagnostic surface for someone reading their own logs) and dropped the always-fires summary tags retry.total_attempts / retry.exhausted along with the try/finally block that hosted them, since the dashboard-aggregation justification doesn't apply. The two summary-tag-only tests were deleted; the two event-asserting tests stay (with summary-tag assertions trimmed). Net diff is 173 insertions / 291 deletions, so the PR is significantly smaller and more focused now.
This comment was generated with GitHub MCP.
…ount Addresses PR review feedback on #479. Two changes: - Drop the redundant httpSendCount / retryEventNumber counters and use the pre-existing attemptCount for the retry.attempt event's attempt_number. attemptCount is already incremented before each retry event fires, so the numbering is identical. - Drop the always-fires summary tags retry.total_attempts and retry.exhausted (and the try/finally that hosted them). The intended consumer is a user reading their own driver logs, where the per-retry retry.attempt events on the SendAsync activity already carry the diagnostic surface. Dashboard-shaped aggregation tags are not the target use case. Tests: delete the two summary-tag-only tests, retain the two event-asserting tests with summary-tag assertions removed. Co-authored-by: Isaac
Collapses the two Retry-After tests (503 and 429) into a single Theory parameterized by status code, since they exercise the same code path. Trims the verbose Assert.True(tag.Key == "...", "long message") pattern in favor of direct dictionary lookups and Assert.Equal. Net: ~150 lines -> ~50 lines; 3 telemetry tests still pass. Co-authored-by: Isaac
Addresses review feedback (#497): the per-retry SetTag calls duplicated the new retry.attempt event. Make each event self-contained and drop the overwritten per-retry span tags so the attempt is logged once. - Transport-error path: remove http.retry.attempt / http.retry.transport_error / http.retry.transport_error_message tags; the event now carries error_message (the error type is already in reason). - HTTP-status path: remove http.retry.attempt / http.response.status_code tags; the event now carries status_code. - Terminal break-path tags (http.retry.outcome / total_attempts) are unchanged — they fire once on exit, not per retry. Add a transport-error event test (asserts reason + error_message) and a status_code assertion to the Retry-After event test. All 29 RetryHttpHandler tests pass. Co-authored-by: Isaac
There was a problem hiding this comment.
Pull request overview
Adds per-retry tracing telemetry to the C# RetryHttpHandler so users can see whether an HTTP request retried (and why) by inspecting the single SendAsync activity, addressing issue #479.
Changes:
- Emit
retry.attemptActivityevents (withattempt_number,delay_ms,reason) for Retry-After-based retries (503/429) and transient transport-error retries. - Add unit tests that capture the
SendAsyncactivity and assert retry events are (a) absent on first-try success and (b) present/ordered for Retry-After retries on 503/429.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| csharp/src/RetryHttpHandler.cs | Adds retry.attempt event emission on both Retry-After and transport-error retry paths. |
| csharp/test/Unit/RetryHttpHandlerTest.cs | Adds ActivityListener-based tests asserting retry.attempt events for Retry-After retries and no events on no-retry success. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // span tags (attempt number, error type via reason, error message). | ||
| activity?.AddEvent("retry.attempt", new List<KeyValuePair<string, object?>> | ||
| { | ||
| new("attempt_number", attemptCount), | ||
| new("delay_ms", (long)transportWaitSeconds * 1000L), | ||
| new("reason", $"transport_error_{ex.GetType().Name}"), | ||
| new("error_message", ex.Message) | ||
| }); | ||
|
|
||
| await Task.Delay(TimeSpan.FromSeconds(transportWaitSeconds), cancellationToken); |
msrathore-db
left a comment
There was a problem hiding this comment.
Verified: 29/29 RetryHttpHandlerTest pass, including the new retry.attempt event tests. The event vocabulary is clean. LGTM.
One note worth surfacing: this also removes the pre-existing per-attempt span tags (http.retry.attempt, http.response.status_code, http.retry.transport_error, http.retry.transport_error_message) in favor of the events. The description frames it as a pure addition, but it's effectively a tag→event migration — if any dashboard/alert keys off those removed tags it would need updating. Fine by me given they were overwritten-per-retry anyway, just flagging it.
Summary
Issue #479:
RetryHttpHandlerruns on every HTTP send and retries on Retry-After (503/429) responses and transient transport errors, but historically emitted zero per-attempt telemetry. A user reading their own driver logs asking "did my failure retry then succeed, or did it fail without any retry?" couldn't answer from the trace alone.This PR makes each retry visible as an event on the
SendAsyncactivity.Event vocabulary
On the activity that wraps each
RetryHttpHandler.SendAsynccall — oneretry.attemptevent per retry (NOT emitted on first-try success):attempt_number(int) — 1-indexed retry counter, sourced directly from the existingattemptCountdelay_ms(long) — milliseconds the handler will sleep before re-sendingreason(string) — one of:retry_after_<status>(e.g.retry_after_503,retry_after_429)retry_after_<status>_invalid_header— header present but unparseableretry_after_<status>_no_header— retryable status, no Retry-Aftertransport_error_<ExceptionTypeName>— transient transport failureWhat's intentionally NOT included
No
retry.total_attempts/retry.exhaustedsummary tags. The target consumer is a user reading their own driver logs, so dashboard-shaped aggregation tags aren't worth the code complexity — the per-retry events already carry the diagnostic surface ("how many times did this retry, and why"). The pre-existinghttp.retry.total_attemptstag on the exhaustion-break paths (already onmainbefore this PR) is untouched.Files changed
csharp/src/RetryHttpHandler.cs— emitsretry.attemptevents inside the existingTraceActivityAsyncblock, both on the transport-error retry path and the Retry-After 503/429 retry path. Uses the existingattemptCountfor the event'sattempt_number(no new counters).csharp/test/Unit/RetryHttpHandlerTest.cs— two new tests assert on the events emitted from theSendAsyncactivity:RetryTelemetry_NoRetry_EmitsNoEvents_Issue479— happy path: no events on first-try success.RetryTelemetry_RetryAfter_EmitsAttemptEvents_Issue479—[Theory]over 503 and 429: every retry emits an event with monotonically increasingattempt_number,delay_ms >= 1000(Retry-After: 1), and areasoncontaining the status code so 429 throttles are distinguishable from 503s.Test plan
dotnet test --filter "FullyQualifiedName~RetryTelemetry"— 3/3 pass (1[Fact]+ 2[Theory]cases)dotnet test --filter "FullyQualifiedName~RetryHttpHandlerTest"— full retry suite passesCloses #479
This pull request and its description were written by Isaac.