fix(sentry): real titles, real chains, real provider tags, drop quota noise [0.7 backport]#138
Conversation
… noise The user pasted a Sentry issue from 0.7.13 demo titled "upstream server error", with 460 events in one bucket, 1896 hidden issues in the trace preview, tag provider="pdl_person_enrichment" (wrong — that's the whole tool name, the actual provider is "pdl"), operation_id="unknown", and no exception/stack-trace. Useless for triage. Four bugs, all fixed here: 1. The `tracing::error!` message was a hardcoded literal "upstream server error". sentry-tracing maps the literal to the event title and Sentry groups by title. Every failure of every shape landed in one bucket. Fix: classify by status into 5 `UpstreamErrorClass` variants (auth_error / bad_input / quota / rate_limited / server_error / transport_error) and dispatch into one `tracing::*!` per arm with a distinct literal message. Five buckets per status class instead of one. (Quota + RateLimited are Warning level and get dropped below — see #4.) 2. `split_tool_name` assumed `provider:operation_id` and labelled anything else as `("<whole-tool-name>", "unknown")`. PDL, datalastic, and several other manifests ship flat tool names like `pdl_person_enrichment`. The tag lied and operation_id was useless. Fix: new `provider_and_op(provider_name, tool_name)` that takes the registry-resolved provider name (which the call site already has — `provider.name`). It strips `{provider}:` or `{provider}_` prefixes when present; otherwise the operation IS the tool name verbatim. Never fabricates "unknown". All three call sites in handle_call updated. 3. Errors were flattened to `String` at the boundary (`Some(&e.to_string())`). Sentry's `capture_error(&dyn Error)` path that walks `source()` and attaches each link as a separate exception block was never called. The issue page had no chain. Fix: new `capture_error_with_scope` helper that does `sentry::capture_error(err)` inside the same scope that carries the tags + class-based fingerprint. All three call sites (HTTP / MCP / CLI) call it alongside the report helper. The Quota + RateLimited classes short-circuit inside the helper (don't bother building backtrace for events that before_send will drop). 4. No `before_send` hook → quota / rate-limit events spammed the issue list and burned project quota. Fix: `before_send` reads `tags.upstream_error_class` and drops events tagged `quota` or `rate_limited`. The `tracing::warn!` calls still run so breadcrumbs record context for the next real error. Plus the prerequisites that made #3 work: - Cargo sentry features expanded to include `panic`, `backtrace`, `contexts`, `debug-images`. The old four-feature subset compiled the SDK without the panic hook (so panics silently vanished) and without backtrace + device-context blocks (so even captured exceptions had no frames). - `attach_stacktrace: true` (was already set, now actually does something with the backtrace feature on). - `max_breadcrumbs: 100` (up from default 30) so the breadcrumbs around the failure aren't truncated on busy requests. - `enable_logs: true` so Sentry's Logs product receives the structured tracing fields, not just the rendered event title. - `set_jwt_sentry_scope(claims)` at handle_call entry sets `tags.sub`, `tags.sandbox_id`, `tags.job_id`, `user.id` so every event raised during the request is searchable by identity. Tests: - `provider_and_op_strips_underscore_prefix` — the exact PDL bug from the screenshot: `("pdl", "pdl_person_enrichment") → ("pdl", "person_enrichment")`. - `provider_and_op_no_prefix_keeps_tool_name` — never returns "unknown". - `provider_and_op_explicit_colon` + `_strips_colon_prefix` + `_empty_op_after_strip_keeps_tool_name` — covers all common shapes. - `UpstreamErrorClass::classify` covered for 400 / 401 / 402 / 429 / 5xx / 0. - `before_send_drops_quota_class_events`, `before_send_drops_rate_limited_class_events` — the noise filter. - `before_send_passes_<bad_input|server_error|auth_error| transport_error>_through` — pageable classes survive. - `before_send_passes_events_without_class_tag` — panics + other tracing::error! calls outside the upstream-error helper aren't caught in the filter's blast radius. All 1227 default-feature tests green, all 1241 sentry-feature tests green, clippy clean across default / sentry / all-features, fmt clean.
Greptile SummaryThis is a 0.7 backport of #137, improving Sentry observability by replacing a single "upstream server error" mega-bucket with 6 classified event buckets, fixing provider/operation tag accuracy, attaching real Rust error chains to Sentry issues, and dropping 402/429 billing noise via
Confidence Score: 4/5Safe to merge; all changes are behind the optional sentry cargo feature and the core proxy logic is unaffected The backport is well-structured, test-covered, and fixes a real user-reported issue. The main concern is that sentry::configure_scope in set_jwt_sentry_scope permanently modifies the thread-local Sentry scope and is not reset between requests sharing the same Tokio worker thread — contrary to the inline comment — which can cause JWT identity tags from one request to bleed onto Sentry events from a subsequent request on the same thread. This is an observability accuracy issue rather than a user-visible defect. A secondary note: unclassified 4xx codes (408, 409, 410, etc.) silently fall into TransportError which is semantically defined as transport-level failures with no HTTP exchange. src/core/sentry_scope.rs — the configure_scope lifetime and the 4xx classification gap Important Files Changed
Sequence DiagramsequenceDiagram
participant Client
participant handle_call
participant sentry_scope
participant Sentry
Client->>handle_call: POST /call (JWT bearer)
handle_call->>sentry_scope: set_jwt_sentry_scope(claims)
Note over sentry_scope: configure_scope: sub, sandbox_id, job_id, user.id
handle_call->>handle_call: execute tool call
alt Tool / HTTP error
handle_call->>sentry_scope: provider_and_op(provider.name, tool_name)
sentry_scope-->>handle_call: (provider, operation_id)
handle_call->>sentry_scope: report_upstream_error(provider, op, status, ...)
sentry_scope->>sentry_scope: UpstreamErrorClass::classify(status)
sentry_scope->>sentry_scope: with_upstream_scope: set tags + fingerprint
sentry_scope->>Sentry: tracing::error!(upstream class) → titled issue event
handle_call->>sentry_scope: capture_error_with_scope(err, provider, op, status, ...)
sentry_scope->>sentry_scope: skip if is_quota_class()
sentry_scope->>sentry_scope: with_upstream_scope: same tags + fingerprint
sentry_scope->>Sentry: sentry::capture_error(err) → exception event with source chain
end
Note over Sentry: before_send drops quota/rate_limited events
Note over Sentry: Both events share fingerprint → same issue
Reviews (1): Last reviewed commit: "fix(sentry): real titles, real chains, r..." | Re-trigger Greptile |
| #[allow(unused_variables)] | ||
| pub fn set_jwt_sentry_scope(claims: &TokenClaims) { | ||
| #[cfg(feature = "sentry")] | ||
| sentry::configure_scope(|scope| { | ||
| scope.set_tag("sub", claims.sub.as_str()); | ||
| if let Some(ref id) = claims.sandbox_id { | ||
| scope.set_tag("sandbox_id", id.as_str()); | ||
| } | ||
| if let Some(ref id) = claims.job_id { | ||
| scope.set_tag("job_id", id.as_str()); | ||
| } | ||
| // Sentry's `user.id` is the standard slot for the auth subject — | ||
| // populating it lets the issue page show "users affected" counts | ||
| // and gives the search UI a first-class facet. | ||
| scope.set_user(Some(sentry::User { | ||
| id: Some(claims.sub.clone()), | ||
| ..Default::default() | ||
| })); | ||
| }); | ||
| } |
There was a problem hiding this comment.
configure_scope scope not cleared between requests
sentry::configure_scope mutates the thread-local hub scope permanently — it is not reset by the sentry::with_scope blocks that the report helpers push. Those blocks push a nested child scope and pop it, but the tags set by configure_scope on the parent scope remain on that thread indefinitely. In a Tokio worker-thread pool, the next request executed on the same thread will inherit sub, sandbox_id, and job_id from the previous request, causing Sentry events to be attributed to the wrong identity. The docstring's "cleared automatically when the tokio worker reaches the next request" claim is incorrect.
The correct pattern for per-request scoping in async Rust is to create an isolated Hub: Hub::new_from_top(Hub::current()) and run the handler under it. Alternatively, explicitly clear the user identity at request end, though that's harder to guarantee in all code paths.
| impl UpstreamErrorClass { | ||
| pub fn classify(upstream_status: u16) -> Self { | ||
| match upstream_status { | ||
| 401 | 403 | 407 => Self::AuthError, | ||
| 400 | 404 | 422 => Self::BadInput, | ||
| 402 => Self::Quota, | ||
| 429 => Self::RateLimited, | ||
| 500..=599 => Self::ServerError, | ||
| _ => Self::TransportError, | ||
| } | ||
| } |
There was a problem hiding this comment.
Unclassified 4xx codes fall into
TransportError
HTTP status codes that aren't explicitly matched (e.g., 405 Method Not Allowed, 408 Request Timeout, 409 Conflict, 410 Gone, 423, 425, and any other 4xx not listed) fall through to TransportError, whose doc comment defines it as "no HTTP status (DNS, TCP, TLS, MCP transport error)". Those codes do have an HTTP status, so they will be silently mislabeled in Sentry. In practice a provider returning 408 or 409 would appear under the transport_error bucket rather than any client-error bucket, making the on-call investigation misleading.
…(P2) Greptile review on PR #137 raised two real defects: P1 (correctness + multi-tenant data leak): `set_jwt_sentry_scope` used `sentry::configure_scope` which mutates the thread-local scope permanently. For optional fields (sandbox_id, job_id) the code only called `set_tag` when the field was `Some`, but never removed the tag when it was `None`. The doc comment claimed "cleared automatically when the tokio worker reaches the next request because Sentry's scope is request-keyed by the with_scope blocks the report helpers push" — that's wrong. `with_scope` pushes a NEW layer; it does NOT clear what `configure_scope` set underneath. On a busy proxy with worker-thread reuse, request A's sandbox_id would stamp itself on request B's events whenever B's JWT lacked that field. Fix: unconditionally `scope.remove_tag("sandbox_id")` and `remove_tag("job_id")` at the top of `set_jwt_sentry_scope`, then set them only when the new claims actually have them. `sub` and `user.id` are always set (sub is required by auth middleware; set_user(Some(..)) unconditionally overwrites prior values). P2 (UI title drift): The error path emitted TWO Sentry events with the same fingerprint: the tracing-bridge event titled with the class literal (e.g. "upstream bad_input"), and a `sentry::capture_error` event whose title defaulted to the underlying error's Display. Sentry's "latest event wins for title" rule meant the operator-visible bucket name would regularly get overwritten with `HttpError: ApiError { ... }` strings, defeating the PR's headline improvement. Fix: replace the bare `sentry::capture_error(err)` with `sentry::event_from_error(err)` followed by an explicit `event.message = Some(class_literal.into())` before `capture_event(event)`. The exception block is unchanged (chain still ships), the fingerprint still groups, and the title stays pinned. Tests: Added `test` feature to the sentry crate so the new tests can use `sentry::test::with_captured_events` to inspect actual emitted events. - `set_jwt_sentry_scope_clears_stale_sandbox_id_between_requests`: Two requests on the same thread, B with no sandbox_id/job_id. Asserts request B's emitted event has NEITHER tag — proves the leak is closed. - `set_jwt_sentry_scope_replaces_user_between_requests`: Locks in that set_user(Some(..)) overwrites without needing an explicit None clear. - `captured_event_message_pins_class_literal_not_error_display`: Asserts the captured event's `message` is the class literal and the exception chain is preserved. - `captured_event_message_pins_every_class`: Locks in the class → literal mapping so future renames of the tracing-event literals don't silently desync with the capture-error path. 1249/1249 sentry-feature tests green, default + clippy default + clippy sentry + clippy all-features + fmt all clean.
Summary
0.7.x backport of #137 (the fix for the user-reported `upstream server error` Sentry mega-bucket).
The user pasted a Sentry issue tagged `release=0.7.13`, environment `demo` — same generic title, same misleading `provider=pdl_person_enrichment / operation_id=unknown` tags, no exception chain. The 0.8 fix shipped in #137; this brings the same fix to the 0.7 line so the demo environment actually gets it.
What this carries from #137
Identical logic; the only differences are conflict-resolution artifacts from a now-divergent codebase:
Conflict resolution notes
Cherry-pick of `8521978` (the 0.7-shaped commit produced from `67af709` on `main`). Three text conflicts handled:
Tests
(0.7's suite is smaller than 0.8's because 0.7 doesn't have the passthrough / persistence / control-plane test files added since the fork.)
After this lands
Sister PR on main
#137 — same fix on the 0.8 line, identical test results.
🤖 Generated with Claude Code