fix(sentry): real titles, real chains, real provider tags, drop quota noise [0.7 backport] by claudio-michel[bot] · Pull Request #138 · Parcha-ai/ati

claudio-michel · 2026-06-04T04:56:28Z

Summary

0.7.x backport of #137 (the fix for the user-reported `upstream server error` Sentry mega-bucket).

The user pasted a Sentry issue tagged `release=0.7.13`, environment `demo` — same generic title, same misleading `provider=pdl_person_enrichment / operation_id=unknown` tags, no exception chain. The 0.8 fix shipped in #137; this brings the same fix to the 0.7 line so the demo environment actually gets it.

What this carries from #137

Identical logic; the only differences are conflict-resolution artifacts from a now-divergent codebase:

6 classified Sentry buckets (`upstream auth_error` / `bad_input` / `quota` / `rate_limited` / `server_error` / `transport_error`) via per-arm literal `tracing::*!` messages — replacing the single `"upstream server error"` mega-bucket.
`provider_and_op(provider_name, tool_name)` that strips `{provider}_` / `{provider}:` prefixes correctly and never returns `"unknown"`. The exact PDL screenshot bug pinned by a test.
`capture_error_with_scope` that calls `sentry::capture_error(&dyn Error)` so the issue page shows the real `HttpError::ApiError` source chain + reqwest frames.
`before_send` drops 402 / 429 noise (they still ride along as breadcrumbs in the next real event).
Cargo sentry features expanded to include `panic`, `backtrace`, `contexts`, `debug-images` — gives us actual stack traces for the first time.
`set_jwt_sentry_scope(claims)` at `handle_call` entry → every event carries `sub` / `sandbox_id` / `job_id` tags + `user.id`.
`max_breadcrumbs(100)`, `enable_logs(true)`.

Conflict resolution notes

Cherry-pick of `8521978` (the 0.7-shaped commit produced from `67af709` on `main`). Three text conflicts handled:

`src/core/logging.rs`: 0.7 has a single-feature-gated `shutdown(_guard: Option)` instead of 0.8's `InitGuards` struct. Kept 0.7's `shutdown` and added the new `before_send` function adjacent to it.
`src/proxy/server.rs`: 0.7 doesn't have the Ati-Key path (added in feat(jwt+manifest): per-provider session token + multi-audience JWT validation (closes #121) #122). Kept 0.7's "Bearer-token path only" comment wording verbatim and inserted the new `set_jwt_sentry_scope` call.
`Cargo.lock`: regenerated against the new `sentry` features by running `cargo check` on the resolved tree.

Tests

cargo test                                                    # 0.7-default green
cargo test --features sentry                                  # 984/984 sentry-feature green
cargo clippy --features sentry --lib --tests -- -D warnings   # clean
cargo fmt --all --check                                       # clean

(0.7's suite is smaller than 0.8's because 0.7 doesn't have the passthrough / persistence / control-plane test files added since the fork.)

After this lands

Bump PR for `v0.7.14` against `release/0.7`
Tag `v0.7.14`, release workflow ships artifacts
Demo env can pull `0.7.14` and the screenshot's `pdl_person_enrichment` events start showing up as `upstream bad_input` with the right `provider=pdl` / `operation_id=person_enrichment` tags + reqwest chain

Sister PR on main

#137 — same fix on the 0.8 line, identical test results.

🤖 Generated with Claude Code

… noise The user pasted a Sentry issue from 0.7.13 demo titled "upstream server error", with 460 events in one bucket, 1896 hidden issues in the trace preview, tag provider="pdl_person_enrichment" (wrong — that's the whole tool name, the actual provider is "pdl"), operation_id="unknown", and no exception/stack-trace. Useless for triage. Four bugs, all fixed here: 1. The `tracing::error!` message was a hardcoded literal "upstream server error". sentry-tracing maps the literal to the event title and Sentry groups by title. Every failure of every shape landed in one bucket. Fix: classify by status into 5 `UpstreamErrorClass` variants (auth_error / bad_input / quota / rate_limited / server_error / transport_error) and dispatch into one `tracing::*!` per arm with a distinct literal message. Five buckets per status class instead of one. (Quota + RateLimited are Warning level and get dropped below — see #4.) 2. `split_tool_name` assumed `provider:operation_id` and labelled anything else as `("<whole-tool-name>", "unknown")`. PDL, datalastic, and several other manifests ship flat tool names like `pdl_person_enrichment`. The tag lied and operation_id was useless. Fix: new `provider_and_op(provider_name, tool_name)` that takes the registry-resolved provider name (which the call site already has — `provider.name`). It strips `{provider}:` or `{provider}_` prefixes when present; otherwise the operation IS the tool name verbatim. Never fabricates "unknown". All three call sites in handle_call updated. 3. Errors were flattened to `String` at the boundary (`Some(&e.to_string())`). Sentry's `capture_error(&dyn Error)` path that walks `source()` and attaches each link as a separate exception block was never called. The issue page had no chain. Fix: new `capture_error_with_scope` helper that does `sentry::capture_error(err)` inside the same scope that carries the tags + class-based fingerprint. All three call sites (HTTP / MCP / CLI) call it alongside the report helper. The Quota + RateLimited classes short-circuit inside the helper (don't bother building backtrace for events that before_send will drop). 4. No `before_send` hook → quota / rate-limit events spammed the issue list and burned project quota. Fix: `before_send` reads `tags.upstream_error_class` and drops events tagged `quota` or `rate_limited`. The `tracing::warn!` calls still run so breadcrumbs record context for the next real error. Plus the prerequisites that made #3 work: - Cargo sentry features expanded to include `panic`, `backtrace`, `contexts`, `debug-images`. The old four-feature subset compiled the SDK without the panic hook (so panics silently vanished) and without backtrace + device-context blocks (so even captured exceptions had no frames). - `attach_stacktrace: true` (was already set, now actually does something with the backtrace feature on). - `max_breadcrumbs: 100` (up from default 30) so the breadcrumbs around the failure aren't truncated on busy requests. - `enable_logs: true` so Sentry's Logs product receives the structured tracing fields, not just the rendered event title. - `set_jwt_sentry_scope(claims)` at handle_call entry sets `tags.sub`, `tags.sandbox_id`, `tags.job_id`, `user.id` so every event raised during the request is searchable by identity. Tests: - `provider_and_op_strips_underscore_prefix` — the exact PDL bug from the screenshot: `("pdl", "pdl_person_enrichment") → ("pdl", "person_enrichment")`. - `provider_and_op_no_prefix_keeps_tool_name` — never returns "unknown". - `provider_and_op_explicit_colon` + `_strips_colon_prefix` + `_empty_op_after_strip_keeps_tool_name` — covers all common shapes. - `UpstreamErrorClass::classify` covered for 400 / 401 / 402 / 429 / 5xx / 0. - `before_send_drops_quota_class_events`, `before_send_drops_rate_limited_class_events` — the noise filter. - `before_send_passes_<bad_input|server_error|auth_error| transport_error>_through` — pageable classes survive. - `before_send_passes_events_without_class_tag` — panics + other tracing::error! calls outside the upstream-error helper aren't caught in the filter's blast radius. All 1227 default-feature tests green, all 1241 sentry-feature tests green, clippy clean across default / sentry / all-features, fmt clean.

greptile-apps · 2026-06-04T04:59:58Z

Greptile Summary

This is a 0.7 backport of #137, improving Sentry observability by replacing a single "upstream server error" mega-bucket with 6 classified event buckets, fixing provider/operation tag accuracy, attaching real Rust error chains to Sentry issues, and dropping 402/429 billing noise via before_send.

src/core/sentry_scope.rs: Introduces UpstreamErrorClass enum, provider_and_op() (fixes the PDL flat-name tag bug), capture_error_with_scope() (attaches full source() chain as a Sentry exception), and set_jwt_sentry_scope() (stamps every request's events with JWT identity tags); report_upstream_error now emits 6 distinct literal tracing::*! messages so Sentry groups them into separate issues.
src/core/logging.rs: Adds a before_send hook that silently drops quota/rate-limited events, bumps max_breadcrumbs to 100, enables Sentry Logs, and wires in the panic, backtrace, contexts, debug-images features for real stack traces.
src/proxy/server.rs: Replaces three split_tool_name call sites with provider_and_op, adds set_jwt_sentry_scope at handle_call entry, and pairs each error path with capture_error_with_scope for the exception chain.

Confidence Score: 4/5

Safe to merge; all changes are behind the optional sentry cargo feature and the core proxy logic is unaffected

The backport is well-structured, test-covered, and fixes a real user-reported issue. The main concern is that sentry::configure_scope in set_jwt_sentry_scope permanently modifies the thread-local Sentry scope and is not reset between requests sharing the same Tokio worker thread — contrary to the inline comment — which can cause JWT identity tags from one request to bleed onto Sentry events from a subsequent request on the same thread. This is an observability accuracy issue rather than a user-visible defect. A secondary note: unclassified 4xx codes (408, 409, 410, etc.) silently fall into TransportError which is semantically defined as transport-level failures with no HTTP exchange.

src/core/sentry_scope.rs — the configure_scope lifetime and the 4xx classification gap

Important Files Changed

Filename	Overview
src/core/sentry_scope.rs	Major rewrite introducing UpstreamErrorClass enum, provider_and_op(), capture_error_with_scope(), set_jwt_sentry_scope(), and per-class literal tracing messages; well-tested, but configure_scope is not cleared between async requests and some unclassified 4xx codes fall into TransportError
src/core/logging.rs	Adds before_send hook (drops quota/rate_limited events), max_breadcrumbs:100, enable_logs:true, and expands sentry features; well-tested and clean
src/proxy/server.rs	Three split_tool_name calls replaced with provider_and_op, set_jwt_sentry_scope added at request entry, capture_error_with_scope added alongside existing report_upstream_error calls; changes are localized and correct
Cargo.toml	Adds panic, backtrace, contexts, debug-images to sentry feature set; no default-features changed, optional feature gate preserved
Cargo.lock	Lock file regenerated with new sentry sub-crate dependencies and their transitive deps; consistent with Cargo.toml changes

Sequence Diagram

sequenceDiagram
    participant Client
    participant handle_call
    participant sentry_scope
    participant Sentry

    Client->>handle_call: POST /call (JWT bearer)
    handle_call->>sentry_scope: set_jwt_sentry_scope(claims)
    Note over sentry_scope: configure_scope: sub, sandbox_id, job_id, user.id

    handle_call->>handle_call: execute tool call

    alt Tool / HTTP error
        handle_call->>sentry_scope: provider_and_op(provider.name, tool_name)
        sentry_scope-->>handle_call: (provider, operation_id)

        handle_call->>sentry_scope: report_upstream_error(provider, op, status, ...)
        sentry_scope->>sentry_scope: UpstreamErrorClass::classify(status)
        sentry_scope->>sentry_scope: with_upstream_scope: set tags + fingerprint
        sentry_scope->>Sentry: tracing::error!(upstream class) → titled issue event

        handle_call->>sentry_scope: capture_error_with_scope(err, provider, op, status, ...)
        sentry_scope->>sentry_scope: skip if is_quota_class()
        sentry_scope->>sentry_scope: with_upstream_scope: same tags + fingerprint
        sentry_scope->>Sentry: sentry::capture_error(err) → exception event with source chain
    end

    Note over Sentry: before_send drops quota/rate_limited events
    Note over Sentry: Both events share fingerprint → same issue

_{Reviews (1): Last reviewed commit: "fix(sentry): real titles, real chains, r..." | Re-trigger Greptile}

greptile-apps · 2026-06-04T05:00:01Z

+#[allow(unused_variables)]
+pub fn set_jwt_sentry_scope(claims: &TokenClaims) {
+    #[cfg(feature = "sentry")]
+    sentry::configure_scope(|scope| {
+        scope.set_tag("sub", claims.sub.as_str());
+        if let Some(ref id) = claims.sandbox_id {
+            scope.set_tag("sandbox_id", id.as_str());
+        }
+        if let Some(ref id) = claims.job_id {
+            scope.set_tag("job_id", id.as_str());
+        }
+        // Sentry's `user.id` is the standard slot for the auth subject —
+        // populating it lets the issue page show "users affected" counts
+        // and gives the search UI a first-class facet.
+        scope.set_user(Some(sentry::User {
+            id: Some(claims.sub.clone()),
+            ..Default::default()
+        }));
+    });
+}


configure_scope scope not cleared between requests

sentry::configure_scope mutates the thread-local hub scope permanently — it is not reset by the sentry::with_scope blocks that the report helpers push. Those blocks push a nested child scope and pop it, but the tags set by configure_scope on the parent scope remain on that thread indefinitely. In a Tokio worker-thread pool, the next request executed on the same thread will inherit sub, sandbox_id, and job_id from the previous request, causing Sentry events to be attributed to the wrong identity. The docstring's "cleared automatically when the tokio worker reaches the next request" claim is incorrect.

The correct pattern for per-request scoping in async Rust is to create an isolated Hub: Hub::new_from_top(Hub::current()) and run the handler under it. Alternatively, explicitly clear the user identity at request end, though that's harder to guarantee in all code paths.

greptile-apps · 2026-06-04T05:00:02Z

+impl UpstreamErrorClass {
+    pub fn classify(upstream_status: u16) -> Self {
+        match upstream_status {
+            401 | 403 | 407 => Self::AuthError,
+            400 | 404 | 422 => Self::BadInput,
+            402 => Self::Quota,
+            429 => Self::RateLimited,
+            500..=599 => Self::ServerError,
+            _ => Self::TransportError,
+        }
+    }


Unclassified 4xx codes fall into TransportError

HTTP status codes that aren't explicitly matched (e.g., 405 Method Not Allowed, 408 Request Timeout, 409 Conflict, 410 Gone, 423, 425, and any other 4xx not listed) fall through to TransportError, whose doc comment defines it as "no HTTP status (DNS, TCP, TLS, MCP transport error)". Those codes do have an HTTP status, so they will be silently mislabeled in Sentry. In practice a provider returning 408 or 409 would appear under the transport_error bucket rather than any client-error bucket, making the on-call investigation misleading.

…(P2) Greptile review on PR #137 raised two real defects: P1 (correctness + multi-tenant data leak): `set_jwt_sentry_scope` used `sentry::configure_scope` which mutates the thread-local scope permanently. For optional fields (sandbox_id, job_id) the code only called `set_tag` when the field was `Some`, but never removed the tag when it was `None`. The doc comment claimed "cleared automatically when the tokio worker reaches the next request because Sentry's scope is request-keyed by the with_scope blocks the report helpers push" — that's wrong. `with_scope` pushes a NEW layer; it does NOT clear what `configure_scope` set underneath. On a busy proxy with worker-thread reuse, request A's sandbox_id would stamp itself on request B's events whenever B's JWT lacked that field. Fix: unconditionally `scope.remove_tag("sandbox_id")` and `remove_tag("job_id")` at the top of `set_jwt_sentry_scope`, then set them only when the new claims actually have them. `sub` and `user.id` are always set (sub is required by auth middleware; set_user(Some(..)) unconditionally overwrites prior values). P2 (UI title drift): The error path emitted TWO Sentry events with the same fingerprint: the tracing-bridge event titled with the class literal (e.g. "upstream bad_input"), and a `sentry::capture_error` event whose title defaulted to the underlying error's Display. Sentry's "latest event wins for title" rule meant the operator-visible bucket name would regularly get overwritten with `HttpError: ApiError { ... }` strings, defeating the PR's headline improvement. Fix: replace the bare `sentry::capture_error(err)` with `sentry::event_from_error(err)` followed by an explicit `event.message = Some(class_literal.into())` before `capture_event(event)`. The exception block is unchanged (chain still ships), the fingerprint still groups, and the title stays pinned. Tests: Added `test` feature to the sentry crate so the new tests can use `sentry::test::with_captured_events` to inspect actual emitted events. - `set_jwt_sentry_scope_clears_stale_sandbox_id_between_requests`: Two requests on the same thread, B with no sandbox_id/job_id. Asserts request B's emitted event has NEITHER tag — proves the leak is closed. - `set_jwt_sentry_scope_replaces_user_between_requests`: Locks in that set_user(Some(..)) overwrites without needing an explicit None clear. - `captured_event_message_pins_class_literal_not_error_display`: Asserts the captured event's `message` is the class literal and the exception chain is preserved. - `captured_event_message_pins_every_class`: Locks in the class → literal mapping so future renames of the tracing-event literals don't silently desync with the capture-error path. 1249/1249 sentry-feature tests green, default + clippy default + clippy sentry + clippy all-features + fmt all clean.

greptile-apps Bot reviewed Jun 4, 2026

View reviewed changes

miguelrios merged commit 142761d into release/0.7 Jun 4, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sentry): real titles, real chains, real provider tags, drop quota noise [0.7 backport]#138

fix(sentry): real titles, real chains, real provider tags, drop quota noise [0.7 backport]#138
miguelrios merged 2 commits into
release/0.7from
fix/sentry-actually-useful-0.7

claudio-michel Bot commented Jun 4, 2026

Uh oh!

greptile-apps Bot commented Jun 4, 2026

Uh oh!

greptile-apps Bot Jun 4, 2026

Uh oh!

greptile-apps Bot Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

claudio-michel Bot commented Jun 4, 2026

Summary

What this carries from #137

Conflict resolution notes

Tests

After this lands

Sister PR on main

Uh oh!

greptile-apps Bot commented Jun 4, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant