Skip to content

fix(sentry): real titles, real chains, real provider tags, drop quota noise [0.7 backport]#138

Merged
miguelrios merged 2 commits into
release/0.7from
fix/sentry-actually-useful-0.7
Jun 4, 2026
Merged

fix(sentry): real titles, real chains, real provider tags, drop quota noise [0.7 backport]#138
miguelrios merged 2 commits into
release/0.7from
fix/sentry-actually-useful-0.7

Conversation

@claudio-michel

@claudio-michel claudio-michel Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Summary

0.7.x backport of #137 (the fix for the user-reported `upstream server error` Sentry mega-bucket).

The user pasted a Sentry issue tagged `release=0.7.13`, environment `demo` — same generic title, same misleading `provider=pdl_person_enrichment / operation_id=unknown` tags, no exception chain. The 0.8 fix shipped in #137; this brings the same fix to the 0.7 line so the demo environment actually gets it.

What this carries from #137

Identical logic; the only differences are conflict-resolution artifacts from a now-divergent codebase:

  1. 6 classified Sentry buckets (`upstream auth_error` / `bad_input` / `quota` / `rate_limited` / `server_error` / `transport_error`) via per-arm literal `tracing::*!` messages — replacing the single `"upstream server error"` mega-bucket.
  2. `provider_and_op(provider_name, tool_name)` that strips `{provider}_` / `{provider}:` prefixes correctly and never returns `"unknown"`. The exact PDL screenshot bug pinned by a test.
  3. `capture_error_with_scope` that calls `sentry::capture_error(&dyn Error)` so the issue page shows the real `HttpError::ApiError` source chain + reqwest frames.
  4. `before_send` drops 402 / 429 noise (they still ride along as breadcrumbs in the next real event).
  5. Cargo sentry features expanded to include `panic`, `backtrace`, `contexts`, `debug-images` — gives us actual stack traces for the first time.
  6. `set_jwt_sentry_scope(claims)` at `handle_call` entry → every event carries `sub` / `sandbox_id` / `job_id` tags + `user.id`.
  7. `max_breadcrumbs(100)`, `enable_logs(true)`.

Conflict resolution notes

Cherry-pick of `8521978` (the 0.7-shaped commit produced from `67af709` on `main`). Three text conflicts handled:

  • `src/core/logging.rs`: 0.7 has a single-feature-gated `shutdown(_guard: Option)` instead of 0.8's `InitGuards` struct. Kept 0.7's `shutdown` and added the new `before_send` function adjacent to it.
  • `src/proxy/server.rs`: 0.7 doesn't have the Ati-Key path (added in feat(jwt+manifest): per-provider session token + multi-audience JWT validation (closes #121) #122). Kept 0.7's "Bearer-token path only" comment wording verbatim and inserted the new `set_jwt_sentry_scope` call.
  • `Cargo.lock`: regenerated against the new `sentry` features by running `cargo check` on the resolved tree.

Tests

cargo test                                                    # 0.7-default green
cargo test --features sentry                                  # 984/984 sentry-feature green
cargo clippy --features sentry --lib --tests -- -D warnings   # clean
cargo fmt --all --check                                       # clean

(0.7's suite is smaller than 0.8's because 0.7 doesn't have the passthrough / persistence / control-plane test files added since the fork.)

After this lands

  • Bump PR for `v0.7.14` against `release/0.7`
  • Tag `v0.7.14`, release workflow ships artifacts
  • Demo env can pull `0.7.14` and the screenshot's `pdl_person_enrichment` events start showing up as `upstream bad_input` with the right `provider=pdl` / `operation_id=person_enrichment` tags + reqwest chain

Sister PR on main

#137 — same fix on the 0.8 line, identical test results.

🤖 Generated with Claude Code

… noise

The user pasted a Sentry issue from 0.7.13 demo titled
"upstream server error", with 460 events in one bucket, 1896 hidden
issues in the trace preview, tag provider="pdl_person_enrichment"
(wrong — that's the whole tool name, the actual provider is "pdl"),
operation_id="unknown", and no exception/stack-trace. Useless for
triage.

Four bugs, all fixed here:

1. The `tracing::error!` message was a hardcoded literal
   "upstream server error". sentry-tracing maps the literal to the
   event title and Sentry groups by title. Every failure of every
   shape landed in one bucket.

   Fix: classify by status into 5 `UpstreamErrorClass` variants
   (auth_error / bad_input / quota / rate_limited / server_error /
   transport_error) and dispatch into one `tracing::*!` per arm with
   a distinct literal message. Five buckets per status class instead
   of one. (Quota + RateLimited are Warning level and get dropped
   below — see #4.)

2. `split_tool_name` assumed `provider:operation_id` and labelled
   anything else as `("<whole-tool-name>", "unknown")`. PDL,
   datalastic, and several other manifests ship flat tool names like
   `pdl_person_enrichment`. The tag lied and operation_id was
   useless.

   Fix: new `provider_and_op(provider_name, tool_name)` that takes
   the registry-resolved provider name (which the call site already
   has — `provider.name`). It strips `{provider}:` or `{provider}_`
   prefixes when present; otherwise the operation IS the tool name
   verbatim. Never fabricates "unknown". All three call sites in
   handle_call updated.

3. Errors were flattened to `String` at the boundary
   (`Some(&e.to_string())`). Sentry's `capture_error(&dyn Error)`
   path that walks `source()` and attaches each link as a separate
   exception block was never called. The issue page had no chain.

   Fix: new `capture_error_with_scope` helper that does
   `sentry::capture_error(err)` inside the same scope that carries
   the tags + class-based fingerprint. All three call sites (HTTP /
   MCP / CLI) call it alongside the report helper. The Quota +
   RateLimited classes short-circuit inside the helper (don't bother
   building backtrace for events that before_send will drop).

4. No `before_send` hook → quota / rate-limit events spammed the
   issue list and burned project quota.

   Fix: `before_send` reads `tags.upstream_error_class` and drops
   events tagged `quota` or `rate_limited`. The `tracing::warn!`
   calls still run so breadcrumbs record context for the next real
   error.

Plus the prerequisites that made #3 work:

- Cargo sentry features expanded to include `panic`, `backtrace`,
  `contexts`, `debug-images`. The old four-feature subset compiled
  the SDK without the panic hook (so panics silently vanished) and
  without backtrace + device-context blocks (so even captured
  exceptions had no frames).
- `attach_stacktrace: true` (was already set, now actually does
  something with the backtrace feature on).
- `max_breadcrumbs: 100` (up from default 30) so the breadcrumbs
  around the failure aren't truncated on busy requests.
- `enable_logs: true` so Sentry's Logs product receives the
  structured tracing fields, not just the rendered event title.
- `set_jwt_sentry_scope(claims)` at handle_call entry sets
  `tags.sub`, `tags.sandbox_id`, `tags.job_id`, `user.id` so every
  event raised during the request is searchable by identity.

Tests:

- `provider_and_op_strips_underscore_prefix` — the exact PDL bug
  from the screenshot: `("pdl", "pdl_person_enrichment") →
  ("pdl", "person_enrichment")`.
- `provider_and_op_no_prefix_keeps_tool_name` — never returns
  "unknown".
- `provider_and_op_explicit_colon` + `_strips_colon_prefix` +
  `_empty_op_after_strip_keeps_tool_name` — covers all common
  shapes.
- `UpstreamErrorClass::classify` covered for 400 / 401 / 402 / 429
  / 5xx / 0.
- `before_send_drops_quota_class_events`,
  `before_send_drops_rate_limited_class_events` — the noise filter.
- `before_send_passes_<bad_input|server_error|auth_error|
  transport_error>_through` — pageable classes survive.
- `before_send_passes_events_without_class_tag` — panics + other
  tracing::error! calls outside the upstream-error helper aren't
  caught in the filter's blast radius.

All 1227 default-feature tests green, all 1241 sentry-feature tests
green, clippy clean across default / sentry / all-features, fmt
clean.
@greptile-apps

greptile-apps Bot commented Jun 4, 2026

Copy link
Copy Markdown

Greptile Summary

This is a 0.7 backport of #137, improving Sentry observability by replacing a single "upstream server error" mega-bucket with 6 classified event buckets, fixing provider/operation tag accuracy, attaching real Rust error chains to Sentry issues, and dropping 402/429 billing noise via before_send.

  • src/core/sentry_scope.rs: Introduces UpstreamErrorClass enum, provider_and_op() (fixes the PDL flat-name tag bug), capture_error_with_scope() (attaches full source() chain as a Sentry exception), and set_jwt_sentry_scope() (stamps every request's events with JWT identity tags); report_upstream_error now emits 6 distinct literal tracing::*! messages so Sentry groups them into separate issues.
  • src/core/logging.rs: Adds a before_send hook that silently drops quota/rate-limited events, bumps max_breadcrumbs to 100, enables Sentry Logs, and wires in the panic, backtrace, contexts, debug-images features for real stack traces.
  • src/proxy/server.rs: Replaces three split_tool_name call sites with provider_and_op, adds set_jwt_sentry_scope at handle_call entry, and pairs each error path with capture_error_with_scope for the exception chain.

Confidence Score: 4/5

Safe to merge; all changes are behind the optional sentry cargo feature and the core proxy logic is unaffected

The backport is well-structured, test-covered, and fixes a real user-reported issue. The main concern is that sentry::configure_scope in set_jwt_sentry_scope permanently modifies the thread-local Sentry scope and is not reset between requests sharing the same Tokio worker thread — contrary to the inline comment — which can cause JWT identity tags from one request to bleed onto Sentry events from a subsequent request on the same thread. This is an observability accuracy issue rather than a user-visible defect. A secondary note: unclassified 4xx codes (408, 409, 410, etc.) silently fall into TransportError which is semantically defined as transport-level failures with no HTTP exchange.

src/core/sentry_scope.rs — the configure_scope lifetime and the 4xx classification gap

Important Files Changed

Filename Overview
src/core/sentry_scope.rs Major rewrite introducing UpstreamErrorClass enum, provider_and_op(), capture_error_with_scope(), set_jwt_sentry_scope(), and per-class literal tracing messages; well-tested, but configure_scope is not cleared between async requests and some unclassified 4xx codes fall into TransportError
src/core/logging.rs Adds before_send hook (drops quota/rate_limited events), max_breadcrumbs:100, enable_logs:true, and expands sentry features; well-tested and clean
src/proxy/server.rs Three split_tool_name calls replaced with provider_and_op, set_jwt_sentry_scope added at request entry, capture_error_with_scope added alongside existing report_upstream_error calls; changes are localized and correct
Cargo.toml Adds panic, backtrace, contexts, debug-images to sentry feature set; no default-features changed, optional feature gate preserved
Cargo.lock Lock file regenerated with new sentry sub-crate dependencies and their transitive deps; consistent with Cargo.toml changes

Sequence Diagram

sequenceDiagram
    participant Client
    participant handle_call
    participant sentry_scope
    participant Sentry

    Client->>handle_call: POST /call (JWT bearer)
    handle_call->>sentry_scope: set_jwt_sentry_scope(claims)
    Note over sentry_scope: configure_scope: sub, sandbox_id, job_id, user.id

    handle_call->>handle_call: execute tool call

    alt Tool / HTTP error
        handle_call->>sentry_scope: provider_and_op(provider.name, tool_name)
        sentry_scope-->>handle_call: (provider, operation_id)

        handle_call->>sentry_scope: report_upstream_error(provider, op, status, ...)
        sentry_scope->>sentry_scope: UpstreamErrorClass::classify(status)
        sentry_scope->>sentry_scope: with_upstream_scope: set tags + fingerprint
        sentry_scope->>Sentry: tracing::error!(upstream class) → titled issue event

        handle_call->>sentry_scope: capture_error_with_scope(err, provider, op, status, ...)
        sentry_scope->>sentry_scope: skip if is_quota_class()
        sentry_scope->>sentry_scope: with_upstream_scope: same tags + fingerprint
        sentry_scope->>Sentry: sentry::capture_error(err) → exception event with source chain
    end

    Note over Sentry: before_send drops quota/rate_limited events
    Note over Sentry: Both events share fingerprint → same issue
Loading

Reviews (1): Last reviewed commit: "fix(sentry): real titles, real chains, r..." | Re-trigger Greptile

Comment thread src/core/sentry_scope.rs
Comment on lines +609 to +628
#[allow(unused_variables)]
pub fn set_jwt_sentry_scope(claims: &TokenClaims) {
#[cfg(feature = "sentry")]
sentry::configure_scope(|scope| {
scope.set_tag("sub", claims.sub.as_str());
if let Some(ref id) = claims.sandbox_id {
scope.set_tag("sandbox_id", id.as_str());
}
if let Some(ref id) = claims.job_id {
scope.set_tag("job_id", id.as_str());
}
// Sentry's `user.id` is the standard slot for the auth subject —
// populating it lets the issue page show "users affected" counts
// and gives the search UI a first-class facet.
scope.set_user(Some(sentry::User {
id: Some(claims.sub.clone()),
..Default::default()
}));
});
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 configure_scope scope not cleared between requests

sentry::configure_scope mutates the thread-local hub scope permanently — it is not reset by the sentry::with_scope blocks that the report helpers push. Those blocks push a nested child scope and pop it, but the tags set by configure_scope on the parent scope remain on that thread indefinitely. In a Tokio worker-thread pool, the next request executed on the same thread will inherit sub, sandbox_id, and job_id from the previous request, causing Sentry events to be attributed to the wrong identity. The docstring's "cleared automatically when the tokio worker reaches the next request" claim is incorrect.

The correct pattern for per-request scoping in async Rust is to create an isolated Hub: Hub::new_from_top(Hub::current()) and run the handler under it. Alternatively, explicitly clear the user identity at request end, though that's harder to guarantee in all code paths.

Comment thread src/core/sentry_scope.rs
Comment on lines +42 to +52
impl UpstreamErrorClass {
pub fn classify(upstream_status: u16) -> Self {
match upstream_status {
401 | 403 | 407 => Self::AuthError,
400 | 404 | 422 => Self::BadInput,
402 => Self::Quota,
429 => Self::RateLimited,
500..=599 => Self::ServerError,
_ => Self::TransportError,
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unclassified 4xx codes fall into TransportError

HTTP status codes that aren't explicitly matched (e.g., 405 Method Not Allowed, 408 Request Timeout, 409 Conflict, 410 Gone, 423, 425, and any other 4xx not listed) fall through to TransportError, whose doc comment defines it as "no HTTP status (DNS, TCP, TLS, MCP transport error)". Those codes do have an HTTP status, so they will be silently mislabeled in Sentry. In practice a provider returning 408 or 409 would appear under the transport_error bucket rather than any client-error bucket, making the on-call investigation misleading.

…(P2)

Greptile review on PR #137 raised two real defects:

P1 (correctness + multi-tenant data leak):
`set_jwt_sentry_scope` used `sentry::configure_scope` which mutates the
thread-local scope permanently. For optional fields (sandbox_id,
job_id) the code only called `set_tag` when the field was `Some`, but
never removed the tag when it was `None`. The doc comment claimed
"cleared automatically when the tokio worker reaches the next request
because Sentry's scope is request-keyed by the with_scope blocks the
report helpers push" — that's wrong. `with_scope` pushes a NEW layer;
it does NOT clear what `configure_scope` set underneath. On a busy
proxy with worker-thread reuse, request A's sandbox_id would stamp
itself on request B's events whenever B's JWT lacked that field.

Fix: unconditionally `scope.remove_tag("sandbox_id")` and
`remove_tag("job_id")` at the top of `set_jwt_sentry_scope`, then set
them only when the new claims actually have them. `sub` and `user.id`
are always set (sub is required by auth middleware; set_user(Some(..))
unconditionally overwrites prior values).

P2 (UI title drift):
The error path emitted TWO Sentry events with the same fingerprint:
the tracing-bridge event titled with the class literal (e.g.
"upstream bad_input"), and a `sentry::capture_error` event whose title
defaulted to the underlying error's Display. Sentry's "latest event
wins for title" rule meant the operator-visible bucket name would
regularly get overwritten with `HttpError: ApiError { ... }` strings,
defeating the PR's headline improvement.

Fix: replace the bare `sentry::capture_error(err)` with
`sentry::event_from_error(err)` followed by an explicit
`event.message = Some(class_literal.into())` before
`capture_event(event)`. The exception block is unchanged (chain still
ships), the fingerprint still groups, and the title stays pinned.

Tests:
Added `test` feature to the sentry crate so the new tests can use
`sentry::test::with_captured_events` to inspect actual emitted events.

- `set_jwt_sentry_scope_clears_stale_sandbox_id_between_requests`:
  Two requests on the same thread, B with no sandbox_id/job_id. Asserts
  request B's emitted event has NEITHER tag — proves the leak is
  closed.
- `set_jwt_sentry_scope_replaces_user_between_requests`: Locks in that
  set_user(Some(..)) overwrites without needing an explicit None
  clear.
- `captured_event_message_pins_class_literal_not_error_display`:
  Asserts the captured event's `message` is the class literal and the
  exception chain is preserved.
- `captured_event_message_pins_every_class`: Locks in the class →
  literal mapping so future renames of the tracing-event literals
  don't silently desync with the capture-error path.

1249/1249 sentry-feature tests green, default + clippy default +
clippy sentry + clippy all-features + fmt all clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant