Skip to content

fix(csharp): emit db.error.kind tag on Status=Error spans#58

Merged
CurtHagenlocher merged 2 commits into
adbc-drivers:mainfrom
eric-wang-1990:tracing/481-db-error-kind
May 29, 2026
Merged

fix(csharp): emit db.error.kind tag on Status=Error spans#58
CurtHagenlocher merged 2 commits into
adbc-drivers:mainfrom
eric-wang-1990:tracing/481-db-error-kind

Conversation

@eric-wang-1990

Copy link
Copy Markdown
Contributor

Motivation

Fixes adbc-drivers/databricks#481. When a query fails today the only error classification on the OTel span is exception.type, which is too ambiguous for automated dashboards / alerts to group on:

  • TaskCanceledException: client-side timeout? user-cancel via Cancel()? mid-flight network drop?
  • TTransportException: server-side cancel? network drop? Thrift framing error?

The driver already knows which of these it is internally (it set the timeout, it called Cancel, it received the Thrift error status, etc.); this PR surfaces that knowledge as a db.error.kind tag on the originating Activity.

What changed

A new internal helper ErrorKindClassifier (csharp/src/AdbcDrivers.HiveServer2/ErrorKindClassifier.cs) maps a caught exception to one of the OTel taxonomy values in ActivityKeys.Db.ErrorKindValues:

Kind Trigger
server_error Thrift TStatus != SUCCESS (via HandleThriftResponse -> ThrowErrorResponse), polled operation in ERROR_STATE (in PollForResponseAsync), OpenSession failure paths, or direct-result error display messages from ExecuteStatementAsync.
network Transport-level: HttpRequestException without a typed status code, SocketException, IOException, or a TTransportException with no more-specific inner exception.
auth_failed HTTP 401 / 403 — typed StatusCode on net5+, message scan fallback on netstandard2.0 / net472 (the same fallback the driver already uses in IsUnauthorized).
protocol_error TProtocolException (Thrift framing/encoding).
query_timeout Cancellation in scopes where the only CTS in scope carries a CancelAfter timer (HiveServer2Connection.OpenAsync, TryOpenSessionWithFallbackAsync, the metadata-query catch sites).

The classifier is invoked at the catch/throw sites in HiveServer2Connection and HiveServer2Statement so the tag lives on the Activity where the failure actually originated. ErrorKindClassifier.Tag is idempotent — if an inner catch site already tagged a kind, an outer one will not overwrite — so the innermost classification wins.

Deferred: user_cancel

Per the parent issue this PR intentionally does NOT emit user_cancel. At the Statement-level catch sites (HiveServer2Statement.ExecuteQueryAsync etc.) the local CTS has BOTH a CancelAfter timer (from QueryTimeoutSeconds) AND a manual Cancel() path (the parent driver's Statement.Cancel()). Disambiguating those two requires cross-layer state (a flag set by Cancel() that the catch site can read) and is its own design question. Rather than risk mislabeling a user-cancel as a timeout, Statement-level cancellations are left UNTAGGED here, and user_cancel is left for a follow-up that touches the parent driver. The Connection-level cancel paths only have a timer, so they ARE tagged as query_timeout.

Tests

Adds MockServerErrorKindTests (csharp/test/AdbcDrivers.Tests.HiveServer2/Hive2/MockServer/MockServerErrorKindTests.cs) following the ActivityListener-scoped-to-source pattern PR #57 established. Three failure modes are mockable through the in-process HiveServer2 server:

  • OperationError_TagsServerErrorGetOperationStatus reports ERROR_STATE -> assert db.error.kind = "server_error" on the originating Activity.
  • OpenSessionError_TagsServerErrorOpenSession returns ERROR_STATUS -> same.
  • Connect_RefusedTcp_TagsNetwork — connect to a closed loopback port -> assert db.error.kind = "network".

Behavior was verified RED-then-GREEN: all three fail on the test-only commit and pass after the implementation commit. All 119 MockServer tests still pass; remaining failures in the full suite are the pre-existing integration tests gated on {HIVE,IMPALA,SPARK}_TEST_CONFIG_FILE env vars (same caveat as PR #57).

auth_failed and protocol_error are exercised by the classifier code paths but not by a mock test in this PR — mocking an HTTP 401 / Thrift framing error from the in-process test server is a non-trivial change to the test infrastructure. A parent-repo E2E test against a real server can validate those once the parent bumps the submodule pointer.

Refs adbc-drivers/databricks#481

This pull request and its description were written by Isaac.

Adds three RED tests for adbc-drivers/databricks#481 covering the
classifications that are tractable to mock through the in-process
HiveServer2 test server:

- OperationError_TagsServerError: GetOperationStatus reports
  ERROR_STATE -> assert "server_error".
- OpenSessionError_TagsServerError: OpenSession returns ERROR_STATUS
  -> assert "server_error".
- Connect_RefusedTcp_TagsNetwork: connect to a closed loopback port
  -> assert "network".

Tests fail on origin/main (no db.error.kind tag exists) and will
turn GREEN once the catch sites are wired up in the follow-up
commit.
Adds OTel-compatible db.error.kind taxonomy on failure spans so
dashboards can group/filter by error category instead of relying
solely on exception.type (which is ambiguous between, e.g.,
TaskCanceledException for timeouts vs user cancels).

Implemented classifications:
- server_error: Thrift TStatus != SUCCESS, polled operation in
  ERROR_STATE, OpenSession failure paths, or direct-result error
  display messages.
- network: transport-level (HttpRequestException without status,
  SocketException, IOException, TTransportException without
  a more specific inner exception).
- auth_failed: HTTP 401 / 403 (typed StatusCode on net5+, message
  scan on netstandard2.0 / net472).
- protocol_error: TProtocolException (Thrift framing/encoding).
- query_timeout: cancellation in scopes where the only CTS in
  scope had a CancelAfter timer (OpenAsync, the metadata-query
  paths in HiveServer2Connection). NOT applied at the
  Statement-level catches because the parent driver can also call
  Statement.Cancel(), and disambiguating user_cancel from
  query_timeout there requires cross-layer state.

Deferred: user_cancel classification — left for a follow-up that
also touches the parent driver. Statement-level cancellations
remain UNTAGGED for now rather than risk mislabeling.

The tag is set at the originating catch/throw sites (innermost
wins, via a guard in ErrorKindClassifier.Tag), so the OTel
Status=Error span where the failure actually occurred carries
the classification. Outer wrapper spans inherit Status=Error
from System.Diagnostics but do not duplicate the kind.

Refs adbc-drivers/databricks#481
@@ -0,0 +1,184 @@
/*
* Copyright (c) 2025 ADBC Drivers Contributors

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Copyright (c) 2025 ADBC Drivers Contributors
* Copyright (c) 2026 ADBC Drivers Contributors

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This has been throwing me off, too; I have to make an effort to remember.)

@@ -0,0 +1,182 @@
/*
* Copyright (c) 2025 ADBC Drivers Contributors

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Copyright (c) 2025 ADBC Drivers Contributors
* Copyright (c) 2026 ADBC Drivers Contributors

@CurtHagenlocher CurtHagenlocher merged commit 8600ed1 into adbc-drivers:main May 29, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tracing(csharp): no semantic error classification (db.error.kind) on failure spans

2 participants