Skip to content

feat(csharp): wire connection-level telemetry into StatementExecutionConnection (PECO-3022)#463

Draft
jadewang-db wants to merge 14 commits into
mainfrom
stack/pr-phase4-sea-connection-telemetry
Draft

feat(csharp): wire connection-level telemetry into StatementExecutionConnection (PECO-3022)#463
jadewang-db wants to merge 14 commits into
mainfrom
stack/pr-phase4-sea-connection-telemetry

Conversation

@jadewang-db
Copy link
Copy Markdown
Collaborator

@jadewang-db jadewang-db commented May 19, 2026

🥞 Stacked PR

Use this link to review incremental changes.


Summary

Wires IConnectionTelemetry into StatementExecutionConnection so the SEA path emits the same connection-lifecycle events as Thrift (CREATE_SESSION, DELETE_SESSION), and aligns the connection-parameter fields users care about for cross-protocol comparator dashboards.

  • _telemetry: IConnectionTelemetry field — defaults to NoOpConnectionTelemetry.Instance, replaced after CreateSessionAsync succeeds. Calls ConnectionTelemetry.Create(..., mode: DriverMode.Sea, ...) and emits EmitCreateSessionTelemetry on session open / EmitDeleteSessionTelemetry on dispose. 5-second hard timeout on DisposeAsync so a stuck exporter cannot hang connection close. Test seam TelemetryForTesting mirrors the Thrift pattern.
  • socket_timeout (D1 + B11) — sourced from _httpClient.Timeout.TotalMilliseconds rather than the SEA query-wait timeout _waitTimeoutSeconds. In addition, CreateHttpClient now honors a user-supplied SparkParameters.ConnectTimeoutMilliseconds when explicitly set, falling back to CloudFetchTimeoutMinutes otherwise. End result: the telemetry record's socket_timeout reflects the actual HTTP timeout, matching Thrift exactly under default settings (900s).
  • enable_direct_results (B10) — read from the user-supplied DatabricksParameters.EnableDirectResults property and emitted faithfully in telemetry, rather than the previous hardcoded true. SEA does not implement Thrift's direct-results optimization internally, but reflecting user intent keeps the wire record honest.

Files touched

  • csharp/src/StatementExecution/StatementExecutionConnection.cs — telemetry field, OpenAsync/Dispose wiring, CreateHttpClient timeout source, observer construction in CreateStatement.
  • csharp/test/Unit/StatementExecution/StatementExecutionConnectionTelemetryTests.cs (new) — exercises CREATE_SESSION/DELETE_SESSION emission, dispose timeout, fail-open initialization.

Test plan

  • OpenAsync succeeds even if telemetry initialization throws (fail-open).
  • Dispose completes within 5 seconds even when the exporter is wedged (Dispose_FlushHangs_CompletesWithin5Seconds).
  • Observer is created in CreateStatement from _telemetry.Session; falls back to NullObserver.Instance when telemetry is disabled.
  • Production-verified in eng_lumberjack: socket_timeout = 900s (matches Thrift), enable_direct_results reflects user property.

Related

Part of PECO-3022.

Jade Wang added 14 commits May 22, 2026 20:06
…erMode\n\nTask ID: task-1.1-refactor-connection-telemetry-create
- High: preserve fail-open contract — wrap the TSessionHandle Guid byte[]
  conversion in InitializeTelemetry with try/catch. Pre-refactor the same
  conversion lived inside ConnectionTelemetry.Create's outer try/catch so a
  malformed session GUID degraded to NoOp telemetry; moving it to the
  transport boundary lost that guarantee.
- Medium: remove SafeBuildSystemConfiguration_..._FallbackPath — it passed
  string.Empty for assemblyVersion expecting a throw, but
  CreateSystemConfiguration coalesces empty string. The catch block is
  never reached. The CanonicalConstant_HasExpectedLiteralValue test
  already pins both branches by construction.
- Low: rename Create_ThrowingHttpClient_ReturnsNullConnectionTelemetry to
  ...ReturnsNoOpConnectionTelemetry — actual return is the NoOp singleton.
- Low: document the test's implicit assumption that the empty-host throws
  inside HttpClientFactory/TelemetryClientManager so future defensive
  handling there doesn't silently turn this into a non-test.
- Low: add Create_EmptySessionId_MapsToNullInContext to pin the
  string.Empty -> null SessionId mapping at ConnectionTelemetry.cs:133.

Co-authored-by: Isaac
…on\n\nTask ID: task-2.1-observer-interface-and-null-impl
…ryContext + enqueue\n\nTask ID: task-2.2-telemetry-observer-impl
…server\n\nTask ID: task-2.3-safe-observer-decorator
…\n\nTask ID: task-3.1-refactor-databricks-statement-observer
@jadewang-db jadewang-db force-pushed the stack/pr-phase4-sea-connection-telemetry branch from 25b5e6e to 44b2001 Compare May 22, 2026 20:09
CurtHagenlocher pushed a commit to CurtHagenlocher/databricks that referenced this pull request May 24, 2026
)

## 🥞 Stacked PR
Use this
[link](https://github.com/adbc-drivers/databricks/pull/455/files) to
review incremental changes.
-
[**stack/PECO-3022-sea-telemetry-design**](adbc-drivers#455)
[[Files
changed](https://github.com/adbc-drivers/databricks/pull/455/files)]
-
[stack/pr-phase1-connection-telemetry-create-refactor](adbc-drivers#460)
[[Files
changed](https://github.com/adbc-drivers/databricks/pull/460/files/c0dfbed80fd09c91b9e2e5ed8050a268435618bd..d3190aeb6f2f1c727b359d0ef40d26be2280c73e)]
-
[stack/pr-phase2-observer-interface](adbc-drivers#461)
[[Files
changed](https://github.com/adbc-drivers/databricks/pull/461/files/b2340d9d32da68d5b81756e0a77008f05aadd45b..b0d9f02236bf7f99e93132cfe6ed4dd119fc1e73)]
-
[stack/pr-phase3-databricks-statement-observer](adbc-drivers#462)
[[Files
changed](https://github.com/adbc-drivers/databricks/pull/462/files/b0d9f02236bf7f99e93132cfe6ed4dd119fc1e73..fa799fcd0cc2209982eae2890e16e26854a0649e)]
-
[stack/pr-phase4-sea-connection-telemetry](adbc-drivers#463)
[[Files
changed](https://github.com/adbc-drivers/databricks/pull/463/files/f0e344aa4d9d341302ee9b3a5217a6794d8ba189..25b5e6eaf2f2f04941d07bbbe582845254630950)]
-
[stack/pr-phase5-sea-statement-telemetry](adbc-drivers#464)
[[Files
changed](https://github.com/adbc-drivers/databricks/pull/464/files/9dc278a249433000b03e2d3a4cdf7daa151caa69..8109047005a89804f243beb3a6c689f43b539506)]

---------
## Summary of the design

This design closes the gap where the SEA (REST) transport in the C# ADBC
Databricks driver emits **zero** client telemetry — no session events,
no per-statement operation events, no error events, no chunk metrics.
SEA traffic is currently invisible to the `eng_lumberjack`
driver-telemetry pipeline.

The design:
- Introduces a small **observer interface**
(`IStatementOperationObserver`) between statement classes and the
telemetry implementation. The interface is shaped around the statement's
lifecycle (`OnExecuteStarted`, `OnPollCompleted`, …) rather than
telemetry's data model. Contract: fail-open — implementations must not
throw.
- **Refactors `ConnectionTelemetry.Create`** to accept `string
sessionId` (dropping its Thrift `TSessionHandle` coupling) and a
`DriverMode mode` parameter. Both transports use the same factory.
- Wires observer callbacks at the SEA hookpoints in
`StatementExecutionConnection` and `StatementExecutionStatement`.
- **Reuses all existing telemetry infrastructure as-is**:
`TelemetryClient`, `TelemetryClientManager`,
`CircuitBreakerTelemetryExporter`, `DatabricksTelemetryExporter`,
`FeatureFlagCache`, `TelemetrySessionContext`,
`StatementTelemetryContext`.

## Key decisions and alternatives considered

- **Observer interface over static helper or instance emitter.** The
gap-fix doc proposed a static `TelemetryHelper`. This design supersedes
that with an observer-shaped interface because: (a) it gives one-line
callsites with no parameter threading, (b) it decouples statement
classes from telemetry types so future tracing/audit observers can plug
in without touching statement code, (c) it's trivially mockable for
tests. The fail-open contract pushes all try/catch into the
implementations exactly once — callsites stay clean.
- **Composition, not inheritance.** Considered (and rejected) three
inheritance variants: SEA-inherits-Thrift (drags in entire HiveServer2
chain — semantically wrong), shared base above both (blocked
structurally — `HiveServer2Connection`'s parent is in the Apache package
we don't own), and interface + extension methods (functionally identical
to the static helper). C# single inheritance + unowned Apache base
classes make pure inheritance impractical.
- **Decorator at AdbcConnection boundary rejected.** A wrapper around
the whole connection cannot see deep telemetry signals (chunk timing,
poll count, first-batch latency) — those live inside statement classes.
Wrong granularity.
- **Refactor `Create` signature rather than add overload.** Changes the
canonical `ConnectionTelemetry.Create` to take `string sessionId`.
Thrift converts at the call boundary
(`sessionHandle.SessionId.Guid.ToString()`). Eliminates the Thrift leak
from the telemetry API permanently.
- **Scoped strictly to SEA, plus `DriverMode.Sea` setter.** The
cross-cutting Thrift gaps (workspace_id, auth_type, metadata-ops
instrumentation, retry_count) are owned by the separate gap-fix
workstream and will land independently. The only cross-cutting change
pulled in here is unhooking the hardcoded `DriverMode.Types.Type.Thrift`
in `BuildDriverConnectionParams` — without it SEA records would be
silently mislabeled as Thrift.

## Areas needing specific review focus

1. **Observer interface shape & fail-open contract** (`§5.1`) — the
observer methods, naming, and the requirement that implementations never
throw. Specifically: are the 8 methods the right cut, or is finer
granularity (e.g. split `OnChunksDownloaded` from `OnConsumed`)
preferred?
2. **`ConnectionTelemetry.Create` signature change** (`§5.2`) — replaces
`TSessionHandle?` with `string sessionId` and adds a `DriverMode mode`
parameter. This touches a stable API used by the existing Thrift path;
the Thrift call site must convert at the boundary.
3. **Result-format mapping for SEA** (`§8`) — SEA does not expose a
typed `ResultFormat`. The mapping table from wire `disposition` ×
manifest state → proto `ExecutionResult.Format` is a judgment call;
please review.
4. **Chunk-metrics dependency on gap-fix** (`§9`, `§16`) — this design
assumes the `CloudFetchDownloader → ChunkMetrics →
CloudFetchReader.GetChunkMetrics()` plumbing from the gap-fix workstream
lands first or concurrently. If gap-fix is delayed, SEA ships with
`ChunkMetrics.Empty` and backfills later. Is that acceptable, or should
we sequence differently?
5. **Open questions** (`§17`): polling-granularity semantics for
`PollCount`, `is_internal_call` semantics for SEA `USE SCHEMA`, and
whether SEA's `operation_type` should always be
`EXECUTE_STATEMENT_ASYNC` or map to sync-emulated variants.

## Related

- Builds on the architecture in
[`csharp/doc/telemetry-design.md`](../blob/stack/PECO-3022-sea-telemetry-design/csharp/doc/telemetry-design.md)
- Supersedes the `TelemetryHelper` static-helper proposal in
[`docs/designs/fix-telemetry-gaps-design.md`](../blob/stack/PECO-3022-sea-telemetry-design/docs/designs/fix-telemetry-gaps-design.md)
for the new SEA code; Thrift-side gap-fix work continues independently
- Jira: [PECO-3022](https://databricks.atlassian.net/browse/PECO-3022)

This pull request and its description were written by Isaac.

---------

Co-authored-by: Jade Wang <jade.wang+data@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant