feat(csharp): emit per-statement telemetry on the SEA path (PECO-3022)#464
Draft
jadewang-db wants to merge 29 commits into
Draft
feat(csharp): emit per-statement telemetry on the SEA path (PECO-3022)#464jadewang-db wants to merge 29 commits into
jadewang-db wants to merge 29 commits into
Conversation
This was referenced May 14, 2026
added 24 commits
May 22, 2026 20:06
…erMode\n\nTask ID: task-1.1-refactor-connection-telemetry-create
- High: preserve fail-open contract — wrap the TSessionHandle Guid byte[] conversion in InitializeTelemetry with try/catch. Pre-refactor the same conversion lived inside ConnectionTelemetry.Create's outer try/catch so a malformed session GUID degraded to NoOp telemetry; moving it to the transport boundary lost that guarantee. - Medium: remove SafeBuildSystemConfiguration_..._FallbackPath — it passed string.Empty for assemblyVersion expecting a throw, but CreateSystemConfiguration coalesces empty string. The catch block is never reached. The CanonicalConstant_HasExpectedLiteralValue test already pins both branches by construction. - Low: rename Create_ThrowingHttpClient_ReturnsNullConnectionTelemetry to ...ReturnsNoOpConnectionTelemetry — actual return is the NoOp singleton. - Low: document the test's implicit assumption that the empty-host throws inside HttpClientFactory/TelemetryClientManager so future defensive handling there doesn't silently turn this into a non-test. - Low: add Create_EmptySessionId_MapsToNullInContext to pin the string.Empty -> null SessionId mapping at ConnectionTelemetry.cs:133. Co-authored-by: Isaac
…on\n\nTask ID: task-2.1-observer-interface-and-null-impl
…ryContext + enqueue\n\nTask ID: task-2.2-telemetry-observer-impl
…server\n\nTask ID: task-2.3-safe-observer-decorator
…\n\nTask ID: task-3.1-refactor-databricks-statement-observer
…elemetryContext downcast
…D: task-4.1-wire-sea-connection-telemetry
…t _waitTimeoutSeconds) in SEA
…and inject from connection\n\nTask ID: task-5.1-sea-statement-observer-field-and-create
…EA statement\n\nTask ID: task-5.2-sea-statement-execute-and-poll-hookpoints
…SEA EXECUTE records
added 5 commits
May 22, 2026 20:09
PECO-3022 wired SEA telemetry end-to-end (gap-1 through gap-5 + D1). The 11 E2E test files that were gated off via Skip.If(Protocol == "rest", ...) now have their guards removed and execute against both protocols, turning the existing suite into a Thrift-vs-SEA comparator. Categories addressed: - 7 files marked "PECO-3010: telemetry not wired for SEA" — marker is stale, drop skip + drop accompanying TODO comments. - 4 files marked "Thrift-only" (CloudFetch chunk details, connection parameters) — drop skip experimentally; SEA paths should now populate equivalent fields via the gap-2 SeaResultFormatMapper, gap-3/4/5 reader-side hookpoints, the CLOSE_STATEMENT event wiring, and the D1 fix sourcing socket_timeout from HttpClient.Timeout. Left untouched: - RetryCountTests.cs — retry_count is not wired for SEA; tracked separately in the gap-fix workstream, not PECO-3022 scope. Co-authored-by: Isaac
Three gap reports authored during PECO-3022 implementation and verification: 1. gap-report-PECO-3022-sea-telemetry-2026-05-14.md — initial post-implementation gap audit (G1-G3 critical wiring gaps + D1-D4 divergences) 2. gap-report-PECO-3022-sea-telemetry-e2e-2026-05-15.md — E2E test coverage gap (G4) and Thrift-vs-SEA comparator strategy 3. gap-report-PECO-3022-sea-telemetry-prod-findings-2026-05-15.md — bugs surfaced by comparing v1.1.4 SEA records vs Thrift in eng_lumberjack (B1-B7), plus follow-up resolution tracking All gaps now closed except B5 (gap-fix workstream dependency) and B7 (consumer-team confirmation pending). Co-authored-by: Isaac
…ect_results, ConnectTimeoutMilliseconds B8 (uncommitted SDD output) + B8 finisher + B10 + B11. Changes: - SeaMetadataOperationMapper: maps metadata SqlQuery keywords (getcatalogs, getschemas, gettables, getcolumns, gettabletypes, getprimarykeys, getcrossreference) to OperationType.List* — mirrors Thrift's DatabricksStatement.GetMetadataOperationType. - StatementExecutionStatement._pendingMetadataOperation + SetPendingMetadataOperation: per-statement override so the sub-statement's OnExecuteStarted emits (StatementType.Metadata, OperationType.List*) instead of (Query, ExecuteStatementAsync). - StatementExecutionConnection.ExecuteMetadataSqlAsync(sql, OperationType, ct): new overload that stamps the sub-statement before execution. Connection-level IGetObjectsDataProvider methods + ExecuteShowColumnsAsync all pass the right OperationType through. - StatementExecutionConnection.GetTableTypes: emits ListTableTypes telemetry directly via EmitStatementOperationTelemetry since the method is purely local and bypasses ExecuteMetadataSqlAsync entirely (B8 finisher). - StatementExecutionConnection.OpenAsync: enableDirectResults now read from DatabricksParameters.EnableDirectResults property instead of hardcoded true, matching user intent in telemetry (B10). - StatementExecutionConnection.CreateHttpClient: honors SparkParameters.ConnectTimeoutMilliseconds when explicitly set, falling back to CloudFetchTimeoutMinutes otherwise. Combined with the D1 fix that sources socket_timeout from HttpClient.Timeout, this makes user-supplied connect-timeout values reach the telemetry wire (B11). Verified against the test workspace: - Telemetry_GetTableTypes_EmitsListTableTypes — passes - ConnectionParams_SocketTimeout_IsPopulated — passes (was: Expected 120 / Actual 900) - ConnectionParams_EnableDirectResults_IsPopulated — passes - 7/11 metadata tests pass (catalogs, columns, all statement-side); 3 remaining metadata tests (Schemas, Tables, AllDepths) hit TaskCanceledException on SHOW SCHEMAS/TABLES IN ALL CATALOGS — environmental (10s metadata budget on this warehouse), not a telemetry-wiring issue. Co-authored-by: Isaac
Reflects the SDD-round B8/B9 closures plus the manual B8 finisher (GetTableTypes), B10 (enable_direct_results), and B11 (ConnectTimeoutMilliseconds) fixes. Documents that the remaining 3 metadata-test failures (Schemas, Tables, AllDepths) are environmental — SHOW SCHEMAS/TABLES IN ALL CATALOGS times out against the 10s metadata budget on the test warehouse — not telemetry-wiring issues. Co-authored-by: Isaac
8109047 to
f7be471
Compare
CurtHagenlocher
pushed a commit
to CurtHagenlocher/databricks
that referenced
this pull request
May 24, 2026
) ## 🥞 Stacked PR Use this [link](https://github.com/adbc-drivers/databricks/pull/455/files) to review incremental changes. - [**stack/PECO-3022-sea-telemetry-design**](adbc-drivers#455) [[Files changed](https://github.com/adbc-drivers/databricks/pull/455/files)] - [stack/pr-phase1-connection-telemetry-create-refactor](adbc-drivers#460) [[Files changed](https://github.com/adbc-drivers/databricks/pull/460/files/c0dfbed80fd09c91b9e2e5ed8050a268435618bd..d3190aeb6f2f1c727b359d0ef40d26be2280c73e)] - [stack/pr-phase2-observer-interface](adbc-drivers#461) [[Files changed](https://github.com/adbc-drivers/databricks/pull/461/files/b2340d9d32da68d5b81756e0a77008f05aadd45b..b0d9f02236bf7f99e93132cfe6ed4dd119fc1e73)] - [stack/pr-phase3-databricks-statement-observer](adbc-drivers#462) [[Files changed](https://github.com/adbc-drivers/databricks/pull/462/files/b0d9f02236bf7f99e93132cfe6ed4dd119fc1e73..fa799fcd0cc2209982eae2890e16e26854a0649e)] - [stack/pr-phase4-sea-connection-telemetry](adbc-drivers#463) [[Files changed](https://github.com/adbc-drivers/databricks/pull/463/files/f0e344aa4d9d341302ee9b3a5217a6794d8ba189..25b5e6eaf2f2f04941d07bbbe582845254630950)] - [stack/pr-phase5-sea-statement-telemetry](adbc-drivers#464) [[Files changed](https://github.com/adbc-drivers/databricks/pull/464/files/9dc278a249433000b03e2d3a4cdf7daa151caa69..8109047005a89804f243beb3a6c689f43b539506)] --------- ## Summary of the design This design closes the gap where the SEA (REST) transport in the C# ADBC Databricks driver emits **zero** client telemetry — no session events, no per-statement operation events, no error events, no chunk metrics. SEA traffic is currently invisible to the `eng_lumberjack` driver-telemetry pipeline. The design: - Introduces a small **observer interface** (`IStatementOperationObserver`) between statement classes and the telemetry implementation. The interface is shaped around the statement's lifecycle (`OnExecuteStarted`, `OnPollCompleted`, …) rather than telemetry's data model. Contract: fail-open — implementations must not throw. - **Refactors `ConnectionTelemetry.Create`** to accept `string sessionId` (dropping its Thrift `TSessionHandle` coupling) and a `DriverMode mode` parameter. Both transports use the same factory. - Wires observer callbacks at the SEA hookpoints in `StatementExecutionConnection` and `StatementExecutionStatement`. - **Reuses all existing telemetry infrastructure as-is**: `TelemetryClient`, `TelemetryClientManager`, `CircuitBreakerTelemetryExporter`, `DatabricksTelemetryExporter`, `FeatureFlagCache`, `TelemetrySessionContext`, `StatementTelemetryContext`. ## Key decisions and alternatives considered - **Observer interface over static helper or instance emitter.** The gap-fix doc proposed a static `TelemetryHelper`. This design supersedes that with an observer-shaped interface because: (a) it gives one-line callsites with no parameter threading, (b) it decouples statement classes from telemetry types so future tracing/audit observers can plug in without touching statement code, (c) it's trivially mockable for tests. The fail-open contract pushes all try/catch into the implementations exactly once — callsites stay clean. - **Composition, not inheritance.** Considered (and rejected) three inheritance variants: SEA-inherits-Thrift (drags in entire HiveServer2 chain — semantically wrong), shared base above both (blocked structurally — `HiveServer2Connection`'s parent is in the Apache package we don't own), and interface + extension methods (functionally identical to the static helper). C# single inheritance + unowned Apache base classes make pure inheritance impractical. - **Decorator at AdbcConnection boundary rejected.** A wrapper around the whole connection cannot see deep telemetry signals (chunk timing, poll count, first-batch latency) — those live inside statement classes. Wrong granularity. - **Refactor `Create` signature rather than add overload.** Changes the canonical `ConnectionTelemetry.Create` to take `string sessionId`. Thrift converts at the call boundary (`sessionHandle.SessionId.Guid.ToString()`). Eliminates the Thrift leak from the telemetry API permanently. - **Scoped strictly to SEA, plus `DriverMode.Sea` setter.** The cross-cutting Thrift gaps (workspace_id, auth_type, metadata-ops instrumentation, retry_count) are owned by the separate gap-fix workstream and will land independently. The only cross-cutting change pulled in here is unhooking the hardcoded `DriverMode.Types.Type.Thrift` in `BuildDriverConnectionParams` — without it SEA records would be silently mislabeled as Thrift. ## Areas needing specific review focus 1. **Observer interface shape & fail-open contract** (`§5.1`) — the observer methods, naming, and the requirement that implementations never throw. Specifically: are the 8 methods the right cut, or is finer granularity (e.g. split `OnChunksDownloaded` from `OnConsumed`) preferred? 2. **`ConnectionTelemetry.Create` signature change** (`§5.2`) — replaces `TSessionHandle?` with `string sessionId` and adds a `DriverMode mode` parameter. This touches a stable API used by the existing Thrift path; the Thrift call site must convert at the boundary. 3. **Result-format mapping for SEA** (`§8`) — SEA does not expose a typed `ResultFormat`. The mapping table from wire `disposition` × manifest state → proto `ExecutionResult.Format` is a judgment call; please review. 4. **Chunk-metrics dependency on gap-fix** (`§9`, `§16`) — this design assumes the `CloudFetchDownloader → ChunkMetrics → CloudFetchReader.GetChunkMetrics()` plumbing from the gap-fix workstream lands first or concurrently. If gap-fix is delayed, SEA ships with `ChunkMetrics.Empty` and backfills later. Is that acceptable, or should we sequence differently? 5. **Open questions** (`§17`): polling-granularity semantics for `PollCount`, `is_internal_call` semantics for SEA `USE SCHEMA`, and whether SEA's `operation_type` should always be `EXECUTE_STATEMENT_ASYNC` or map to sync-emulated variants. ## Related - Builds on the architecture in [`csharp/doc/telemetry-design.md`](../blob/stack/PECO-3022-sea-telemetry-design/csharp/doc/telemetry-design.md) - Supersedes the `TelemetryHelper` static-helper proposal in [`docs/designs/fix-telemetry-gaps-design.md`](../blob/stack/PECO-3022-sea-telemetry-design/docs/designs/fix-telemetry-gaps-design.md) for the new SEA code; Thrift-side gap-fix work continues independently - Jira: [PECO-3022](https://databricks.atlassian.net/browse/PECO-3022) This pull request and its description were written by Isaac. --------- Co-authored-by: Jade Wang <jade.wang+data@databricks.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🥞 Stacked PR
Use this link to review incremental changes.
Summary
Wires all eight observer hookpoints into
StatementExecutionStatementso the SEA path emits a complete per-statement telemetry record matching the Thrift path. This is the final phase of the PECO-3022 stack — after merge, SEA telemetry is at production parity with Thrift for every field the design owns.Observer callsites (all 8)
OnExecuteStartedExecuteQueryInternalAsyncbefore_client.ExecuteStatementAsyncOnExecuteSucceededSeaResultFormatMapperforExecutionResult.FormatOnPollCompletedOnFirstBatchReadyExecuteQueryInternalAsync(fires regardless of consumer iteration)OnConsumedEnsureObserverSignaledsafety net for consumers that abandon the readerOnChunksDownloadedChunkMetricsuntil the gap-fix workstream surfaces real valuesOnErrorExecuteQueryInternalAsynccatch blockOnFinalizedStatementExecutionStatement.Dispose, gated on_executeStarted, idempotent viaInterlocked.CompareExchangeAdditional pieces
SeaResultFormatMapper— pure-function helper mapping wiredisposition× manifest state to protoExecutionResult.Format(per design §8).SeaMetadataOperationMapper+SetPendingMetadataOperation— categorizes metadata operations soGetCatalogs/GetSchemas/GetTables/GetColumns/GetPrimaryKeys/GetCrossReferenceemit(StatementType.Metadata, OperationType.List*)instead of(Query, ExecuteStatementAsync).GetTableTypes(pure local, no SQL) emits its own directEmitStatementOperationTelemetrysince it bypasses the statement pipeline.OperationType.ExecuteStatementAsyncinstead ofExecuteStatement— SEA is always async on the wire (design §17 open question ci(deps): Bump actions/checkout from 4 to 5 #3).ExecuteUpdatepath — observer lifecycle now mirrors the Query path; UPDATE statements emit telemetry just like SELECTs.Disposewith idempotency gate, mirroring the Thrift path's per-statement-disposal event.Skip.If(Protocol == "rest", ...)guards incsharp/test/E2E/Telemetry/removed. The existing Thrift telemetry suite now runs against SEA, turning it into a Thrift-vs-SEA comparator.RetryCountTestsstays skipped (retry_count belongs to a separate gap-fix workstream).Files touched
csharp/src/StatementExecution/StatementExecutionStatement.cs— observer field, 8 hookpoints, metadata pending-operation, CLOSE_STATEMENT emission, ExecuteUpdate wiring, EnsureObserverSignaled safety net.csharp/src/StatementExecution/StatementExecutionConnection.cs—GetTableTypesdirect emission, metadata-taggedExecuteMetadataSqlAsyncoverload,ExecuteShowColumnsAsyncoperation-type threading.csharp/src/StatementExecution/SeaResultFormatMapper.cs(new) — disposition→proto-enum mapping.csharp/src/StatementExecution/SeaMetadataOperationMapper.cs(new) — metadata-keyword→OperationType.List* mapping.csharp/test/Unit/StatementExecution/StatementExecutionStatementObserverTests.cs+StatementExecutionStatementObserverInjectionTests.cs(new) — comprehensive unit coverage of all hookpoints, idempotency, and metadata stamping.csharp/test/E2E/Telemetry/*.cs— 11 files: PECO-3010 skip-guards + stale TODOs removed.csharp/doc/gap-report-PECO-3022-sea-telemetry-*.md(4 reports) — documents the iterative gap-fill rounds, prod-data verification, and final comparator results.Test plan
Protocol=rest): 62 / 93 pass — every divergence vs the Thrift run (84 / 93) is explained: 19 are the B5/ChunkMetrics workstream dependency, 6 are intentional SEA skips (RetryCount + LZ4 disabled), 3 are environmental SQL timeouts unrelated to telemetry wiring.main.eng_lumberjack.prod_frontend_log_sql_driver_log:operation_type = EXECUTE_STATEMENT_ASYNC,result_set_ready_latency_millis,result_set_consumption_latency_millispopulated,socket_timeout = 900(matches Thrift)execution_result = INLINE_ARROW(noFORMAT_UNSPECIFIEDrecords)Related
csharp/doc/gap-report-PECO-3022-sea-telemetry-*.md.Part of PECO-3022. Closes out the PECO-3022 stack.