Skip to content

feat(csharp): emit per-statement telemetry on the SEA path (PECO-3022)#464

Draft
jadewang-db wants to merge 29 commits into
mainfrom
stack/pr-phase5-sea-statement-telemetry
Draft

feat(csharp): emit per-statement telemetry on the SEA path (PECO-3022)#464
jadewang-db wants to merge 29 commits into
mainfrom
stack/pr-phase5-sea-statement-telemetry

Conversation

@jadewang-db
Copy link
Copy Markdown
Collaborator

@jadewang-db jadewang-db commented May 19, 2026

🥞 Stacked PR

Use this link to review incremental changes.


Summary

Wires all eight observer hookpoints into StatementExecutionStatement so the SEA path emits a complete per-statement telemetry record matching the Thrift path. This is the final phase of the PECO-3022 stack — after merge, SEA telemetry is at production parity with Thrift for every field the design owns.

Observer callsites (all 8)

Hookpoint Where it fires
OnExecuteStarted ExecuteQueryInternalAsync before _client.ExecuteStatementAsync
OnExecuteSucceeded After execute response, with SeaResultFormatMapper for ExecutionResult.Format
OnPollCompleted After polling loop terminates (accumulated count + ms)
OnFirstBatchReady At reader construction in ExecuteQueryInternalAsync (fires regardless of consumer iteration)
OnConsumed At reader Dispose, with EnsureObserverSignaled safety net for consumers that abandon the reader
OnChunksDownloaded At cloud-fetch reader Dispose; falls back to empty ChunkMetrics until the gap-fix workstream surfaces real values
OnError ExecuteQueryInternalAsync catch block
OnFinalized StatementExecutionStatement.Dispose, gated on _executeStarted, idempotent via Interlocked.CompareExchange

Additional pieces

  • SeaResultFormatMapper — pure-function helper mapping wire disposition × manifest state to proto ExecutionResult.Format (per design §8).
  • SeaMetadataOperationMapper + SetPendingMetadataOperation — categorizes metadata operations so GetCatalogs/GetSchemas/GetTables/GetColumns/GetPrimaryKeys/GetCrossReference emit (StatementType.Metadata, OperationType.List*) instead of (Query, ExecuteStatementAsync). GetTableTypes (pure local, no SQL) emits its own direct EmitStatementOperationTelemetry since it bypasses the statement pipeline.
  • OperationType.ExecuteStatementAsync instead of ExecuteStatement — SEA is always async on the wire (design §17 open question ci(deps): Bump actions/checkout from 4 to 5 #3).
  • ExecuteUpdate path — observer lifecycle now mirrors the Query path; UPDATE statements emit telemetry just like SELECTs.
  • CLOSE_STATEMENT event — emitted from Dispose with idempotency gate, mirroring the Thrift path's per-statement-disposal event.
  • E2E test skip-guards — 11 of 12 Skip.If(Protocol == "rest", ...) guards in csharp/test/E2E/Telemetry/ removed. The existing Thrift telemetry suite now runs against SEA, turning it into a Thrift-vs-SEA comparator. RetryCountTests stays skipped (retry_count belongs to a separate gap-fix workstream).

Files touched

  • csharp/src/StatementExecution/StatementExecutionStatement.cs — observer field, 8 hookpoints, metadata pending-operation, CLOSE_STATEMENT emission, ExecuteUpdate wiring, EnsureObserverSignaled safety net.
  • csharp/src/StatementExecution/StatementExecutionConnection.csGetTableTypes direct emission, metadata-tagged ExecuteMetadataSqlAsync overload, ExecuteShowColumnsAsync operation-type threading.
  • csharp/src/StatementExecution/SeaResultFormatMapper.cs (new) — disposition→proto-enum mapping.
  • csharp/src/StatementExecution/SeaMetadataOperationMapper.cs (new) — metadata-keyword→OperationType.List* mapping.
  • csharp/test/Unit/StatementExecution/StatementExecutionStatementObserverTests.cs + StatementExecutionStatementObserverInjectionTests.cs (new) — comprehensive unit coverage of all hookpoints, idempotency, and metadata stamping.
  • csharp/test/E2E/Telemetry/*.cs — 11 files: PECO-3010 skip-guards + stale TODOs removed.
  • csharp/doc/gap-report-PECO-3022-sea-telemetry-*.md (4 reports) — documents the iterative gap-fill rounds, prod-data verification, and final comparator results.

Test plan

  • Unit: all observer hookpoints fire exactly-once in the expected order; CLOSE_STATEMENT idempotent across repeated Dispose; OnFinalized exactly-once even when both error path and dispose path call it.
  • E2E telemetry suite on SEA (Protocol=rest): 62 / 93 pass — every divergence vs the Thrift run (84 / 93) is explained: 19 are the B5/ChunkMetrics workstream dependency, 6 are intentional SEA skips (RetryCount + LZ4 disabled), 3 are environmental SQL timeouts unrelated to telemetry wiring.
  • Production-verified in main.eng_lumberjack.prod_frontend_log_sql_driver_log:
    • 30/30 SEA EXECUTE records carry operation_type = EXECUTE_STATEMENT_ASYNC, result_set_ready_latency_millis, result_set_consumption_latency_millis populated, socket_timeout = 900 (matches Thrift)
    • 30 paired CLOSE_STATEMENT events emitted
    • execution_result = INLINE_ARROW (no FORMAT_UNSPECIFIED records)

Related

Part of PECO-3022. Closes out the PECO-3022 stack.

Jade Wang added 24 commits May 22, 2026 20:06
…erMode\n\nTask ID: task-1.1-refactor-connection-telemetry-create
- High: preserve fail-open contract — wrap the TSessionHandle Guid byte[]
  conversion in InitializeTelemetry with try/catch. Pre-refactor the same
  conversion lived inside ConnectionTelemetry.Create's outer try/catch so a
  malformed session GUID degraded to NoOp telemetry; moving it to the
  transport boundary lost that guarantee.
- Medium: remove SafeBuildSystemConfiguration_..._FallbackPath — it passed
  string.Empty for assemblyVersion expecting a throw, but
  CreateSystemConfiguration coalesces empty string. The catch block is
  never reached. The CanonicalConstant_HasExpectedLiteralValue test
  already pins both branches by construction.
- Low: rename Create_ThrowingHttpClient_ReturnsNullConnectionTelemetry to
  ...ReturnsNoOpConnectionTelemetry — actual return is the NoOp singleton.
- Low: document the test's implicit assumption that the empty-host throws
  inside HttpClientFactory/TelemetryClientManager so future defensive
  handling there doesn't silently turn this into a non-test.
- Low: add Create_EmptySessionId_MapsToNullInContext to pin the
  string.Empty -> null SessionId mapping at ConnectionTelemetry.cs:133.

Co-authored-by: Isaac
…on\n\nTask ID: task-2.1-observer-interface-and-null-impl
…ryContext + enqueue\n\nTask ID: task-2.2-telemetry-observer-impl
…server\n\nTask ID: task-2.3-safe-observer-decorator
…\n\nTask ID: task-3.1-refactor-databricks-statement-observer
…and inject from connection\n\nTask ID: task-5.1-sea-statement-observer-field-and-create
…EA statement\n\nTask ID: task-5.2-sea-statement-execute-and-poll-hookpoints
Jade Wang added 5 commits May 22, 2026 20:09
PECO-3022 wired SEA telemetry end-to-end (gap-1 through gap-5 + D1).
The 11 E2E test files that were gated off via Skip.If(Protocol == "rest", ...)
now have their guards removed and execute against both protocols, turning
the existing suite into a Thrift-vs-SEA comparator.

Categories addressed:
- 7 files marked "PECO-3010: telemetry not wired for SEA" — marker is stale,
  drop skip + drop accompanying TODO comments.
- 4 files marked "Thrift-only" (CloudFetch chunk details, connection parameters)
  — drop skip experimentally; SEA paths should now populate equivalent fields
  via the gap-2 SeaResultFormatMapper, gap-3/4/5 reader-side hookpoints, the
  CLOSE_STATEMENT event wiring, and the D1 fix sourcing socket_timeout from
  HttpClient.Timeout.

Left untouched:
- RetryCountTests.cs — retry_count is not wired for SEA; tracked separately
  in the gap-fix workstream, not PECO-3022 scope.

Co-authored-by: Isaac
Three gap reports authored during PECO-3022 implementation and verification:

1. gap-report-PECO-3022-sea-telemetry-2026-05-14.md — initial post-implementation
   gap audit (G1-G3 critical wiring gaps + D1-D4 divergences)
2. gap-report-PECO-3022-sea-telemetry-e2e-2026-05-15.md — E2E test coverage gap
   (G4) and Thrift-vs-SEA comparator strategy
3. gap-report-PECO-3022-sea-telemetry-prod-findings-2026-05-15.md — bugs
   surfaced by comparing v1.1.4 SEA records vs Thrift in eng_lumberjack
   (B1-B7), plus follow-up resolution tracking

All gaps now closed except B5 (gap-fix workstream dependency) and B7
(consumer-team confirmation pending).

Co-authored-by: Isaac
…ect_results, ConnectTimeoutMilliseconds

B8 (uncommitted SDD output) + B8 finisher + B10 + B11.

Changes:
- SeaMetadataOperationMapper: maps metadata SqlQuery keywords (getcatalogs,
  getschemas, gettables, getcolumns, gettabletypes, getprimarykeys,
  getcrossreference) to OperationType.List* — mirrors Thrift's
  DatabricksStatement.GetMetadataOperationType.
- StatementExecutionStatement._pendingMetadataOperation + SetPendingMetadataOperation:
  per-statement override so the sub-statement's OnExecuteStarted emits
  (StatementType.Metadata, OperationType.List*) instead of
  (Query, ExecuteStatementAsync).
- StatementExecutionConnection.ExecuteMetadataSqlAsync(sql, OperationType, ct):
  new overload that stamps the sub-statement before execution. Connection-level
  IGetObjectsDataProvider methods + ExecuteShowColumnsAsync all pass the right
  OperationType through.
- StatementExecutionConnection.GetTableTypes: emits ListTableTypes telemetry
  directly via EmitStatementOperationTelemetry since the method is purely local
  and bypasses ExecuteMetadataSqlAsync entirely (B8 finisher).
- StatementExecutionConnection.OpenAsync: enableDirectResults now read from
  DatabricksParameters.EnableDirectResults property instead of hardcoded true,
  matching user intent in telemetry (B10).
- StatementExecutionConnection.CreateHttpClient: honors SparkParameters.ConnectTimeoutMilliseconds
  when explicitly set, falling back to CloudFetchTimeoutMinutes otherwise.
  Combined with the D1 fix that sources socket_timeout from HttpClient.Timeout,
  this makes user-supplied connect-timeout values reach the telemetry wire (B11).

Verified against the test workspace:
- Telemetry_GetTableTypes_EmitsListTableTypes — passes
- ConnectionParams_SocketTimeout_IsPopulated — passes (was: Expected 120 / Actual 900)
- ConnectionParams_EnableDirectResults_IsPopulated — passes
- 7/11 metadata tests pass (catalogs, columns, all statement-side); 3 remaining
  metadata tests (Schemas, Tables, AllDepths) hit TaskCanceledException on
  SHOW SCHEMAS/TABLES IN ALL CATALOGS — environmental (10s metadata budget on
  this warehouse), not a telemetry-wiring issue.

Co-authored-by: Isaac
Reflects the SDD-round B8/B9 closures plus the manual B8 finisher (GetTableTypes),
B10 (enable_direct_results), and B11 (ConnectTimeoutMilliseconds) fixes. Documents
that the remaining 3 metadata-test failures (Schemas, Tables, AllDepths) are
environmental — SHOW SCHEMAS/TABLES IN ALL CATALOGS times out against the 10s
metadata budget on the test warehouse — not telemetry-wiring issues.

Co-authored-by: Isaac
@jadewang-db jadewang-db force-pushed the stack/pr-phase5-sea-statement-telemetry branch from 8109047 to f7be471 Compare May 22, 2026 20:11
CurtHagenlocher pushed a commit to CurtHagenlocher/databricks that referenced this pull request May 24, 2026
)

## 🥞 Stacked PR
Use this
[link](https://github.com/adbc-drivers/databricks/pull/455/files) to
review incremental changes.
-
[**stack/PECO-3022-sea-telemetry-design**](adbc-drivers#455)
[[Files
changed](https://github.com/adbc-drivers/databricks/pull/455/files)]
-
[stack/pr-phase1-connection-telemetry-create-refactor](adbc-drivers#460)
[[Files
changed](https://github.com/adbc-drivers/databricks/pull/460/files/c0dfbed80fd09c91b9e2e5ed8050a268435618bd..d3190aeb6f2f1c727b359d0ef40d26be2280c73e)]
-
[stack/pr-phase2-observer-interface](adbc-drivers#461)
[[Files
changed](https://github.com/adbc-drivers/databricks/pull/461/files/b2340d9d32da68d5b81756e0a77008f05aadd45b..b0d9f02236bf7f99e93132cfe6ed4dd119fc1e73)]
-
[stack/pr-phase3-databricks-statement-observer](adbc-drivers#462)
[[Files
changed](https://github.com/adbc-drivers/databricks/pull/462/files/b0d9f02236bf7f99e93132cfe6ed4dd119fc1e73..fa799fcd0cc2209982eae2890e16e26854a0649e)]
-
[stack/pr-phase4-sea-connection-telemetry](adbc-drivers#463)
[[Files
changed](https://github.com/adbc-drivers/databricks/pull/463/files/f0e344aa4d9d341302ee9b3a5217a6794d8ba189..25b5e6eaf2f2f04941d07bbbe582845254630950)]
-
[stack/pr-phase5-sea-statement-telemetry](adbc-drivers#464)
[[Files
changed](https://github.com/adbc-drivers/databricks/pull/464/files/9dc278a249433000b03e2d3a4cdf7daa151caa69..8109047005a89804f243beb3a6c689f43b539506)]

---------
## Summary of the design

This design closes the gap where the SEA (REST) transport in the C# ADBC
Databricks driver emits **zero** client telemetry — no session events,
no per-statement operation events, no error events, no chunk metrics.
SEA traffic is currently invisible to the `eng_lumberjack`
driver-telemetry pipeline.

The design:
- Introduces a small **observer interface**
(`IStatementOperationObserver`) between statement classes and the
telemetry implementation. The interface is shaped around the statement's
lifecycle (`OnExecuteStarted`, `OnPollCompleted`, …) rather than
telemetry's data model. Contract: fail-open — implementations must not
throw.
- **Refactors `ConnectionTelemetry.Create`** to accept `string
sessionId` (dropping its Thrift `TSessionHandle` coupling) and a
`DriverMode mode` parameter. Both transports use the same factory.
- Wires observer callbacks at the SEA hookpoints in
`StatementExecutionConnection` and `StatementExecutionStatement`.
- **Reuses all existing telemetry infrastructure as-is**:
`TelemetryClient`, `TelemetryClientManager`,
`CircuitBreakerTelemetryExporter`, `DatabricksTelemetryExporter`,
`FeatureFlagCache`, `TelemetrySessionContext`,
`StatementTelemetryContext`.

## Key decisions and alternatives considered

- **Observer interface over static helper or instance emitter.** The
gap-fix doc proposed a static `TelemetryHelper`. This design supersedes
that with an observer-shaped interface because: (a) it gives one-line
callsites with no parameter threading, (b) it decouples statement
classes from telemetry types so future tracing/audit observers can plug
in without touching statement code, (c) it's trivially mockable for
tests. The fail-open contract pushes all try/catch into the
implementations exactly once — callsites stay clean.
- **Composition, not inheritance.** Considered (and rejected) three
inheritance variants: SEA-inherits-Thrift (drags in entire HiveServer2
chain — semantically wrong), shared base above both (blocked
structurally — `HiveServer2Connection`'s parent is in the Apache package
we don't own), and interface + extension methods (functionally identical
to the static helper). C# single inheritance + unowned Apache base
classes make pure inheritance impractical.
- **Decorator at AdbcConnection boundary rejected.** A wrapper around
the whole connection cannot see deep telemetry signals (chunk timing,
poll count, first-batch latency) — those live inside statement classes.
Wrong granularity.
- **Refactor `Create` signature rather than add overload.** Changes the
canonical `ConnectionTelemetry.Create` to take `string sessionId`.
Thrift converts at the call boundary
(`sessionHandle.SessionId.Guid.ToString()`). Eliminates the Thrift leak
from the telemetry API permanently.
- **Scoped strictly to SEA, plus `DriverMode.Sea` setter.** The
cross-cutting Thrift gaps (workspace_id, auth_type, metadata-ops
instrumentation, retry_count) are owned by the separate gap-fix
workstream and will land independently. The only cross-cutting change
pulled in here is unhooking the hardcoded `DriverMode.Types.Type.Thrift`
in `BuildDriverConnectionParams` — without it SEA records would be
silently mislabeled as Thrift.

## Areas needing specific review focus

1. **Observer interface shape & fail-open contract** (`§5.1`) — the
observer methods, naming, and the requirement that implementations never
throw. Specifically: are the 8 methods the right cut, or is finer
granularity (e.g. split `OnChunksDownloaded` from `OnConsumed`)
preferred?
2. **`ConnectionTelemetry.Create` signature change** (`§5.2`) — replaces
`TSessionHandle?` with `string sessionId` and adds a `DriverMode mode`
parameter. This touches a stable API used by the existing Thrift path;
the Thrift call site must convert at the boundary.
3. **Result-format mapping for SEA** (`§8`) — SEA does not expose a
typed `ResultFormat`. The mapping table from wire `disposition` ×
manifest state → proto `ExecutionResult.Format` is a judgment call;
please review.
4. **Chunk-metrics dependency on gap-fix** (`§9`, `§16`) — this design
assumes the `CloudFetchDownloader → ChunkMetrics →
CloudFetchReader.GetChunkMetrics()` plumbing from the gap-fix workstream
lands first or concurrently. If gap-fix is delayed, SEA ships with
`ChunkMetrics.Empty` and backfills later. Is that acceptable, or should
we sequence differently?
5. **Open questions** (`§17`): polling-granularity semantics for
`PollCount`, `is_internal_call` semantics for SEA `USE SCHEMA`, and
whether SEA's `operation_type` should always be
`EXECUTE_STATEMENT_ASYNC` or map to sync-emulated variants.

## Related

- Builds on the architecture in
[`csharp/doc/telemetry-design.md`](../blob/stack/PECO-3022-sea-telemetry-design/csharp/doc/telemetry-design.md)
- Supersedes the `TelemetryHelper` static-helper proposal in
[`docs/designs/fix-telemetry-gaps-design.md`](../blob/stack/PECO-3022-sea-telemetry-design/docs/designs/fix-telemetry-gaps-design.md)
for the new SEA code; Thrift-side gap-fix work continues independently
- Jira: [PECO-3022](https://databricks.atlassian.net/browse/PECO-3022)

This pull request and its description were written by Isaac.

---------

Co-authored-by: Jade Wang <jade.wang+data@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant