Severity: High
Observation
Under load, ReadNextRecordBatchAsync dominates trace volume and TraceId cardinality. In a sweep of all 33 E2E test classes with OTEL_TRACES_EXPORTER=adbcfile:
- MemoryStressTests (3 tests against 5M-row queries): 24,632
ReadNextRecordBatchAsync spans / 18.5MB / 52.5% of bytes. Plus 12,033 DatabricksReader.ProcessFetchedBatches spans / 15.5MB / 44.1%. Together = 96.6% of bytes for an extract workload.
- 12,316 of 12,958 distinct TraceIds (95%) are single-span root
ReadNextRecordBatchAsync traces — i.e. each "pull next batch" call starts a fresh top-level trace.
- Reproducible across every CloudFetch / extract test: CloudFetchE2ETest 1,580 ReadNext spans (790 roots); CloudFetchStressTests 2,692 / 1,346 roots.
This is a real APM ingest-cost issue (volume) AND a backend cardinality issue (throwaway TraceIds clutter trace-list views).
Workaround that exists today
The events on DatabricksReader.ProcessFetchedBatches (decompress_start, decompress_completed, deserialize_batch) DO carry per-batch metrics. So users can filter by event name and get per-batch detail without the span volume.
Suggested fix
Pick one (or combine):
- Aggregate: emit one span per N batches (e.g. N=100) with summed metrics instead of one per batch.
- Demote ReadNextRecordBatchAsync to an event on the parent statement span, so it doesn't create its own activity.
- At minimum, chain
ReadNextRecordBatchAsync to the parent statement TraceId so the 12K roots collapse to 1 root per statement. (Partial overlap with the "no driver-session root" issue.)
Evidence
Traces.MemoryStressTests.20260527_154049/ (35MB / 37,950 spans / 12,958 TraceIds — ReadNext + ProcessFetchedBatches = 96.6% of bytes)
Traces.CloudFetchE2ETest.20260527_144809/ (1,580 ReadNext spans, 790 as roots)
Traces.CloudFetchStressTests.20260527_154205/ (2,692 ReadNext, 1,346 roots)
Found during a tracer-output bugbash of the file exporter; full evaluation report available on request.
Severity: High
Observation
Under load,
ReadNextRecordBatchAsyncdominates trace volume and TraceId cardinality. In a sweep of all 33 E2E test classes withOTEL_TRACES_EXPORTER=adbcfile:ReadNextRecordBatchAsyncspans / 18.5MB / 52.5% of bytes. Plus 12,033DatabricksReader.ProcessFetchedBatchesspans / 15.5MB / 44.1%. Together = 96.6% of bytes for an extract workload.ReadNextRecordBatchAsynctraces — i.e. each "pull next batch" call starts a fresh top-level trace.This is a real APM ingest-cost issue (volume) AND a backend cardinality issue (throwaway TraceIds clutter trace-list views).
Workaround that exists today
The events on
DatabricksReader.ProcessFetchedBatches(decompress_start,decompress_completed,deserialize_batch) DO carry per-batch metrics. So users can filter by event name and get per-batch detail without the span volume.Suggested fix
Pick one (or combine):
ReadNextRecordBatchAsyncto the parent statement TraceId so the 12K roots collapse to 1 root per statement. (Partial overlap with the "no driver-session root" issue.)Evidence
Traces.MemoryStressTests.20260527_154049/(35MB / 37,950 spans / 12,958 TraceIds —ReadNext+ProcessFetchedBatches= 96.6% of bytes)Traces.CloudFetchE2ETest.20260527_144809/(1,580 ReadNext spans, 790 as roots)Traces.CloudFetchStressTests.20260527_154205/(2,692 ReadNext, 1,346 roots)Found during a tracer-output bugbash of the file exporter; full evaluation report available on request.