Skip to content

feat(csharp): apply adbc.spark.data_type_conv on SEA path (PECO-3060) [SEA]#469

Open
eric-wang-1990 wants to merge 1 commit into
mainfrom
feat/csharp/PECO-3060-sea-data-type-conv
Open

feat(csharp): apply adbc.spark.data_type_conv on SEA path (PECO-3060) [SEA]#469
eric-wang-1990 wants to merge 1 commit into
mainfrom
feat/csharp/PECO-3060-sea-data-type-conv

Conversation

@eric-wang-1990

Copy link
Copy Markdown
Collaborator

What's Changed

PECO-3060 — M3 remaining parameter gap. The Statement Execution (SEA/REST) path
now honours `adbc.spark.data_type_conv`, matching the Thrift path's
`HiveServer2SchemaParser.GetArrowType` semantics:

  • `scalar` (default): DATE → Date32, DECIMAL → Decimal128, TIMESTAMP → Timestamp,
    FLOAT → Float — native Arrow types, identical to today's SEA behaviour.
  • `none`: DATE / DECIMAL / TIMESTAMP → String; FLOAT → Double — surfaces the
    conversion-sensitive scalars as strings (or widens to double) so SEA matches
    Thrift output regardless of protocol.

`StatementExecutionConnection` now parses `SparkParameters.DataTypeConv` via
`DataTypeConversionParser` (same precedence as `SparkHttpConnection`) and
exposes the parsed `DataTypeConversion` enum to the statement.
`ArrowTypeParser.MapPrimitiveType` branches on the flag for the four affected
scalars; a new `ScalarConversionStream` wrapper — layered after
`IntervalSerializingStream` / `ComplexTypeSerializingStream` only when the mode is
`none` — converts the native Date32 / Timestamp / Decimal128 / Float arrays into
matching `StringArray` / `DoubleArray` so the manifest schema and batch data agree.

Why

PECO-3060 — Jun 10 cutoff per the M3 plan. Until now, the SEA path
unconditionally returned native typed columns. Users setting
`adbc.spark.data_type_conv=none` (or its alias `adbc.hive.data_type_conv=none`)
got Date32/Decimal128/Timestamp from SEA but String from Thrift — a protocol-visible
behaviour difference. This change makes the two paths behaviourally identical.

Red → Green proof

Before fix (`ExecuteQuery_DataTypeConv_None_SerializesScalarTypesToStrings`,
SEA + `data_type_conv=none`):

```
Assert.Equal() Failure: Values differ
Expected: String
Actual: Date32
```

After fix:

```
Passed AdbcDrivers.Databricks.Tests.E2E.StatementExecution.StatementExecutionDriverE2ETests.ExecuteQuery_DataTypeConv_None_SerializesScalarTypesToStrings [1 s]
Passed AdbcDrivers.Databricks.Tests.E2E.StatementExecution.StatementExecutionDriverE2ETests.ExecuteQuery_DataTypeConv_Scalar_KeepsNativeTypes [1 s]
```

Full `StatementExecutionDriverE2ETests` suite (13 tests) passes — no regressions in
adjacent SEA tests.

Files touched

  • `csharp/src/ArrowTypeParser.cs` — primitive type mapping consults the new flag.
  • `csharp/src/ScalarConversionStream.cs` — new wrapper that converts native arrays
    when `none`. Same detection pattern (`Spark:DataType:SqlName` metadata) as the
    existing Interval/ComplexType wrappers.
  • `csharp/src/StatementExecution/StatementExecutionConnection.cs` — parses and
    exposes `DataTypeConversion`.
  • `csharp/src/StatementExecution/StatementExecutionStatement.cs` — threads the flag
    through the manifest schema mapping and the reader pipeline.
  • `csharp/test/E2E/StatementExecution/StatementExecutionDriverE2ETests.cs` —
    `SkippableFact` red→green coverage for both `none` and `scalar` modes.

Manual verification

  • `dotnet build` green on `netstandard2.0` and `net8.0`.
  • New E2E tests pass against `pecotesting` warehouse.
  • `StatementExecutionDriverE2ETests` class passes (13/13).
  • Default behaviour (`scalar`) unchanged — existing SEA tests untouched.

… [SEA]

Honor data_type_conv on the Statement Execution (REST/SEA) path so it matches
the Thrift path's HiveServer2SchemaParser semantics:

- scalar (default): DATE -> Date32, DECIMAL -> Decimal128, TIMESTAMP -> Timestamp,
  FLOAT -> Float (native Arrow types, unchanged from current behaviour).
- none: DATE / DECIMAL / TIMESTAMP -> String; FLOAT -> Double (widened).

StatementExecutionConnection now parses SparkParameters.DataTypeConv via
DataTypeConversionParser (same precedence as SparkHttpConnection) and exposes
the resulting DataTypeConversion to StatementExecutionStatement. The schema
mapping in ArrowTypeParser.MapPrimitiveType branches on the flag for the four
conversion-sensitive scalars; a new ScalarConversionStream wrapper, layered
between IntervalSerializingStream and ComplexTypeSerializingStream when the
mode is none, converts the native Date32/Timestamp/Decimal128/Float arrays
into matching StringArray / DoubleArray so the schema and batch data agree.

E2E coverage: ExecuteQuery_DataTypeConv_None_SerializesScalarTypesToStrings
proves DATE/TIMESTAMP/DECIMAL columns surface as StringType under
adbc.spark.data_type_conv=none; ExecuteQuery_DataTypeConv_Scalar_KeepsNativeTypes
guards the default mode.

PECO-3060
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant