[SPARK-51590][SQL] Disable TIME in builtin file-based datasources #50358

MaxGekk · 2025-03-23T16:24:40Z

What changes were proposed in this pull request?

In the PR, I propose to modify the supportDataType() method of builtin file based datasources, and disable the new data type TIME till it is fully supported in such datasources. In particular, disable it in the following V1 and V2 datasources:

Text
JSON/CSV
XML
Parquet
ORC: native and Hive
Avro

Why are the changes needed?

Before the changes if we don't disable the unsupported data type TIME in datasources, users might get exceptions from external libs like:

ORC:

java.lang.IllegalArgumentException: Can't parse category at 'time^(6)'
	at org.apache.orc.impl.ParserUtils.parseCategory(ParserUtils.java:67)

Avro:

org.apache.spark.sql.avro.IncompatibleSchemaException: Unexpected type TimeType(6).
	at org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:369)

Does this PR introduce any user-facing change?

Yes. After the changes, users get more friendly AnalysisException:

[UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE] The ORC datasource doesn't support the column `t` of the type "TIME(6)". SQLSTATE: 0A000

How was this patch tested?

By running the modified test suites:

$ build/sbt "test:testOnly *FileBasedDataSourceSuite"
$ build/sbt -Phive "test:testOnle *HiveOrcQuerySuite"

Was this patch authored or co-authored using generative AI tooling?

No.

HyukjinKwon · 2025-03-24T02:00:48Z

Merged to master.

### What changes were proposed in this pull request? In the PR, I propose to modify the `supportDataType()` method of builtin file based datasources, and disable the new data type `TIME` till it is fully supported in such datasources. In particular, disable it in the following V1 and V2 datasources: - Text - JSON/CSV - XML - Parquet - ORC: native and Hive - Avro ### Why are the changes needed? Before the changes if we don't disable the unsupported data type `TIME` in datasources, users might get exceptions from external libs like: ORC: ```java java.lang.IllegalArgumentException: Can't parse category at 'time^(6)' at org.apache.orc.impl.ParserUtils.parseCategory(ParserUtils.java:67) ``` Avro: ```java org.apache.spark.sql.avro.IncompatibleSchemaException: Unexpected type TimeType(6). at org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:369) ``` ### Does this PR introduce _any_ user-facing change? Yes. After the changes, users get more friendly `AnalysisException`: ``` [UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE] The ORC datasource doesn't support the column `t` of the type "TIME(6)". SQLSTATE: 0A000 ``` ### How was this patch tested? By running the modified test suites: ``` $ build/sbt "test:testOnly *FileBasedDataSourceSuite" $ build/sbt -Phive "test:testOnle *HiveOrcQuerySuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#50358 from MaxGekk/time-off-fs-ds-2. Authored-by: Max Gekk <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? In the PR, I propose to support the new data type `TIME` in the Parquet datasource, and store it as the logical type: ``` TimeType(isAdjustedToUTC = false, unit = MICROS) ``` see https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#time where - `isAdjustedToUTC` = **false** means the `TIME` values represent a local time, regardless of the local time zone in effect. - `unit` = **MICROS** - stores the number of microseconds after midnight. At the writer side, recognize `int64` values of the logical type `TimeType(isAdjustedToUTC = false, unit = MICROS)` and convert them to Spark's type `TIME(6)`. ### Why are the changes needed? To allow Spark SQL users reading of parquet files with `TIME` values created by other systems. ### Does this PR introduce _any_ user-facing change? Yes. Before the changes, reading and writing Dataset to/from parquet failed with the error `UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE` because the PR #50358 disabled `TIME` in the Parquet datasource. ### How was this patch tested? By running new tests: ``` $ build/sbt "test:testOnly *ParquetSchemaSuite" $ build/sbt "test:testOnly *ParquetIOSuite" $ build/sbt "test:testOnly *FileBasedDataSourceSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50389 from MaxGekk/time-parquet. Authored-by: Max Gekk <[email protected]> Signed-off-by: Max Gekk <[email protected]>

MaxGekk added 4 commits March 23, 2025 19:06

Disable TIME in built-in datasources

c163ee8

Check Hive ORC

16c322d

Test in Avro

4856966

SPARK-XXXXX -> SPARK-51590

43d965a

github-actions bot added SQL AVRO labels Mar 23, 2025

MaxGekk changed the title ~~[WIP][SPARK-51590][SQL] Disable TIME in builtin file-based datasources~~ [SPARK-51590][SQL] Disable TIME in builtin file-based datasources Mar 23, 2025

MaxGekk marked this pull request as ready for review March 23, 2025 18:39

MaxGekk requested review from gengliangwang, yaooqinn, dongjoon-hyun and HyukjinKwon March 23, 2025 19:46

HyukjinKwon approved these changes Mar 24, 2025

View reviewed changes

HyukjinKwon closed this in 1ce5380 Mar 24, 2025

MaxGekk mentioned this pull request Mar 26, 2025

[SPARK-51610][SQL] Support the TIME data type in the parquet datasource #50389

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-51590][SQL] Disable TIME in builtin file-based datasources #50358

[SPARK-51590][SQL] Disable TIME in builtin file-based datasources #50358

Uh oh!

MaxGekk commented Mar 23, 2025 •

edited

Loading

Uh oh!

HyukjinKwon commented Mar 24, 2025

Uh oh!

Uh oh!

[SPARK-51590][SQL] Disable TIME in builtin file-based datasources #50358

[SPARK-51590][SQL] Disable TIME in builtin file-based datasources #50358

Uh oh!

Conversation

MaxGekk commented Mar 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

HyukjinKwon commented Mar 24, 2025

Uh oh!

Uh oh!

MaxGekk commented Mar 23, 2025 •

edited

Loading