Skip to content

[SPARK-51590][SQL] Disable TIME in builtin file-based datasources #50358

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Mar 23, 2025

What changes were proposed in this pull request?

In the PR, I propose to modify the supportDataType() method of builtin file based datasources, and disable the new data type TIME till it is fully supported in such datasources. In particular, disable it in the following V1 and V2 datasources:

  • Text
  • JSON/CSV
  • XML
  • Parquet
  • ORC: native and Hive
  • Avro

Why are the changes needed?

Before the changes if we don't disable the unsupported data type TIME in datasources, users might get exceptions from external libs like:

ORC:

java.lang.IllegalArgumentException: Can't parse category at 'time^(6)'
	at org.apache.orc.impl.ParserUtils.parseCategory(ParserUtils.java:67)

Avro:

org.apache.spark.sql.avro.IncompatibleSchemaException: Unexpected type TimeType(6).
	at org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:369)

Does this PR introduce any user-facing change?

Yes. After the changes, users get more friendly AnalysisException:

[UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE] The ORC datasource doesn't support the column `t` of the type "TIME(6)". SQLSTATE: 0A000

How was this patch tested?

By running the modified test suites:

$ build/sbt "test:testOnly *FileBasedDataSourceSuite"
$ build/sbt -Phive "test:testOnle *HiveOrcQuerySuite"

Was this patch authored or co-authored using generative AI tooling?

No.

@MaxGekk MaxGekk changed the title [WIP][SPARK-51590][SQL] Disable TIME in builtin file-based datasources [SPARK-51590][SQL] Disable TIME in builtin file-based datasources Mar 23, 2025
@MaxGekk MaxGekk marked this pull request as ready for review March 23, 2025 18:39
@HyukjinKwon
Copy link
Member

Merged to master.

SauronShepherd pushed a commit to SauronShepherd/spark that referenced this pull request Mar 25, 2025
### What changes were proposed in this pull request?
In the PR, I propose to modify the `supportDataType()` method of builtin file based datasources, and disable the new data type `TIME` till it is fully supported in such datasources. In particular, disable it in the following V1 and V2 datasources:
- Text
- JSON/CSV
- XML
- Parquet
- ORC: native and Hive
- Avro

### Why are the changes needed?
Before the changes if we don't disable the unsupported data type `TIME` in datasources, users might get exceptions from external libs like:

ORC:
```java
java.lang.IllegalArgumentException: Can't parse category at 'time^(6)'
	at org.apache.orc.impl.ParserUtils.parseCategory(ParserUtils.java:67)
```

Avro:
```java
org.apache.spark.sql.avro.IncompatibleSchemaException: Unexpected type TimeType(6).
	at org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:369)
```

### Does this PR introduce _any_ user-facing change?
Yes. After the changes, users get more friendly `AnalysisException`:
```
[UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE] The ORC datasource doesn't support the column `t` of the type "TIME(6)". SQLSTATE: 0A000
```

### How was this patch tested?
By running the modified test suites:
```
$ build/sbt "test:testOnly *FileBasedDataSourceSuite"
$ build/sbt -Phive "test:testOnle *HiveOrcQuerySuite"
```

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#50358 from MaxGekk/time-off-fs-ds-2.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
MaxGekk added a commit that referenced this pull request Mar 26, 2025
### What changes were proposed in this pull request?
In the PR, I propose to support the new data type `TIME` in the Parquet datasource, and store it as the logical type:
```
TimeType(isAdjustedToUTC = false, unit = MICROS)
```
see https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#time where
- `isAdjustedToUTC` = **false** means the `TIME` values represent a local time, regardless of the local time zone in effect.
- `unit` = **MICROS** - stores the number of microseconds after midnight.

At the writer side, recognize `int64` values of the logical type `TimeType(isAdjustedToUTC = false, unit = MICROS)` and convert them to Spark's type `TIME(6)`.

### Why are the changes needed?
To allow Spark SQL users reading of parquet files with `TIME` values created by other systems.

### Does this PR introduce _any_ user-facing change?
Yes. Before the changes, reading and writing Dataset to/from parquet failed with the error `UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE` because the PR #50358 disabled `TIME` in the Parquet datasource.

### How was this patch tested?
By running new tests:
```
$ build/sbt "test:testOnly *ParquetSchemaSuite"
$ build/sbt "test:testOnly *ParquetIOSuite"
$ build/sbt "test:testOnly *FileBasedDataSourceSuite"
```

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #50389 from MaxGekk/time-parquet.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants