Skip to content

[SPARK-51680][SQL] Set the logical type for TIME in the parquet writer #50476

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Apr 1, 2025

What changes were proposed in this pull request?

In the PR, I propose to modify the parquet schema converter for the TIME data type, and convert Catalyst's TimeType(n) to parquet physical type INT64 annotated by the logical type:

TimeType(isAdjustedToUTC = false, unit = MICROS)

in the parquet writer.

Why are the changes needed?

To fix a failure of non-vectorized reader. The code below portraits the issue:

scala> spark.conf.set("spark.sql.parquet.enableVectorizedReader", false)
scala> spark.read.parquet("/Users/maxim.gekk/tmp/time_parquet3").show()

org.apache.spark.SparkRuntimeException: [PARQUET_CONVERSION_FAILURE.UNSUPPORTED] Unable to create a Parquet converter for the data type "TIME(6)" whose Parquet type is optional int64 col. Please modify the conversion making sure it is supported. SQLSTATE: 42846
  at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotCreateParquetConverterForDataTypeError(QueryExecutionErrors.scala:2000)

Does this PR introduce any user-facing change?

No.

How was this patch tested?

By running the modified test:

$ build/sbt "test:testOnly *ParquetFileFormatV1Suite"
$ build/sbt "test:testOnly *ParquetFileFormatV2Suite"

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Apr 1, 2025
@MaxGekk MaxGekk changed the title [WIP][SQL] Fix non-vectorized parquet reader for the TIME data type [WIP][SPARK-51680][SQL] Set the logical type for the TIME type in the parquet writer Apr 1, 2025
@MaxGekk MaxGekk changed the title [WIP][SPARK-51680][SQL] Set the logical type for the TIME type in the parquet writer [SPARK-51680][SQL] Set the logical type for the TIME type in the parquet writer Apr 1, 2025
@MaxGekk MaxGekk marked this pull request as ready for review April 1, 2025 14:58
@MaxGekk MaxGekk changed the title [SPARK-51680][SQL] Set the logical type for the TIME type in the parquet writer [SPARK-51680][SQL] Set the logical type for TIME in the parquet writer Apr 1, 2025
withNestedDataFrame(data).foreach { case (newDF, colName, _) =>
withTempPath { dir =>
newDF.write.parquet(dir.getCanonicalPath)
Seq(false, true).foreach { vectorizedReaderEnabled =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for improving this test coverage.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @MaxGekk .

@MaxGekk
Copy link
Member Author

MaxGekk commented Apr 1, 2025

Merging to master. Thank you, @dongjoon-hyun for review.

@MaxGekk MaxGekk closed this in 921eba8 Apr 1, 2025
senthh pushed a commit to senthh/spark-1 that referenced this pull request Apr 2, 2025
### What changes were proposed in this pull request?
In the PR, I propose to modify the parquet schema converter for the TIME data type, and convert Catalyst's `TimeType(n)` to parquet physical type `INT64` annotated by the logical type:
```
TimeType(isAdjustedToUTC = false, unit = MICROS)
```
in the parquet writer.

### Why are the changes needed?
To fix a failure of non-vectorized reader. The code below portraits the issue:
```scala
scala> spark.conf.set("spark.sql.parquet.enableVectorizedReader", false)
scala> spark.read.parquet("/Users/maxim.gekk/tmp/time_parquet3").show()

org.apache.spark.SparkRuntimeException: [PARQUET_CONVERSION_FAILURE.UNSUPPORTED] Unable to create a Parquet converter for the data type "TIME(6)" whose Parquet type is optional int64 col. Please modify the conversion making sure it is supported. SQLSTATE: 42846
  at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotCreateParquetConverterForDataTypeError(QueryExecutionErrors.scala:2000)
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By running the modified test:
```
$ build/sbt "test:testOnly *ParquetFileFormatV1Suite"
$ build/sbt "test:testOnly *ParquetFileFormatV2Suite"
```

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#50476 from MaxGekk/parquet-time-nested.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants