[SPARK-51680][SQL] Set the logical type for TIME in the parquet writer #50476

MaxGekk · 2025-04-01T10:10:56Z

What changes were proposed in this pull request?

In the PR, I propose to modify the parquet schema converter for the TIME data type, and convert Catalyst's TimeType(n) to parquet physical type INT64 annotated by the logical type:

TimeType(isAdjustedToUTC = false, unit = MICROS)

in the parquet writer.

Why are the changes needed?

To fix a failure of non-vectorized reader. The code below portraits the issue:

scala> spark.conf.set("spark.sql.parquet.enableVectorizedReader", false)
scala> spark.read.parquet("/Users/maxim.gekk/tmp/time_parquet3").show()

org.apache.spark.SparkRuntimeException: [PARQUET_CONVERSION_FAILURE.UNSUPPORTED] Unable to create a Parquet converter for the data type "TIME(6)" whose Parquet type is optional int64 col. Please modify the conversion making sure it is supported. SQLSTATE: 42846
  at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotCreateParquetConverterForDataTypeError(QueryExecutionErrors.scala:2000)

Does this PR introduce any user-facing change?

No.

How was this patch tested?

By running the modified test:

$ build/sbt "test:testOnly *ParquetFileFormatV1Suite"
$ build/sbt "test:testOnly *ParquetFileFormatV2Suite"

Was this patch authored or co-authored using generative AI tooling?

No.

dongjoon-hyun · 2025-04-01T19:42:32Z

...c/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormatSuite.scala

+        withNestedDataFrame(data).foreach { case (newDF, colName, _) =>
+          withTempPath { dir =>
+            newDF.write.parquet(dir.getCanonicalPath)
+            Seq(false, true).foreach { vectorizedReaderEnabled =>


Thank you for improving this test coverage.

dongjoon-hyun

+1, LGTM. Thank you, @MaxGekk .

MaxGekk · 2025-04-01T20:07:36Z

Merging to master. Thank you, @dongjoon-hyun for review.

### What changes were proposed in this pull request? In the PR, I propose to modify the parquet schema converter for the TIME data type, and convert Catalyst's `TimeType(n)` to parquet physical type `INT64` annotated by the logical type: ``` TimeType(isAdjustedToUTC = false, unit = MICROS) ``` in the parquet writer. ### Why are the changes needed? To fix a failure of non-vectorized reader. The code below portraits the issue: ```scala scala> spark.conf.set("spark.sql.parquet.enableVectorizedReader", false) scala> spark.read.parquet("/Users/maxim.gekk/tmp/time_parquet3").show() org.apache.spark.SparkRuntimeException: [PARQUET_CONVERSION_FAILURE.UNSUPPORTED] Unable to create a Parquet converter for the data type "TIME(6)" whose Parquet type is optional int64 col. Please modify the conversion making sure it is supported. SQLSTATE: 42846 at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotCreateParquetConverterForDataTypeError(QueryExecutionErrors.scala:2000) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running the modified test: ``` $ build/sbt "test:testOnly *ParquetFileFormatV1Suite" $ build/sbt "test:testOnly *ParquetFileFormatV2Suite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#50476 from MaxGekk/parquet-time-nested. Authored-by: Max Gekk <[email protected]> Signed-off-by: Max Gekk <[email protected]>

Fix non-vectorized parquet reader for the TIME data type

c5b3ee1

github-actions bot added the SQL label Apr 1, 2025

Set logical type annotation

d0cfe18

MaxGekk changed the title ~~[WIP][SQL] Fix non-vectorized parquet reader for the TIME data type~~ [WIP][SPARK-51680][SQL] Set the logical type for the TIME type in the parquet writer Apr 1, 2025

MaxGekk changed the title ~~[WIP][SPARK-51680][SQL] Set the logical type for the TIME type in the parquet writer~~ [SPARK-51680][SQL] Set the logical type for the TIME type in the parquet writer Apr 1, 2025

MaxGekk marked this pull request as ready for review April 1, 2025 14:58

MaxGekk changed the title ~~[SPARK-51680][SQL] Set the logical type for the TIME type in the parquet writer~~ [SPARK-51680][SQL] Set the logical type for TIME in the parquet writer Apr 1, 2025

MaxGekk requested review from gengliangwang and dongjoon-hyun April 1, 2025 17:23

dongjoon-hyun reviewed Apr 1, 2025

View reviewed changes

dongjoon-hyun approved these changes Apr 1, 2025

View reviewed changes

MaxGekk closed this in 921eba8 Apr 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-51680][SQL] Set the logical type for TIME in the parquet writer #50476

[SPARK-51680][SQL] Set the logical type for TIME in the parquet writer #50476

MaxGekk commented Apr 1, 2025 •

edited

Loading

dongjoon-hyun Apr 1, 2025

dongjoon-hyun left a comment

MaxGekk commented Apr 1, 2025

[SPARK-51680][SQL] Set the logical type for TIME in the parquet writer #50476

[SPARK-51680][SQL] Set the logical type for TIME in the parquet writer #50476

Conversation

MaxGekk commented Apr 1, 2025 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

dongjoon-hyun Apr 1, 2025

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

MaxGekk commented Apr 1, 2025

MaxGekk commented Apr 1, 2025 •

edited

Loading