-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-54609][SQL] Disable TIME type by default #53344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7818f04 to
4a43700
Compare
aa1ac43 to
5857c5d
Compare
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make CI happy, @davidm-db ?
|
cc @MaxGekk , @uros-db , @cloud-fan , @HyukjinKwon , @viirya , @peter-toth , @yaooqinn , @LuciferYang , @vinodkc |
sql/api/src/main/scala/org/apache/spark/sql/types/TimeType.scala
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| override def supportDataType(dataType: DataType): Boolean = dataType match { | ||
| case _: TimeType => SQLConf.get. isTimeTypeEnabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do we block geo types for data sources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per offline discussion with @uros-db, blocking for Parquet should be sufficient for TIME.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added guards for all file formats (that haven't previously been explicitly marked as not supported for TIME).
|
Could you answer the above comments and make this PR pass the CIs for further discussion, @davidm-db ? |
|
TimeType is marked as Unstable. Is this short-term prohibition actually required? |
|
Yes, we need this to protect the users from the accidental use of unfinished work, @yaooqinn .
|
f58fbfa to
02c68f0
Compare
02c68f0 to
c861aa6
Compare
sql/api/src/main/scala/org/apache/spark/sql/types/TimeType.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
Outdated
Show resolved
Hide resolved
| val specialDate = convertSpecialDate(value, zoneId).map(Literal(_, DateType)) | ||
| specialDate.getOrElse(toLiteral(stringToDate, DateType)) | ||
| case TIME => toLiteral(stringToTime, TimeType()) | ||
| case TIME if conf.isTimeTypeEnabled => toLiteral(stringToTime, TimeType()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we have the check here, we don't need to update SqlBaseParser.g4 to complicate things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replicated what Max did internally. I think the reason for this is:
- the code you are commenting is handling literal types (statement example:
SELECT TIME'10:00:00') and is done this way to fit into the existing error message format {time_type_enabled}?guard inSqlBaseParser.g4guards references to the TIME as a type and throws different class of errors, i.e. datatype unsupported (statement example:CREATE TABLE t(col TIME))
I don't know if we want to change this behavior or not, please share your thoughts.
...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
Outdated
Show resolved
Hide resolved
|
can we add some test cases following the geo types blocking PRs? |
uros-db
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidm-db Should we add some tests, e.g.
- e2e sql query tests with config turned off
- blocking data sources like Parquet, CSV
- data frames (classic and spark connect)
- also, there are Scala suites for casting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, given that @yaooqinn 's comment, while working on this PR, we need to build a consensus on this by sending out an email to dev@spark mailing list, @davidm-db , @uros-db , and @cloud-fan .
It would be enough to reply on RC2 email about the TIME type. Maybe, could you send out the decision clearly to the mailing list, please, @cloud-fan , because @MaxGekk is not in the loop yet?
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
Outdated
Show resolved
Hide resolved
| unsupportedType = ctx.literalType.getText, | ||
| supportedTypes = | ||
| // TODO: Remove TIME from the list. | ||
| Seq("DATE", "TIMESTAMP_NTZ", "TIMESTAMP_LTZ", "TIMESTAMP", "INTERVAL", "X", "TIME"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, the following style?
- Seq("DATE", "TIMESTAMP_NTZ", "TIMESTAMP_LTZ", "TIMESTAMP", "INTERVAL", "X", "TIME"),
+ Seq("DATE", "TIMESTAMP_NTZ", "TIMESTAMP_LTZ", "TIMESTAMP", "INTERVAL", "X") ++ (if (conf.isTimeTypeEnabled) Seq("TIME") else None)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, will do this definitely. There are a lot of dependencies and TIME needs to be guarded in a lot of places, so for now I'm just trying to figure out what's needed to make the CI pass. Afterwards, I'll sort out the TODO comment. Hope to finish everything tomorrow!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
Outdated
Show resolved
Hide resolved
f2552ce to
c9cf94c
Compare
|
@dongjoon-hyun @cloud-fan I think I've resolved all of the comments. I'll make sure tonight that after my latest changes all CIs are passing. |
|
Thank you, @davidm-db . |
| override def toString: String = "XML" | ||
|
|
||
| override def supportDataType(dataType: DataType): Boolean = dataType match { | ||
| case _: TimeType => SQLConf.get.isTimeTypeEnabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a narrow waist: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L531
check where we call disallowWritingIntervals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this covering everything? isn't this only for a write path? how do we handle blocking on the read path then?
I assume that the idea is to have a check in single place instead of doing it for each File Format, which makes complete sense. I'm just wondering if with what you suggested we are covering the same scope as with checks in File Formats (current state of the code)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can also block read path in DataSource.resolveRelation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh DataSourceUtils.verifySchema is a better narrow waist for both read and write paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I just have one additional question - if we go down this way (which makes sense on a high level), I think we might be missing to completely block the type. for example, if you take a look at ParquetFileFormat#supportDataType, it recursively calls supportDataType function in case the root type is Array/Map/Struct. from the top of my mind, I think the same holds for Xml, maybe for some others.
am I missing something or what I just said makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataSourceUtils.verifySchema gets the full schema and we can do whatever we want, e.g.
if (schema.existsRecursively(_.isInstanceOf[TimeType])) fail ...
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still missing, @davidm-db and @cloud-fan ?
Need to add tests for disabled config.
|
I verified manually that it's blocked properly. Given that, shall we proceed those test case as a follow-up, @davidm-db and @cloud-fan ? |
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
|
I removed all client side checks as it's not very meaningful. People can use the |
|
So, is this the final status from your side, @cloud-fan ? Then, could you give your approval? |
| new GeometryConverter(g) | ||
| case DateType if SQLConf.get.datetimeJava8ApiEnabled => LocalDateConverter | ||
| case DateType => DateConverter | ||
| case _: TimeType if !SQLConf.get.isTimeTypeEnabled => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for the record, we don't have this for GeographyType|GeometryType.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have it here, a few lines above
case _ @ (_: GeographyType | _: GeometryType) if !SQLConf.get.geospatialEnabled =>
throw new org.apache.spark.sql.AnalysisException(
errorClass = "UNSUPPORTED_FEATURE.GEOSPATIAL_DISABLED",
messageParameters = scala.collection.immutable.Map.empty)
| def verifySchema(format: FileFormat, schema: StructType, readOnly: Boolean = false): Unit = { | ||
| if (!SQLConf.get.isTimeTypeEnabled && schema.existsRecursively(_.isInstanceOf[TimeType])) { | ||
| throw QueryCompilationErrors.unsupportedTimeTypeError() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto. We don't have this for Geo*Type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no data source supports geo types yet, so it's not needed for now. But to be future-proof we should check geo here as well.
| """ | ||
| print("Enabling TIME data type") | ||
| jspark.sql("SET spark.sql.timeType.enabled = true") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have this for Geo*Type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also curious about why geo didn't fail here...
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
+1, LGTM. Thank you so much, @cloud-fan .
|
I manually verified the compilation and Scalalinter and both |
|
Merged to master.
Never mind. I resolved the conflicts and am testing locally on |
Introducing a new SQL config for TIME type: `spark.sql.timeType.enabled`. The default value is `false` and it is enabled only in tests. TIME data type support is not complete, so we need to guard it before it is completed, especially ahead of Spark 4.1 release. No. Need to add tests for disabled config. No. Closes #53344 from davidm-db/davidm-db/time-config. Lead-authored-by: David Milicevic <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 18a9435) Signed-off-by: Dongjoon Hyun <[email protected]>
|
Merged to branch-4.1, too. |

What changes were proposed in this pull request?
Introducing a new SQL config for TIME type:
spark.sql.timeType.enabled.The default value is
falseand it is enabled only in tests.Why are the changes needed?
TIME data type support is not complete, so we need to guard it before it is completed, especially ahead of Spark 4.1 release.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Need to add tests for disabled config.
Was this patch authored or co-authored using generative AI tooling?
No.