-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(datafusion): Treat timestamp conversion functions like a cast. #945
base: main
Are you sure you want to change the base?
feat(datafusion): Treat timestamp conversion functions like a cast. #945
Conversation
EDIT: copying this message to the issue itself, since discussion started there. @Fokko hope it's ok to tag :) This PR contains the first (and easiest) part of the fix for the issue. I wanted to ask, is there a way to express function within iceberg predicates? Is this even desired? The reason this could be beneficial is that sometimes you need access to the column value and then you could perform much better manifest elimination. A few examples I have in this context:
This also reveals what I think is a bug. In If I understand correctly, this could cause wrong results for example for the query would result in the predicate Would appreciate some guidance about how to tackle this issue of propagating more information, I don't think it makes sense in the scope of this PR but maybe I am missing something basic. |
Hey @omerhadari Thanks for working on this. Unfortunately, I'm afraid that this is a pretty complex task. When we do like |
crates/integrations/datafusion/src/physical_plan/expr_to_predicate.rs
Outdated
Show resolved
Hide resolved
Updated the PR with the cases that were discussed to be within the scope for now (in this comment) I think this makes sense for a single PR, though it doesn't solve the entire issue. |
4907956
to
735e7df
Compare
Does it make sense to create an issue for supporting an equivalent to UnboundTransform here? |
@Fokko will you be able to do another round of review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @omerhadari for this pr, but I think the correct way to solve problems in this pr is to call expression simplification in datafusion rather hacking in this pr.
@@ -119,7 +122,53 @@ fn to_iceberg_predicate(expr: &Expr) -> TransformedResult { | |||
_ => TransformedResult::NotTransformed, | |||
} | |||
} | |||
Expr::Cast(c) => to_iceberg_predicate(&c.expr), | |||
Expr::Cast(c) => { | |||
if DataType::Date32 == c.data_type || DataType::Date64 == c.data_type { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have concerns handling this case in such a way here. This is a process of simplifying expression in datafusion, which should call datafusion's expression simplification api.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is less about simplification, rather than extracting the right information for Iceberg to optimize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is less about simplification, rather than extracting the right information for Iceberg to optimize.
Maybe the word simplification is a little confusing, I think constant folding is better here. The expressions in tests should be simplified by constant folding and replaced with a literal, which could be processed by iceberg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Constant folding is being done I think, (e.g. TO_TIMESTAMP(2+2)
=>
TO_TIMESTAMP(4)
) but I am not sure it makes sense to turn TO_TIMESTAMP(<date_str>)
to CAST(date_str AS TIMESTAMP)
as a simplification in Datafusion. Currently the implementation of CAST
and TO_TIMESTAMP
is the same with no parameters, but I don't think turning TO_TIMESTAMP
to CAST
is a "simplification" necessarily. It may even be counter intuitive and effect errors in some weird ways, with no gains to readability or performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By simplification I don't mean converting TO_TIMESAMPT(<date_str>)
to CAST(date_str AS TIMESTAMP)
, I mean it should be converting TO_TIMESAMPT(<date_str>)
to <timestamp literal>
, which should be done in constant folding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think datafusion's expression simplification api should have done this, you could see this example:
https://github.com/apache/datafusion/blob/132b21f02763cabbab85ab99e0e0c7fd6cd3d516/datafusion-examples/examples/expr_api.rs#L153
https://github.com/apache/datafusion/blob/132b21f02763cabbab85ab99e0e0c7fd6cd3d516/datafusion-examples/examples/expr_api.rs#L180
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I don't mind it being in in DataFusion, I'll try opening an issue/PR there and see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem a datafusion problem, maybe we should call the simplification api in the iceberg planner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be redundant I think. We should call it in the tests to mimic production behaviour.
The current simplification API in DataFusion simplifies casts, not these specific function calls (TO_TIMESTAMP, TO_DATE) as far as I could tell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the details, but I guess in iceberg-datafusion planner we missed sth. It's fine for me to seek for help in datafusion community.
if DataType::Date32 == c.data_type || DataType::Date64 == c.data_type { | ||
match c.expr.as_ref() { | ||
Expr::Literal(ScalarValue::Utf8(Some(literal))) => { | ||
let date = literal.split('T').next(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be reluctant to do any string manipulation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead, I would add support for the datetime format in date_from_str
:
iceberg-rust/crates/iceberg/src/spec/values.rs
Lines 1459 to 1466 in 75c7e8d
pub fn date_from_str<S: AsRef<str>>(s: S) -> Result<Self> { | |
let t = s.as_ref().parse::<NaiveDate>().map_err(|e| { | |
Error::new( | |
ErrorKind::DataInvalid, | |
format!("Can't parse date from string: {}", s.as_ref()), | |
) | |
.with_source(e) | |
})?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko currently date_from_str
supports datetime (i.e. 2025-01-01T11:11:11 will not be truncated).
I think this is a mistake given the name of the function (date_from_str and not datetime_from_str) and its documentation, but I don't think I can change its behaviour so drastically now right?
Do you suggest I change it, or create another one that is fix?
|
||
#[test] | ||
fn test_to_date_comparison_creates_predicate() { | ||
let sql = "ts >= CAST('2023-01-05T11:11:11' AS DATE)"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised to see that this works:
spark-sql (default)> SELECT CAST('2023-01-05T11:11:11' AS DATE);
2023-01-05
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I would have preferred an error, but nobody is asking me 😆
@liurenjie1024 @Fokko regardless of whether or not turning Talking about what you can see here |
Thanks @jonathanc-n I think it would make sense to fix the bug you mentioned, and it would be better to put it in another pr. |
|
Partially Solves #933
Note: This PR also fixes a bug that made Casts to Date which implicitly truncate the expression inside them create predicates with the raw expression, rather than with the truncated date. This could result in wrong predicates (see the relevant test in the PR).