Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datafusion): Treat timestamp conversion functions like a cast. #945

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

omerhadari
Copy link
Contributor

@omerhadari omerhadari commented Feb 5, 2025

Partially Solves #933

Note: This PR also fixes a bug that made Casts to Date which implicitly truncate the expression inside them create predicates with the raw expression, rather than with the truncated date. This could result in wrong predicates (see the relevant test in the PR).

@omerhadari
Copy link
Contributor Author

omerhadari commented Feb 5, 2025

EDIT: copying this message to the issue itself, since discussion started there.

@Fokko hope it's ok to tag :)

This PR contains the first (and easiest) part of the fix for the issue.

I wanted to ask, is there a way to express function within iceberg predicates? Is this even desired? The reason this could be beneficial is that sometimes you need access to the column value and then you could perform much better manifest elimination. A few examples I have in this context:

  • TO_DATE essentially converts the column to Timestamp, and then truncates to the nearest day. I cannot easily do that in the context of generating the predicate
  • TO_TIMESTAMP accepts format for strings, but I see no way to pass the format inside the predicate and use it correctly.

This also reveals what I think is a bug. In datafusion (as well as many engines) when you cast for example a string to a DATE, it truncates to the nearest day. Currently in the conversion function - the expression is simply extracted from the Cast.

See here:
image

If I understand correctly, this could cause wrong results for example for the query
SELECT * FROM table WHERE date_col > CAST('2025-01-01T00:10:00' AS DATE)

would result in the predicate date_col > '2025-01-01T00:10:00' which will filter out data files where 2025-01-01T00:00:00 < date_col < 2025-01-01T00:10:00 even though they are supposed to be included.

Would appreciate some guidance about how to tackle this issue of propagating more information, I don't think it makes sense in the scope of this PR but maybe I am missing something basic.

@Fokko
Copy link
Contributor

Fokko commented Feb 6, 2025

Hey @omerhadari Thanks for working on this. Unfortunately, I'm afraid that this is a pretty complex task.

When we do like to_date(ts) in SQL, that maps to an Iceberg partition transform day(ts). What we do in Java, we convert the to_date(ts) into an UnboundTransform. This transform will be bound to the schema at some point. When you do: to_date(ts) >= 'date' 2021-01-01 the evaluator sees that it is both a date, and it can compare it directly against the partition values.

@omerhadari
Copy link
Contributor Author

Updated the PR with the cases that were discussed to be within the scope for now (in this comment)

I think this makes sense for a single PR, though it doesn't solve the entire issue.

@omerhadari omerhadari marked this pull request as ready for review February 8, 2025 19:59
@omerhadari omerhadari force-pushed the feat/support-to-timestamp-and-to-date branch from 4907956 to 735e7df Compare February 8, 2025 20:10
@omerhadari
Copy link
Contributor Author

omerhadari commented Feb 8, 2025

Hey @omerhadari Thanks for working on this. Unfortunately, I'm afraid that this is a pretty complex task.

When we do like to_date(ts) in SQL, that maps to an Iceberg partition transform day(ts). What we do in Java, we convert the to_date(ts) into an UnboundTransform. This transform will be bound to the schema at some point. When you do: to_date(ts) >= 'date' 2021-01-01 the evaluator sees that it is both a date, and it can compare it directly against the partition values.

Does it make sense to create an issue for supporting an equivalent to UnboundTransform here?

@omerhadari
Copy link
Contributor Author

@Fokko will you be able to do another round of review?

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @omerhadari for this pr, but I think the correct way to solve problems in this pr is to call expression simplification in datafusion rather hacking in this pr.

@@ -119,7 +122,53 @@ fn to_iceberg_predicate(expr: &Expr) -> TransformedResult {
_ => TransformedResult::NotTransformed,
}
}
Expr::Cast(c) => to_iceberg_predicate(&c.expr),
Expr::Cast(c) => {
if DataType::Date32 == c.data_type || DataType::Date64 == c.data_type {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have concerns handling this case in such a way here. This is a process of simplifying expression in datafusion, which should call datafusion's expression simplification api.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is less about simplification, rather than extracting the right information for Iceberg to optimize.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is less about simplification, rather than extracting the right information for Iceberg to optimize.

Maybe the word simplification is a little confusing, I think constant folding is better here. The expressions in tests should be simplified by constant folding and replaced with a literal, which could be processed by iceberg.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constant folding is being done I think, (e.g. TO_TIMESTAMP(2+2) => TO_TIMESTAMP(4)) but I am not sure it makes sense to turn TO_TIMESTAMP(<date_str>) to CAST(date_str AS TIMESTAMP) as a simplification in Datafusion. Currently the implementation of CAST and TO_TIMESTAMP is the same with no parameters, but I don't think turning TO_TIMESTAMP to CAST is a "simplification" necessarily. It may even be counter intuitive and effect errors in some weird ways, with no gains to readability or performance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By simplification I don't mean converting TO_TIMESAMPT(<date_str>) to CAST(date_str AS TIMESTAMP), I mean it should be converting TO_TIMESAMPT(<date_str>) to <timestamp literal>, which should be done in constant folding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I don't mind it being in in DataFusion, I'll try opening an issue/PR there and see.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem a datafusion problem, maybe we should call the simplification api in the iceberg planner?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be redundant I think. We should call it in the tests to mimic production behaviour.

The current simplification API in DataFusion simplifies casts, not these specific function calls (TO_TIMESTAMP, TO_DATE) as far as I could tell.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the details, but I guess in iceberg-datafusion planner we missed sth. It's fine for me to seek for help in datafusion community.

if DataType::Date32 == c.data_type || DataType::Date64 == c.data_type {
match c.expr.as_ref() {
Expr::Literal(ScalarValue::Utf8(Some(literal))) => {
let date = literal.split('T').next();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be reluctant to do any string manipulation here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, I would add support for the datetime format in date_from_str:

pub fn date_from_str<S: AsRef<str>>(s: S) -> Result<Self> {
let t = s.as_ref().parse::<NaiveDate>().map_err(|e| {
Error::new(
ErrorKind::DataInvalid,
format!("Can't parse date from string: {}", s.as_ref()),
)
.with_source(e)
})?;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko currently date_from_str supports datetime (i.e. 2025-01-01T11:11:11 will not be truncated).

I think this is a mistake given the name of the function (date_from_str and not datetime_from_str) and its documentation, but I don't think I can change its behaviour so drastically now right?

Do you suggest I change it, or create another one that is fix?


#[test]
fn test_to_date_comparison_creates_predicate() {
let sql = "ts >= CAST('2023-01-05T11:11:11' AS DATE)";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised to see that this works:

spark-sql (default)> SELECT CAST('2023-01-05T11:11:11' AS DATE);
2023-01-05

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I would have preferred an error, but nobody is asking me 😆

@omerhadari
Copy link
Contributor Author

omerhadari commented Feb 17, 2025

@liurenjie1024 @Fokko regardless of whether or not turning TO_TIMESTAMP/DATE into casts makes sense or should be done as a simplification in DataFusion, an important part in this PR is fixing the wrong handling of CAST(<datetime_str> AS DATE) expressions, which can cause incorrect query results. Would you like me to extract this part only as a separate PR so that it can be reviewed on its own?

Talking about what you can see here

@liurenjie1024
Copy link
Contributor

@liurenjie1024 @Fokko regardless of whether or not turning TO_TIMESTAMP/DATE into casts makes sense or should be done as a simplification in DataFusion, an important part in this PR is fixing the wrong handling of CAST(<datetime_str> AS DATE) expressions, which can cause incorrect query results. Would you like me to extract this part only as a separate PR so that it can be reviewed on its own?

Talking about what you can see here

Thanks @jonathanc-n I think it would make sense to fix the bug you mentioned, and it would be better to put it in another pr.

@omerhadari
Copy link
Contributor Author

@liurenjie1024 @Fokko regardless of whether or not turning TO_TIMESTAMP/DATE into casts makes sense or should be done as a simplification in DataFusion, an important part in this PR is fixing the wrong handling of CAST(<datetime_str> AS DATE) expressions, which can cause incorrect query results. Would you like me to extract this part only as a separate PR so that it can be reviewed on its own?
Talking about what you can see here

Thanks @jonathanc-n I think it would make sense to fix the bug you mentioned, and it would be better to put it in another pr.

here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants