Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(substrait): Do not add implicit groupBy expressions when building logical plans from Substrait #14553

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

anlinc
Copy link

@anlinc anlinc commented Feb 8, 2025

Which issue does this PR close?

Closes #14348

Rationale for this change

Substrait plans are intended to be interpreted literally. When you see plan nodes like:

"project": {
  "common": {
    "emit": {
      "outputMapping": [0, 3]
    }
  },
...
}

The output mapping (e.g. [0, 3]) contains ordinals representing the offset of the target expression(s) within the [input, output] list. If the DataFusion LogicalPlanBuilder is introducing additional input expressions, this violates the plan's intent and will produce the incorrect output mappings. Please see the issue for a concrete example.

What changes are included in this PR?

In the Substrait path, do not add additional grouping expressions derived from functional dependencies.

Are these changes tested?

Added a multilayer aggregation Substrait example. The first aggregation produces a unique column with a functional dependency. Despite this, the second aggregation must not introduce any additional grouping expressions.

There should be no changes in the non-Substrait path.

Are there any user-facing changes?

No.

@github-actions github-actions bot added logical-expr Logical plan and expressions substrait labels Feb 8, 2025
@anlinc anlinc changed the title fix: Do not add implicit groupBy expressions when building logical plans from Substrait fix(substrait): Do not add implicit groupBy expressions when building logical plans from Substrait Feb 10, 2025
…. Do not implicitly add any expressions when building the LogicalPlan.
@anlinc anlinc force-pushed the anlinc/fix_logical_agg_substrait branch from a4030e9 to cc0fee8 Compare February 10, 2025 22:26
self._aggregate(group_expr, aggr_expr, false)
}

fn _aggregate(
Copy link
Author

@anlinc anlinc Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super new to Rust -- is this an okay / conventional way to name private helpers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's need for _ since the function is already private (by virtue of not being pub fn). Something like aggregate_inner I think is used quite a lot.

Alternatively, given the logicalplanbuilder for aggregate doesn't do that much, we could also just inline it into the substrait consumer. That way it's not changing the LogicalPlanBuilder api, which might be easier.

Or maybe this whole add_group_by_exprs_from_dependencies thing should move from the plan builder into the analyzer/optimizer? Intuitively it feels like the constructed logical plan shouldn't do this kind of magic, but the analyzer/optimizer can if it makes things faster to execute. But that might be a bigger undertaking, so I'd be quite fine with this PR or the alternative above first.

@anlinc anlinc marked this pull request as ready for review February 10, 2025 22:39
@@ -300,6 +300,17 @@ async fn aggregate_grouping_rollup() -> Result<()> {
).await
}

#[tokio::test]
async fn multilayer_aggregate() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this succeeding also before? (I'd guess so as it'd add the extra groupbys but take that into consideration while producing the plan, is that right?)

Comment on lines +1113 to +1116
if include_implicit_group_by_exprs {
group_expr =
add_group_by_exprs_from_dependencies(group_expr, self.plan.schema())?;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: style-wise I'd prefer keeping group_expr non-mut and doing something like:

Suggested change
if include_implicit_group_by_exprs {
group_expr =
add_group_by_exprs_from_dependencies(group_expr, self.plan.schema())?;
}
let group_expr = if include_implicit_group_by_exprs {
group_expr =
add_group_by_exprs_from_dependencies(group_expr, self.plan.schema())?;
} else {
group_exrp
};

@Blizzara
Copy link
Contributor

Thanks, seems like a clear enough bug, appreciate both the report and the PR to fix it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions substrait
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[substrait] Synthetically added grouping expressions in Aggregates can cause mismatched output columns
2 participants