You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Gate new ScalarSubqueryExec node behind session property (#22530)
## Which issue does this PR close?
Related to discussion on #21240 and
#21080 (comment).
PR #21240 introduced `ScalarSubqueryExec` / `ScalarSubqueryExpr` to
execute uncorrelated scalar subqueries during physical execution. The
two communicate via shared in process state (a `slot` in
`ExecutionProps`), which breaks distributed execution that may split
execution across a network boundary between the producer
(`ScalarSubqueryExec`) and the consumer expression
(`ScalarSubqueryExpr`). See more details on this explanation in
[datafusion-contrib/datafusion-distributed#460](datafusion-contrib/datafusion-distributed#460)
## What changes are included in this PR?
Adds a new optimizer config option
`datafusion.optimizer.enable_physical_uncorrelated_scalar_subquery`
(default true, preserving the current behavior). When true (default),
behavior is unchanged from current main; when false, all scalar
subqueries are rewritten to left joins by `ScalarSubqueryToJoin` and
`ScalarSubqueryExec` is never constructed (which was the previous
behavior).
## Are these changes tested?
Yes all tests pass and added
`uncorrelated_scalar_subquery_rewritten_when_flag_off` to test the
negative case.
## Are there any user-facing changes?
Yes, a new config option
`datafusion.optimizer.physical_uncorrelated_scalar_subquery` (this just
changes the way the query is executed but not the results)
@@ -453,6 +454,7 @@ datafusion.optimizer.enable_distinct_aggregation_soft_limit true When set to tru
453
454
datafusion.optimizer.enable_dynamic_filter_pushdown true When set to true attempts to push down dynamic filters generated by operators (TopK, Join & Aggregate) into the file scan phase. For example, for a query such as `SELECT * FROM t ORDER BY timestamp DESC LIMIT 10`, the optimizer will attempt to push down the current top 10 timestamps that the TopK operator references into the file scans. This means that if we already have 10 timestamps in the year 2025 any files that only have timestamps in the year 2024 can be skipped / pruned at various stages in the scan. The config will suppress `enable_join_dynamic_filter_pushdown`, `enable_topk_dynamic_filter_pushdown` & `enable_aggregate_dynamic_filter_pushdown` So if you disable `enable_topk_dynamic_filter_pushdown`, then enable `enable_dynamic_filter_pushdown`, the `enable_topk_dynamic_filter_pushdown` will be overridden.
454
455
datafusion.optimizer.enable_join_dynamic_filter_pushdown true When set to true, the optimizer will attempt to push down Join dynamic filters into the file scan phase.
455
456
datafusion.optimizer.enable_leaf_expression_pushdown true When set to true, the optimizer will extract leaf expressions (such as `get_field`) from filter/sort/join nodes into projections closer to the leaf table scans, and push those projections down towards the leaf nodes.
457
+
datafusion.optimizer.enable_physical_uncorrelated_scalar_subquery true When set to true, uncorrelated scalar subqueries are left in the logical plan and executed by `ScalarSubqueryExec` during physical execution. When set to false, all scalar subqueries (including uncorrelated ones) are rewritten to left joins by the `ScalarSubqueryToJoin` optimizer rule. Note disabling this option is not recommended. It restores pre <https://github.com/apache/datafusion/pull/21240> behavior, which silently produces incorrect results for multi-row subqueries and does not support scalar subqueries in ORDER BY / JOIN ON / aggregate-function arguments. This option is intended as a temporary escape hatch for distributed execution frameworks and is planned to be removed in a future DataFusion release.
456
458
datafusion.optimizer.enable_piecewise_merge_join false When set to true, piecewise merge join is enabled. PiecewiseMergeJoin is currently experimental. Physical planner will opt for PiecewiseMergeJoin when there is only one range filter.
457
459
datafusion.optimizer.enable_round_robin_repartition true When set to true, the physical plan optimizer will try to add round robin repartitioning to increase parallelism to leverage more CPU cores
458
460
datafusion.optimizer.enable_sort_pushdown true Enable sort pushdown optimization. When enabled, attempts to push sort requirements down to data sources that can natively handle them (e.g., by reversing file/row group read order). Returns **inexact ordering**: Sort operator is kept for correctness, but optimized input enables early termination for TopK queries (ORDER BY ... LIMIT N), providing significant speedup. Memory: No additional overhead (only changes read order). Future: Will add option to detect perfectly sorted data and eliminate Sort completely. Default: true
0 commit comments