[CORE] Preserve fallback tag for nodes without logicalLink in RemoveFallbackTagRule#12028
Conversation
|
Run Gluten Clickhouse CI on x86 |
380cc92 to
a58dac1
Compare
|
Run Gluten Clickhouse CI on x86 |
a58dac1 to
ef62ac0
Compare
|
Run Gluten Clickhouse CI on x86 |
ef62ac0 to
c63c3ab
Compare
|
Run Gluten Clickhouse CI on x86 |
1 similar comment
|
Run Gluten Clickhouse CI on x86 |
c63c3ab to
c510184
Compare
|
Run Gluten Clickhouse CI on x86 |
|
Hi @zhztheplayer @Zouxxyy. After #11853, GlutenQueryExecutionListener.onSQLExecutionEnd re-walks qe.executedPlan and posts a fresh GlutenPlanFallbackEvent, which overwrites the per-stage events in kvstore (last write wins). By that time every AQE stage has already gone through RemoveFallbackTagRule, so for nodes without a logicalLink (e.g. SortExec / BroadcastExchange from EnsureRequirements) the FallbackTag is gone. End result: the regression #11295 tried to fix is back, just hidden by the test being relaxed from forall to exists in #11853. This PR takes the simplest fix: don't untag physical nodes that have no logicalLink in RemoveFallbackTagRule. Any downside you can think of to leaving the FallbackTag on the physical node in this case? |
Just curious, does this behavior differ across Spark versions? I checked latest Spark code and the rule adopts |
|
@zhztheplayer You're right that In |
…allbackTagRule For SparkPlan nodes that lack a logicalLink (e.g., SortExec / BroadcastExchange generated by EnsureRequirements), GlutenFallbackReporter cannot copy the fallback reason onto a logical plan, so removing the FallbackTag on the physical node makes GlutenQueryExecutionListener fall back to the generic 'Gluten does not touch it or does not support it' reason at SQL execution end. Fix: in RemoveFallbackTagRule, skip untagging for such nodes. Their tag stays on the physical node and is read by GlutenExplainUtils.handleVanillaSparkPlan via FallbackTags.getOption(p) in both the per-stage GlutenFallbackReporter event and the end-of-SQL plan walk. Note: an earlier attempt synthesized a dummy LeafNode and attached it via SparkPlan.LOGICAL_PLAN_TAG. That poisoned AdaptiveSparkPlanExec.setLogicalLinkFor- NewQueryStage, which relies on EnsureRequirements-generated nodes having no logicalLink in order to walk down to the real logical node, and broke AQE-driven join/empty-relation rewrites (SPARK-34533, SPARK-34781, SPARK-39551). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
c510184 to
a09699e
Compare
|
Run Gluten Clickhouse CI on x86 |
Summary
For
SparkPlannodes that lack alogicalLink(e.g.,SortExec/BroadcastExchangegenerated byEnsureRequirements),GlutenFallbackReportercannot copy theirFallbackTagonto a logical plan, so the subsequentRemoveFallbackTagRuleremoved their tag without persisting it anywhere. This caused the fallback reason to be lost whenGlutenQueryExecutionListenerre-walks the finalized plan at SQL execution end, resulting in a generic "Gluten does not touch it or does not support it" reason instead of the actual fallback reason.In #11853, the test assertion for "get correct fallback reason on nodes without logicalLink" was changed from
foralltoexists, which hid this issue — among thefallbackReasons, one event's reason had degraded to the generic message because the physical plan tag was removed byRemoveFallbackTagRule, butexistsonly required some reasons to be correct, masking the fact that others were wrong.Fix: In
RemoveFallbackTagRule, skip untagging for nodes that have aFallbackTagbut nologicalLink. Their tag stays on the physical node and is read byGlutenExplainUtils.handleVanillaSparkPlanviaFallbackTags.getOption(p)(the existing fallback branch) in both the per-stageGlutenFallbackReporterevent and the end-of-SQL plan walk. The tag is regenerated by Gluten validation on every AQE stage'spostStageCreationRules, so it survives across AQE re-planning iterations as well.With this fix, the test assertion is changed back to
forall, verifying that all fallback events now report the correct reason.Why not synthesize a logicalLink?
An earlier attempt synthesized a dummy
LeafNode(FallbackLogicalPlan) and attached it viaSparkPlan.LOGICAL_PLAN_TAGso the existingp.logicalLink.flatMap(FallbackTags.getOption)path would find it. That approach poisonedAdaptiveSparkPlanExec.setLogicalLinkForNewQueryStage, which intentionally relies onEnsureRequirements-generated nodes having nologicalLinkin order to walk down to the real logical node. With a synthetic link in place, the stage'slogicalLinkwas set to the dummy, which never appeared in the original logical plan; AQE'sreplaceWithQueryStagesthen could not insert theLogicalQueryStagewrapper, breaking AQE rules that matchLogicalQueryStage(_, stage: ...QueryStageExec)such asAQEPropagateEmptyRelation.Keeping the tag on the physical node sidesteps the AQE contract entirely.
Test plan
foralland verifies all fallback reasons are correct