[SPARK-56482][SQL][FOLLOWUP] Simplify UnionExec codegen and narrow partition-index gate

cloud-fan · cloud-fan · commit a8cb80a9c797 · 2026-05-08T17:37:35.000+08:00
### What changes were proposed in this pull request? Followup to SPARK-56482 (#55425). Two groups of changes to `UnionExec`'s whole-stage codegen path. **Code cleanness:** - Hoist `metricTerm("numOutputRows")` to `doProduce` and store it on the instance. `doConsume` runs once per child during emission, so the previous code registered the same metric N times in `references[]` for an N-child Union; now once. - Drop the dead `assert` in `perChildProjections` and the duplicate `allChildOutputDataTypesMatch` lazy val. The dataType comparison now has a single source of truth in the `type-mismatch` branch of the gate. - Inline the one-shot `hasAnyPartitionIndexDependentDescendant` lazy val. - Drop the unreachable `case other` in the `UnionPartition` match and replace with `asInstanceOf`. `unionedInputRDD` is built as `new UnionRDD(...)` two lines up, and `getPartitions` only ever returns `UnionPartition[_]`. - Factor `isPlainUnion` helper used by the gate and `doExecute` so the invariant "codegen path matches `sparkContext.union` semantics" lives in one place. - Bind `currentPartitionIndexVar` to the array-deref expression `((int[]) refs[K])[partitionIndex]` directly. An earlier revision hoisted this to a `childLocalIdx` local at helper entry, but `SampleExec.doConsume` reads `currentPartitionIndexVar` from inside an `addMutableState` initializer, which is emitted into the state-init function — outside the per-child helper — so the local was not in scope and the generated code failed to compile. The expression form resolves in any emission scope (helper parameter or `BufferedRowIterator` field). - Drop the `try/finally` around codegen state restoration. Codegen failure aborts the whole stage, so the restoration is unreachable. **Gate narrowing:** - Narrow `hasPartitionIndexDependentCodegen` to exclude `InputFileName`, `InputFileBlockStart`, and `InputFileBlockLength`. These are `Nondeterministic` but read from `InputFileBlockHolder` (a per-task thread-local) and do not embed `partitionIndex`, so they are safe under fusion. Queries like `SELECT input_file_name() FROM a UNION ALL SELECT input_file_name() FROM b` now fuse. ### Why are the changes needed? The cleanups remove accidental complexity in the fused code path: an N-fold metric reference, two duplicated dataType comparisons, an unreachable defensive guard, and a `try/finally` that protects against an unreachable case. The gate narrowing turns a missed optimization (file-scan unions) into a fused plan. ### Does this PR introduce _any_ user-facing change? No. `spark.sql.codegen.wholeStage.union.enabled` remains off by default; when on, the new behavior fuses additional plans (file-scan unions with `input_file_name()`) that the previous gate over-rejected. ### How was this patch tested? `UnionCodegenSuite`, `UnionCodegenAnsiSuite`, `UnionCodegenAqeSuite`, and the relevant `SQLMetricsSuite` test all pass. Three tests added: - `partitioning-aware union falls back to non-codegen` — covers a `supportCodegenFailureReason` branch that lacked explicit coverage. - `input_file_name child fuses (Nondeterministic but partition-index-free)` — validates the gate narrowing. - `union with sample children fuses (or falls back) without crashing` — regression test for the `currentPartitionIndexVar` binding (caught by LuciferYang in review). The `columnar` fallback branch is not covered by a new test: reliably constructing a plan where `Union.supportsColumnar` is true via the user-facing API turned out to be brittle, since `ApplyColumnarRulesAndInsertTransitions` aggressively rebalances columnar/row transitions. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code Closes #55719 from cloud-fan/SPARK-56482-followup. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit d905e73) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala
@@ -901,61 +901,47 @@ case class UnionExec(children: Seq[SparkPlan]) extends SparkPlan with CodegenSup
     }
   }
 
-  // `WidenSetOperationTypes` inserts a `Project(Cast)` above each child whose
-  // dataType differs from the widened set type, so on the codegen path
-  // `src.dataType == tgt.dataType` holds. The Alias only remaps each child
-  // attribute onto the union's output exprId/name/metadata. Mismatched cases
-  // are gated upstream by `allChildOutputDataTypesMatch`, so the assert is a
-  // defensive guard.
+  // True when the codegen path applies: `outputPartitioning` is `UnknownPartitioning`,
+  // and `unionedInputRDD` matches the semantics of `sparkContext.union(...)` in `doExecute`.
+  private def isPlainUnion: Boolean = outputPartitioning.isInstanceOf[UnknownPartitioning]
+
+  // Per-child projection from the child's output to the union's output. The wrapped
+  // child is always the source `Attribute` (deterministic by construction); the Alias
+  // only remaps the exprId/name/metadata. `WidenSetOperationTypes` aligns top-level
+  // dataTypes, but nested nullability differences bypass it; those cases are caught
+  // by the `type-mismatch` gate below, which is the single source of truth for the
+  // `src.dataType == tgt.dataType` invariant `doConsume` relies on.
   @transient private lazy val perChildProjections: IndexedSeq[Seq[NamedExpression]] =
     children.toIndexedSeq.map { child =>
       child.output.zip(output).map { case (src, tgt) =>
-        assert(src.dataType == tgt.dataType,
-          s"UnionExec child output dataType ${src.dataType} does not match " +
-            s"union output dataType ${tgt.dataType}; supportCodegen should " +
-            "have returned false via the 'type-mismatch' reason.")
         Alias(src, tgt.name)(
           exprId = tgt.exprId,
           qualifier = tgt.qualifier,
           explicitMetadata = Some(tgt.metadata))
       }
     }
 
-  // True iff every child output dataType matches the corresponding union
-  // output dataType, including all nested nullabilities.
-  // `Union.allChildrenCompatible` ignores nested nullability, so children
-  // differing only there bypass `WidenSetOperationTypes`; `UnionExec.output`
-  // then merges those flags via `StructType.unionLikeMerge`, leaving src/tgt
-  // mismatched.
-  @transient private lazy val allChildOutputDataTypesMatch: Boolean =
-    children.forall { c =>
-      c.output.zip(output).forall { case (src, tgt) => src.dataType == tgt.dataType }
-    }
-
-  // Memoized: `supportCodegen` is called multiple times during planning.
-  @transient private lazy val hasAnyPartitionIndexDependentDescendant: Boolean =
-    children.exists(UnionExec.hasPartitionIndexDependentCodegen)
-
   // Memoized: consulted by `supportCodegen` (called multiple times by
   // `CollapseCodegenStages`) and by `metrics`. Conf and children are stable
   // for a given UnionExec instance; cross-plan staleness is impossible since
   // UnionExec is a case class and `withNewChildren` produces a fresh instance.
   @transient private lazy val supportCodegenFailureReason: Option[String] = {
     if (!conf.getConf(SQLConf.WHOLESTAGE_UNION_CODEGEN_ENABLED)) {
       Some("union-codegen-disabled")
-    } else if (!outputPartitioning.isInstanceOf[UnknownPartitioning]) {
+    } else if (!isPlainUnion) {
       Some("partitioning-aware")
     } else if (children.exists(_.exists(_.isInstanceOf[UnionExec]))) {
       Some("nested-union")
     } else if (children.exists(_.exists(UnionExec.isKnownMultiInputRDDCodegen))) {
       Some("multi-rdd-child")
-    } else if (hasAnyPartitionIndexDependentDescendant) {
+    } else if (children.exists(UnionExec.hasPartitionIndexDependentCodegen)) {
       Some("partition-index-dependent-child")
     } else if (children.size > conf.getConf(SQLConf.WHOLESTAGE_UNION_MAX_CHILDREN)) {
       Some("max-children-exceeded")
     } else if (supportsColumnar) {
       Some("columnar")
-    } else if (!allChildOutputDataTypesMatch) {
+    } else if (children.exists(c =>
+      c.output.zip(output).exists { case (src, tgt) => src.dataType != tgt.dataType })) {
       Some("type-mismatch")
     } else {
       None
@@ -1002,61 +988,66 @@ case class UnionExec(children: Seq[SparkPlan]) extends SparkPlan with CodegenSup
 
   override def inputRDDs(): Seq[RDD[InternalRow]] = Seq(unionedInputRDD)
 
-  // Driver-side cursor written by `doProduce` and read by `doConsume` during
-  // single-threaded code emission; resets to -1 once emission completes.
+  // Set in `doProduce`, read in `doConsume` during single-threaded code
+  // emission. `numOutputRowsTerm` is registered once per stage so the
+  // metric appears in `references[]` exactly once instead of once per
+  // child. `currentEmittingChild` tells `doConsume` which child's
+  // projection to bind.
+  @transient private var numOutputRowsTerm: String = _
   @transient private var currentEmittingChild: Int = -1
 
   override protected def doProduce(ctx: CodegenContext): String = {
+    numOutputRowsTerm = metricTerm(ctx, "numOutputRows")
+
     // For each partition of the unioned RDD, record its owning child and its
     // index within that child's RDD. Read both fields directly off the
     // `UnionPartition` so the lookup arrays do not assume `UnionRDD` lays
     // partitions out in child order.
-    val (partitionToChild, partitionToLocalIdx) =
-      unionedInputRDD.partitions.map {
-        case up: UnionPartition[_] => (up.parentRddIndex, up.parentPartition.index)
-        case other =>
-          throw SparkException.internalError(
-            s"UnionExec: Unexpected partition type ${other.getClass.getName}")
-      }.unzip
+    val (partitionToChild, partitionToLocalIdx) = unionedInputRDD.partitions.map { p =>
+      val up = p.asInstanceOf[UnionPartition[_]]
+      (up.parentRddIndex, up.parentPartition.index)
+    }.unzip
     val p2cRef = ctx.addReferenceObj("partitionToChild", partitionToChild)
     val p2lRef = ctx.addReferenceObj("partitionToLocalIdx", partitionToLocalIdx)
     val childIndexVar = ctx.freshName("unionChildIdx")
 
-    // Each child's produce output is wrapped in its own helper method. The
-    // outer `switch` in `doProduce`'s return value dispatches to the helper.
+    // Each child's produced code is wrapped in its own helper method.
     // Without this, the fused method's bytecode grows linearly with the
     // number of children and quickly exceeds HotSpot's per-method limit,
     // forcing the whole stage to run interpreted.
     //
-    // `partitionIndex` is passed as a parameter (shadowing the superclass
-    // field) rather than read from the enclosing scope. `addNewFunction` may
-    // spill helpers into a nested class when the outer class fills up, and a
-    // nested class cannot access the protected
-    // `BufferedRowIterator.partitionIndex` field. Using the parameter name
-    // `partitionIndex` keeps any child-emitted reference to that identifier
-    // resolving locally.
+    // The helper takes `int partitionIndex` as a parameter; `addNewFunction`
+    // may spill helpers into a nested class once the outer class fills up,
+    // and a nested class cannot access the protected
+    // `BufferedRowIterator.partitionIndex` field.
+    //
+    // `currentPartitionIndexVar` is rebound to an array-deref expression
+    // (rather than a local) so leaf operators (`RangeExec`, `SampleExec`)
+    // see the child-local index regardless of where their code is emitted.
+    // `SampleExec.doConsume` uses `addMutableState`, whose initializer is
+    // emitted into the state-init function, not the helper - a local in
+    // the helper would not be in scope there. The expression resolves
+    // against `partitionIndex` (the helper parameter inside the helper,
+    // and the `BufferedRowIterator` field elsewhere) in every context.
     val savedPartIdxVar = ctx.currentPartitionIndexVar
-    val cases = try {
-      children.zipWithIndex.map { case (c, i) =>
-        currentEmittingChild = i
-        ctx.currentPartitionIndexVar = s"((int[]) $p2lRef)[partitionIndex]"
-        val producedCode = c.asInstanceOf[CodegenSupport].produce(ctx, this)
-        val helper = ctx.freshName("unionChildProcess")
-        val qualifiedHelper = ctx.addNewFunction(helper,
-          s"""
-             |private void $helper(int partitionIndex) throws java.io.IOException {
-             |  $producedCode
-             |}
-           """.stripMargin)
-        s"""case $i: {
-           |  $qualifiedHelper(partitionIndex);
-           |  break;
-           |}""".stripMargin
-      }
-    } finally {
-      currentEmittingChild = -1
-      ctx.currentPartitionIndexVar = savedPartIdxVar
+    ctx.currentPartitionIndexVar = s"((int[]) $p2lRef)[partitionIndex]"
+    val cases = children.zipWithIndex.map { case (c, i) =>
+      currentEmittingChild = i
+      val producedCode = c.asInstanceOf[CodegenSupport].produce(ctx, this)
+      val helper = ctx.freshName("unionChildProcess")
+      val qualifiedHelper = ctx.addNewFunction(helper,
+        s"""
+           |private void $helper(int partitionIndex) throws java.io.IOException {
+           |  $producedCode
+           |}
+         """.stripMargin)
+      s"""case $i: {
+         |  $qualifiedHelper(partitionIndex);
+         |  break;
+         |}""".stripMargin
     }
+    currentEmittingChild = -1
+    ctx.currentPartitionIndexVar = savedPartIdxVar
 
     s"""
        |int $childIndexVar = ((int[]) $p2cRef)[partitionIndex];
@@ -1071,24 +1062,17 @@ case class UnionExec(children: Seq[SparkPlan]) extends SparkPlan with CodegenSup
 
   override def doConsume(
       ctx: CodegenContext, input: Seq[ExprCode], row: ExprCode): String = {
-    require(currentEmittingChild >= 0,
-      "UnionExec.doConsume invoked outside doProduce emission window")
     val i = currentEmittingChild
-    // The wrapped child in each `perChildProjections(i)` element is always an
-    // `Attribute`, which is deterministic by definition; no
-    // `evaluateRequiredVariables` call is needed to force single-evaluation
-    // of non-deterministic expressions.
-    val bound = BindReferences.bindReferences(perChildProjections(i), children(i).output)
-
+    require(i >= 0, "UnionExec.doConsume invoked outside doProduce emission window")
     // Route BoundReference reads through `currentVars` (the incoming row is
     // delivered as variables under WSCG, not via ctx.INPUT_ROW).
+    val bound = BindReferences.bindReferences(perChildProjections(i), children(i).output)
     ctx.currentVars = input
     ctx.INPUT_ROW = null
     val projectedExprCodes = bound.map(_.genCode(ctx))
 
-    val numOutput = metricTerm(ctx, "numOutputRows")
     s"""
-       |$numOutput.add(1L);
+       |$numOutputRowsTerm.add(1L);
        |${consume(ctx, projectedExprCodes)}
      """.stripMargin
   }
@@ -1103,7 +1087,7 @@ case class UnionExec(children: Seq[SparkPlan]) extends SparkPlan with CodegenSup
   override def usedInputs: AttributeSet = AttributeSet.empty
 
   protected override def doExecute(): RDD[InternalRow] = {
-    if (outputPartitioning.isInstanceOf[UnknownPartitioning]) {
+    if (isPlainUnion) {
       sparkContext.union(children.map(_.execute()))
     } else {
       // This union has a known partitioning, i.e., its children have the same partitioning
@@ -1138,13 +1122,24 @@ object UnionExec {
   }
 
   /**
-   * True if any expression in the subtree is [[Nondeterministic]]. Such
-   * expressions may embed the raw `partitionIndex` field via
-   * `addPartitionInitializationStatement`, which would read the global
+   * True if any expression in the subtree embeds the raw `partitionIndex` field
+   * via `addPartitionInitializationStatement`, which would read the global
    * UnionRDD index instead of the child-local one under fusion.
+   *
+   * The check uses [[Nondeterministic]] as the proxy: every catalyst expression
+   * that calls `addPartitionInitializationStatement` referencing `partitionIndex`
+   * is `Nondeterministic`. The `InputFile*` expressions are `Nondeterministic`
+   * but read from `InputFileBlockHolder` (a per-task thread-local) and do not
+   * embed `partitionIndex`, so they are safe under fusion.
    */
-  def hasPartitionIndexDependentCodegen(p: SparkPlan): Boolean = p.exists {
-    plan => plan.expressions.exists(_.exists(_.isInstanceOf[Nondeterministic]))
+  def hasPartitionIndexDependentCodegen(p: SparkPlan): Boolean = p.exists { plan =>
+    plan.expressions.exists(_.exists {
+      case _: InputFileName => false
+      case _: InputFileBlockStart => false
+      case _: InputFileBlockLength => false
+      case _: Nondeterministic => true
+      case _ => false
+    })
   }
 }
 
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/UnionCodegenSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/UnionCodegenSuite.scala
@@ -602,6 +602,57 @@ class UnionCodegenSuite extends QueryTest with SharedSparkSession {
         "numOutputRows should be 0 for all-empty union")
     }
   }
+
+  test("SPARK-56482: partitioning-aware union falls back to non-codegen") {
+    // After repartition, both children expose a `HashPartitioning` on the same key,
+    // so `UnionExec.outputPartitioning` is non-Unknown and the codegen path is denied.
+    // AQE is disabled here so the executedPlan exposes the UnionExec directly
+    // (under AQE the plan is wrapped in `AdaptiveSparkPlanExec`, which does not
+    // surface its inputPlan via `children`).
+    withSQLConf(
+      SQLConf.UNION_OUTPUT_PARTITIONING.key -> "true",
+      SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false") {
+      val a = rangeDF(100).repartition(4, col("id"))
+      val b = rangeDF(100, 200).repartition(4, col("id"))
+      val df = a.union(b)
+      assert(!unionInsideWSCG(df),
+        "Partitioning-aware union must not fuse into WSCG")
+      val unionExec = df.queryExecution.executedPlan.collectFirst {
+        case u: UnionExec => u
+      }.get
+      assert(!unionExec.metrics.contains("numOutputRows"),
+        "numOutputRows metric must not be registered on the partitioning-aware path")
+      assertFlagParity(() => a.union(b).orderBy("id"))
+    }
+  }
+
+  test("SPARK-56482: input_file_name child fuses (Nondeterministic but partition-index-free)") {
+    // `InputFileName` is `Nondeterministic` but reads from `InputFileBlockHolder`
+    // (a per-task thread-local) and does not embed `partitionIndex`. The gate's
+    // narrow check should let this fuse.
+    withTempPath { dir =>
+      val path = dir.getCanonicalPath
+      rangeDF(20).write.parquet(path)
+      val a = spark.read.parquet(path).select(col("id"), input_file_name().as("f"))
+      val b = spark.read.parquet(path).select(col("id"), input_file_name().as("f"))
+      val df = a.union(b).filter(col("id") > 0)
+      assert(unionInsideWSCG(df),
+        "Union with input_file_name child should fuse into WSCG")
+      assertFlagParity(() => a.union(b).orderBy("id", "f"))
+    }
+  }
+
+  test("SPARK-56482: union with sample children fuses (or falls back) without crashing") {
+    // `SampleExec.doConsume` reads `currentPartitionIndexVar` from inside an
+    // `addMutableState` initializer, which is emitted into the state-init
+    // function rather than the per-child helper. The bound expression must
+    // therefore resolve in any emission scope, not just inside the helper.
+    val a = rangeDF(20).sample(false, 0.5, 1L)
+    val b = rangeDF(20).sample(false, 0.5, 1L)
+    val df = a.union(b).filter(col("id") > 0)
+    df.collect()
+    assertFlagParity(() => a.union(b).orderBy("id"))
+  }
 }
 
 /** Runs [[UnionCodegenSuite]] with ANSI mode enabled. */