[SPARK-52166] [SDP] Add support for PipelineEvents #50906

jonmio · 2025-05-15T19:53:59Z

What changes were proposed in this pull request?

The execution of an SDP is a complex, multi-stage computation. To observe progress and monitor state transitions of a pipeline execution, we will maintain a stream of pipeline events (e.g. flow started, flow completed, pipeline run started, pipeline run completed). This PR introduces the data model for PipelineEvents.

Additionally since this is the first PR to add tests to the pipelines module, we update the GA CI job to run tests in the pipeline module.

Why are the changes needed?

See description above

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added unit tests for event construction as well as unit tests for the helper functions used in ConstructPipelineEvent.

Was this patch authored or co-authored using generative AI tooling?

No

gengliangwang · 2025-05-19T16:47:15Z

...pipelines/src/main/scala/org/apache/spark/sql/pipelines/logging/ConstructPipelineEvent.scala

+
+import org.apache.spark.internal.Logging
+
+case class UnresolvedFlowFailureException(name: String, cause: Throwable)


let's move this to QueryCompilationErrors and create an error condition for it in error-conditions.json

I saw that this is actually unused so I am just going to remove this for now

gengliangwang · 2025-05-19T17:04:25Z

...pipelines/src/main/scala/org/apache/spark/sql/pipelines/logging/ConstructPipelineEvent.scala

+      case _ =>
+        val className = t.getClass.getName
+        t match {
+          case _: UnresolvedFlowFailureException =>


why converting UnresolvedFlowFailureException to SerializedException?

SerializedException is a slightly more structured format for recording errors that happen during pipeline execution.

Note that SerializedException is an internal exception type so I don't think we need to create an error class for it

...pipelines/src/main/scala/org/apache/spark/sql/pipelines/logging/ConstructPipelineEvent.scala

gengliangwang · 2025-05-19T17:10:22Z

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/logging/PipelineEvent.scala

+
+/**
+ * An internal event that is emitted during the run of a pipeline.
+ * @param id A time based, globally unique id


In this PR, the id is UUID.randomUUID(). Why it is time based?

Fixed the comment

gengliangwang · 2025-05-19T17:14:34Z

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/logging/PipelineEvent.scala

+ * @param flowName The name of the flow
+ * @param sourceCodeLocation The location of the source code
+ */
+case class Origin(


Given there is an existing Origin in Spark catalyst: https://github.com/apache/spark/blob/master/sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/origin.scala, shall we rename this one as PipelineOrigin?

Updated to PipelineEventOrigin since the origin contains information not just about the pipeline but also about the dataset

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/logging/PipelineEvent.scala

sql/pipelines/src/test/scala/org/apache/spark/sql/pipelines/logging/PipelineEventSuite.scala

gengliangwang · 2025-05-19T17:40:38Z

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/logging/EventHelpers.scala

+  /** A format string that defines how timestamps are serialized in a [[PipelineEvent]]. */
+  private val timestampFormat: String = "yyyy-MM-dd'T'HH:mm:ss.SSSXX"
+  private val tz: TimeZone = TimeZone.getTimeZone("UTC")
+  private val df = new SimpleDateFormat(timestampFormat)


Shall we use DateTimeFormatter(thread-safe)?

gengliangwang · 2025-05-19T17:42:15Z

sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/logging/EventHelpers.scala

+
+  /** A format string that defines how timestamps are serialized in a [[PipelineEvent]]. */
+  private val timestampFormat: String = "yyyy-MM-dd'T'HH:mm:ss.SSSXX"
+  private val tz: TimeZone = TimeZone.getTimeZone("UTC")


So the timezone is always UTC?
The default timestamp is using a session local time zone, which can be set via

val SESSION_LOCAL_TIMEZONE = buildConf(SqlApiConfHelper.SESSION_LOCAL_TIMEZONE_KEY) .doc("The ID of session local timezone in the format of either region-based zone IDs or " + "zone offsets. Region IDs must have the form 'area/city', such as 'America/Los_Angeles'. " + "Zone offsets must be in the format '(+|-)HH', '(+|-)HH:mm' or '(+|-)HH:mm:ss', e.g '-08', " + "'+01:00' or '-13:33:33'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'. Other " + "short names are not recommended to use because they can be ambiguous.") .version("2.2.0") .stringConf .checkValue(isValidTimezone, errorClass = "TIME_ZONE", parameters = tz => Map.empty) .createWithDefaultFunction(() => TimeZone.getDefault.getID)

Can we address this in a followup? I think we need to discuss what the right interface to set this conf is for a pipeline. It seems like this should be a pipeline level setting that cannot be changed throughout the run of the pipeline and it should be set as a spark conf in the pipeline settings

The pipeline events are also not user facing and just used by the system at the moment

Let's add a comment to explain the intention.

gengliangwang · 2025-05-20T17:03:06Z

...pipelines/src/main/scala/org/apache/spark/sql/pipelines/logging/ConstructPipelineEvent.scala

+ * automatically filled in. Developers should always use this factory rather than construct
+ * an event directly from an empty proto.
+ */
+object ConstructPipelineEvent extends Logging {


Is the Logging trait used?

nope, removed

gengliangwang · 2025-05-20T17:04:05Z

sql/pipelines/pom.xml

+    <dependencies>
+        <dependency>
+            <groupId>org.apache.spark</groupId>
+            <artifactId>spark-core_${scala.binary.version}</artifactId>


Is this required in this PR?

gengliangwang · 2025-05-21T17:09:33Z

Thanks, merging to master

LuciferYang · 2025-05-24T07:18:57Z

@jonmio The compilation of all Maven daily tests failed due to the following reasons:

scaladoc error: fatal error: object scala in compiler mirror not found.
Error:  Failed to execute goal net.alchim31.maven:scala-maven-plugin:4.9.2:doc-jar (attach-scaladocs) on project spark-pipelines_2.13: MavenReportException: Error while creating archive: wrap: Process exited with an error: 1 (Exit value: 1) -> [Help 1]
Error:  
Error:  To see the full stack trace of the errors, re-run Maven with the -e switch.
Error:  Re-run Maven using the -X switch to enable full debug logging.
Error:  
Error:  For more information about the errors and possible solutions, please read the following articles:
Error:  [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
Error:  
Error:  After correcting the problems, you can resume the build with the command
Error:    mvn <args> -rf :spark-pipelines_2.13
Error: Process completed with exit code 1.

Would you have a moment to look ？ Thanks

also cc @gengliangwang

LuciferYang · 2025-05-24T07:53:17Z

@jonmio The compilation of all Maven daily tests failed due to the following reasons:

scaladoc error: fatal error: object scala in compiler mirror not found.
Error:  Failed to execute goal net.alchim31.maven:scala-maven-plugin:4.9.2:doc-jar (attach-scaladocs) on project spark-pipelines_2.13: MavenReportException: Error while creating archive: wrap: Process exited with an error: 1 (Exit value: 1) -> [Help 1]
Error:  
Error:  To see the full stack trace of the errors, re-run Maven with the -e switch.
Error:  Re-run Maven using the -X switch to enable full debug logging.
Error:  
Error:  For more information about the errors and possible solutions, please read the following articles:
Error:  [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
Error:  
Error:  After correcting the problems, you can resume the build with the command
Error:    mvn <args> -rf :spark-pipelines_2.13
Error: Process completed with exit code 1.

Would you have a moment to look ？ Thanks

also cc @gengliangwang

Give a quick fix: #51008

sryza · 2025-05-30T15:17:40Z

Thanks for jumping in and fixing this @LuciferYang !

jon-mio added 3 commits May 15, 2025 14:37

scaffolding

aa82378

test pass

0f5d4ea

fmt

f6c62aa

github-actions bot added SQL BUILD labels May 15, 2025

Merge branch 'master' into include_event_classes

0c5311e

jonmio changed the title ~~[WIP] [SPARK-52166] [SDP] Add support for PipelineEvents~~ [SPARK-52166] [SDP] Add support for PipelineEvents May 19, 2025

add ci

369fe32

github-actions bot added the INFRA label May 19, 2025

jon-mio added 2 commits May 19, 2025 11:57

remove core module

a78ce94

fix

e71c16c