-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-52166] [SDP] Add support for PipelineEvents #50906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
import org.apache.spark.internal.Logging | ||
|
||
case class UnresolvedFlowFailureException(name: String, cause: Throwable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's move this to QueryCompilationErrors
and create an error condition for it in error-conditions.json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw that this is actually unused so I am just going to remove this for now
case _ => | ||
val className = t.getClass.getName | ||
t match { | ||
case _: UnresolvedFlowFailureException => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why converting UnresolvedFlowFailureException
to SerializedException
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SerializedException
is a slightly more structured format for recording errors that happen during pipeline execution.
Note that SerializedException is an internal exception type so I don't think we need to create an error class for it
...pipelines/src/main/scala/org/apache/spark/sql/pipelines/logging/ConstructPipelineEvent.scala
Outdated
Show resolved
Hide resolved
...pipelines/src/main/scala/org/apache/spark/sql/pipelines/logging/ConstructPipelineEvent.scala
Outdated
Show resolved
Hide resolved
|
||
/** | ||
* An internal event that is emitted during the run of a pipeline. | ||
* @param id A time based, globally unique id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this PR, the id is UUID.randomUUID()
. Why it is time based?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the comment
* @param flowName The name of the flow | ||
* @param sourceCodeLocation The location of the source code | ||
*/ | ||
case class Origin( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given there is an existing Origin in Spark catalyst: https://github.com/apache/spark/blob/master/sql/api/src/main/scala/org/apache/spark/sql/catalyst/trees/origin.scala, shall we rename this one as PipelineOrigin
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to PipelineEventOrigin since the origin contains information not just about the pipeline but also about the dataset
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/logging/PipelineEvent.scala
Outdated
Show resolved
Hide resolved
sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/logging/PipelineEvent.scala
Show resolved
Hide resolved
sql/pipelines/src/test/scala/org/apache/spark/sql/pipelines/logging/PipelineEventSuite.scala
Outdated
Show resolved
Hide resolved
/** A format string that defines how timestamps are serialized in a [[PipelineEvent]]. */ | ||
private val timestampFormat: String = "yyyy-MM-dd'T'HH:mm:ss.SSSXX" | ||
private val tz: TimeZone = TimeZone.getTimeZone("UTC") | ||
private val df = new SimpleDateFormat(timestampFormat) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we use DateTimeFormatter
(thread-safe)?
|
||
/** A format string that defines how timestamps are serialized in a [[PipelineEvent]]. */ | ||
private val timestampFormat: String = "yyyy-MM-dd'T'HH:mm:ss.SSSXX" | ||
private val tz: TimeZone = TimeZone.getTimeZone("UTC") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the timezone is always UTC?
The default timestamp is using a session local time zone, which can be set via
val SESSION_LOCAL_TIMEZONE = buildConf(SqlApiConfHelper.SESSION_LOCAL_TIMEZONE_KEY)
.doc("The ID of session local timezone in the format of either region-based zone IDs or " +
"zone offsets. Region IDs must have the form 'area/city', such as 'America/Los_Angeles'. " +
"Zone offsets must be in the format '(+|-)HH', '(+|-)HH:mm' or '(+|-)HH:mm:ss', e.g '-08', " +
"'+01:00' or '-13:33:33'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'. Other " +
"short names are not recommended to use because they can be ambiguous.")
.version("2.2.0")
.stringConf
.checkValue(isValidTimezone, errorClass = "TIME_ZONE", parameters = tz => Map.empty)
.createWithDefaultFunction(() => TimeZone.getDefault.getID)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we address this in a followup? I think we need to discuss what the right interface to set this conf is for a pipeline. It seems like this should be a pipeline level setting that cannot be changed throughout the run of the pipeline and it should be set as a spark conf in the pipeline settings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pipeline events are also not user facing and just used by the system at the moment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a comment to explain the intention.
* automatically filled in. Developers should always use this factory rather than construct | ||
* an event directly from an empty proto. | ||
*/ | ||
object ConstructPipelineEvent extends Logging { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the Logging trait used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, removed
<dependencies> | ||
<dependency> | ||
<groupId>org.apache.spark</groupId> | ||
<artifactId>spark-core_${scala.binary.version}</artifactId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this required in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
Thanks, merging to master |
@jonmio The compilation of all Maven daily tests failed due to the following reasons:
Would you have a moment to look ? Thanks
also cc @gengliangwang |
Give a quick fix: #51008 |
Thanks for jumping in and fixing this @LuciferYang ! |
What changes were proposed in this pull request?
The execution of an SDP is a complex, multi-stage computation. To observe progress and monitor state transitions of a pipeline execution, we will maintain a stream of pipeline events (e.g. flow started, flow completed, pipeline run started, pipeline run completed). This PR introduces the data model for PipelineEvents.
Additionally since this is the first PR to add tests to the pipelines module, we update the GA CI job to run tests in the pipeline module.
Why are the changes needed?
See description above
Does this PR introduce any user-facing change?
No
How was this patch tested?
Added unit tests for event construction as well as unit tests for the helper functions used in
ConstructPipelineEvent
.Was this patch authored or co-authored using generative AI tooling?
No