Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-51537][CONNECT][CORE] construct classpath using both global jars and session specific jars on executor #50334

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

wbo4958
Copy link
Contributor

@wbo4958 wbo4958 commented Mar 20, 2025

What changes were proposed in this pull request?

This PR is to add global jars (e.g., added by --jars ) into the classpath on the executor side in the connect mode.

Why are the changes needed?

In Spark Connect mode, when connecting to a non-local (e.g., standalone) cluster, the executor creates an isolated session state that includes a session-specific classloader for each task. However, a notable issue arises: this session-specific classloader does not include the global JARs specified by the --jars option in the classpath. This oversight can lead to deserialization exceptions. For example:

Caused by: java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance of org.apache.spark.rdd.MapPartitionsRDD
        at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2096)

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manual test,

  1. clone the minimum project that could repro this issue
git clone [email protected]:wbo4958/ConnectMLIssue.git
  1. Compile the project
mvn clean package
  1. Start a standalone cluster
$SPARK_HOME/sbin/start-master.sh -h localhost
$SPARK_HOME/sbin/start-worker.sh spark://localhost:7077
  1. Start a connect server connecting to the spark standalone cluster
./standalone.sh
  1. Play around the demo

Running the below code under the pyspark client environment.

python repro-issue.py

Without this PR, you're going to see the below exception

Caused by: java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance of org.apache.spark.rdd.MapPartitionsRDD
	at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2096)
	at java.io.ObjectStreamClass$FieldReflector.checkObjectFieldValueTypes(ObjectStreamClass.java:2060)
	at java.io.ObjectStreamClass.checkObjFieldValueTypes(ObjectStreamClass.java:1347)
	at java.io.ObjectInputStream$FieldValues.defaultCheckFieldValues(ObjectInputStream.java:2679)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2486)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2257)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1733)
	at java.io.ObjectInputStream$FieldValues.<init>(ObjectInputStream.java:2606)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2457)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2257)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1733)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:509)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:467)
	at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:88)
	at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:136)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:86)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)
	at org.apache.spark.scheduler.Task.run(Task.scala:147)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:645)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:80)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:77)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:100)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:648)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.lang.Thread.run(Thread.java:840)

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the CORE label Mar 20, 2025
@wbo4958 wbo4958 changed the title [SPARK-51537][CONNECT] [constructed classpath using both global jars and session specific jars in executor [SPARK-51537][CONNECT][CORE] [constructed classpath using both global jars and session specific jars in executor Mar 20, 2025
@wbo4958 wbo4958 force-pushed the connect-executor-classpath branch from 7b963dc to bbe2a94 Compare March 21, 2025 02:17
@wbo4958 wbo4958 changed the title [SPARK-51537][CONNECT][CORE] [constructed classpath using both global jars and session specific jars in executor [SPARK-51537][CONNECT][CORE] construct classpath using both global jars and session specific jars in executor Mar 21, 2025
@wbo4958 wbo4958 marked this pull request as ready for review March 21, 2025 02:43
@wbo4958 wbo4958 changed the title [SPARK-51537][CONNECT][CORE] construct classpath using both global jars and session specific jars in executor [SPARK-51537][CONNECT][CORE] construct classpath using both global jars and session specific jars on executor Mar 21, 2025
@wbo4958
Copy link
Contributor Author

wbo4958 commented Mar 21, 2025

Hi @hvanhovell @zhenlineo @HyukjinKwon @vicennial, Could you help review this PR? Thx very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant