-
-
Notifications
You must be signed in to change notification settings - Fork 4
Spark Job dependencies set using spark.submit.pyFiles cannot be loaded from HDFS #419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Attached is an example spark job configured to mount the HDFS and Kerberos config and load resources from HDFS. |
I can confirm this is still a problem with Spark 3.5.1 and the 24.7.0 release. It works as expected if hdfs is not kerberized. Attempts to use the IP address of the active namenode (or the service name) also fail with :
|
Update The problem was not upstream but in the operator not setting the environment variables properly. This is now fixed by #451. A working example can be found in the reproduce-spark-bug branch of the |
Hi all, small update: I gave this a quick try based on the e2e security demo and indeed setting I pushed the results to https://github.com/stackabletech/demos/tree/chore/reproduce-spark-bug |
Fixed by #451, closing. |
Could you please add a short sentence as a comment that could go into the release notes? |
Apache Spark Operator: Environment variables can now be overridden with the role group’s envOverrides property. |
Affected Stackable version
24.3
Affected Apache Spark-on-Kubernetes version
3.5.0
Current and expected behavior
With the correct configuration in place for Kerberos and HDFS Spark jobs can be successfully started using a resource loaded from Kerberos-enabled HDFS by setting
mainApplicationFile
to a HDFS URL e.g.mainApplicationFile: hdfs://poc-hdfs/user/stackable/pi.py
. The same Spark Job will fail if the propertyspark.submit.pyFiles
is configured pointing to a resource stored on the same HDFS cluster e.g.hdfs://poc-hdfs/user/stackable/mybanner.py
.2024-06-25T10:21:14,754 WARN [main] org.apache.hadoop.fs.FileSystem - Failed to initialize fileystem hdfs://poc-hdfs/user/stackable/mybanner.py: java.lang.IllegalArgumentException: java.net.UnknownHostException: poc-hdfs
Possible solution
No response
Additional context
No response
Environment
No response
Would you like to work on fixing this bug?
None
Tasks
The text was updated successfully, but these errors were encountered: