Spark Job dependencies set using spark.submit.pyFiles cannot be loaded from HDFS #419

Jimvin · 2024-06-25T10:29:54Z

Affected Stackable version

24.3

Affected Apache Spark-on-Kubernetes version

3.5.0

Current and expected behavior

With the correct configuration in place for Kerberos and HDFS Spark jobs can be successfully started using a resource loaded from Kerberos-enabled HDFS by setting mainApplicationFile to a HDFS URL e.g. mainApplicationFile: hdfs://poc-hdfs/user/stackable/pi.py. The same Spark Job will fail if the property spark.submit.pyFiles is configured pointing to a resource stored on the same HDFS cluster e.g. hdfs://poc-hdfs/user/stackable/mybanner.py.

2024-06-25T10:21:14,754 WARN [main] org.apache.hadoop.fs.FileSystem - Failed to initialize fileystem hdfs://poc-hdfs/user/stackable/mybanner.py: java.lang.IllegalArgumentException: java.net.UnknownHostException: poc-hdfs

Possible solution

No response

Additional context

No response

Environment

No response

Would you like to work on fixing this bug?

None

Tasks

Give feedback

provide workaround here (if existent)
optional: Report upstream Spark bug
Options

The text was updated successfully, but these errors were encountered:

Jimvin · 2024-06-25T11:15:33Z

Attached is an example spark job configured to mount the HDFS and Kerberos config and load resources from HDFS.
spark-pi-test.zip

razvan · 2024-08-16T08:16:24Z

I can confirm this is still a problem with Spark 3.5.1 and the 24.7.0 release.

It works as expected if hdfs is not kerberized.

Attempts to use the IP address of the active namenode (or the service name) also fail with :

aused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]                                            
    at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1612)                                                                                                                                               
    at org.apache.hadoop.ipc.Client.call(Client.java:1558)                                                                                                                                                         
    at org.apache.hadoop.ipc.Client.call(Client.java:1455)                                                                                                                                                         
    at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:242)                                                                                                                        
    at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:129)                                                                                                                        
    at jdk.proxy2/jdk.proxy2.$Proxy27.getFileInfo(Unknown Source)                                                                                                                                                  
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:965)                                                                               
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)                                                                                                                              
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)                                                                                                            
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)                                                                                                    
    at java.base/java.lang.reflect.Method.invoke(Method.java:569)                                                                                                                                                  
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)                                                                                                             
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)                                                                                                        
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)                                                                                                              
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)                                                                                                           
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)                                                                                                                   
    at jdk.proxy2/jdk.proxy2.$Proxy28.getFileInfo(Unknown Source)                                                                                                                                                  
    at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1739)

~~I'm sure this is a problem upstream (either in Spark or in the HDFS client).~~

razvan · 2024-08-19T11:54:15Z

Update

The problem was not upstream but in the operator not setting the environment variables properly. This is now fixed by #451.

A working example can be found in the reproduce-spark-bug branch of the demos repository.

sbernauer · 2024-08-19T11:54:27Z

Hi all,

small update: I gave this a quick try based on the e2e security demo and indeed setting HADOOP_CONF_DIR fixes this (this also makes sense, as we only add the configs to the driver and executor classPaths).
By doing so I was also able to use the defaultFs: mainApplicationFile: hdfs:/lakehouse/test.py

I pushed the results to https://github.com/stackabletech/demos/tree/chore/reproduce-spark-bug

Jimvin · 2024-09-09T07:11:29Z

Fixed by #451, closing.

lfrancke · 2024-09-11T14:22:14Z

Could you please add a short sentence as a comment that could go into the release notes?

Jimvin · 2024-09-11T14:46:27Z

Apache Spark Operator: Environment variables can now be overridden with the role group’s envOverrides property.

Jimvin added the type/bug label Jun 25, 2024

razvan self-assigned this Aug 15, 2024

razvan added this to Stackable Engineering Aug 16, 2024

razvan moved this to Development: Waiting for Review in Stackable Engineering Aug 16, 2024

sbernauer assigned Jimvin Aug 19, 2024

sbernauer moved this from Development: In Review to Development: Done in Stackable Engineering Sep 9, 2024

Jimvin closed this as completed Sep 9, 2024

lfrancke added the release/24.11.0 label Sep 11, 2024

lfrancke added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Sep 11, 2024

lfrancke moved this from Acceptance: In Progress to Done in Stackable Engineering Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark Job dependencies set using spark.submit.pyFiles cannot be loaded from HDFS #419

Spark Job dependencies set using spark.submit.pyFiles cannot be loaded from HDFS #419

Jimvin commented Jun 25, 2024 •

edited by sbernauer

Loading

Tasks

Jimvin commented Jun 25, 2024

razvan commented Aug 16, 2024 •

edited

Loading

razvan commented Aug 19, 2024

sbernauer commented Aug 19, 2024 •

edited

Loading

Jimvin commented Sep 9, 2024

lfrancke commented Sep 11, 2024

Jimvin commented Sep 11, 2024

Spark Job dependencies set using spark.submit.pyFiles cannot be loaded from HDFS #419

Spark Job dependencies set using spark.submit.pyFiles cannot be loaded from HDFS #419

Comments

Jimvin commented Jun 25, 2024 • edited by sbernauer Loading

Affected Stackable version

Affected Apache Spark-on-Kubernetes version

Current and expected behavior

Possible solution

Additional context

Environment

Would you like to work on fixing this bug?

Tasks

Jimvin commented Jun 25, 2024

razvan commented Aug 16, 2024 • edited Loading

razvan commented Aug 19, 2024

sbernauer commented Aug 19, 2024 • edited Loading

Jimvin commented Sep 9, 2024

lfrancke commented Sep 11, 2024

Jimvin commented Sep 11, 2024

Jimvin commented Jun 25, 2024 •

edited by sbernauer

Loading

razvan commented Aug 16, 2024 •

edited

Loading

sbernauer commented Aug 19, 2024 •

edited

Loading