Skip to content

Spark Job dependencies set using spark.submit.pyFiles cannot be loaded from HDFS #419

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 tasks
Jimvin opened this issue Jun 25, 2024 · 7 comments
Closed
2 tasks
Assignees
Labels
release/24.11.0 release-note Denotes a PR that will be considered when it comes time to generate release notes. type/bug

Comments

@Jimvin
Copy link
Member

Jimvin commented Jun 25, 2024

Affected Stackable version

24.3

Affected Apache Spark-on-Kubernetes version

3.5.0

Current and expected behavior

With the correct configuration in place for Kerberos and HDFS Spark jobs can be successfully started using a resource loaded from Kerberos-enabled HDFS by setting mainApplicationFile to a HDFS URL e.g. mainApplicationFile: hdfs://poc-hdfs/user/stackable/pi.py. The same Spark Job will fail if the property spark.submit.pyFiles is configured pointing to a resource stored on the same HDFS cluster e.g. hdfs://poc-hdfs/user/stackable/mybanner.py.

2024-06-25T10:21:14,754 WARN [main] org.apache.hadoop.fs.FileSystem - Failed to initialize fileystem hdfs://poc-hdfs/user/stackable/mybanner.py: java.lang.IllegalArgumentException: java.net.UnknownHostException: poc-hdfs

Possible solution

No response

Additional context

No response

Environment

No response

Would you like to work on fixing this bug?

None

Tasks

Preview Give feedback
@Jimvin
Copy link
Member Author

Jimvin commented Jun 25, 2024

Attached is an example spark job configured to mount the HDFS and Kerberos config and load resources from HDFS.
spark-pi-test.zip

@razvan razvan self-assigned this Aug 15, 2024
@razvan
Copy link
Member

razvan commented Aug 16, 2024

I can confirm this is still a problem with Spark 3.5.1 and the 24.7.0 release.

It works as expected if hdfs is not kerberized.

Attempts to use the IP address of the active namenode (or the service name) also fail with :

aused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]                                            
    at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1612)                                                                                                                                               
    at org.apache.hadoop.ipc.Client.call(Client.java:1558)                                                                                                                                                         
    at org.apache.hadoop.ipc.Client.call(Client.java:1455)                                                                                                                                                         
    at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:242)                                                                                                                        
    at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:129)                                                                                                                        
    at jdk.proxy2/jdk.proxy2.$Proxy27.getFileInfo(Unknown Source)                                                                                                                                                  
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:965)                                                                               
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)                                                                                                                              
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)                                                                                                            
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)                                                                                                    
    at java.base/java.lang.reflect.Method.invoke(Method.java:569)                                                                                                                                                  
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)                                                                                                             
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)                                                                                                        
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)                                                                                                              
    at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)                                                                                                           
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)                                                                                                                   
    at jdk.proxy2/jdk.proxy2.$Proxy28.getFileInfo(Unknown Source)                                                                                                                                                  
    at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1739) 

I'm sure this is a problem upstream (either in Spark or in the HDFS client).

@razvan razvan moved this to Development: Waiting for Review in Stackable Engineering Aug 16, 2024
@razvan
Copy link
Member

razvan commented Aug 19, 2024

Update

The problem was not upstream but in the operator not setting the environment variables properly. This is now fixed by #451.

A working example can be found in the reproduce-spark-bug branch of the demos repository.

@sbernauer
Copy link
Member

sbernauer commented Aug 19, 2024

Hi all,

small update: I gave this a quick try based on the e2e security demo and indeed setting HADOOP_CONF_DIR fixes this (this also makes sense, as we only add the configs to the driver and executor classPaths).
By doing so I was also able to use the defaultFs: mainApplicationFile: hdfs:/lakehouse/test.py

I pushed the results to https://github.com/stackabletech/demos/tree/chore/reproduce-spark-bug

@sbernauer sbernauer moved this from Development: In Review to Development: Done in Stackable Engineering Sep 9, 2024
@Jimvin
Copy link
Member Author

Jimvin commented Sep 9, 2024

Fixed by #451, closing.

@Jimvin Jimvin closed this as completed Sep 9, 2024
@lfrancke
Copy link
Member

Could you please add a short sentence as a comment that could go into the release notes?

@Jimvin
Copy link
Member Author

Jimvin commented Sep 11, 2024

Apache Spark Operator: Environment variables can now be overridden with the role group’s envOverrides property.

@lfrancke lfrancke added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Sep 11, 2024
@lfrancke lfrancke moved this from Acceptance: In Progress to Done in Stackable Engineering Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release/24.11.0 release-note Denotes a PR that will be considered when it comes time to generate release notes. type/bug
Projects
Archived in project
Development

No branches or pull requests

4 participants