-
-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Building on CUDA 12.6 likely has issues on driver versions older than 565 #417
Comments
can you try installing from conda-forge's channel and seeing if you can reproduce? Do you need 2.18 or could 2.17 reproduce the error? We just can't keep troubleshooting pip packages without upstream support. OTherwise, please build the 2.18 package locally and tell us what error you get. |
2.17 cannot be reproduced because none of the variants were built with CUDA 12.6. |
P100 GPU (capability 6.0), NVIDIA 550.144, tensorflow 2.18.0 (
P100 (compute capability 6.0), NVIDIA 550.144, TensorFlow 2.17.0, CUDA 12.6, conda: Crashes without any notable logs. The P100 GPU is known for not supporting modern CUDA versions so this should be normal (i.e. no fault of anyone).
|
A100 GPU (capability 8.0), NVIDIA 550.127.05, tensorflow 2.18.0 ( Nothing concerning.
A100 GPU (capability 8.0), NVIDIA 550.127.05, tensorflow 2.17.0, CUDA 12.6, Conda: Nothing concerning.
|
I am happy with how things are in this state. That message about the CUDA driver in the pip installed version probably exists because of some build configuration that was only enabled there and not in conda-forge. I just needed |
Solution to issue cannot be found in the documentation.
Issue
W external/local_xla/xla/service/gpu/nvptx_compiler.cc:930] The NVIDIA driver's CUDA version is 12.4 which is older than the PTX compiler version 12.5.82. Because the driver is older than the PTX compiler version, XLA is disabling parallel compilation, which may slow down compilation. You should update your NVIDIA driver or use the NVIDIA-provided CUDA forward compatibility packages.
I external/local_xla/xla/service/gpu/autotuning/conv_algorithm_picker.cc:557] Omitted potentially buggy algorithm eng14{} for conv (f32[32,128,64,3]{3,2,1,0}, u8[0]{0}) custom-call(f32[32,5,64,3]{3,2,1,0}, f32[128,5,3,3]{3,2,1,0}, f32[128]{0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBiasActivationForward", backend_config={"cudnn_conv_backend_config":{"activation_mode":"kNone","conv_result_scale":1,"leakyrelu_alpha":0,"side_input_scale":0},"force_earliest_schedule":false,"operation_queue_id":"0","wait_on_operation_queues":[]}
I observe the above issue while using NVIDIA 550.144 drivers together with pip
tensorflow[and-cuda]==2.18.0
, which was built with and depends on CUDA 12.5 (But not with JAX).This is an example of the not-satisfying condition specified in the PTX section of https://pypackaging-native.github.io/key-issues/gpus/#additional-notes-on-cuda-compatibility.
(Can't test TensorFlow 2.17.0 on conda-forge because it is built with CUDA 12.0 right now.)
Therefore, it is possible that a build based on CUDA 12.6 has issues. The code is still trainable, but has implications of limitations for XLA.
Please prove otherwise on an older driver version if this issue doesn't exist (somehow) in conda-forge.
This will not likely have issues with JAX and PyTorch, so this is a different issue from conda-forge/pytorch-cpu-feedstock#337 (while the extensive discussion there has helped understand the CUDA dependency landscape).
Installed packages
Environment info
The text was updated successfully, but these errors were encountered: