Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building on CUDA 12.6 likely has issues on driver versions older than 565 #417

Closed
1 task done
ehfd opened this issue Feb 4, 2025 · 5 comments
Closed
1 task done
Labels

Comments

@ehfd
Copy link
Member

ehfd commented Feb 4, 2025

Solution to issue cannot be found in the documentation.

  • I checked the documentation.

Issue

W external/local_xla/xla/service/gpu/nvptx_compiler.cc:930] The NVIDIA driver's CUDA version is 12.4 which is older than the PTX compiler version 12.5.82. Because the driver is older than the PTX compiler version, XLA is disabling parallel compilation, which may slow down compilation. You should update your NVIDIA driver or use the NVIDIA-provided CUDA forward compatibility packages.
I external/local_xla/xla/service/gpu/autotuning/conv_algorithm_picker.cc:557] Omitted potentially buggy algorithm eng14{} for conv (f32[32,128,64,3]{3,2,1,0}, u8[0]{0}) custom-call(f32[32,5,64,3]{3,2,1,0}, f32[128,5,3,3]{3,2,1,0}, f32[128]{0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBiasActivationForward", backend_config={"cudnn_conv_backend_config":{"activation_mode":"kNone","conv_result_scale":1,"leakyrelu_alpha":0,"side_input_scale":0},"force_earliest_schedule":false,"operation_queue_id":"0","wait_on_operation_queues":[]}

I observe the above issue while using NVIDIA 550.144 drivers together with pip tensorflow[and-cuda]==2.18.0, which was built with and depends on CUDA 12.5 (But not with JAX).

This is an example of the not-satisfying condition specified in the PTX section of https://pypackaging-native.github.io/key-issues/gpus/#additional-notes-on-cuda-compatibility.

(Can't test TensorFlow 2.17.0 on conda-forge because it is built with CUDA 12.0 right now.)

Therefore, it is possible that a build based on CUDA 12.6 has issues. The code is still trainable, but has implications of limitations for XLA.

Please prove otherwise on an older driver version if this issue doesn't exist (somehow) in conda-forge.

This will not likely have issues with JAX and PyTorch, so this is a different issue from conda-forge/pytorch-cpu-feedstock#337 (while the extensive discussion there has helped understand the CUDA dependency landscape).

Installed packages

TensorFlow 2.18.0 is unreleased.

Environment info

TensorFlow 2.18.0 is unreleased.
@hmaarrfk
Copy link
Contributor

hmaarrfk commented Feb 4, 2025

can you try installing from conda-forge's channel and seeing if you can reproduce? Do you need 2.18 or could 2.17 reproduce the error?

We just can't keep troubleshooting pip packages without upstream support.

OTherwise, please build the 2.18 package locally and tell us what error you get.

@ehfd
Copy link
Member Author

ehfd commented Feb 4, 2025

2.17 cannot be reproduced because none of the variants were built with CUDA 12.6.
Sure, I will test. @hmaarrfk

@ehfd
Copy link
Member Author

ehfd commented Feb 5, 2025

P100 GPU (capability 6.0), NVIDIA 550.144, tensorflow 2.18.0 (https://github.com/conda-forge/tensorflow-feedstock/pull/414/commits/44113375e85bfefe77b64dd0df2a8a195415bc10), CUDA 12.6, conda:

mamba create -n tf tensorflow-gpu==2.18.0=cuda126* keras-tuner pandas numpy seaborn scikit-learn scipy numba pydot jupyter ipykernel cuda-version=12.6 python=3.12 -c ./build_artifacts -c conda-forge

  + tensorflow-base                        2.18.0  cuda126py312hfb0ba9c_200  /home/jovyan/tmp/build_artifacts     Cached
  + tensorflow-estimator                   2.18.0  cuda126py312hd49ae37_200  /home/jovyan/tmp/build_artifacts     Cached
  + tensorflow                             2.18.0  cuda126py312h5379a72_200  /home/jovyan/tmp/build_artifacts     Cached
  + tensorflow-gpu                         2.18.0  cuda126py312h418687c_200  /home/jovyan/tmp/build_artifacts     Cached
I0000 00:00:1738757650.727012    2875 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 15513 MB memory:  -> device: 0, name: Tesla P100-SXM2-16GB, pci bus id: 0000:04:00.0, compute capability: 6.0
I0000 00:00:1738757653.559803    3294 service.cc:148] XLA service 0x763cd0004120 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1738757653.559849    3294 service.cc:156]   StreamExecutor device (0): Tesla P100-SXM2-16GB, Compute Capability 6.0
I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1738757653.796604    3294 cuda_dnn.cc:529] Loaded cuDNN version 90300
I external/local_xla/xla/service/gpu/autotuning/conv_algorithm_picker.cc:557] Omitted potentially buggy algorithm eng14{} for conv (f32[32,128,64,3]{3,2,1,0}, u8[0]{0}) custom-call(f32[32,5,64,3]{3,2,1,0}, f32[128,5,3,3]{3,2,1,0}, f32[128]{0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBiasActivationForward", backend_config={"cudnn_conv_backend_config":{"activation_mode":"kNone","conv_result_scale":1,"leakyrelu_alpha":0,"side_input_scale":0},"force_earliest_schedule":false,"operation_queue_id":"0","wait_on_operation_queues":[]}
I0000 00:00:1738757655.396762    3294 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
I external/local_xla/xla/service/gpu/autotuning/conv_algorithm_picker.cc:557] Omitted potentially buggy algorithm eng14{} for conv (f32[3,128,64,3]{3,2,1,0}, u8[0]{0}) custom-call(f32[3,5,64,3]{3,2,1,0}, f32[128,5,3,3]{3,2,1,0}, f32[128]{0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBiasActivationForward", backend_config={"cudnn_conv_backend_config":{"activation_mode":"kNone","conv_result_scale":1,"leakyrelu_alpha":0,"side_input_scale":0},"force_earliest_schedule":false,"operation_queue_id":"0","wait_on_operation_queues":[]}
I external/local_xla/xla/service/gpu/autotuning/conv_algorithm_picker.cc:557] Omitted potentially buggy algorithm eng14{} for conv (f32[32,128,64,3]{3,2,1,0}, u8[0]{0}) custom-call(f32[32,5,64,3]{3,2,1,0}, f32[128,5,3,3]{3,2,1,0}, f32[128]{0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBiasActivationForward", backend_config={"cudnn_conv_backend_config":{"activation_mode":"kRelu","conv_result_scale":1,"leakyrelu_alpha":0,"side_input_scale":0},"force_earliest_schedule":false,"operation_queue_id":"0","wait_on_operation_queues":[]}
I external/local_xla/xla/service/gpu/autotuning/conv_algorithm_picker.cc:557] Omitted potentially buggy algorithm eng14{} for conv (f32[29,128,64,3]{3,2,1,0}, u8[0]{0}) custom-call(f32[29,5,64,3]{3,2,1,0}, f32[128,5,3,3]{3,2,1,0}, f32[128]{0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBiasActivationForward", backend_config={"cudnn_conv_backend_config":{"activation_mode":"kRelu","conv_result_scale":1,"leakyrelu_alpha":0,"side_input_scale":0},"force_earliest_schedule":false,"operation_queue_id":"0","wait_on_operation_queues":[]}

P100 (compute capability 6.0), NVIDIA 550.144, TensorFlow 2.17.0, CUDA 12.6, conda:

Crashes without any notable logs.

The P100 GPU is known for not supporting modern CUDA versions so this should be normal (i.e. no fault of anyone).
https://www.reddit.com/r/StableDiffusion/comments/1au8dol/is_the_nvidia_p100_a_hidden_gem_or_hidden_trap/

mamba create -n tf tensorflow-gpu==2.17.0=cuda126* keras-tuner pandas numpy seaborn scikit-learn scipy numba pydot jupyter ipykernel cuda-version=12.6 python=3.12 -c ./build_artifacts -c conda-forge

  + tensorflow-base                        2.17.0  cuda126py312hcfc9039_203  /home/jovyan/tmp/build_artifacts      402MB
  + tensorflow-estimator                   2.17.0  cuda126py312he8d4543_203  /home/jovyan/tmp/build_artifacts      696kB
  + tensorflow                             2.17.0  cuda126py312h5379a72_203  /home/jovyan/tmp/build_artifacts       43kB
  + tensorflow-gpu                         2.17.0  cuda126py312h418687c_203  /home/jovyan/tmp/build_artifacts       43kB

@ehfd
Copy link
Member Author

ehfd commented Feb 5, 2025

A100 GPU (capability 8.0), NVIDIA 550.127.05, tensorflow 2.18.0 (https://github.com/conda-forge/tensorflow-feedstock/pull/414/commits/44113375e85bfefe77b64dd0df2a8a195415bc10), CUDA 12.6, Conda:

Nothing concerning.

mamba create -n tf tensorflow-gpu==2.18.0=cuda126* keras-tuner pandas numpy seaborn scikit-learn scipy numba pydot jupyter ipykernel cuda-version=12.6 python=3.12 -c ./build_artifacts -c conda-forge

  + tensorflow-base                        2.18.0  cuda126py312hfb0ba9c_200  /home/jovyan/tmp/build_artifacts     Cached
  + tensorflow-estimator                   2.18.0  cuda126py312hd49ae37_200  /home/jovyan/tmp/build_artifacts     Cached
  + tensorflow                             2.18.0  cuda126py312h5379a72_200  /home/jovyan/tmp/build_artifacts     Cached
  + tensorflow-gpu                         2.18.0  cuda126py312h418687c_200  /home/jovyan/tmp/build_artifacts     Cached
I0000 00:00:1738756967.311148  194083 gpu_device.cc:2022] Created device [/job](https://vscode-remote+ehf-002dml-002enrp-002dnautilus-002eio.vscode-resource.vscode-cdn.net/job):localhost/replica:0/task:0/device:GPU:0 with 79078 MB memory:  -> device: 0, name: NVIDIA A100 80GB PCIe, pci bus id: 0000:b0:00.0, compute capability: 8.0
Epoch 1/5000
I0000 00:00:1738756969.157232  194479 service.cc:148] XLA service 0x7f1c5400bcd0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1738756969.157288  194479 service.cc:156]   StreamExecutor device (0): NVIDIA A100 80GB PCIe, Compute Capability 8.0
I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1738756969.292815  194479 cuda_dnn.cc:529] Loaded cuDNN version 90300
I0000 00:00:1738756970.318971  194479 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.

A100 GPU (capability 8.0), NVIDIA 550.127.05, tensorflow 2.17.0, CUDA 12.6, Conda:

Nothing concerning.

mamba create -n tf tensorflow-gpu==2.17.0=cuda126* keras-tuner pandas numpy seaborn scikit-learn scipy numba pydot jupyter ipykernel cuda-version=12.6 python=3.12 -c ./build_artifacts -c conda-forge

  + tensorflow-base                        2.17.0  cuda126py312hcfc9039_203  /home/jovyan/tmp/build_artifacts      402MB
  + tensorflow-estimator                   2.17.0  cuda126py312he8d4543_203  /home/jovyan/tmp/build_artifacts      696kB
  + tensorflow                             2.17.0  cuda126py312h5379a72_203  /home/jovyan/tmp/build_artifacts       43kB
  + tensorflow-gpu                         2.17.0  cuda126py312h418687c_203  /home/jovyan/tmp/build_artifacts       43kB
2025-02-05 11:47:26.700973: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2021] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79078 MB memory:  -> device: 0, name: NVIDIA A100 80GB PCIe, pci bus id: 0000:b0:00.0, compute capability: 8.0
''+ptx86+ptx86' is not a recognized feature for this target' is not a recognized feature for this target (ignoring feature)
 (ignoring feature)
'+ptx86' is not a recognized feature for this target (ignoring feature)
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1738756048.539361  182261 service.cc:146] XLA service 0x557a5e629440 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1738756048.539419  182261 service.cc:154]   StreamExecutor device (0): NVIDIA A100 80GB PCIe, Compute Capability 8.0
I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:531] Loaded cuDNN version 90300
'+ptx86' is not a recognized feature for this target (ignoring feature)
I0000 00:00:1738756049.918985  182261 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
'+ptx86' is not a recognized feature for this target (ignoring feature)

@ehfd
Copy link
Member Author

ehfd commented Feb 5, 2025

@hmaarrfk

I am happy with how things are in this state. That message about the CUDA driver in the pip installed version probably exists because of some build configuration that was only enabled there and not in conda-forge.
Otherwise, things seem pretty normal.

I just needed os.environ["XLA_FLAGS"] = "--xla_gpu_cuda_data_dir=" + os.environ["CONDA_PREFIX"].

@ehfd ehfd closed this as completed Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants