Building on CUDA 12.6 likely has issues on driver versions older than 565 #417

ehfd · 2025-02-04T13:00:38Z

Solution to issue cannot be found in the documentation.

I checked the documentation.

Issue

W external/local_xla/xla/service/gpu/nvptx_compiler.cc:930] The NVIDIA driver's CUDA version is 12.4 which is older than the PTX compiler version 12.5.82. Because the driver is older than the PTX compiler version, XLA is disabling parallel compilation, which may slow down compilation. You should update your NVIDIA driver or use the NVIDIA-provided CUDA forward compatibility packages.
I external/local_xla/xla/service/gpu/autotuning/conv_algorithm_picker.cc:557] Omitted potentially buggy algorithm eng14{} for conv (f32[32,128,64,3]{3,2,1,0}, u8[0]{0}) custom-call(f32[32,5,64,3]{3,2,1,0}, f32[128,5,3,3]{3,2,1,0}, f32[128]{0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBiasActivationForward", backend_config={"cudnn_conv_backend_config":{"activation_mode":"kNone","conv_result_scale":1,"leakyrelu_alpha":0,"side_input_scale":0},"force_earliest_schedule":false,"operation_queue_id":"0","wait_on_operation_queues":[]}

I observe the above issue while using NVIDIA 550.144 drivers together with pip tensorflow[and-cuda]==2.18.0, which was built with and depends on CUDA 12.5 (But not with JAX).

This is an example of the not-satisfying condition specified in the PTX section of https://pypackaging-native.github.io/key-issues/gpus/#additional-notes-on-cuda-compatibility.

(Can't test TensorFlow 2.17.0 on conda-forge because it is built with CUDA 12.0 right now.)

Therefore, it is possible that a build based on CUDA 12.6 has issues. The code is still trainable, but has implications of limitations for XLA.

Please prove otherwise on an older driver version if this issue doesn't exist (somehow) in conda-forge.

This will not likely have issues with JAX and PyTorch, so this is a different issue from conda-forge/pytorch-cpu-feedstock#337 (while the extensive discussion there has helped understand the CUDA dependency landscape).

Installed packages

TensorFlow 2.18.0 is unreleased.

Environment info

TensorFlow 2.18.0 is unreleased.

The text was updated successfully, but these errors were encountered:

hmaarrfk · 2025-02-04T13:31:11Z

can you try installing from conda-forge's channel and seeing if you can reproduce? Do you need 2.18 or could 2.17 reproduce the error?

We just can't keep troubleshooting pip packages without upstream support.

OTherwise, please build the 2.18 package locally and tell us what error you get.

ehfd · 2025-02-04T13:35:15Z

2.17 cannot be reproduced because none of the variants were built with CUDA 12.6.
Sure, I will test. @hmaarrfk

ehfd · 2025-02-05T10:17:01Z

P100 GPU (capability 6.0), NVIDIA 550.144, tensorflow 2.18.0 (https://github.com/conda-forge/tensorflow-feedstock/pull/414/commits/44113375e85bfefe77b64dd0df2a8a195415bc10), CUDA 12.6, conda:

mamba create -n tf tensorflow-gpu==2.18.0=cuda126* keras-tuner pandas numpy seaborn scikit-learn scipy numba pydot jupyter ipykernel cuda-version=12.6 python=3.12 -c ./build_artifacts -c conda-forge

  + tensorflow-base                        2.18.0  cuda126py312hfb0ba9c_200  /home/jovyan/tmp/build_artifacts     Cached
  + tensorflow-estimator                   2.18.0  cuda126py312hd49ae37_200  /home/jovyan/tmp/build_artifacts     Cached
  + tensorflow                             2.18.0  cuda126py312h5379a72_200  /home/jovyan/tmp/build_artifacts     Cached
  + tensorflow-gpu                         2.18.0  cuda126py312h418687c_200  /home/jovyan/tmp/build_artifacts     Cached

I0000 00:00:1738757650.727012    2875 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 15513 MB memory:  -> device: 0, name: Tesla P100-SXM2-16GB, pci bus id: 0000:04:00.0, compute capability: 6.0
I0000 00:00:1738757653.559803    3294 service.cc:148] XLA service 0x763cd0004120 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1738757653.559849    3294 service.cc:156]   StreamExecutor device (0): Tesla P100-SXM2-16GB, Compute Capability 6.0
I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1738757653.796604    3294 cuda_dnn.cc:529] Loaded cuDNN version 90300
I external/local_xla/xla/service/gpu/autotuning/conv_algorithm_picker.cc:557] Omitted potentially buggy algorithm eng14{} for conv (f32[32,128,64,3]{3,2,1,0}, u8[0]{0}) custom-call(f32[32,5,64,3]{3,2,1,0}, f32[128,5,3,3]{3,2,1,0}, f32[128]{0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBiasActivationForward", backend_config={"cudnn_conv_backend_config":{"activation_mode":"kNone","conv_result_scale":1,"leakyrelu_alpha":0,"side_input_scale":0},"force_earliest_schedule":false,"operation_queue_id":"0","wait_on_operation_queues":[]}
I0000 00:00:1738757655.396762    3294 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
I external/local_xla/xla/service/gpu/autotuning/conv_algorithm_picker.cc:557] Omitted potentially buggy algorithm eng14{} for conv (f32[3,128,64,3]{3,2,1,0}, u8[0]{0}) custom-call(f32[3,5,64,3]{3,2,1,0}, f32[128,5,3,3]{3,2,1,0}, f32[128]{0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBiasActivationForward", backend_config={"cudnn_conv_backend_config":{"activation_mode":"kNone","conv_result_scale":1,"leakyrelu_alpha":0,"side_input_scale":0},"force_earliest_schedule":false,"operation_queue_id":"0","wait_on_operation_queues":[]}
I external/local_xla/xla/service/gpu/autotuning/conv_algorithm_picker.cc:557] Omitted potentially buggy algorithm eng14{} for conv (f32[32,128,64,3]{3,2,1,0}, u8[0]{0}) custom-call(f32[32,5,64,3]{3,2,1,0}, f32[128,5,3,3]{3,2,1,0}, f32[128]{0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBiasActivationForward", backend_config={"cudnn_conv_backend_config":{"activation_mode":"kRelu","conv_result_scale":1,"leakyrelu_alpha":0,"side_input_scale":0},"force_earliest_schedule":false,"operation_queue_id":"0","wait_on_operation_queues":[]}
I external/local_xla/xla/service/gpu/autotuning/conv_algorithm_picker.cc:557] Omitted potentially buggy algorithm eng14{} for conv (f32[29,128,64,3]{3,2,1,0}, u8[0]{0}) custom-call(f32[29,5,64,3]{3,2,1,0}, f32[128,5,3,3]{3,2,1,0}, f32[128]{0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBiasActivationForward", backend_config={"cudnn_conv_backend_config":{"activation_mode":"kRelu","conv_result_scale":1,"leakyrelu_alpha":0,"side_input_scale":0},"force_earliest_schedule":false,"operation_queue_id":"0","wait_on_operation_queues":[]}

P100 (compute capability 6.0), NVIDIA 550.144, TensorFlow 2.17.0, CUDA 12.6, conda:

Crashes without any notable logs.

The P100 GPU is known for not supporting modern CUDA versions so this should be normal (i.e. no fault of anyone).
https://www.reddit.com/r/StableDiffusion/comments/1au8dol/is_the_nvidia_p100_a_hidden_gem_or_hidden_trap/

mamba create -n tf tensorflow-gpu==2.17.0=cuda126* keras-tuner pandas numpy seaborn scikit-learn scipy numba pydot jupyter ipykernel cuda-version=12.6 python=3.12 -c ./build_artifacts -c conda-forge

  + tensorflow-base                        2.17.0  cuda126py312hcfc9039_203  /home/jovyan/tmp/build_artifacts      402MB
  + tensorflow-estimator                   2.17.0  cuda126py312he8d4543_203  /home/jovyan/tmp/build_artifacts      696kB
  + tensorflow                             2.17.0  cuda126py312h5379a72_203  /home/jovyan/tmp/build_artifacts       43kB
  + tensorflow-gpu                         2.17.0  cuda126py312h418687c_203  /home/jovyan/tmp/build_artifacts       43kB

ehfd · 2025-02-05T11:33:13Z

A100 GPU (capability 8.0), NVIDIA 550.127.05, tensorflow 2.18.0 (https://github.com/conda-forge/tensorflow-feedstock/pull/414/commits/44113375e85bfefe77b64dd0df2a8a195415bc10), CUDA 12.6, Conda:

Nothing concerning.

mamba create -n tf tensorflow-gpu==2.18.0=cuda126* keras-tuner pandas numpy seaborn scikit-learn scipy numba pydot jupyter ipykernel cuda-version=12.6 python=3.12 -c ./build_artifacts -c conda-forge

  + tensorflow-base                        2.18.0  cuda126py312hfb0ba9c_200  /home/jovyan/tmp/build_artifacts     Cached
  + tensorflow-estimator                   2.18.0  cuda126py312hd49ae37_200  /home/jovyan/tmp/build_artifacts     Cached
  + tensorflow                             2.18.0  cuda126py312h5379a72_200  /home/jovyan/tmp/build_artifacts     Cached
  + tensorflow-gpu                         2.18.0  cuda126py312h418687c_200  /home/jovyan/tmp/build_artifacts     Cached

I0000 00:00:1738756967.311148  194083 gpu_device.cc:2022] Created device [/job](https://vscode-remote+ehf-002dml-002enrp-002dnautilus-002eio.vscode-resource.vscode-cdn.net/job):localhost/replica:0/task:0/device:GPU:0 with 79078 MB memory:  -> device: 0, name: NVIDIA A100 80GB PCIe, pci bus id: 0000:b0:00.0, compute capability: 8.0
Epoch 1/5000
I0000 00:00:1738756969.157232  194479 service.cc:148] XLA service 0x7f1c5400bcd0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1738756969.157288  194479 service.cc:156]   StreamExecutor device (0): NVIDIA A100 80GB PCIe, Compute Capability 8.0
I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1738756969.292815  194479 cuda_dnn.cc:529] Loaded cuDNN version 90300
I0000 00:00:1738756970.318971  194479 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.

A100 GPU (capability 8.0), NVIDIA 550.127.05, tensorflow 2.17.0, CUDA 12.6, Conda:

Nothing concerning.

mamba create -n tf tensorflow-gpu==2.17.0=cuda126* keras-tuner pandas numpy seaborn scikit-learn scipy numba pydot jupyter ipykernel cuda-version=12.6 python=3.12 -c ./build_artifacts -c conda-forge

  + tensorflow-base                        2.17.0  cuda126py312hcfc9039_203  /home/jovyan/tmp/build_artifacts      402MB
  + tensorflow-estimator                   2.17.0  cuda126py312he8d4543_203  /home/jovyan/tmp/build_artifacts      696kB
  + tensorflow                             2.17.0  cuda126py312h5379a72_203  /home/jovyan/tmp/build_artifacts       43kB
  + tensorflow-gpu                         2.17.0  cuda126py312h418687c_203  /home/jovyan/tmp/build_artifacts       43kB

2025-02-05 11:47:26.700973: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2021] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79078 MB memory:  -> device: 0, name: NVIDIA A100 80GB PCIe, pci bus id: 0000:b0:00.0, compute capability: 8.0
''+ptx86+ptx86' is not a recognized feature for this target' is not a recognized feature for this target (ignoring feature)
 (ignoring feature)
'+ptx86' is not a recognized feature for this target (ignoring feature)
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1738756048.539361  182261 service.cc:146] XLA service 0x557a5e629440 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1738756048.539419  182261 service.cc:154]   StreamExecutor device (0): NVIDIA A100 80GB PCIe, Compute Capability 8.0
I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:531] Loaded cuDNN version 90300
'+ptx86' is not a recognized feature for this target (ignoring feature)
I0000 00:00:1738756049.918985  182261 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
'+ptx86' is not a recognized feature for this target (ignoring feature)

ehfd · 2025-02-05T12:31:57Z

@hmaarrfk

I am happy with how things are in this state. That message about the CUDA driver in the pip installed version probably exists because of some build configuration that was only enabled there and not in conda-forge.
Otherwise, things seem pretty normal.

I just needed os.environ["XLA_FLAGS"] = "--xla_gpu_cuda_data_dir=" + os.environ["CONDA_PREFIX"].

ehfd added the bug label Feb 4, 2025

This was referenced Feb 4, 2025

Build older CUDA minor versions? conda-forge/pytorch-cpu-feedstock#337

Closed

tensorflow v2.18.0 (redux) #414

Merged

ehfd closed this as completed Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Building on CUDA 12.6 likely has issues on driver versions older than 565 #417

Building on CUDA 12.6 likely has issues on driver versions older than 565 #417

ehfd commented Feb 4, 2025 •

edited

Loading

hmaarrfk commented Feb 4, 2025

ehfd commented Feb 4, 2025

ehfd commented Feb 5, 2025 •

edited

Loading

ehfd commented Feb 5, 2025 •

edited

Loading

ehfd commented Feb 5, 2025 •

edited

Loading

Building on CUDA 12.6 likely has issues on driver versions older than 565 #417

Building on CUDA 12.6 likely has issues on driver versions older than 565 #417

Comments

ehfd commented Feb 4, 2025 • edited Loading

Solution to issue cannot be found in the documentation.

Issue

Installed packages

Environment info

hmaarrfk commented Feb 4, 2025

ehfd commented Feb 4, 2025

ehfd commented Feb 5, 2025 • edited Loading

ehfd commented Feb 5, 2025 • edited Loading

ehfd commented Feb 5, 2025 • edited Loading

ehfd commented Feb 4, 2025 •

edited

Loading

ehfd commented Feb 5, 2025 •

edited

Loading

ehfd commented Feb 5, 2025 •

edited

Loading

ehfd commented Feb 5, 2025 •

edited

Loading