Skip to content

Conversation

@terjekv
Copy link
Member

@terjekv terjekv commented Oct 8, 2025

I ran into this as one of our build nodes had the driver mismatch:

$ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
NVML library version: 580.95

Sadly, this resulted in eessi_archdetect.sh setting EESSI_ACCEL_SUBDIR to an empty value, which again caused building CUDA software with EESSI-extend to fail with:

== FAILED: Installation ended unsuccessfully: It seems you are trying to install an accelerator package TensorFlow into a non-accelerator location
/cvmfs/software.eessi.io/host_injections/2023.06/software/linux/x86_64/intel/cascadelake/software/TensorFlow/2.15.1-foss-2023a-CUDA-12.1.1. You need to reconfigure your installation to target the correct location. (took 0 secs)
== Results of the build can be found in the log file(s) /tmp/eb-ypzupdcc/easybuild-TensorFlow-2.15.1-20251008.110243.Fkxxm.log
== Running post-easyblock hook...
== Summary:
   * [FAILED]  TensorFlow/2.15.1-foss-2023a-CUDA-12.1.1

ERROR: Installation of TensorFlow-2.15.1-foss-2023a-CUDA-12.1.1.eb failed: 'It seems you are trying to install an accelerator package TensorFlow into a non-accelerator location /cvmfs/software.eessi.io/host_injections/2023.06/software/linux/x86_64/intel/cascadelake/software/TensorFlow/2.15.1-foss-2023a-CUDA-12.1.1. You need to reconfigure your installation to target the correct location.'

@terjekv
Copy link
Member Author

terjekv commented Oct 8, 2025

Note, this will make archdetect fail if nvidia-smi has a driver/library mismatch. This will have consequences for loading the EESSI module as it will now fail if there's a driver/library mismatch, even if the user loading the EESSI module has no intention of using the GPUs.

This may not be ideal.

Copy link
Contributor

@trz42 trz42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. Minor suggestion on improving the information for a user.

nvidia-smi --query-gpu=gpu_name,count,driver_version,compute_cap --format=csv,noheader 2>&1 > $nvidia_smi_out
if [[ $? -eq 0 ]]; then
if grep -q "Failed to initialize NVML: Driver/library version mismatch" $nvidia_smi_out; then
log "ERROR" "accelpath: nvidia-smi command failed with 'Failed to initialize NVML: Driver/library version mismatch'"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR @terjekv !

We discussed this at a support meeting and suggest the following:

  • print the unaltered error message (e.g., just cat $nvidia_smi_out)
  • print a hint how this could be fixed (you mentioned that you fixed this)

Does this sound ok?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants