Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,18 @@ Default: `true`

Type: `bool`

### hpc_azure_disable_predictable_net_names

Whether to disable predictable network interface names by adding `net.ifnames=0`
to the kernel command line (via the bootloader system role).

This keeps kernel names such as `ib0`, `ib1`, ... instead of `ibP...` on IPoIB,
but it also affects Ethernet naming (e.g. `eth0` instead of `enP...`).

Default: `true`

Type: `bool`

### hpc_install_system_openmpi

Whether to install OpenMPI that comes from AppStream repositories and does not have Nvidia GPU support.
Expand Down
1 change: 1 addition & 0 deletions defaults/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ hpc_install_nvidia_fabric_manager: true
hpc_install_nvidia_imex: true
hpc_install_rdma: true
hpc_enable_azure_persistent_rdma_naming: true
hpc_azure_disable_predictable_net_names: true
hpc_install_system_openmpi: true
hpc_build_openmpi_w_nvidia_gpu_support: true
hpc_install_moneo: true
Expand Down
15 changes: 15 additions & 0 deletions tasks/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -578,6 +578,21 @@
name: nvidia-imex.service
enabled: true

- name: Disable predictable network interface names on Azure (net.ifnames=0)
when:
- hpc_azure_disable_predictable_net_names
block:
- name: Configure net.ifnames via bootloader role
include_role:
name: fedora.linux_system_roles.bootloader
vars:
bootloader_settings:
- kernel: ALL
options:
- name: net.ifnames
value: "0"
state: present

- name: Install RDMA packages
when: hpc_install_rdma
block:
Expand Down
4 changes: 2 additions & 2 deletions templates/test-azure-health-checks.sh.j2
Original file line number Diff line number Diff line change
Expand Up @@ -206,10 +206,10 @@ test_health_log() {
pass "health.log file exists"

echo "Checking: health.log for errors"
if grep -Ei "fail|fault|error" "$log_file" | grep -Eiv "success" > /dev/null 2>&1; then
if grep -Eiu '^(fail|fault|error)\b' "$log_file" | grep -Eiv "success" > /dev/null 2>&1; then
echo ""
echo "--- Error excerpts from health.log ---"
grep -Ei "fail|fault|error" "$log_file" | grep -Eiv "success" | head -20
grep -Eiu '^(fail|fault|error)\b' "$log_file" | grep -Eiv "success" | head -20
echo "--- End of excerpts ---"
echo ""
fail "FAIL/FAULT/ERROR found in health.log"
Expand Down
1 change: 1 addition & 0 deletions tests/tests_default.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
hpc_install_docker: "{{ hpc_install_nvidia_container_toolkit }}"
hpc_install_azurehpc_health_checks: "{{ hpc_install_nvidia_container_toolkit }}"
hpc_install_diagnostics: false
hpc_azure_disable_predictable_net_names: false
tasks:
- name: Skip unsupported architectures
include_tasks: tasks/skip_unsupported_archs.yml
Expand Down
1 change: 1 addition & 0 deletions tests/tests_include_vars_from_parent.yml
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@
hpc_install_docker: "{{ hpc_install_nvidia_container_toolkit }}"
hpc_install_azurehpc_health_checks: "{{ hpc_install_nvidia_container_toolkit }}"
hpc_install_diagnostics: false
hpc_azure_disable_predictable_net_names: false

- name: Cleanup
file:
Expand Down
1 change: 1 addition & 0 deletions tests/tests_skip_toolkit.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
hpc_install_docker: "{{ hpc_install_nvidia_container_toolkit }}"
hpc_install_azurehpc_health_checks: "{{ hpc_install_nvidia_container_toolkit }}"
hpc_install_diagnostics: false
hpc_azure_disable_predictable_net_names: false
tags:
- tests::reboot
tasks:
Expand Down
Loading