Skip to content

Commit aa1ce79

Browse files
committed
fix: add opt-in net.ifnames=0 for Azure images
Add hpc_azure_disable_predictable_net_names to disable predictable interface naming on Azure by persisting net.ifnames=0 used fedora.linux_system_roles.bootloader role to modify kernel param This prevents IPoIB interfaces from being renamed to ibP* and keeps kernel-style names (e.g. ib0, ib1). also for the test script updated the check to only match lines that start with FAIL, FAULT, or ERROR, which should stop WARNING lines like “failure threshold” from triggering a failure while still catching real error lines. Signed-off-by: Gaurav Goklani <ggoklani@redhat.com>
1 parent da62b42 commit aa1ce79

File tree

7 files changed

+35
-2
lines changed

7 files changed

+35
-2
lines changed

README.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -137,6 +137,18 @@ Default: `true`
137137

138138
Type: `bool`
139139

140+
### hpc_azure_disable_predictable_net_names
141+
142+
Whether to disable predictable network interface names by adding `net.ifnames=0`
143+
to the kernel command line (via the bootloader system role).
144+
145+
This keeps kernel names such as `ib0`, `ib1`, ... instead of `ibP...` on IPoIB,
146+
but it also affects Ethernet naming (e.g. `eth0` instead of `enP...`).
147+
148+
Default: `true`
149+
150+
Type: `bool`
151+
140152
### hpc_install_system_openmpi
141153

142154
Whether to install OpenMPI that comes from AppStream repositories and does not have Nvidia GPU support.

defaults/main.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ hpc_install_nvidia_fabric_manager: true
2323
hpc_install_nvidia_imex: true
2424
hpc_install_rdma: true
2525
hpc_enable_azure_persistent_rdma_naming: true
26+
hpc_azure_disable_predictable_net_names: true
2627
hpc_install_system_openmpi: true
2728
hpc_build_openmpi_w_nvidia_gpu_support: true
2829
hpc_install_moneo: true

tasks/main.yml

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -578,6 +578,23 @@
578578
name: nvidia-imex.service
579579
enabled: true
580580

581+
- name: Disable predictable network interface names on Azure (net.ifnames=0)
582+
when:
583+
- hpc_azure_disable_predictable_net_names
584+
block:
585+
- name: Configure net.ifnames via bootloader role
586+
include_role:
587+
name: fedora.linux_system_roles.bootloader
588+
vars:
589+
bootloader_settings:
590+
- kernel: ALL
591+
options:
592+
- name: net.ifnames
593+
state: absent
594+
- name: net.ifnames
595+
value: "0"
596+
state: present
597+
581598
- name: Install RDMA packages
582599
when: hpc_install_rdma
583600
block:

templates/test-azure-health-checks.sh.j2

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -206,10 +206,10 @@ test_health_log() {
206206
pass "health.log file exists"
207207

208208
echo "Checking: health.log for errors"
209-
if grep -Ei "fail|fault|error" "$log_file" | grep -Eiv "success" > /dev/null 2>&1; then
209+
if grep -Eiu '^(fail|fault|error)\b' "$log_file" | grep -Eiv "success" > /dev/null 2>&1; then
210210
echo ""
211211
echo "--- Error excerpts from health.log ---"
212-
grep -Ei "fail|fault|error" "$log_file" | grep -Eiv "success" | head -20
212+
grep -Eiu '^(fail|fault|error)\b' "$log_file" | grep -Eiv "success" | head -20
213213
echo "--- End of excerpts ---"
214214
echo ""
215215
fail "FAIL/FAULT/ERROR found in health.log"

tests/tests_default.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@
2020
hpc_install_docker: "{{ hpc_install_nvidia_container_toolkit }}"
2121
hpc_install_azurehpc_health_checks: "{{ hpc_install_nvidia_container_toolkit }}"
2222
hpc_install_diagnostics: false
23+
hpc_azure_disable_predictable_net_names: false
2324
tasks:
2425
- name: Skip unsupported architectures
2526
include_tasks: tasks/skip_unsupported_archs.yml

tests/tests_include_vars_from_parent.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,7 @@
6565
hpc_install_docker: "{{ hpc_install_nvidia_container_toolkit }}"
6666
hpc_install_azurehpc_health_checks: "{{ hpc_install_nvidia_container_toolkit }}"
6767
hpc_install_diagnostics: false
68+
hpc_azure_disable_predictable_net_names: false
6869

6970
- name: Cleanup
7071
file:

tests/tests_skip_toolkit.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
hpc_install_docker: "{{ hpc_install_nvidia_container_toolkit }}"
2424
hpc_install_azurehpc_health_checks: "{{ hpc_install_nvidia_container_toolkit }}"
2525
hpc_install_diagnostics: false
26+
hpc_azure_disable_predictable_net_names: false
2627
tags:
2728
- tests::reboot
2829
tasks:

0 commit comments

Comments
 (0)