fix: add opt-in net.ifnames=0 for Azure images#101
fix: add opt-in net.ifnames=0 for Azure images#101ggoklani merged 1 commit intolinux-system-roles:mainfrom
Conversation
Reviewer's GuideAdds an opt-in (default-enabled) Azure-specific path to disable predictable network interface names via net.ifnames=0, updates defaults and documentation for the new variable, and tightens health log error matching in the Azure test script to ignore benign warnings. File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- The
lineinfileregex for/etc/kernel/cmdline('(^.*)(\snet\.ifnames=0)?$'withline: '\1 net.ifnames=0') will duplicatenet.ifnames=0if it already exists because group 1 is greedy; consider a pattern that excludes the argument from group 1 (e.g. using a negative lookahead or anchoring on(^.*?)(?:\snet\.ifnames=0)?$) or a simplerregexp: '\bnet\.ifnames=0\b'withinsertafter: EOF. - The
grubby --info=DEFAULT/--update-kernel=ALLcommands will hard-fail the play ifgrubbyis missing or unsupported on a given Azure image; consider addingfailed_when: falseto the info call and guarding the update with awhen: __hpc_grubby_default.rc == 0(or similar) so the role degrades gracefully where grubby isn’t available.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The `lineinfile` regex for `/etc/kernel/cmdline` (`'(^.*)(\snet\.ifnames=0)?$'` with `line: '\1 net.ifnames=0'`) will duplicate `net.ifnames=0` if it already exists because group 1 is greedy; consider a pattern that excludes the argument from group 1 (e.g. using a negative lookahead or anchoring on `(^.*?)(?:\snet\.ifnames=0)?$`) or a simpler `regexp: '\bnet\.ifnames=0\b'` with `insertafter: EOF`.
- The `grubby --info=DEFAULT` / `--update-kernel=ALL` commands will hard-fail the play if `grubby` is missing or unsupported on a given Azure image; consider adding `failed_when: false` to the info call and guarding the update with a `when: __hpc_grubby_default.rc == 0` (or similar) so the role degrades gracefully where grubby isn’t available.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
@dgchinner @lixuemin2016 Please check the PR .. as I select this approach due to issue with others as explained below. Renaming via systemd .link files Renaming via a oneshot systemd service When run early, it caused boot/provisioning issues on GPU nodes (cloud-init DHCP failures / ordering cycles). Renaming via udev + systemd service Udev did trigger the service, but the rename often still happened too late, and legacy rules/templates also conflicted ( saw leftover 99-... rules). Why net.ifnames=0 is the right approach here Tradeoff (why it’s opt-in) |
That doesn't sound right. systemd-udev's should use the .link files if they are defined and ignore everything else. i.e. they are supposed to run instead of any other naming/udev rule that might be defined for that device. The persistent slot-based naming that udev does (ibP....) is well down the list of naming options, so it doesn't make sense to me that this didn't work. Did the interface end up with an altname of "ib0", or did it not get applied at all? Also What was the output of 'udevadm test-builtin net_id /sys/class/net/ib....' when it was failing to rename the devices? FWIW, I think that the .link file matching should be done based on the interface MAC address (defined by the hardware, so always consistent across reboots). There should be no need to try to match against anything else (like Path, ID_NET_NAME_PATH, etc) that might be inconsistent across devices and drivers.
That doesn't sound right, either. We can rename devices without taking them down. E.g on my laptop I just renamed the ethernet device from enp73s0 to eth0 with: $ sudo ip link change dev enp73s0 name eth0 Nothing was interrupted, and the interface was not bounced by the rename. ie. We should always be able to manually rename IPoIB interfaces without any disruption to network services at all. |
|
@dgchinner Why MAC matching is unreliable for IPoIB “this port’s MAC → ib0” What is stable for ordering That’s why net.ifnames=0 is attractive here: it avoids all per-port matching and races by disabling predictable naming globally. Why the oneshot service rename caused trouble (even if ip link change can be non-disruptive) Cloud-init / NetworkManager / provisioning agents often bind config to an interface name. net.ifnames=0 avoids all of the above by removing the entire predictable-naming stack from the equation: no .link matching problem |
Only if you take the 20-byte address as a whole. There is a fixed portion of the address that comes form the hardware - the last 8 bytes is the port GUID and that is burned into teh firmware of the IB HCA at the factory. i.e. the last 8 bytes of the MAC address is guaranteed to be persistent, stable and unique across all the IB devices in the machine. |
|
@dgchinner You’re right about the last 8 bytes: for IPoIB the tail is the port GUID, and that portion is stable/unique for a given HCA port. The practical reasons we still didn’t use “GUID-only matching” as the primary production solution here were: we still need deterministic GUID → ibX assignment across multi-port machines. The GUID alone doesn’t tell you “this is ib0 vs ib1” unless we also encode ordering logic (e.g., sort GUIDs, or map GUIDs to port order). Sorting GUIDs gives a deterministic order, but it’s not guaranteed to match the physical/port enumeration order you wanted (mlx5_0 -> ib0, mlx5_1 -> ib1) across different SKUs/firmware layouts. Implementing a robust mask-based udev rule is fiddly across distros/udev versions. You need a rule that extracts exactly the GUID portion from the 20‑byte address and matches it reliably for IPoIB netdevs. That’s doable, but it’s easy to get wrong (byte offsets, formatting differences, lower/upper hex, separators), and then we get silent mis-matches or collisions. Generic image problem: a pure udev rule based on GUID requires either: hardcoding GUIDs (not possible in a reusable image), or I always tried these methods but ending up to lead more issues..
|
I think there has been a bit of misunderstanding here, probably because I didn't explainit well. That is, the mlx5_X -> ibX mapping example I gave that you repeat above was just that: an example of what consistent mapping behaviour might look like. i.e. if we map mlx5_0 -> ib1, we should -always- map it this way. Using the last 8 bytes of the MAC address to create indexes is one way we can obtain a consistent persistent ibX mapping for each device. I'm not trying to be pedantic or difficult here - my understanding of how kernel network device naming works tells me that using net.ifnames = 0 doesn't actually fix the persistent naming problem. It gives the -impression- that it is doing what we need, but it does not actually guarantee device naming is consistent or persistent. That is, net.ifnames = 0 results in the kernel device naming being retained and exposed to userspace. Kernel device naming is dependent on device discovery order. This order can change from boot to boot because device discovery is asynchronous. Hence while we might always have ib0 and ib1 devices when net.ifnames = 0 is set, but there is no guarantee that ib0 always points to the same IB HCA. They can -and do - swap around randomly on each boot. Indeed, the inabiliity of the kernel to reliably name devices is the primary reason that net.ifnames = 0 is deprecated in favour of using systemd/udev to rename devices in userspace after boot. Another example: why do we have the persistent RMDA device renaming infrastructure for the IB HCAs? WHy is that needed, and how does it work to create consistent persistent RDMA device names? RDMA device renaming uses a sysfs device iteration method for generating consistent persistent mappings for the rdma device names. azure_persistent_rdma_naming.sh iterates the IB devices in sysfs, sorts them numerically, then renames the rdma device to mlx5_ibX where X is incremented by 1 for each device that is found. There is no reason why we can't do exactly the same thing to iterate all the IB devices and rename the IPoIB interfaces appropriately. e.g. 'ls /sys/class/infiniband/DEV/device/net' should return the current network device name for the DEV IB device. We can then rename the network device using the 'ip link change dev name ' command. This is yet another method that we can use to achieve consistent persistent device naming for IPoIB device names, and this has the advantage of being consistent with the RDMA device name for the same IB HCA. Indeed, doing it this way also gets around all the udev naming startup mess because we already have a RDMA naming monitor service that periodically checks and corrects RDMA device naming issues. Using this infrastructure would also address the IPoIB startup naming race problems as well. |
06ae2e2 to
2d548af
Compare
As Discussed , we will be going with net.ifname as of now but for future release we will check for alternative approaches... |
5fb7173 to
769344b
Compare
04c0360 to
aa1ce79
Compare
10173f3 to
b0cb8c2
Compare
4af89e1 to
7ecce7a
Compare
|
lgtm - @spetrosi ? |
Add hpc_azure_disable_predictable_net_names to disable predictable interface naming on Azure by persisting net.ifnames=0 used fedora.linux_system_roles.bootloader role to modify kernel param This prevents IPoIB interfaces from being renamed to ibP* and keeps kernel-style names (e.g. ib0, ib1). also for the test script updated the check to only match lines that start with FAIL, FAULT, or ERROR, which should stop WARNING lines like “failure threshold” from triggering a failure while still catching real error lines. Signed-off-by: Gaurav Goklani <ggoklani@redhat.com>
8d571c1 to
8f881d3
Compare
Enhancement:
Add hpc_azure_disable_predictable_net_names to disable predictable interface naming on Azure by persisting net.ifnames=0 into the bootloader configuration. This prevents IPoIB interfaces from being renamed to ibP* and keeps kernel-style names (e.g. ib0, ib1).
Also for the test script updated the check to only match lines that start with FAIL, FAULT, or ERROR, which should stop WARNING lines like “failure threshold” from triggering a failure while still catching real error lines.
Reason:
Some Azure SKUs rename IPoIB interfaces from ib0/ib1 to ibP*, which breaks expectations/tests that require stable ibX names.
Result: When hpc_azure_disable_predictable_net_names: true, VMs keep kernel-style interface names (e.g. ib0, ib1) instead of ibP* on IPoIB. Note that this also affects Ethernet naming (e.g. eth0 instead of enP*).
Result:
[azureuser@gaurav-hpc-gpu-testnew1 tests]$ sudo ./test-azure-health-checks.sh
[2026-03-19 08:33:29] ========================================
[2026-03-19 08:33:29] Azure HPC Health Checks Wrapper
[2026-03-19 08:33:29] ========================================
[2026-03-19 08:33:29] Test: Setup checks...
Checking: azurehpc-health-checks directory exists
[PASS] azurehpc-health-checks directory exists
Checking: run-health-checks.sh script exists
[PASS] run-health-checks.sh script exists
Checking: Docker service is running
[PASS] Docker service is running
Checking: AZNHC Docker image is available
[PASS] AZNHC Docker image is available
[2026-03-19 08:33:31] Test: Running Azure HPC health checks...
[2026-03-19 08:33:31] Working directory: /opt/hpc/azure/tests/azurehpc-health-checks
No custom conf file specified, detecting VM SKU...
Running health checks for Standard_nd40rs_v2 SKU...
Running health checks using /opt/hpc/azure/tests/azurehpc-health-checks/conf/nd40rs_v2.conf and outputting to /opt/hpc/azure/tests/azurehpc-health-checks/health.log
==========
== CUDA ==
CUDA Version 12.4.1
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
SUCCESS: nhc: Health check passed: check_gpu_ecc: ECC checks passed
SUCCESS: nhc: Health check passed: check_gpu_count: Expected 8 and found 8
SUCCESS: nhc: Health check passed: check_gpu_xid: GPU XID error check passed.
SUCCESS: nhc: Health check passed: check_nvsmi_healthmon: nvidia-smi completed successfully
WARNING: nhc: Nearing a failure threshold: check_gpu_bw: H2D test on GPU 0 has a bandwidth of 10.42 GB/s. Expected threshold 10 GB/s.
WARNING: nhc: Nearing a failure threshold: check_gpu_bw: H2D test on GPU 1 has a bandwidth of 10.35 GB/s. Expected threshold 10 GB/s.
WARNING: nhc: Nearing a failure threshold: check_gpu_bw: H2D test on GPU 2 has a bandwidth of 10.17 GB/s. Expected threshold 10 GB/s.
WARNING: nhc: Nearing a failure threshold: check_gpu_bw: H2D test on GPU 3 has a bandwidth of 10.27 GB/s. Expected threshold 10 GB/s.
SUCCESS: nhc: Health check passed: check_nvBW_gpu_bw: GPU bandwidth Tests with NVBandwidth passed
SUCCESS: nhc: Health check passed: check_gpu_bw: GPU Bandwidth Tests Passed
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 0 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 1 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 2 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 3 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 4 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 5 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 6 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 7 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nccl_allreduce: NCCL all reduce bandwidth test passed, 128.724 GB/s
SUCCESS: nhc: Health check passed: check_ib_bw_non_gdr: IB write bandwidth non gdr test passed for IB=mlx5_ib0, IB BW=103.58 Gbps
SUCCESS: nhc: Health check passed: check_ib_link_flapping: No IB link flapping found
Health checks completed with exit code: 0.
[PASS] Azure HPC health checks completed successfully
[2026-03-19 08:35:03] Test: Validating health.log...
Checking: health.log file exists
[PASS] health.log file exists
Checking: health.log for errors
[PASS] No FAIL/FAULT/ERROR in health.log
[2026-03-19 08:35:03] ========================================
[2026-03-19 08:35:03] All tests passed (7)
[2026-03-19 08:35:03] ========================================
[azureuser@gaurav-hpc-gpu-testnew1 tests]$
Issue Tracker Tickets (Jira or BZ if any):https://redhat.atlassian.net/browse/RHELHPC-177
Summary by Sourcery
Add an opt-in mechanism to disable predictable network interface names on Azure images and tighten Azure health log error detection.
New Features:
Bug Fixes:
Enhancements:
Tests: