Skip to content

fix: add opt-in net.ifnames=0 for Azure images#101

Merged
ggoklani merged 1 commit intolinux-system-roles:mainfrom
ggoklani:fix_ib_device_rename
Mar 20, 2026
Merged

fix: add opt-in net.ifnames=0 for Azure images#101
ggoklani merged 1 commit intolinux-system-roles:mainfrom
ggoklani:fix_ib_device_rename

Conversation

@ggoklani
Copy link
Collaborator

@ggoklani ggoklani commented Mar 17, 2026

Enhancement:
Add hpc_azure_disable_predictable_net_names to disable predictable interface naming on Azure by persisting net.ifnames=0 into the bootloader configuration. This prevents IPoIB interfaces from being renamed to ibP* and keeps kernel-style names (e.g. ib0, ib1).

Also for the test script updated the check to only match lines that start with FAIL, FAULT, or ERROR, which should stop WARNING lines like “failure threshold” from triggering a failure while still catching real error lines.

Reason:
Some Azure SKUs rename IPoIB interfaces from ib0/ib1 to ibP*, which breaks expectations/tests that require stable ibX names.

Result: When hpc_azure_disable_predictable_net_names: true, VMs keep kernel-style interface names (e.g. ib0, ib1) instead of ibP* on IPoIB. Note that this also affects Ethernet naming (e.g. eth0 instead of enP*).
Result:
[azureuser@gaurav-hpc-gpu-testnew1 tests]$ sudo ./test-azure-health-checks.sh
[2026-03-19 08:33:29] ========================================
[2026-03-19 08:33:29] Azure HPC Health Checks Wrapper
[2026-03-19 08:33:29] ========================================

[2026-03-19 08:33:29] Test: Setup checks...

Checking: azurehpc-health-checks directory exists
[PASS] azurehpc-health-checks directory exists
Checking: run-health-checks.sh script exists
[PASS] run-health-checks.sh script exists
Checking: Docker service is running
[PASS] Docker service is running
Checking: AZNHC Docker image is available
[PASS] AZNHC Docker image is available

[2026-03-19 08:33:31] Test: Running Azure HPC health checks...
[2026-03-19 08:33:31] Working directory: /opt/hpc/azure/tests/azurehpc-health-checks

No custom conf file specified, detecting VM SKU...
Running health checks for Standard_nd40rs_v2 SKU...
Running health checks using /opt/hpc/azure/tests/azurehpc-health-checks/conf/nd40rs_v2.conf and outputting to /opt/hpc/azure/tests/azurehpc-health-checks/health.log

==========
== CUDA ==

CUDA Version 12.4.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

SUCCESS: nhc: Health check passed: check_gpu_ecc: ECC checks passed
SUCCESS: nhc: Health check passed: check_gpu_count: Expected 8 and found 8
SUCCESS: nhc: Health check passed: check_gpu_xid: GPU XID error check passed.
SUCCESS: nhc: Health check passed: check_nvsmi_healthmon: nvidia-smi completed successfully
WARNING: nhc: Nearing a failure threshold: check_gpu_bw: H2D test on GPU 0 has a bandwidth of 10.42 GB/s. Expected threshold 10 GB/s.
WARNING: nhc: Nearing a failure threshold: check_gpu_bw: H2D test on GPU 1 has a bandwidth of 10.35 GB/s. Expected threshold 10 GB/s.
WARNING: nhc: Nearing a failure threshold: check_gpu_bw: H2D test on GPU 2 has a bandwidth of 10.17 GB/s. Expected threshold 10 GB/s.
WARNING: nhc: Nearing a failure threshold: check_gpu_bw: H2D test on GPU 3 has a bandwidth of 10.27 GB/s. Expected threshold 10 GB/s.
SUCCESS: nhc: Health check passed: check_nvBW_gpu_bw: GPU bandwidth Tests with NVBandwidth passed
SUCCESS: nhc: Health check passed: check_gpu_bw: GPU Bandwidth Tests Passed
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 0 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 1 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 2 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 3 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 4 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 5 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 6 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nvlink_status: GPU 7 has all nvlinks active.
SUCCESS: nhc: Health check passed: check_nccl_allreduce: NCCL all reduce bandwidth test passed, 128.724 GB/s
SUCCESS: nhc: Health check passed: check_ib_bw_non_gdr: IB write bandwidth non gdr test passed for IB=mlx5_ib0, IB BW=103.58 Gbps
SUCCESS: nhc: Health check passed: check_ib_link_flapping: No IB link flapping found
Health checks completed with exit code: 0.

[PASS] Azure HPC health checks completed successfully
[2026-03-19 08:35:03] Test: Validating health.log...

Checking: health.log file exists
[PASS] health.log file exists
Checking: health.log for errors
[PASS] No FAIL/FAULT/ERROR in health.log

[2026-03-19 08:35:03] ========================================
[2026-03-19 08:35:03] All tests passed (7)
[2026-03-19 08:35:03] ========================================
[azureuser@gaurav-hpc-gpu-testnew1 tests]$

Issue Tracker Tickets (Jira or BZ if any):https://redhat.atlassian.net/browse/RHELHPC-177

Summary by Sourcery

Add an opt-in mechanism to disable predictable network interface names on Azure images and tighten Azure health log error detection.

New Features:

  • Introduce the hpc_azure_disable_predictable_net_names flag to control adding net.ifnames=0 to the kernel command line on Azure systems.

Bug Fixes:

  • Adjust Azure health check log scanning to only treat lines starting with FAIL, FAULT, or ERROR as errors, avoiding false positives from WARNING lines.

Enhancements:

  • Persist net.ifnames=0 into both /etc/kernel/cmdline and all grubby-managed boot entries when the Azure disable-predictable-names option is enabled.
  • Document the new hpc_azure_disable_predictable_net_names option and its impact on IPoIB and Ethernet interface naming.

Tests:

  • Refine the Azure health checks test script to narrow the pattern used when extracting error lines from health.log.

@ggoklani ggoklani requested review from richm and spetrosi as code owners March 17, 2026 12:45
@sourcery-ai
Copy link

sourcery-ai bot commented Mar 17, 2026

Reviewer's Guide

Adds an opt-in (default-enabled) Azure-specific path to disable predictable network interface names via net.ifnames=0, updates defaults and documentation for the new variable, and tightens health log error matching in the Azure test script to ignore benign warnings.

File-Level Changes

Change Details Files
Add Azure-specific task block to disable predictable network interface names via net.ifnames=0 and persist it across boots.
  • Gate the behavior on a new hpc_azure_disable_predictable_net_names variable and Microsoft system vendor fact.
  • Check for /etc/kernel/cmdline and, if present, ensure net.ifnames=0 is appended using a backreferenced regexp-friendly lineinfile task.
  • Query the current default kernel args with grubby --info=DEFAULT and, if net.ifnames=0 is missing, run grubby --update-kernel=ALL --args=net.ifnames=0 to update all boot entries.
tasks/main.yml
Introduce and document the new hpc_azure_disable_predictable_net_names variable, defaulting it to true.
  • Add hpc_azure_disable_predictable_net_names: true to the role defaults so the behavior is enabled by default.
  • Document the variable semantics, side effects on IPoIB and Ethernet naming, default value, and type in the README.
defaults/main.yml
README.md
Narrow Azure health check log parsing so only lines starting with FAIL/FAULT/ERROR are treated as errors, reducing false positives from warnings.
  • Change the grep pattern to match FAIL/FAULT/ERROR only at the start of a line using a case-insensitive anchored regex with word boundary.
  • Apply the same anchored pattern to both the detection and excerpt-printing greps while still excluding lines containing 'success'.
templates/test-azure-health-checks.sh.j2

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • The lineinfile regex for /etc/kernel/cmdline ('(^.*)(\snet\.ifnames=0)?$' with line: '\1 net.ifnames=0') will duplicate net.ifnames=0 if it already exists because group 1 is greedy; consider a pattern that excludes the argument from group 1 (e.g. using a negative lookahead or anchoring on (^.*?)(?:\snet\.ifnames=0)?$) or a simpler regexp: '\bnet\.ifnames=0\b' with insertafter: EOF.
  • The grubby --info=DEFAULT / --update-kernel=ALL commands will hard-fail the play if grubby is missing or unsupported on a given Azure image; consider adding failed_when: false to the info call and guarding the update with a when: __hpc_grubby_default.rc == 0 (or similar) so the role degrades gracefully where grubby isn’t available.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The `lineinfile` regex for `/etc/kernel/cmdline` (`'(^.*)(\snet\.ifnames=0)?$'` with `line: '\1 net.ifnames=0'`) will duplicate `net.ifnames=0` if it already exists because group 1 is greedy; consider a pattern that excludes the argument from group 1 (e.g. using a negative lookahead or anchoring on `(^.*?)(?:\snet\.ifnames=0)?$`) or a simpler `regexp: '\bnet\.ifnames=0\b'` with `insertafter: EOF`.
- The `grubby --info=DEFAULT` / `--update-kernel=ALL` commands will hard-fail the play if `grubby` is missing or unsupported on a given Azure image; consider adding `failed_when: false` to the info call and guarding the update with a `when: __hpc_grubby_default.rc == 0` (or similar) so the role degrades gracefully where grubby isn’t available.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@ggoklani
Copy link
Collaborator Author

@dgchinner @lixuemin2016 Please check the PR .. as I select this approach due to issue with others as explained below.

Renaming via systemd .link files
I tried to force ib0/ib1 by matching properties and setting Name=ib0.
On these Azure GPU SKUs, the interface was already named ibP… by the time link setup ran, so .link policy (keep) just “kept” ibP… and never converted it back to ib0.
Matching was also inconsistent (Path= vs ID_PATH, driver strings like mlx5_core[ib_ipoib], etc.).

Renaming via a oneshot systemd service

When run early, it caused boot/provisioning issues on GPU nodes (cloud-init DHCP failures / ordering cycles).
When run late (after cloud-init/network), the IPoIB interface was already UP and configured, so ip link set name … typically fails or would require bouncing the interface (which is disruptive/risky).

Renaming via udev + systemd service

Udev did trigger the service, but the rename often still happened too late, and legacy rules/templates also conflicted ( saw leftover 99-... rules).
Even when the naming program was correct, the rename wasn’t reliably “winning” against systemd predictable naming across all SKUs.

Why net.ifnames=0 is the right approach here
It fixes the problem at the source: disables predictable interface naming globally, so systemd won’t rename ib0 -> ibP… in the first place.
It’s consistent across VM types (GPU and non-GPU) because it’s a kernel boot policy, not a race between udev, systemd, cloud-init, and NetworkManager.
It works on first boot after deployment (no extra reboot) because the kernel cmdline is already baked into the image.

Tradeoff (why it’s opt-in)
net.ifnames=0 affects all interfaces, not just IPoIB (e.g., Ethernet becomes eth0 instead of enP…). That’s why we implemented it as an opt-in variable (hpc_azure_disable_predictable_net_names).

@dgchinner
Copy link
Collaborator

Renaming via systemd .link files I tried to force ib0/ib1 by matching properties and setting Name=ib0. On these
Azure GPU SKUs, the interface was already named ibP… by the time link setup ran, so .link policy (keep) just “kept”
ibP… and never converted it back to ib0. Matching was also inconsistent (Path= vs ID_PATH, driver strings like
mlx5_core[ib_ipoib], etc.).

That doesn't sound right. systemd-udev's should use the .link files if they are defined and ignore everything else. i.e. they are supposed to run instead of any other naming/udev rule that might be defined for that device. The persistent slot-based naming that udev does (ibP....) is well down the list of naming options, so it doesn't make sense to me that this didn't work. Did the interface end up with an altname of "ib0", or did it not get applied at all?

Also What was the output of 'udevadm test-builtin net_id /sys/class/net/ib....' when it was failing to rename the devices?

FWIW, I think that the .link file matching should be done based on the interface MAC address (defined by the hardware, so always consistent across reboots). There should be no need to try to match against anything else (like Path, ID_NET_NAME_PATH, etc) that might be inconsistent across devices and drivers.

Renaming via a oneshot systemd service

When run early, it caused boot/provisioning issues on GPU nodes (cloud-init DHCP failures / ordering cycles).
When run late (after cloud-init/network), the IPoIB interface was already UP and configured, so ip link set name … >typically fails or would require bouncing the interface (which is disruptive/risky).

That doesn't sound right, either. We can rename devices without taking them down. E.g on my laptop I just renamed the ethernet device from enp73s0 to eth0 with:

$ sudo ip link change dev enp73s0 name eth0

Nothing was interrupted, and the interface was not bounced by the rename. ie. We should always be able to manually rename IPoIB interfaces without any disruption to network services at all.

@ggoklani
Copy link
Collaborator Author

@dgchinner
The main problem is that for IPoIB the “MAC address” you see in ip link is not a simple hardware MAC like Ethernet, and it’s not a great stable identifier for matching.

Why MAC matching is unreliable for IPoIB
IPoIB uses a 20‑byte hardware address (link/infiniband …), not a 6‑byte Ethernet MAC.
That 20‑byte address typically includes dynamic fields (e.g., the QPN) in addition to the GUID portion. Those dynamic fields can change across boots / device re-create, so matching the full address can fail even on the same port.
In other words, the “MAC is defined by hardware” assumption is true for Ethernet, but not strictly true for IPoIB netdev addresses as exposed to Linux.
Why MAC matching also doesn’t solve ordering by itself
Even if MAC were stable, you still need a mapping like:

“this port’s MAC → ib0”
“that port’s MAC → ib1”
On a generic image you don’t know those MAC/GUID values ahead of time, and on multi-interface machines you’d still need per-host discovery (which is exactly what led us into timing/race issues earlier).

What is stable for ordering
For consistent mlx5_0 -> ib0, mlx5_1 -> ib1 across SKUs, the most robust keys tend to be device path + port index (sysfs devpath + dev_port/port number) or udev’s ID_NET_NAME_PATH/SLOT—but those require careful matching and differ across platforms.

That’s why net.ifnames=0 is attractive here: it avoids all per-port matching and races by disabling predictable naming globally.

Why the oneshot service rename caused trouble (even if ip link change can be non-disruptive)
It’s true that sometimes you can rename an UP interface without an obvious traffic drop (your laptop example). But that doesn’t mean it’s safe during automated provisioning on Azure images:

Cloud-init / NetworkManager / provisioning agents often bind config to an interface name.
Even if the kernel rename succeeds, higher-level tooling may still be working with the old name mid-flight. That’s how we end up with “DHCP lease failed” style provisioning errors.
We actually observed GPU-node provisioning failures when we tried early-boot renaming approaches (cloud-init errors about DHCP lease failures and systemd ordering-cycle issues). So in this environment, it wasn’t theoretical.
Also, for many device types (and especially when enslaved/managed), rename can fail with “busy” semantics unless the interface is down. So “always possible without disruption” isn’t a guarantee across all netdev types and states.

net.ifnames=0 avoids all of the above by removing the entire predictable-naming stack from the equation:

no .link matching problem
no per-host discovery needed
no race with cloud-init/NetworkManager during first boot
consistent result across GPU and non-GPU SKUs: you keep kernel-style names (ib0, ib1, …) instead of ibP…

@dgchinner
Copy link
Collaborator

PoIB uses a 20‑byte hardware address (link/infiniband …), not a 6‑byte Ethernet MAC.
That 20‑byte address typically includes dynamic fields (e.g., the QPN) in addition to the GUID portion. Those dynamic fields can change across boots / device re-create, so matching the full address can fail even on the same port.
In other words, the “MAC is defined by hardware” assumption is true for Ethernet, but not strictly true for IPoIB netdev addresses as exposed to Linux.

Only if you take the 20-byte address as a whole.

There is a fixed portion of the address that comes form the hardware - the last 8 bytes is the port GUID and that is burned into teh firmware of the IB HCA at the factory. i.e. the last 8 bytes of the MAC address is guaranteed to be persistent, stable and unique across all the IB devices in the machine.

@ggoklani
Copy link
Collaborator Author

ggoklani commented Mar 18, 2026

@dgchinner You’re right about the last 8 bytes: for IPoIB the tail is the port GUID, and that portion is stable/unique for a given HCA port.

The practical reasons we still didn’t use “GUID-only matching” as the primary production solution here were:

we still need deterministic GUID → ibX assignment across multi-port machines. The GUID alone doesn’t tell you “this is ib0 vs ib1” unless we also encode ordering logic (e.g., sort GUIDs, or map GUIDs to port order). Sorting GUIDs gives a deterministic order, but it’s not guaranteed to match the physical/port enumeration order you wanted (mlx5_0 -> ib0, mlx5_1 -> ib1) across different SKUs/firmware layouts.

Implementing a robust mask-based udev rule is fiddly across distros/udev versions. You need a rule that extracts exactly the GUID portion from the 20‑byte address and matches it reliably for IPoIB netdevs. That’s doable, but it’s easy to get wrong (byte offsets, formatting differences, lower/upper hex, separators), and then we get silent mis-matches or collisions.

Generic image problem: a pure udev rule based on GUID requires either:

hardcoding GUIDs (not possible in a reusable image), or
generating per-host rules at first boot (which puts you back into timing/race issues we saw earlier on GPU SKUs).
So: GUID is a solid stable identifier, but using it in a generic, no-first-boot, no-extra-reboot image to produce ib0/ib1 consistently across Azure SKUs still requires additional machinery and assumptions about ordering. That’s why net.ifnames=0 ended up being the simplest reliable “always ibX” lever.

I always tried these methods but ending up to lead more issues..

PoIB uses a 20‑byte hardware address (link/infiniband …), not a 6‑byte Ethernet MAC.
That 20‑byte address typically includes dynamic fields (e.g., the QPN) in addition to the GUID portion. Those dynamic fields can change across boots / device re-create, so matching the full address can fail even on the same port.
In other words, the “MAC is defined by hardware” assumption is true for Ethernet, but not strictly true for IPoIB netdev addresses as exposed to Linux.

Only if you take the 20-byte address as a whole.

There is a fixed portion of the address that comes form the hardware - the last 8 bytes is the port GUID and that is burned into teh firmware of the IB HCA at the factory. i.e. the last 8 bytes of the MAC address is guaranteed to be persistent, stable and unique across all the IB devices in the machine.

@dgchinner
Copy link
Collaborator

dgchinner commented Mar 18, 2026

Sorting GUIDs gives a deterministic order, but it’s not guaranteed to match the physical/port enumeration order you wanted (mlx5_0 -> ib0, mlx5_1 -> ib1) across different SKUs/firmware layouts.

I think there has been a bit of misunderstanding here, probably because I didn't explainit well. That is, the mlx5_X -> ibX mapping example I gave that you repeat above was just that: an example of what consistent mapping behaviour might look like. i.e. if we map mlx5_0 -> ib1, we should -always- map it this way. Using the last 8 bytes of the MAC address to create indexes is one way we can obtain a consistent persistent ibX mapping for each device.

I'm not trying to be pedantic or difficult here - my understanding of how kernel network device naming works tells me that using net.ifnames = 0 doesn't actually fix the persistent naming problem. It gives the -impression- that it is doing what we need, but it does not actually guarantee device naming is consistent or persistent.

That is, net.ifnames = 0 results in the kernel device naming being retained and exposed to userspace. Kernel device naming is dependent on device discovery order. This order can change from boot to boot because device discovery is asynchronous. Hence while we might always have ib0 and ib1 devices when net.ifnames = 0 is set, but there is no guarantee that ib0 always points to the same IB HCA. They can -and do - swap around randomly on each boot.

Indeed, the inabiliity of the kernel to reliably name devices is the primary reason that net.ifnames = 0 is deprecated in favour of using systemd/udev to rename devices in userspace after boot.

Another example: why do we have the persistent RMDA device renaming infrastructure for the IB HCAs? WHy is that needed, and how does it work to create consistent persistent RDMA device names?

RDMA device renaming uses a sysfs device iteration method for generating consistent persistent mappings for the rdma device names. azure_persistent_rdma_naming.sh iterates the IB devices in sysfs, sorts them numerically, then renames the rdma device to mlx5_ibX where X is incremented by 1 for each device that is found.

There is no reason why we can't do exactly the same thing to iterate all the IB devices and rename the IPoIB interfaces appropriately. e.g. 'ls /sys/class/infiniband/DEV/device/net' should return the current network device name for the DEV IB device. We can then rename the network device using the 'ip link change dev name ' command.

This is yet another method that we can use to achieve consistent persistent device naming for IPoIB device names, and this has the advantage of being consistent with the RDMA device name for the same IB HCA. Indeed, doing it this way also gets around all the udev naming startup mess because we already have a RDMA naming monitor service that periodically checks and corrects RDMA device naming issues. Using this infrastructure would also address the IPoIB startup naming race problems as well.

@ggoklani ggoklani force-pushed the fix_ib_device_rename branch 2 times, most recently from 06ae2e2 to 2d548af Compare March 19, 2026 08:09
@ggoklani
Copy link
Collaborator Author

Sorting GUIDs gives a deterministic order, but it’s not guaranteed to match the physical/port enumeration order you wanted (mlx5_0 -> ib0, mlx5_1 -> ib1) across different SKUs/firmware layouts.

I think there has been a bit of misunderstanding here, probably because I didn't explainit well. That is, the mlx5_X -> ibX mapping example I gave that you repeat above was just that: an example of what consistent mapping behaviour might look like. i.e. if we map mlx5_0 -> ib1, we should -always- map it this way. Using the last 8 bytes of the MAC address to create indexes is one way we can obtain a consistent persistent ibX mapping for each device.

I'm not trying to be pedantic or difficult here - my understanding of how kernel network device naming works tells me that using net.ifnames = 0 doesn't actually fix the persistent naming problem. It gives the -impression- that it is doing what we need, but it does not actually guarantee device naming is consistent or persistent.

That is, net.ifnames = 0 results in the kernel device naming being retained and exposed to userspace. Kernel device naming is dependent on device discovery order. This order can change from boot to boot because device discovery is asynchronous. Hence while we might always have ib0 and ib1 devices when net.ifnames = 0 is set, but there is no guarantee that ib0 always points to the same IB HCA. They can -and do - swap around randomly on each boot.

Indeed, the inabiliity of the kernel to reliably name devices is the primary reason that net.ifnames = 0 is deprecated in favour of using systemd/udev to rename devices in userspace after boot.

Another example: why do we have the persistent RMDA device renaming infrastructure for the IB HCAs? WHy is that needed, and how does it work to create consistent persistent RDMA device names?

RDMA device renaming uses a sysfs device iteration method for generating consistent persistent mappings for the rdma device names. azure_persistent_rdma_naming.sh iterates the IB devices in sysfs, sorts them numerically, then renames the rdma device to mlx5_ibX where X is incremented by 1 for each device that is found.

There is no reason why we can't do exactly the same thing to iterate all the IB devices and rename the IPoIB interfaces appropriately. e.g. 'ls /sys/class/infiniband/DEV/device/net' should return the current network device name for the DEV IB device. We can then rename the network device using the 'ip link change dev name ' command.

This is yet another method that we can use to achieve consistent persistent device naming for IPoIB device names, and this has the advantage of being consistent with the RDMA device name for the same IB HCA. Indeed, doing it this way also gets around all the udev naming startup mess because we already have a RDMA naming monitor service that periodically checks and corrects RDMA device naming issues. Using this infrastructure would also address the IPoIB startup naming race problems as well.

As Discussed , we will be going with net.ifname as of now but for future release we will check for alternative approaches...

@ggoklani ggoklani requested review from lixuemin2016 and richm March 19, 2026 08:47
@ggoklani ggoklani force-pushed the fix_ib_device_rename branch 2 times, most recently from 5fb7173 to 769344b Compare March 19, 2026 09:42
@ggoklani ggoklani requested review from dgchinner and spetrosi March 19, 2026 09:43
Copy link
Collaborator

@dgchinner dgchinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks ok to me

@ggoklani ggoklani force-pushed the fix_ib_device_rename branch 2 times, most recently from 04c0360 to aa1ce79 Compare March 19, 2026 11:18
@ggoklani ggoklani force-pushed the fix_ib_device_rename branch 2 times, most recently from 10173f3 to b0cb8c2 Compare March 19, 2026 16:21
@ggoklani ggoklani force-pushed the fix_ib_device_rename branch from 4af89e1 to 7ecce7a Compare March 19, 2026 16:28
@richm
Copy link
Contributor

richm commented Mar 19, 2026

lgtm - @spetrosi ?

Add hpc_azure_disable_predictable_net_names to disable predictable interface naming on Azure by persisting net.ifnames=0
used fedora.linux_system_roles.bootloader role to modify kernel param
This prevents IPoIB interfaces from being renamed to ibP* and
keeps kernel-style names (e.g. ib0, ib1).

also for the test script updated the check to only match lines that start with FAIL, FAULT, or ERROR, which should stop WARNING lines
like “failure threshold” from triggering a failure while still catching real error lines.

Signed-off-by: Gaurav Goklani <ggoklani@redhat.com>
@ggoklani ggoklani force-pushed the fix_ib_device_rename branch from 8d571c1 to 8f881d3 Compare March 20, 2026 05:20
@ggoklani ggoklani merged commit 385a966 into linux-system-roles:main Mar 20, 2026
21 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants