Skip to content

fix: non-zero exit code in RDMA naming script#110

Closed
ggoklani wants to merge 1 commit intolinux-system-roles:mainfrom
ggoklani:fix_rdma_naming_script_new
Closed

fix: non-zero exit code in RDMA naming script#110
ggoklani wants to merge 1 commit intolinux-system-roles:mainfrom
ggoklani:fix_rdma_naming_script_new

Conversation

@ggoklani
Copy link
Collaborator

@ggoklani ggoklani commented Mar 24, 2026

Problem
The azure_persistent_rdma_naming.service fails on some hpc sku with status=1/FAILURE on startup.

Root Cause:
The script /usr/sbin/azure_persistent_rdma_naming.sh uses set -e (exit on error). In the final iteration of the for loop, the last command executed is an increment or an assignment that returns a non-zero status (specifically when the loop finishes or if a sub-command within the loop logic returns 1). Because this is the last line of the script, the entire process exits with 1, causing systemd to report a service failure even if the naming logic was successful.

Solution
Added an explicit exit 0 at the end of the script. This ensures that if the script reaches the end of its logic without encountering a genuine fatal error, it returns a success code to systemd.

Validation
Manual Execution: Running sudo bash -x /usr/sbin/azure_persistent_rdma_naming.sh now completes with an exit code of 0.

Systemd Status: systemctl status azure_persistent_rdma_naming.service now reports active (exited) instead of failed.

Functionality: Verified that mlx5_ib devices are still correctly identified and processed by the script.

Summary by Sourcery

Bug Fixes:

  • Prevent azure_persistent_rdma_naming.service from reporting a failure due to a non-zero exit code at the end of the RDMA naming script.

Problem
The azure_persistent_rdma_naming.service consistently fails with status=1/FAILURE on startup.

Root Cause:
The script /usr/sbin/azure_persistent_rdma_naming.sh uses set -e (exit on error). In the final iteration of the for loop, the last command executed is an increment or an assignment that returns a non-zero status (specifically when the loop finishes or if a sub-command within the loop logic returns 1). Because this is the last line of the script, the entire process exits with 1, causing systemd to report a service failure even if the naming logic was successful.

Solution
Added an explicit exit 0 at the end of the script. This ensures that if the script reaches the end of its logic without encountering a genuine fatal error, it returns a success code to systemd.

Validation
Manual Execution: Running sudo bash -x /usr/sbin/azure_persistent_rdma_naming.sh now completes with an exit code of 0.

Systemd Status: systemctl status azure_persistent_rdma_naming.service now reports active (exited) instead of failed.

Functionality: Verified that mlx5_ib devices are still correctly identified and processed by the script.

Signed-off-by: Gaurav Goklani <ggoklani@redhat.com>
@ggoklani ggoklani requested review from richm and spetrosi as code owners March 24, 2026 09:58
@sourcery-ai
Copy link

sourcery-ai bot commented Mar 24, 2026

Reviewer's guide (collapsed on small PRs)

Reviewer's Guide

Ensures the RDMA naming script always exits successfully when its logic completes without fatal errors by adding an explicit exit 0 at the end of the script, preventing spurious systemd service failures.

Sequence diagram for systemd invocation of RDMA naming script with explicit exit 0

sequenceDiagram
    participant systemd
    participant azure_persistent_rdma_naming_service
    participant shell
    participant azure_persistent_rdma_naming_sh

    systemd->>azure_persistent_rdma_naming_service: start
    azure_persistent_rdma_naming_service->>shell: exec /usr/sbin/azure_persistent_rdma_naming.sh
    shell->>azure_persistent_rdma_naming_sh: run script (set -e)
    loop for_each_infiniband_device
        azure_persistent_rdma_naming_sh->>azure_persistent_rdma_naming_sh: detect device type
        azure_persistent_rdma_naming_sh->>azure_persistent_rdma_naming_sh: apply naming logic
    end
    azure_persistent_rdma_naming_sh->>shell: exit 0 (explicit at end)
    shell->>azure_persistent_rdma_naming_service: process exits with code 0
    azure_persistent_rdma_naming_service->>systemd: notify success
    systemd-->>azure_persistent_rdma_naming_service: state active_exited
Loading

File-Level Changes

Change Details Files
Guarantee a zero exit status for the RDMA naming script when it completes normally to avoid systemd reporting failures.
  • Append an explicit exit 0 at the end of the RDMA naming shell script template so that a successful run returns status code 0 even with set -e enabled
  • Rely on the existing set -e behavior to still abort early and propagate non-zero exit codes when genuine errors occur within the loop or earlier in the script
templates/rdma/azure_persistent_rdma_naming.sh.j2

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • Unconditionally forcing exit 0 at the end may mask genuine failures detected by set -e; consider tracking a success/failure flag (or last significant command status) and exiting with that instead of always returning 0.
  • It might be cleaner to address the specific non-zero-returning operation inside the loop (e.g., by adjusting that increment/assignment or appending || true) rather than overriding the script’s final exit status globally.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Unconditionally forcing `exit 0` at the end may mask genuine failures detected by `set -e`; consider tracking a success/failure flag (or last significant command status) and exiting with that instead of always returning 0.
- It might be cleaner to address the specific non-zero-returning operation inside the loop (e.g., by adjusting that increment/assignment or appending `|| true`) rather than overriding the script’s final exit status globally.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@ggoklani ggoklani marked this pull request as draft March 24, 2026 12:59
@richm
Copy link
Contributor

richm commented Mar 24, 2026

@ggoklani lgtm - is this still a draft?

@ggoklani
Copy link
Collaborator Author

@ggoklani lgtm - is this still a draft?

Yes , this fix isn't working so currently I am testing more on this.

@richm
Copy link
Contributor

richm commented Mar 24, 2026

@ggoklani lgtm - is this still a draft?

Yes , this fix isn't working so currently I am testing more on this.

Does it work if you do not use set -e? What is the output?

@ggoklani ggoklani closed this Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants