fix: non-zero exit code in RDMA naming script#110
Closed
ggoklani wants to merge 1 commit intolinux-system-roles:mainfrom
Closed
fix: non-zero exit code in RDMA naming script#110ggoklani wants to merge 1 commit intolinux-system-roles:mainfrom
ggoklani wants to merge 1 commit intolinux-system-roles:mainfrom
Conversation
Problem The azure_persistent_rdma_naming.service consistently fails with status=1/FAILURE on startup. Root Cause: The script /usr/sbin/azure_persistent_rdma_naming.sh uses set -e (exit on error). In the final iteration of the for loop, the last command executed is an increment or an assignment that returns a non-zero status (specifically when the loop finishes or if a sub-command within the loop logic returns 1). Because this is the last line of the script, the entire process exits with 1, causing systemd to report a service failure even if the naming logic was successful. Solution Added an explicit exit 0 at the end of the script. This ensures that if the script reaches the end of its logic without encountering a genuine fatal error, it returns a success code to systemd. Validation Manual Execution: Running sudo bash -x /usr/sbin/azure_persistent_rdma_naming.sh now completes with an exit code of 0. Systemd Status: systemctl status azure_persistent_rdma_naming.service now reports active (exited) instead of failed. Functionality: Verified that mlx5_ib devices are still correctly identified and processed by the script. Signed-off-by: Gaurav Goklani <ggoklani@redhat.com>
Reviewer's guide (collapsed on small PRs)Reviewer's GuideEnsures the RDMA naming script always exits successfully when its logic completes without fatal errors by adding an explicit exit 0 at the end of the script, preventing spurious systemd service failures. Sequence diagram for systemd invocation of RDMA naming script with explicit exit 0sequenceDiagram
participant systemd
participant azure_persistent_rdma_naming_service
participant shell
participant azure_persistent_rdma_naming_sh
systemd->>azure_persistent_rdma_naming_service: start
azure_persistent_rdma_naming_service->>shell: exec /usr/sbin/azure_persistent_rdma_naming.sh
shell->>azure_persistent_rdma_naming_sh: run script (set -e)
loop for_each_infiniband_device
azure_persistent_rdma_naming_sh->>azure_persistent_rdma_naming_sh: detect device type
azure_persistent_rdma_naming_sh->>azure_persistent_rdma_naming_sh: apply naming logic
end
azure_persistent_rdma_naming_sh->>shell: exit 0 (explicit at end)
shell->>azure_persistent_rdma_naming_service: process exits with code 0
azure_persistent_rdma_naming_service->>systemd: notify success
systemd-->>azure_persistent_rdma_naming_service: state active_exited
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- Unconditionally forcing
exit 0at the end may mask genuine failures detected byset -e; consider tracking a success/failure flag (or last significant command status) and exiting with that instead of always returning 0. - It might be cleaner to address the specific non-zero-returning operation inside the loop (e.g., by adjusting that increment/assignment or appending
|| true) rather than overriding the script’s final exit status globally.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Unconditionally forcing `exit 0` at the end may mask genuine failures detected by `set -e`; consider tracking a success/failure flag (or last significant command status) and exiting with that instead of always returning 0.
- It might be cleaner to address the specific non-zero-returning operation inside the loop (e.g., by adjusting that increment/assignment or appending `|| true`) rather than overriding the script’s final exit status globally.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Contributor
|
@ggoklani lgtm - is this still a draft? |
Collaborator
Author
Yes , this fix isn't working so currently I am testing more on this. |
Contributor
Does it work if you do not use |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The azure_persistent_rdma_naming.service fails on some hpc sku with status=1/FAILURE on startup.
Root Cause:
The script /usr/sbin/azure_persistent_rdma_naming.sh uses set -e (exit on error). In the final iteration of the for loop, the last command executed is an increment or an assignment that returns a non-zero status (specifically when the loop finishes or if a sub-command within the loop logic returns 1). Because this is the last line of the script, the entire process exits with 1, causing systemd to report a service failure even if the naming logic was successful.
Solution
Added an explicit exit 0 at the end of the script. This ensures that if the script reaches the end of its logic without encountering a genuine fatal error, it returns a success code to systemd.
Validation
Manual Execution: Running sudo bash -x /usr/sbin/azure_persistent_rdma_naming.sh now completes with an exit code of 0.
Systemd Status: systemctl status azure_persistent_rdma_naming.service now reports active (exited) instead of failed.
Functionality: Verified that mlx5_ib devices are still correctly identified and processed by the script.
Summary by Sourcery
Bug Fixes: