Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nvidia] Capture more nvidia commands #3777

Merged
merged 1 commit into from
Oct 8, 2024

Conversation

jcastill
Copy link
Member

Capture commands related to nvidia container toolkit.

Related: RHEL-58172


Please place an 'X' inside each '[]' to confirm you adhere to our Contributor Guidelines

  • Is the commit message split over multiple lines and hard-wrapped at 72 characters?
  • Is the subject and message clear and concise?
  • Does the subject start with [plugin_name] if submitting a plugin patch or a [section_name] if part of the core sosreport code?
  • Does the commit contain a Signed-off-by: First Lastname [email protected]?
  • Are any related Issues or existing PRs properly referenced via a Closes (Issue) or Resolved (PR) line?
  • Are all passwords or private data gathered by this PR obfuscated?

Copy link

Congratulations! One of the builds has completed. 🍾

You can install the built RPMs by following these steps:

  • sudo yum install -y dnf-plugins-core on RHEL 8
  • sudo dnf install -y dnf-plugins-core on Fedora
  • dnf copr enable packit/sosreport-sos-3777
  • And now you can install the packages.

Please note that the RPMs should be used only in a testing environment.

Copy link
Contributor

@pmoravec pmoravec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. I re-started timeouted tests that failed.

self.add_service_status("nvidia-persistenced")
self.add_service_status("nvidia-fabricmanager")
self.add_service_status("nvidia-toolkit-firstboot")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious, would it good to add journal for all these too. i.e. if one only wants to enable the nvidia plugin and nothing else?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I'll addthe journals as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these are added to the services tuple, it can be used for plugin enablement as well as automatically getting the journal and service status.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Shall I add the three services to the plugin enablement? We have its journal capture a bit different:

self.add_journal(boot=0, identifier='nvidia-persistenced')

But a quick test I did in a RHEL AI image showed that the nvidia-persistenced service is captured correctly. My only doubt is the 'boot=0' option.

Copy link
Member

@arif-ali arif-ali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@@ -44,4 +53,5 @@ def setup(self):
)
self.add_journal(boot=0, identifier='nvidia-persistenced')


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, do we need the extra line here? not overly precious about it though

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I think we can remove it.
My only doubt then is what to do with this add_journal() call before it. The service is now in the services tuple, but the boot=0 is what makes me doubt if it's safe or not to remove it. Any thoughts on this?

Copy link
Member

@TurboTurtle TurboTurtle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At current, this will result in duplicate journal collections for nvidia-persistenced, once for the whole journal and then again for only the current boot. Since that service is now in the services tuple, we automatically get the journal.

@jcastill
Copy link
Member Author

jcastill commented Oct 3, 2024

Thank you for clarifying it Jake. I'll remove the extra capture and will push again.

Capture commands related to nvidia container toolkit.

Related: RHEL-58172

Signed-off-by: Jose Castillo <[email protected]>
@TurboTurtle TurboTurtle merged commit 757d2b3 into sosreport:main Oct 8, 2024
36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants