Skip to content

Conversation

@elezar
Copy link
Member

@elezar elezar commented Oct 7, 2025

This change adds support for users to explicitly set specific XIDs as fatal errors that cause a device to be marked as unhealth. When used in conjunction with the ability to ignore ALL XIDs, the set of XIDs that are fatal can be set.

To do this, an environment variable DP_ENABLE_HEALTHCHECKS that mirrors the existing DP_DISABLE_HEALTHCHECKS envvar is added.

Explicitly enabling an XID overrides disabling them.

This builds on the change from #1335

This change adds support for users to explicitly set specific XIDs as fatal
errors that cause a device to be marked as unhealth. When used in conjunction
with the ability to ignore ALL XIDs, the set of XIDs that are fatal can be set.

To do this, an environment variable DP_ENABLE_HEALTHCHECKS that mirrors the
existing DP_DISABLE_HEALTHCHECKS envvar is added.

Explicitly enabling an XID overrides disabling them.

Signed-off-by: Robert Smith <[email protected]>
Co-Authored-by: Evan Lezar <[email protected]>
@elezar elezar force-pushed the robertdavidsmith/main branch from 268ec6e to 0900fac Compare October 7, 2025 12:05
@elezar elezar merged commit 7bcab30 into NVIDIA:main Oct 9, 2025
9 checks passed
@elezar elezar deleted the robertdavidsmith/main branch October 9, 2025 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants