Skip to content

Conversation

@robertdavidsmith
Copy link
Contributor

@robertdavidsmith robertdavidsmith commented Jul 28, 2025

Make it possible to enable health checks for xids which are disabled by default via an env var DP_ENABLE_HEALTHCHECKS.

This is occasionally useful, e.g. we have seen examples where xid 45 is very correlated with genuine GPU issues.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jul 28, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@elezar
Copy link
Member

elezar commented Aug 1, 2025

Thanks @robertdavidsmith. We have had some internal discussions around more flexibility here and I had the idea for #1340 created against an internal issue tracker. I think your proposal here aligns with this to some extent and as such your input there would be appreciated.

// disabled entirely. If set, the envvar is treated as a comma-separated list of Xids to ignore. Note that
// this is in addition to the Application errors that are already ignored.
envDisableHealthChecks = "DP_DISABLE_HEALTHCHECKS"
envEnableHealthChecks = "DP_ENABLE_HEALTHCHECKS"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this DP_FORCE_HEALTHCHECKS or CRITICAL_XIDS or FATAL_XIDS instead? (I would prefer the latter, but I could hear arguments for alternatives.

Copy link
Contributor Author

@robertdavidsmith robertdavidsmith Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DP_DISABLE_HEALTHCHECKS is the exact opposite of DP_ENABLE_HEALTHCHECKS and IMO the naming should reflect this. So I'd favour renaming DP_ENABLE_HEALTHCHECKS to DP_FORCE_ENABLE_HEALTHCHECKS, that way you get the "force" but it's still very clear it's the exact opposite of DP_DISABLE_HEALTHCHECKS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question: What is the order of precedence of these? If DP_DISABLE_HEALTHCHECKS=all|xids does adding a specific XID to DP_FORCE_ENABLE_HEALTHCHECKS take precedence over this?

Then a comment on the naming. I don't think this envvar is the oposite of the existing one. While the existing one removes the predefined XIDs that we treat as fatal, this one adds ADDITIONAL ones.

Copy link
Contributor Author

@robertdavidsmith robertdavidsmith Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On balance I'd make DP_ADDITIONAL_HEALTHCHECKS /DP_FORCE_ENABLE_HEALTHCHECKS support all, and have it override DP_DISABLE_HEALTHCHECKS. This matters more if we choose a name with FORCE in it, but is probably worth doing regardless of name. This would complicate the code a little, but would make the externally visible behavior to be what a reasonable user would expect.

return nil
}

enableHealthChecks := strings.ToLower(os.Getenv(envEnableHealthChecks))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this to wehre we actually use the value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair comment, though if we add support for "all" this all changes anyway.

skippedXids[additionalXid] = true
}

for _, additionalXid := range getAdditionalXids(enableHealthChecks) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to allow something like all to be specified?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's reasonable, I'm happy to do this

@robertdavidsmith
Copy link
Contributor Author

Thanks @robertdavidsmith. We have had some internal discussions around more flexibility here and I had the idea for #1340 created against an internal issue tracker. I think your proposal here aligns with this to some extent and as such your input there would be appreciated.

Noted, I've commented on #1340

Do you see that issue as a replacement for this PR (in which case I'll close it), or in addition to it (in which case I'll update the PR to incorporate your feedback)?

@elezar
Copy link
Member

elezar commented Aug 13, 2025

Thanks @robertdavidsmith. We have had some internal discussions around more flexibility here and I had the idea for #1340 created against an internal issue tracker. I think your proposal here aligns with this to some extent and as such your input there would be appreciated.

Noted, I've commented on #1340

Do you see that issue as a replacement for this PR (in which case I'll close it), or in addition to it (in which case I'll update the PR to incorporate your feedback)?

I think the core logic of what we're trying to achieve here is covered by #1340, but this PR calls out an omission in that issue which is that the set of "Critical" / "Forced" XIDs also need to be settable by an environment variable and not only in a config file. With that in mind, I think we can continue with your PR and then all that remains for #1340 would be to define the actual config format and map these to the relevant envvars using CLI flags like we do for other options.

Using our CLI handling to specify envvars also give us more flexibility to rename these envvars while falling back to the original. For the longest time, the DP_ENABLE_HEALTHCHECKS envvar has been considered an internal-only / undocumented features and if we pull this up to a more public configuration option (with support in helm etc.) a clearer name would be useful.

@robertdavidsmith
Copy link
Contributor Author

Ok shall I make a PR to add DP_ADDITIONAL_HEALTHCHECKS that

  • Will support DP_ADDITIONAL_HEALTHCHECKS=all
  • DP_ADDITIONAL_HEALTHCHECKS always beats DP_DISABLE_HEALTHCHECKS
  • Given increased complexity, I'll separate the enable/disable logic from the rest of checkHealth() so it can be properly unit-tested

You can then do #1340

Thoughts?

@robertdavidsmith
Copy link
Contributor Author

robertdavidsmith commented Aug 15, 2025

Just seen https://github.com/NVIDIA/k8s-device-plugin/pull/1360/files, that PR is largely complete and should be merged first.

@elezar
Copy link
Member

elezar commented Aug 28, 2025

Just seen https://github.com/NVIDIA/k8s-device-plugin/pull/1360/files, that PR is largely complete and should be merged first.

While #1360 does have some overlap with what you're doing here, I think that it could be seen as a follow up. I have created a branch https://github.com/elezar/k8s-device-plugin/tree/robertdavidsmith/main with a minor refactoring on top of your changes which should be compatible with what we want to do in 1360. What do you think?

@robertdavidsmith
Copy link
Contributor Author

robertdavidsmith commented Aug 29, 2025

Just seen https://github.com/NVIDIA/k8s-device-plugin/pull/1360/files, that PR is largely complete and should be merged first.

While #1360 does have some overlap with what you're doing here, I think that it could be seen as a follow up. I have created a branch https://github.com/elezar/k8s-device-plugin/tree/robertdavidsmith/main with a minor refactoring on top of your changes which should be compatible with what we want to do in 1360. What do you think?

Thank you @elezar

I've commented on your commit here, basically I think this makes a lot of sense, but see comments.

@robertdavidsmith
Copy link
Contributor Author

Hi @elezar,

How do you want to proceed from here?

Shall I copy your commit here onto my branch, make the changes I suggested (see comments on that commit), and then send you this PR for final review?

Thanks,

Rob

@elezar
Copy link
Member

elezar commented Sep 16, 2025

Sorry for the delay here. If you think the changes proposed in the commit are reasonable, feel free to pull them in here so that we can continue reviewing your PR. It would be good for you to get the credit.

@robertdavidsmith
Copy link
Contributor Author

Sorry for the delay here. If you think the changes proposed in the commit are reasonable, feel free to pull them in here so that we can continue reviewing your PR. It would be good for you to get the credit.

Ok, I'm happy to do so

@elezar
Copy link
Member

elezar commented Oct 7, 2025

In order to close this out I have created #1443. Please take a look.

Signed-off-by: Robert Smith <[email protected]>
@robertdavidsmith robertdavidsmith marked this pull request as draft October 7, 2025 15:36
@robertdavidsmith
Copy link
Contributor Author

@elezar PR #1443 LGTM, thank you. I will close this PR

@elezar
Copy link
Member

elezar commented Oct 9, 2025

@elezar PR #1443 LGTM, thank you. I will close this PR

Thanks again for your work here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants