-
Notifications
You must be signed in to change notification settings - Fork 745
Add env var DP_ENABLE_HEALTHCHECKS #1335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Robert Smith <[email protected]>
|
Thanks @robertdavidsmith. We have had some internal discussions around more flexibility here and I had the idea for #1340 created against an internal issue tracker. I think your proposal here aligns with this to some extent and as such your input there would be appreciated. |
internal/rm/health.go
Outdated
| // disabled entirely. If set, the envvar is treated as a comma-separated list of Xids to ignore. Note that | ||
| // this is in addition to the Application errors that are already ignored. | ||
| envDisableHealthChecks = "DP_DISABLE_HEALTHCHECKS" | ||
| envEnableHealthChecks = "DP_ENABLE_HEALTHCHECKS" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename this DP_FORCE_HEALTHCHECKS or CRITICAL_XIDS or FATAL_XIDS instead? (I would prefer the latter, but I could hear arguments for alternatives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DP_DISABLE_HEALTHCHECKS is the exact opposite of DP_ENABLE_HEALTHCHECKS and IMO the naming should reflect this. So I'd favour renaming DP_ENABLE_HEALTHCHECKS to DP_FORCE_ENABLE_HEALTHCHECKS, that way you get the "force" but it's still very clear it's the exact opposite of DP_DISABLE_HEALTHCHECKS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question: What is the order of precedence of these? If DP_DISABLE_HEALTHCHECKS=all|xids does adding a specific XID to DP_FORCE_ENABLE_HEALTHCHECKS take precedence over this?
Then a comment on the naming. I don't think this envvar is the oposite of the existing one. While the existing one removes the predefined XIDs that we treat as fatal, this one adds ADDITIONAL ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On balance I'd make DP_ADDITIONAL_HEALTHCHECKS /DP_FORCE_ENABLE_HEALTHCHECKS support all, and have it override DP_DISABLE_HEALTHCHECKS. This matters more if we choose a name with FORCE in it, but is probably worth doing regardless of name. This would complicate the code a little, but would make the externally visible behavior to be what a reasonable user would expect.
internal/rm/health.go
Outdated
| return nil | ||
| } | ||
|
|
||
| enableHealthChecks := strings.ToLower(os.Getenv(envEnableHealthChecks)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this to wehre we actually use the value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair comment, though if we add support for "all" this all changes anyway.
internal/rm/health.go
Outdated
| skippedXids[additionalXid] = true | ||
| } | ||
|
|
||
| for _, additionalXid := range getAdditionalXids(enableHealthChecks) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to allow something like all to be specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's reasonable, I'm happy to do this
Noted, I've commented on #1340 Do you see that issue as a replacement for this PR (in which case I'll close it), or in addition to it (in which case I'll update the PR to incorporate your feedback)? |
I think the core logic of what we're trying to achieve here is covered by #1340, but this PR calls out an omission in that issue which is that the set of "Critical" / "Forced" XIDs also need to be settable by an environment variable and not only in a config file. With that in mind, I think we can continue with your PR and then all that remains for #1340 would be to define the actual config format and map these to the relevant envvars using CLI flags like we do for other options. Using our CLI handling to specify envvars also give us more flexibility to rename these envvars while falling back to the original. For the longest time, the |
|
Ok shall I make a PR to add
You can then do #1340 Thoughts? |
|
Just seen https://github.com/NVIDIA/k8s-device-plugin/pull/1360/files, that PR is largely complete and should be merged first. |
While #1360 does have some overlap with what you're doing here, I think that it could be seen as a follow up. I have created a branch https://github.com/elezar/k8s-device-plugin/tree/robertdavidsmith/main with a minor refactoring on top of your changes which should be compatible with what we want to do in 1360. What do you think? |
Thank you @elezar I've commented on your commit here, basically I think this makes a lot of sense, but see comments. |
|
Sorry for the delay here. If you think the changes proposed in the commit are reasonable, feel free to pull them in here so that we can continue reviewing your PR. It would be good for you to get the credit. |
Ok, I'm happy to do so |
|
In order to close this out I have created #1443. Please take a look. |
Signed-off-by: Robert Smith <[email protected]>
Make it possible to enable health checks for xids which are disabled by default via an env var
DP_ENABLE_HEALTHCHECKS.This is occasionally useful, e.g. we have seen examples where xid 45 is very correlated with genuine GPU issues.