-
Notifications
You must be signed in to change notification settings - Fork 745
Mark device as healthy if checkHealth function does not receive unhealthy event #1211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -36,7 +36,7 @@ const ( | |
| ) | ||
|
|
||
| // CheckHealth performs health checks on a set of devices, writing to the 'unhealthy' channel with any unhealthy devices | ||
| func (r *nvmlResourceManager) checkHealth(stop <-chan interface{}, devices Devices, unhealthy chan<- *Device) error { | ||
| func (r *nvmlResourceManager) checkHealth(stop <-chan interface{}, devices Devices, healthy chan<- *Device, unhealthy chan<- *Device) error { | ||
| disableHealthChecks := strings.ToLower(os.Getenv(envDisableHealthChecks)) | ||
| if disableHealthChecks == "all" { | ||
| disableHealthChecks = allHealthChecks | ||
|
|
@@ -147,16 +147,6 @@ func (r *nvmlResourceManager) checkHealth(stop <-chan interface{}, devices Devic | |
| continue | ||
| } | ||
|
|
||
| if e.EventType != nvml.EventTypeXidCriticalError { | ||
| klog.Infof("Skipping non-nvmlEventTypeXidCriticalError event: %+v", e) | ||
| continue | ||
| } | ||
|
|
||
| if skippedXids[e.EventData] { | ||
| klog.Infof("Skipping event %+v", e) | ||
| continue | ||
| } | ||
|
|
||
| klog.Infof("Processing event %+v", e) | ||
| eventUUID, ret := e.Device.GetUUID() | ||
| if ret != nvml.SUCCESS { | ||
|
|
@@ -174,6 +164,18 @@ func (r *nvmlResourceManager) checkHealth(stop <-chan interface{}, devices Devic | |
| continue | ||
| } | ||
|
|
||
| if e.EventType != nvml.EventTypeXidCriticalError { | ||
| klog.Infof("Skipping non-nvmlEventTypeXidCriticalError event: %+v", e) | ||
| healthy <- d | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are xid events mutually exclusive? Why are we treating non-critical (or skipped events) as indicators of device health? |
||
| continue | ||
| } | ||
|
|
||
| if skippedXids[e.EventData] { | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Assuming that the device is set as unhealthy due to event A, event B is detected at this time (i.e. satisfying the condition of (e.EventType! =Nvml. EventTypeXidCriticalError)) caused the device to be restored to health. Will this situation lead to the device being erroneously restored to health? I think a reasonable approach should be to restore the device from the same error state before setting it to healthy? |
||
| klog.Infof("Skipping event %+v", e) | ||
| healthy <- d | ||
| continue | ||
| } | ||
|
|
||
| if d.IsMigDevice() && e.GpuInstanceId != 0xFFFFFFFF && e.ComputeInstanceId != 0xFFFFFFFF { | ||
| gi := deviceIDToGiMap[d.ID] | ||
| ci := deviceIDToCiMap[d.ID] | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of introducing a new channel, is there any benefit in having a single channel that accepts a device and the desired status?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could even mark the device as healthy or unhealthy at the point where we send the device on the channel and then keep the health channel as is.