-
Notifications
You must be signed in to change notification settings - Fork 752
Open
Description
Enhanced Error-handling config
Current State
See https://docs.nvidia.com/deploy/xid-errors/index.html#xid-error-listing
The NVIDIA GPU Device Plugin
We register for NVML Events of type nvml.EventTypeXidCriticalError | nvml.EventTypeDoubleBitEccError | nvml.EventTypeSingleBitEccError
We treat the following XIDs as non-fatal errors:
| XID | Description |
|---|---|
| 13 | Graphics Engine Exception |
| 31 | GPU memory page fault |
| 43 | GPU stopped processing |
| 45 | Preemptive cleanup, due to previous errors |
| 68 | Video processor exception |
| 109 | Context Switch Timeout Error |
We allow additional Xids to be specified in the DP_DISABLE_HEALTHCHECKS envvar with the following logic:
- If the value is
xidsorallwe disable healthchecks entirely. - A comma-separated list of numeric XIDs to ignore: e.g.
109,68
The GKE Device Plugin
By default the following error is checked:
| XID | Description |
|---|---|
| 48 | Double-bit ECC Error |
The XID_CONFIG envvar is used to specifiy a comma-separated list of additional XIDs to treat as critical.
Proposal
Add the following config section:
version: v1
health:
disabled: false
eventTypes: [EventTypeXidCriticalError, EventTypeDoubleBitEccError, EventTypeSingleBitEccError]
ignoredXIDs: [13, 31, 43, 45, 68]
criticalXIDs: allGKE defaults:
version: v1
health:
disabled: false
eventTypes: [EventTypeXidCriticalError]
ignoredXIDs: []
criticalXIDs: [48]Metadata
Metadata
Assignees
Labels
No labels