-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DPU fw upgrade/reboot caused Host crash due to PCIe DPC events #180
Comments
@glimchb @ballle98 @jainvipin Here's an initial thought on DPU/HOST DPC behavior for Host-reset, DPU-reset and DPU OS install events. Host OS Reset or Crash
DPU OS Reset or Crash
DPU OS install mode
|
thanks @tedstreete I was hoping we can use this issue to understand and debug why FW upgrade/reboot even cause Host to crash completely?
Why DPC is not working? do we have kernel dumps to attach here and show what happens when DPU reboots and causes Host to crash ? |
@glimchb The primary issue is that neither of the two host OS properly manage PCI surprise remove events. The historical expectations that a failure of a PCIe device will always result in a Host OS crash. The introduction of independently functional devices, like DPUs, breaks that expectation.
OPI will need to determine what behaviors we want the host OS to offer in the event of DPU crash/reset/graceful-restart and then make the necessary changes to the Linux Kernel/PCIe subsystem and the host BIOS/BMC (iDRAC for Dell, iLo for HP etc.). |
Just as a data point, Fedora, CentOS, and RHEL all enable the DPC support by default. For example: From a RHEL 8.6 host:
And from the tip of rawhide:
|
@tedstreete can you please elaborate ? I know Intel is doing a lot of improvements in this area in next gen... do we have data from AMD as well ? |
@seroyer @glimchb The primary issue is that the default behavior when a surprise removal occurs is to crash the OS. OPI need to determine what other behaviors we want the kernel to exhibit and then ensure that the kernel/PCIe subsystem/BIOS/BMC offer those options. Additionally, while it's not mandatory if DPC events are managed gracefully, I'd argue that an ability to disable the Host/DPU PCIe link during DPU OS install/upgrade is a benefit we should explore. |
@ballle98 @tedstreete @jainvipin can you please add all the details and thoughts and debug info that you have on this, we can start bringing more people and don't want them to read the entire slack to understand the issue
The text was updated successfully, but these errors were encountered: