[reconfigurator] Test MGS driven RoT updates #8295

karencfv · 2025-06-09T08:46:37Z

This commit adapts the current code to have the capability to test any component that is update-able by the MGS updater, not just SPs. Additionally, I have added tests for RoT updates

karencfv · 2025-06-10T06:48:15Z

nexus/mgs-updates/src/rot_updater.rs

+            // If the active slot does not match the expected active slot, it is possible
+            // another update is happening. Bail out.
+            if expected_active_slot.slot() != active {
+                return Err(PrecheckError::WrongActiveSlot {
+                    expected: expected_active_slot.slot, found: *active
+                })
+            }
+


The tests helped me find a bug! This precondition should be checked after we've checked there is nothing to do and returned a PrecheckStatus::UpdateComplete. When an update is done, a new active slot is set, so this precondition error would always return after an update was completed instead of reporting the update had completed.

We'll also need to check the transient boot selection and the pending persistent boot selection. Also, the active slot should match the persistent boot selection.
These checks can't be done, and the conflicts won't be seen, until Hubris PR #2050 is merged.
Any deviation would probably indicate a previous failed update. An ignition power-cycle is the big hammer that can be used, but we don't want bugs where we are continually power-cycling equipment. So, RoT and SP resets should be considered if transient and pending persistent are != None. Active != Persistent also needs to be considered.

We'll also need to check the transient boot selection and the pending persistent boot selection. Also, the active slot should match the persistent boot selection.

These are the checks I currently have https://github.com/oxidecomputer/omicron/blob/main/nexus/mgs-updates/src/rot_updater.rs#L370-L386

// If transient boot is being used, the persistent preference is not going to match // the active slot. At the moment, this mismatch can also mean one of the partitions // had a bad signature check. We don't have a way to tell this appart yet. // https://github.com/oxidecomputer/hubris/issues/2066 // // For now, this discrepancy will mean a bad signature check. That's ok, we can continue. // The logic here should change when transient boot preference is implemented. if expected_persistent_boot_preference != active { info!(log, "expected_persistent_boot_preference does not match active slot. \ This could mean a previous broken update attempt."); }; // If pending_persistent_boot_preference or transient_boot_preference is/are some, // then we need to wait, an update is happening. if transient_boot_preference.is_some() || pending_persistent_boot_preference.is_some() { return Err(PrecheckError::EphemeralRotBootPreferenceSet); }

Do they make sense for now?

Could the discrepancies you mention mean two things?

There might be an ongoing update

It might be a previous failed update

If so, how can I know the difference?

These tests are ok for now, but when Hubris 2050 gets merged they can be improved. With your proposal, if an update fails, then there isn't any code to clean it up and retry. The easiest recovery will be to reset the RoT or power-cycle the device and that will be done manually.

Post-2050, in the case of there being a bad signature check on the alternate RoT image, either the pending-persistent or just the persistent boot preference will need to be set to the good image before proceeding.

Also, you could use pending-persistent or transient boot preference settings as part of a heuristic to determine that there is an update in progress or there was a failed update. But in the case of a failed update, these need to be tolerated and fixed so that a successful update can proceed.

Thanks @lzrd! I've updated #8349 to include this information

jgallagher

This looks good to me. I'll defer to @lzrd on the specific RoT checks.

karencfv · 2025-06-16T21:35:57Z

Thanks both for taking time to review my PR 🙇‍♀️

We'll also need to check the transient boot selection and the pending persistent boot selection. Also, the active slot should match the persistent boot selection.
These checks can't be done, and the conflicts won't be seen, until Hubris PR oxidecomputer/hubris#2050 is merged.

It appears I won't be able to do anything with these new checks until oxidecomputer/hubris#2050 is merged 🤔 . I'd like to get these tests and bugfix merged now, and I've opened #8349 to track any further changes to the RoT update pre-checks.

Any objections? @lzrd @jgallagher

jgallagher · 2025-06-17T15:52:05Z

Sounds reasonable to me!

karencfv added 6 commits June 9, 2025 18:28

[reconfigurator] Test MGS driven RoT updates

ae2b1cb

start adding test code for RoT

1d6c034

retrieve info from boot info

d06b804

Add rot artifacts

c7e62d5

Fix bug and add RoT test helpers

4ef66a9

Add more tests

5564cd3

karencfv commented Jun 10, 2025

View reviewed changes

karencfv marked this pull request as ready for review June 10, 2025 06:51

karencfv requested review from davepacheco, jgallagher and lzrd June 10, 2025 06:52

jgallagher reviewed Jun 16, 2025

View reviewed changes

karencfv mentioned this pull request Jun 16, 2025

Review RoT update pre-checks #8349

Open

karencfv mentioned this pull request Jun 17, 2025

[reconfigurator] Pre-checks and post_update actions for RoT bootloader update #8325

Open

jgallagher approved these changes Jun 17, 2025

View reviewed changes

karencfv merged commit 5b05d4a into oxidecomputer:main Jun 17, 2025
16 checks passed

karencfv deleted the test-rot-update branch June 17, 2025 20:29

karencfv mentioned this pull request Jun 24, 2025

MGS driven SP components left in invalid state should have a way to recover from failed updates #8414

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[reconfigurator] Test MGS driven RoT updates #8295

[reconfigurator] Test MGS driven RoT updates #8295

Uh oh!

karencfv commented Jun 9, 2025 •

edited

Loading

Uh oh!

karencfv Jun 10, 2025

Uh oh!

lzrd Jun 13, 2025

Uh oh!

karencfv Jun 13, 2025

Uh oh!

lzrd Jun 18, 2025

Uh oh!

lzrd Jun 18, 2025

Uh oh!

karencfv Jun 18, 2025

Uh oh!

jgallagher left a comment

Uh oh!

karencfv commented Jun 16, 2025

Uh oh!

jgallagher commented Jun 17, 2025

Uh oh!

Uh oh!

Uh oh!

[reconfigurator] Test MGS driven RoT updates #8295

[reconfigurator] Test MGS driven RoT updates #8295

Uh oh!

Conversation

karencfv commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karencfv Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

lzrd Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

karencfv Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

lzrd Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

lzrd Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

karencfv Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

jgallagher left a comment

Choose a reason for hiding this comment

Uh oh!

karencfv commented Jun 16, 2025

Uh oh!

jgallagher commented Jun 17, 2025

Uh oh!

Uh oh!

Uh oh!

karencfv commented Jun 9, 2025 •

edited

Loading