Skip to content

[reconfigurator] Test MGS driven RoT updates #8295

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 17, 2025

Conversation

karencfv
Copy link
Contributor

@karencfv karencfv commented Jun 9, 2025

This commit adapts the current code to have the capability to test any component that is update-able by the MGS updater, not just SPs. Additionally, I have added tests for RoT updates

Comment on lines +295 to +302
// If the active slot does not match the expected active slot, it is possible
// another update is happening. Bail out.
if expected_active_slot.slot() != active {
return Err(PrecheckError::WrongActiveSlot {
expected: expected_active_slot.slot, found: *active
})
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests helped me find a bug! This precondition should be checked after we've checked there is nothing to do and returned a PrecheckStatus::UpdateComplete. When an update is done, a new active slot is set, so this precondition error would always return after an update was completed instead of reporting the update had completed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll also need to check the transient boot selection and the pending persistent boot selection. Also, the active slot should match the persistent boot selection.
These checks can't be done, and the conflicts won't be seen, until Hubris PR #2050 is merged.
Any deviation would probably indicate a previous failed update. An ignition power-cycle is the big hammer that can be used, but we don't want bugs where we are continually power-cycling equipment. So, RoT and SP resets should be considered if transient and pending persistent are != None. Active != Persistent also needs to be considered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll also need to check the transient boot selection and the pending persistent boot selection. Also, the active slot should match the persistent boot selection.

These are the checks I currently have https://github.com/oxidecomputer/omicron/blob/main/nexus/mgs-updates/src/rot_updater.rs#L370-L386

// If transient boot is being used, the persistent preference is not going to match 
// the active slot. At the moment, this mismatch can also mean one of the partitions
// had a bad signature check. We don't have a way to tell this appart yet.
// https://github.com/oxidecomputer/hubris/issues/2066
//
// For now, this discrepancy will mean a bad signature check. That's ok, we can continue.
// The logic here should change when transient boot preference is implemented.
if expected_persistent_boot_preference != active {
    info!(log, "expected_persistent_boot_preference does not match active slot. \
    This could mean a previous broken update attempt.");
};

// If pending_persistent_boot_preference or transient_boot_preference is/are some,
// then we need to wait, an update is happening.
if transient_boot_preference.is_some() || pending_persistent_boot_preference.is_some() {
    return Err(PrecheckError::EphemeralRotBootPreferenceSet);
}

Do they make sense for now?

Could the discrepancies you mention mean two things?

  1. There might be an ongoing update
  2. It might be a previous failed update

If so, how can I know the difference?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are ok for now, but when Hubris 2050 gets merged they can be improved. With your proposal, if an update fails, then there isn't any code to clean it up and retry. The easiest recovery will be to reset the RoT or power-cycle the device and that will be done manually.

Post-2050, in the case of there being a bad signature check on the alternate RoT image, either the pending-persistent or just the persistent boot preference will need to be set to the good image before proceeding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you could use pending-persistent or transient boot preference settings as part of a heuristic to determine that there is an update in progress or there was a failed update. But in the case of a failed update, these need to be tolerated and fixed so that a successful update can proceed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lzrd! I've updated #8349 to include this information

@karencfv karencfv marked this pull request as ready for review June 10, 2025 06:51
Copy link
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I'll defer to @lzrd on the specific RoT checks.

@karencfv
Copy link
Contributor Author

Thanks both for taking time to review my PR 🙇‍♀️

We'll also need to check the transient boot selection and the pending persistent boot selection. Also, the active slot should match the persistent boot selection.
These checks can't be done, and the conflicts won't be seen, until Hubris PR oxidecomputer/hubris#2050 is merged.

It appears I won't be able to do anything with these new checks until oxidecomputer/hubris#2050 is merged 🤔 . I'd like to get these tests and bugfix merged now, and I've opened #8349 to track any further changes to the RoT update pre-checks.

Any objections? @lzrd @jgallagher

@jgallagher
Copy link
Contributor

Sounds reasonable to me!

@karencfv karencfv merged commit 5b05d4a into oxidecomputer:main Jun 17, 2025
16 checks passed
@karencfv karencfv deleted the test-rot-update branch June 17, 2025 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants