-
Notifications
You must be signed in to change notification settings - Fork 45
[reconfigurator] Test MGS driven RoT updates #8295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
// If the active slot does not match the expected active slot, it is possible | ||
// another update is happening. Bail out. | ||
if expected_active_slot.slot() != active { | ||
return Err(PrecheckError::WrongActiveSlot { | ||
expected: expected_active_slot.slot, found: *active | ||
}) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests helped me find a bug! This precondition should be checked after we've checked there is nothing to do and returned a PrecheckStatus::UpdateComplete
. When an update is done, a new active slot is set, so this precondition error would always return after an update was completed instead of reporting the update had completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll also need to check the transient boot selection and the pending persistent boot selection. Also, the active slot should match the persistent boot selection.
These checks can't be done, and the conflicts won't be seen, until Hubris PR #2050 is merged.
Any deviation would probably indicate a previous failed update. An ignition power-cycle is the big hammer that can be used, but we don't want bugs where we are continually power-cycling equipment. So, RoT and SP resets should be considered if transient and pending persistent are != None. Active != Persistent also needs to be considered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll also need to check the transient boot selection and the pending persistent boot selection. Also, the active slot should match the persistent boot selection.
These are the checks I currently have https://github.com/oxidecomputer/omicron/blob/main/nexus/mgs-updates/src/rot_updater.rs#L370-L386
// If transient boot is being used, the persistent preference is not going to match
// the active slot. At the moment, this mismatch can also mean one of the partitions
// had a bad signature check. We don't have a way to tell this appart yet.
// https://github.com/oxidecomputer/hubris/issues/2066
//
// For now, this discrepancy will mean a bad signature check. That's ok, we can continue.
// The logic here should change when transient boot preference is implemented.
if expected_persistent_boot_preference != active {
info!(log, "expected_persistent_boot_preference does not match active slot. \
This could mean a previous broken update attempt.");
};
// If pending_persistent_boot_preference or transient_boot_preference is/are some,
// then we need to wait, an update is happening.
if transient_boot_preference.is_some() || pending_persistent_boot_preference.is_some() {
return Err(PrecheckError::EphemeralRotBootPreferenceSet);
}
Do they make sense for now?
Could the discrepancies you mention mean two things?
- There might be an ongoing update
- It might be a previous failed update
If so, how can I know the difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests are ok for now, but when Hubris 2050 gets merged they can be improved. With your proposal, if an update fails, then there isn't any code to clean it up and retry. The easiest recovery will be to reset the RoT or power-cycle the device and that will be done manually.
Post-2050, in the case of there being a bad signature check on the alternate RoT image, either the pending-persistent or just the persistent boot preference will need to be set to the good image before proceeding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, you could use pending-persistent or transient boot preference settings as part of a heuristic to determine that there is an update in progress or there was a failed update. But in the case of a failed update, these need to be tolerated and fixed so that a successful update can proceed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. I'll defer to @lzrd on the specific RoT checks.
Thanks both for taking time to review my PR 🙇♀️
It appears I won't be able to do anything with these new checks until oxidecomputer/hubris#2050 is merged 🤔 . I'd like to get these tests and bugfix merged now, and I've opened #8349 to track any further changes to the RoT update pre-checks. Any objections? @lzrd @jgallagher |
Sounds reasonable to me! |
This commit adapts the current code to have the capability to test any component that is update-able by the MGS updater, not just SPs. Additionally, I have added tests for RoT updates