Transient boot selection #2050

lzrd · 2025-04-28T23:51:03Z

Automated update should take advantage of transient boot selection. This is the last piece to make that feature available.
See RFD 374 for a description and bootleby source for the logic being driven.

labbott

We need to update fn boot_preferences to read our transient preference (GitHub won't let me comment on code not in the PR)

chips/lpc55/memory.toml

drv/lpc55-update-server/src/main.rs

chips/lpc55/memory.toml

drv/lpc55-update-server/src/main.rs

lzrd · 2025-06-17T17:18:03Z

We need to update fn boot_preferences to read our transient preference (GitHub won't let me comment on code not in the PR)

done

labbott

Once the MGS side is ready I will hit approve

Cargo.lock

See bootleby:src/lib.rs for the receiving end of the override selection.

Given the current API for setting the transient boot preference there is an implementation choice. While setting the transient selection to the alternate image (currently non-active) is the common use case, we also need to define what happens when the transient preference is specified to be the currently active RoT Hubris image. The simple case is that setting transient to active will clear the transient selection. That is the choice this PR makes. The other case, setting an actual transient preference to the active gives rise to some more complex behavior that is just annoying. Assume that A is active, B is the pending persistent preference, and A is also the transient preference. On the first reset, the ROM will promote the CFPA scratch page to active, thus making B the persistent preference. However, Bootleby will use the transient preference of A to select the active image. The RoT will then be running Hubris image A. Bootleby has cleared the transient selection. On the second reset, there is no transient selection or pending persistent. Bootleby will select Hubris image B. These scenarios have been tested and analyzed and there is no motivation found for enabling this sequence. If, after merging this PR, it is found that we do want to enable this sequence of two resets in a row running different images, we can change the API to an enum of `Select(SlotId)` or `ClearSelection`. --- Hubris issue 2093 was found in testing where setting the persistent preference to different values in succession leads to the inability to clear a pending persistent boot preference. This may be needed to recover from a previous failed/abandoned RoT Hubris update where one wants to avoid additional reset operations. This is a corner case with a workaround, but it is also against the intent of the API.

Fixes from review. Fix memory alias for RoT Hubris 'B' image's access to transient boot override setting. `component_update_prep_image` was using `bootstate` to determine active image. The `same_image` function is a more direct test and does not depend on `bootstate` being valid.

mkeeter · 2025-06-24T13:56:05Z

drv/lpc55-update-server/src/main.rs

+                    stage0_handoff::RotSlot::B => SlotId::B,
+                };
+
+                // If the requested slot is the same as the active slot,


These semantics seem confusing / non-obvious. I think it would make more sense to explicitly return an error (InvalidPreferredSlotId?) if someone tries to set the transient override to be the same as the active slot.

This is for the case where a previous update attempt has been abandoned or failed and left a transient preference that the new session does not want. We want to enable resetting an existing transient preference.

This semantic came out of a fault insertion test where this type of abandoned session was simulated.

This still seems non ideal for the developer user case. I brought up not clearing the transient setting without some indication before #2050 (comment) albeit in a slightly different place. I'd rather have an explicit way to clear the transient setting if we want that behavior. This would require MGS changes but I'd argue this is a gap in how the original API was designed.

drv/lpc55-update-server/src/main.rs

mkeeter · 2025-06-24T14:42:54Z

The use of MaybeUninit here seems sketchy; it's only defined in terms of Rust behavior, so using assume_init to grab data that happens to be in RAM could be fragile.

(This is admittedly not a new problem introduced in the PR, but the PR adds another case of it)

I think the right way to handle this is using an extern static, although it's still a subject of active discussion (rust-lang/unsafe-code-guidelines#397 (comment))

Here's an (untested) patch which does just that:

diff --git a/app/lpc55xpresso/app.toml b/app/lpc55xpresso/app.toml
index cc03b0de53..48b479bacb 100644
--- a/app/lpc55xpresso/app.toml
+++ b/app/lpc55xpresso/app.toml
@@ -51,7 +51,8 @@
 max-sizes = {flash = 26720, ram = 16704}
 stacksize = 8192
 start = true
-sections = {bootstate = "usbsram", transient_override = "override"}
+sections = {bootstate = "usbsram"}
+extern-regions = ["transient_override"]
 uses = ["flash_controller", "hash_crypt"]
 notifications = ["flash-irq", "hashcrypt-irq"]
 interrupts = {"flash_controller.irq" = "flash-irq", "hash_crypt.irq" = "hashcrypt-irq"}
diff --git a/app/oxide-rot-1/app-dev.toml b/app/oxide-rot-1/app-dev.toml
index 50a945c016..9d45ad973f 100644
--- a/app/oxide-rot-1/app-dev.toml
+++ b/app/oxide-rot-1/app-dev.toml
@@ -55,7 +55,8 @@
 priority = 3
 stacksize = 8192
 start = true
-sections = {bootstate = "usbsram", transient_override = "override"}
+sections = {bootstate = "usbsram"}
+extern-regions = ["transient_override"]
 uses = ["flash_controller", "hash_crypt"]
 notifications = ["flash-irq", "hashcrypt-irq"]
 interrupts = {"flash_controller.irq" = "flash-irq", "hash_crypt.irq" = "hashcrypt-irq"}
diff --git a/app/oxide-rot-1/app.toml b/app/oxide-rot-1/app.toml
index 9e12953bad..0c45eb254d 100644
--- a/app/oxide-rot-1/app.toml
+++ b/app/oxide-rot-1/app.toml
@@ -43,7 +43,8 @@
 priority = 3
 stacksize = 8192
 start = true
-sections = {bootstate = "usbsram", transient_override = "override"}
+sections = {bootstate = "usbsram"}
+extern-regions = ["transient_override"]
 uses = ["flash_controller", "hash_crypt"]
 notifications = ["flash-irq", "hashcrypt-irq"]
 interrupts = {"flash_controller.irq" = "flash-irq", "hash_crypt.irq" = "hashcrypt-irq"}
diff --git a/app/rot-carrier/app.toml b/app/rot-carrier/app.toml
index 3072a96e1c..74c20d7d64 100644
--- a/app/rot-carrier/app.toml
+++ b/app/rot-carrier/app.toml
@@ -43,7 +43,8 @@
 # TODO size this appropriately
 stacksize = 8192
 start = true
-sections = {bootstate = "usbsram", transient_override = "override"}
+sections = {bootstate = "usbsram"}
+extern-regions = ["transient_override"]
 uses = ["flash_controller", "hash_crypt"]
 notifications = ["flash-irq", "hashcrypt-irq"]
 interrupts = {"flash_controller.irq" = "flash-irq", "hash_crypt.irq" = "hashcrypt-irq"}
diff --git a/chips/lpc55/memory.toml b/chips/lpc55/memory.toml
index 0125777412..c6322ef47d 100644
--- a/chips/lpc55/memory.toml
+++ b/chips/lpc55/memory.toml
@@ -93,7 +93,7 @@
 execute = false
 dma = true
 
-[[override]]
+[[transient_override]]
 name = "a"
 address = 0x2003ffe0
 size = 32
@@ -101,7 +101,7 @@
 write = true
 execute = false
 
-[[override]]
+[[transient_override]]
 name = "b"
 address = 0x2003ffe0
 size = 32
diff --git a/drv/lpc55-update-server/src/main.rs b/drv/lpc55-update-server/src/main.rs
index 8f78b91c42..3cc2898d46 100644
--- a/drv/lpc55-update-server/src/main.rs
+++ b/drv/lpc55-update-server/src/main.rs
@@ -41,10 +41,6 @@
 #[link_section = ".bootstate"]
 static BOOTSTATE: MaybeUninit<[u8; 0x1000]> = MaybeUninit::uninit();
 
-#[used]
-#[link_section = ".transient_override"]
-static mut TRANSIENT_OVERRIDE: MaybeUninit<[u8; 32]> = MaybeUninit::uninit();
-
 #[derive(Copy, Clone, PartialEq)]
 enum UpdateState {
     NoUpdate,
@@ -1416,22 +1412,39 @@
     RotBootStateV2::load_from_addr(addr)
 }
 
+extern "C" {
+    // Symbols injected by the linker.
+    //
+    // This requires adding `extern-regions = ["transient_override"]` to the
+    // task config
+    pub static mut __REGION_TRANSIENT_OVERRIDE_BASE: [u32; 0];
+}
+
 fn set_transient_override(preference: [u8; 32]) {
-    // Safety: Data is consumed by Bootleby on next boot.
-    // There are no concurrent writers possible.
-    // Calling this function multiple times is ok.
-    // Bootleby is careful to vet contents before acting.
-    unsafe {
-        TRANSIENT_OVERRIDE.write(preference);
+    // SAFETY: populated by the linker, getting the address is fine
+    let override_addr =
+        unsafe { __REGION_TRANSIENT_OVERRIDE_BASE.as_ptr() } as *mut u8;
+    for (i, p) in preference.iter().enumerate() {
+        // SAFETY: this points to a valid region of RAM that is otherwise unused
+        // by Rust, so we can write to it.
+        unsafe {
+            core::ptr::write_volatile(override_addr.wrapping_add(i), *p);
+        }
     }
 }
 
 fn get_transient_override() -> [u8; 32] {
-    // Safety: Data is consumed by Bootleby on next boot.
-    // There are no concurrent writers possible.
-    // Bootleby consumes and resets TRANSIENT_OVERRIDE.
-    // The client may be verifying state set during update flows.
-    unsafe { TRANSIENT_OVERRIDE.assume_init() }
+    // SAFETY: populated by the linker, getting the address is fine
+    let override_addr =
+        unsafe { __REGION_TRANSIENT_OVERRIDE_BASE.as_ptr() } as *const u8;
+    let mut out = [0u8; 32];
+    for (i, p) in out.iter_mut().enumerate() {
+        // SAFETY: values are guaranteed to be initialized by stage0
+        unsafe {
+            *p = core::ptr::read_volatile(override_addr.wrapping_add(i));
+        }
+    }
+    out
 }
 
 // Preference constants are taken from bootleby:src/lib.rs

mkeeter

(marking this as "Request changes" to discuss returning an error if the persistent slot matches the active slot)

Co-authored-by: Matt Keeter <[email protected]>

labbott · 2025-06-24T17:38:34Z

drv/lpc55-update-server/src/main.rs

+                    stage0_handoff::RotSlot::B => SlotId::B,
+                };
+
+                // If the requested slot is the same as the active slot,


This still seems non ideal for the developer user case. I brought up not clearing the transient setting without some indication before #2050 (comment) albeit in a slightly different place. I'd rather have an explicit way to clear the transient setting if we want that behavior. This would require MGS changes but I'd argue this is a gap in how the original API was designed.

labbott · 2025-06-24T17:57:04Z

drv/lpc55-update-server/src/main.rs

+                // If we did allow setting the active slot as the transient
+                // preference, then we get this confusing 2-boot scenario when
+                // pending persistent preference is also used:
+                // the next reset returns us to the original image and the 2nd
+                // reset has us running the new/alternate image.


This statement seems strange in contrast to "This is for the case where a previous update attempt has been abandoned or failed and left a transient preference that the new session does not want" from above. If we are running active slot A and alternate slot is B, an incomplete/aborted update will always be targeting B. Why do we have a persistent setting for B if the update was aborted?

lzrd requested review from cbiffle, flihp and labbott as code owners April 28, 2025 23:51

labbott reviewed Apr 29, 2025

View reviewed changes

chips/lpc55/memory.toml Outdated Show resolved Hide resolved

drv/lpc55-update-server/src/main.rs Outdated Show resolved Hide resolved

lzrd force-pushed the transient-boot-selection branch from 4864f92 to 66acc49 Compare June 9, 2025 19:58

This was referenced Jun 9, 2025

Add InvalidPreferredSlotId error for use in rot prep_image_update oxidecomputer/management-gateway-service#398

Merged

[reconfigurator] Test MGS driven RoT updates oxidecomputer/omicron#8295

Merged

lzrd force-pushed the transient-boot-selection branch from 66acc49 to 6e9da78 Compare June 13, 2025 18:01

lzrd requested a review from labbott June 13, 2025 21:09

labbott reviewed Jun 16, 2025

View reviewed changes

chips/lpc55/memory.toml Outdated Show resolved Hide resolved

drv/lpc55-update-server/src/main.rs Outdated Show resolved Hide resolved

drv/lpc55-update-server/src/main.rs Outdated Show resolved Hide resolved

labbott reviewed Jun 16, 2025

View reviewed changes

drv/lpc55-update-server/src/main.rs Outdated Show resolved Hide resolved

labbott reviewed Jun 16, 2025

View reviewed changes

drv/lpc55-update-server/src/main.rs Show resolved Hide resolved

karencfv mentioned this pull request Jun 16, 2025

Review RoT update pre-checks oxidecomputer/omicron#8349

Open

lzrd force-pushed the transient-boot-selection branch from 83fc549 to d94e656 Compare June 16, 2025 23:50

lzrd closed this Jun 17, 2025

lzrd reopened this Jun 17, 2025

lzrd force-pushed the transient-boot-selection branch from d94e656 to f1e94f5 Compare June 17, 2025 23:16

labbott reviewed Jun 18, 2025

View reviewed changes

Cargo.lock Outdated Show resolved Hide resolved

lzrd force-pushed the transient-boot-selection branch 2 times, most recently from 8d68a85 to 87a91ed Compare June 18, 2025 17:32

karencfv mentioned this pull request Jun 23, 2025

MGS driven SP components left in invalid state should have a way to recover from failed updates oxidecomputer/omicron#8414

Open

lzrd added 7 commits June 23, 2025 12:16

Implement transient override for RoT Hubris boot selection.

0a3cd9f

See bootleby:src/lib.rs for the receiving end of the override selection.

update tasks.update_server in other LPC55 app.toml files

731667f

Use non-priviledged transient boot preference token address

d890ecc

include current state of transient override in RotBootInfo

ac39219

wip - TODO: remove patch. added InvalidPreferredSlotId

c1dfea3

update bitflags

8816f13

lzrd force-pushed the transient-boot-selection branch from 87a91ed to a780177 Compare June 23, 2025 21:47

lzrd requested a review from labbott June 23, 2025 21:50

lzrd force-pushed the transient-boot-selection branch from a780177 to e74fe24 Compare June 23, 2025 22:05

lzrd enabled auto-merge (squash) June 23, 2025 22:41

mkeeter reviewed Jun 24, 2025

View reviewed changes

drv/lpc55-update-server/src/main.rs Outdated Show resolved Hide resolved

mkeeter reviewed Jun 24, 2025

View reviewed changes

drv/lpc55-update-server/src/main.rs Outdated Show resolved Hide resolved

mkeeter requested changes Jun 24, 2025

View reviewed changes

lzrd and others added 2 commits June 24, 2025 09:26

Update drv/lpc55-update-server/src/main.rs

578f393

Co-authored-by: Matt Keeter <[email protected]>

Update drv/lpc55-update-server/src/main.rs

43ec488

Co-authored-by: Matt Keeter <[email protected]>

labbott reviewed Jun 24, 2025

View reviewed changes

Transient boot selection #2050

Are you sure you want to change the base?

Transient boot selection #2050

Uh oh!

Conversation

lzrd commented Apr 28, 2025

Uh oh!

labbott left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lzrd commented Jun 17, 2025

Uh oh!

labbott left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mkeeter Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

lzrd Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

lzrd Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

labbott Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mkeeter commented Jun 24, 2025

Uh oh!

mkeeter left a comment

Choose a reason for hiding this comment

Uh oh!

labbott Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

labbott Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!