Skip to content

Transient boot selection #2050

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open

Transient boot selection #2050

wants to merge 10 commits into from

Conversation

lzrd
Copy link
Contributor

@lzrd lzrd commented Apr 28, 2025

Automated update should take advantage of transient boot selection. This is the last piece to make that feature available.
See RFD 374 for a description and bootleby source for the logic being driven.

@lzrd lzrd requested review from cbiffle, flihp and labbott as code owners April 28, 2025 23:51
Copy link
Collaborator

@labbott labbott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to update fn boot_preferences to read our transient preference (GitHub won't let me comment on code not in the PR)

@lzrd
Copy link
Contributor Author

lzrd commented Jun 17, 2025

We need to update fn boot_preferences to read our transient preference (GitHub won't let me comment on code not in the PR)

done

@lzrd lzrd closed this Jun 17, 2025
@lzrd lzrd reopened this Jun 17, 2025
@lzrd lzrd force-pushed the transient-boot-selection branch from d94e656 to f1e94f5 Compare June 17, 2025 23:16
Copy link
Collaborator

@labbott labbott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the MGS side is ready I will hit approve

lzrd added 7 commits June 23, 2025 12:16
See bootleby:src/lib.rs for the receiving end of the override selection.
Given the current API for setting the transient boot preference there is
an implementation choice. While setting the transient selection to the
alternate image (currently non-active) is the common use case, we also
need to define what happens when the transient preference is specified
to be the currently active RoT Hubris image.

The simple case is that setting transient to active will clear the
transient selection. That is the choice this PR makes.

The other case, setting an actual transient preference to the active
gives rise to some more complex behavior that is just annoying.

Assume that A is active, B is the pending persistent preference, and A
is also the transient preference.

On the first reset, the ROM will promote the CFPA scratch page to
active, thus making B the persistent preference. However, Bootleby will
use the transient preference of A to select the active image. The RoT
will then be running Hubris image A. Bootleby has cleared the transient
selection.

On the second reset, there is no transient selection or pending
persistent. Bootleby will select Hubris image B.

These scenarios have been tested and analyzed and there is no motivation
found for enabling this sequence. If, after merging this PR, it is found
that we do want to enable this sequence of two resets in a row running
different images, we can change the API to an enum of `Select(SlotId)`
or `ClearSelection`.

---

Hubris issue 2093 was found in testing where setting the persistent
preference to different values in succession leads to the inability to
clear a pending persistent boot preference. This may be needed to
recover from a previous failed/abandoned RoT Hubris update where one
wants to avoid additional reset operations.
This is a corner case with a workaround, but it is also against the
intent of the API.
@lzrd lzrd force-pushed the transient-boot-selection branch from 87a91ed to a780177 Compare June 23, 2025 21:47
@lzrd lzrd requested a review from labbott June 23, 2025 21:50
Fixes from review.

Fix memory alias for RoT Hubris 'B' image's access to transient boot override setting.

`component_update_prep_image` was using `bootstate` to determine active
image. The `same_image` function is a more direct test and does not
depend on `bootstate` being valid.
@lzrd lzrd force-pushed the transient-boot-selection branch from a780177 to e74fe24 Compare June 23, 2025 22:05
@lzrd lzrd enabled auto-merge (squash) June 23, 2025 22:41
stage0_handoff::RotSlot::B => SlotId::B,
};

// If the requested slot is the same as the active slot,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These semantics seem confusing / non-obvious. I think it would make more sense to explicitly return an error (InvalidPreferredSlotId?) if someone tries to set the transient override to be the same as the active slot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for the case where a previous update attempt has been abandoned or failed and left a transient preference that the new session does not want. We want to enable resetting an existing transient preference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This semantic came out of a fault insertion test where this type of abandoned session was simulated.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still seems non ideal for the developer user case. I brought up not clearing the transient setting without some indication before #2050 (comment) albeit in a slightly different place. I'd rather have an explicit way to clear the transient setting if we want that behavior. This would require MGS changes but I'd argue this is a gap in how the original API was designed.

@mkeeter
Copy link
Collaborator

mkeeter commented Jun 24, 2025

The use of MaybeUninit here seems sketchy; it's only defined in terms of Rust behavior, so using assume_init to grab data that happens to be in RAM could be fragile.

(This is admittedly not a new problem introduced in the PR, but the PR adds another case of it)

I think the right way to handle this is using an extern static, although it's still a subject of active discussion (rust-lang/unsafe-code-guidelines#397 (comment))

Here's an (untested) patch which does just that:

diff --git a/app/lpc55xpresso/app.toml b/app/lpc55xpresso/app.toml
index cc03b0de53..48b479bacb 100644
--- a/app/lpc55xpresso/app.toml
+++ b/app/lpc55xpresso/app.toml
@@ -51,7 +51,8 @@
 max-sizes = {flash = 26720, ram = 16704}
 stacksize = 8192
 start = true
-sections = {bootstate = "usbsram", transient_override = "override"}
+sections = {bootstate = "usbsram"}
+extern-regions = ["transient_override"]
 uses = ["flash_controller", "hash_crypt"]
 notifications = ["flash-irq", "hashcrypt-irq"]
 interrupts = {"flash_controller.irq" = "flash-irq", "hash_crypt.irq" = "hashcrypt-irq"}
diff --git a/app/oxide-rot-1/app-dev.toml b/app/oxide-rot-1/app-dev.toml
index 50a945c016..9d45ad973f 100644
--- a/app/oxide-rot-1/app-dev.toml
+++ b/app/oxide-rot-1/app-dev.toml
@@ -55,7 +55,8 @@
 priority = 3
 stacksize = 8192
 start = true
-sections = {bootstate = "usbsram", transient_override = "override"}
+sections = {bootstate = "usbsram"}
+extern-regions = ["transient_override"]
 uses = ["flash_controller", "hash_crypt"]
 notifications = ["flash-irq", "hashcrypt-irq"]
 interrupts = {"flash_controller.irq" = "flash-irq", "hash_crypt.irq" = "hashcrypt-irq"}
diff --git a/app/oxide-rot-1/app.toml b/app/oxide-rot-1/app.toml
index 9e12953bad..0c45eb254d 100644
--- a/app/oxide-rot-1/app.toml
+++ b/app/oxide-rot-1/app.toml
@@ -43,7 +43,8 @@
 priority = 3
 stacksize = 8192
 start = true
-sections = {bootstate = "usbsram", transient_override = "override"}
+sections = {bootstate = "usbsram"}
+extern-regions = ["transient_override"]
 uses = ["flash_controller", "hash_crypt"]
 notifications = ["flash-irq", "hashcrypt-irq"]
 interrupts = {"flash_controller.irq" = "flash-irq", "hash_crypt.irq" = "hashcrypt-irq"}
diff --git a/app/rot-carrier/app.toml b/app/rot-carrier/app.toml
index 3072a96e1c..74c20d7d64 100644
--- a/app/rot-carrier/app.toml
+++ b/app/rot-carrier/app.toml
@@ -43,7 +43,8 @@
 # TODO size this appropriately
 stacksize = 8192
 start = true
-sections = {bootstate = "usbsram", transient_override = "override"}
+sections = {bootstate = "usbsram"}
+extern-regions = ["transient_override"]
 uses = ["flash_controller", "hash_crypt"]
 notifications = ["flash-irq", "hashcrypt-irq"]
 interrupts = {"flash_controller.irq" = "flash-irq", "hash_crypt.irq" = "hashcrypt-irq"}
diff --git a/chips/lpc55/memory.toml b/chips/lpc55/memory.toml
index 0125777412..c6322ef47d 100644
--- a/chips/lpc55/memory.toml
+++ b/chips/lpc55/memory.toml
@@ -93,7 +93,7 @@
 execute = false
 dma = true
 
-[[override]]
+[[transient_override]]
 name = "a"
 address = 0x2003ffe0
 size = 32
@@ -101,7 +101,7 @@
 write = true
 execute = false
 
-[[override]]
+[[transient_override]]
 name = "b"
 address = 0x2003ffe0
 size = 32
diff --git a/drv/lpc55-update-server/src/main.rs b/drv/lpc55-update-server/src/main.rs
index 8f78b91c42..3cc2898d46 100644
--- a/drv/lpc55-update-server/src/main.rs
+++ b/drv/lpc55-update-server/src/main.rs
@@ -41,10 +41,6 @@
 #[link_section = ".bootstate"]
 static BOOTSTATE: MaybeUninit<[u8; 0x1000]> = MaybeUninit::uninit();
 
-#[used]
-#[link_section = ".transient_override"]
-static mut TRANSIENT_OVERRIDE: MaybeUninit<[u8; 32]> = MaybeUninit::uninit();
-
 #[derive(Copy, Clone, PartialEq)]
 enum UpdateState {
     NoUpdate,
@@ -1416,22 +1412,39 @@
     RotBootStateV2::load_from_addr(addr)
 }
 
+extern "C" {
+    // Symbols injected by the linker.
+    //
+    // This requires adding `extern-regions = ["transient_override"]` to the
+    // task config
+    pub static mut __REGION_TRANSIENT_OVERRIDE_BASE: [u32; 0];
+}
+
 fn set_transient_override(preference: [u8; 32]) {
-    // Safety: Data is consumed by Bootleby on next boot.
-    // There are no concurrent writers possible.
-    // Calling this function multiple times is ok.
-    // Bootleby is careful to vet contents before acting.
-    unsafe {
-        TRANSIENT_OVERRIDE.write(preference);
+    // SAFETY: populated by the linker, getting the address is fine
+    let override_addr =
+        unsafe { __REGION_TRANSIENT_OVERRIDE_BASE.as_ptr() } as *mut u8;
+    for (i, p) in preference.iter().enumerate() {
+        // SAFETY: this points to a valid region of RAM that is otherwise unused
+        // by Rust, so we can write to it.
+        unsafe {
+            core::ptr::write_volatile(override_addr.wrapping_add(i), *p);
+        }
     }
 }
 
 fn get_transient_override() -> [u8; 32] {
-    // Safety: Data is consumed by Bootleby on next boot.
-    // There are no concurrent writers possible.
-    // Bootleby consumes and resets TRANSIENT_OVERRIDE.
-    // The client may be verifying state set during update flows.
-    unsafe { TRANSIENT_OVERRIDE.assume_init() }
+    // SAFETY: populated by the linker, getting the address is fine
+    let override_addr =
+        unsafe { __REGION_TRANSIENT_OVERRIDE_BASE.as_ptr() } as *const u8;
+    let mut out = [0u8; 32];
+    for (i, p) in out.iter_mut().enumerate() {
+        // SAFETY: values are guaranteed to be initialized by stage0
+        unsafe {
+            *p = core::ptr::read_volatile(override_addr.wrapping_add(i));
+        }
+    }
+    out
 }
 
 // Preference constants are taken from bootleby:src/lib.rs

Copy link
Collaborator

@mkeeter mkeeter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(marking this as "Request changes" to discuss returning an error if the persistent slot matches the active slot)

stage0_handoff::RotSlot::B => SlotId::B,
};

// If the requested slot is the same as the active slot,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still seems non ideal for the developer user case. I brought up not clearing the transient setting without some indication before #2050 (comment) albeit in a slightly different place. I'd rather have an explicit way to clear the transient setting if we want that behavior. This would require MGS changes but I'd argue this is a gap in how the original API was designed.

Comment on lines +963 to +967
// If we did allow setting the active slot as the transient
// preference, then we get this confusing 2-boot scenario when
// pending persistent preference is also used:
// the next reset returns us to the original image and the 2nd
// reset has us running the new/alternate image.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement seems strange in contrast to "This is for the case where a previous update attempt has been abandoned or failed and left a transient preference that the new session does not want" from above. If we are running active slot A and alternate slot is B, an incomplete/aborted update will always be targeting B. Why do we have a persistent setting for B if the update was aborted?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants