Skip to content

Conversation

karencfv
Copy link
Contributor

Leaving this as draft for now, but I want address something first. The issue for this PR suggests that this check should only be implemented for SP and Host OS updates. Is there any harm in implementing it for all MGS driven updates? The implementation seems cleaner and much less complex this way. Thoughts? @davepacheco @jgallagher

Closes: #8482

@davepacheco
Copy link
Collaborator

That seems like a worthwhile simplification to me. We don't need to do this for RoT/bootloader updates because those won't take down the zone, but I don't think there's much harm.

@karencfv
Copy link
Contributor Author

Thanks! I'll finish up the PR with the current implementation then

@karencfv karencfv marked this pull request as ready for review September 29, 2025 05:04
Copy link
Contributor Author

@karencfv karencfv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ready for review :) The actual functionality is probably about 100 lines of code. The rest is mostly testing output

// pending updates for components earlier in the update ordering
// than zones: RoT bootloader / RoT / SP / Host OS.
mgs_updates.is_empty()
mgs_updates.pending_mgs_updates.is_empty()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we only checking this inner field now? I would expect we still care about blocked MGS updates?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha! I wrote this before merging the blocked updates PR. Thanks for catching this!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was unclear: can we go back to checking mgs_updates.is_empty()? As written, if we add any new fields to mgs_updates, we'd have to remember to come back to this spot and decide whether we need to consider that field here too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I think I understand, but this feels like we have something backwards. Prior to this PR, mgs_updates.is_empty() meant "all MGS updates are complete", and after this PR, it means "all MGS updates are complete and there are no sleds with zones that are unsafe to shut down", so this spot changes to go back to only checking the subfields that mean "all MGS updates are complete". That feels kinda fragile. Is there a way to write the checks such that we only populate mgs_updates.unsafe_zones if the sled with an unsafe zone actually needed an MGS-driven update?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe another way to phrase it: it seems like it would be nice to be able to say "we have an MGS update blocked on a zone that is unsafe to shut down"? At least conceptually, these checks could be a part of blocked_mgs_updates?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I can't really put the unsafe zones check within blocked_mg_updates because that check will block the planner from going any further. We only want to skip the sled boards with unsafe zones, not mark them as blocked. Or at least that's what I understood from the issue.

I'll take a closer look and see how I can modify the checks or the structure to make it clearer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only want to skip the sled boards with unsafe zones, not mark them as blocked. Or at least that's what I understood from the issue.

I think this is the disconnect that's bothering me. On this PR, we skip sled boards with unsafe zones, but in doing so we accumulate "stuff" into the MGS step report about those boards. That means we can no longer just call mgs_updates.is_empty() here, because we don't want to block updates just because we skipped some boards.

But: We only need to skip boards if it would have been possible to do an MGS update to them. If instead of skipping boards first (and accumulating all those skipped boards into our step report), we did something like:

  • build the list of boards to skip, but proceed with the rest of MGS planning as normal
  • if we find a board that actually needs an MGS update but is present in the "list to skip", then skip it and instead insert something into blocked_mgs_updates

then we'd be back in the state where mgs_updates.is_empty() still means "it's okay to proceed", right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense.

Also, looking at the code again, the unsafe_zones field in the report isn't used for absolutely anything other than printing out some information that isn't super necessary. We can do without that field altogether

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah no wait! I put that field in because of this comment! #9044 (comment) 😄

Let's see how I can fit all of this in together

Comment on lines +29 to +34
# The above blueprint includes a pending MGS update, which we should delete
# (we want to start from a fresh state).
blueprint-edit latest delete-sp-update serial0
# Also set the Omicron config for all sleds to reflect the
# corresponding image sources.
sled-set serial0 omicron-config latest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a similar command to easily update all the hubris images of a given sled (or baseboard?) to match the latest blueprint like this one that we have for Omicron configs? Having to manually delete the staged SP update is kinda confusing.

Copy link
Contributor Author

@karencfv karencfv Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it's my unfamiliarity with this part of the codebase, but I couldn't find an easy quick way to create this command as part of this PR. Do you mind if this is done as a follow-up PR? I want to get this merged soon so it can be tested before the release

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, totally fine.


# Attempt to upgrade one RoT bootloader. We should see two zones skipped and
# a safe one to update
blueprint-plan latest latest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't this planning run also attempt to restore internal DNS redundancy? Is it because we're trying to do MGS updates, and therefore we don't do any zone additions?

I'm a little worried we might have a sort of planning deadlock here:

  • We can't update 2 of the sleds' hubris images because they're hosting internal DNS, and we're at 2 of 3 required internal DNS zones
  • We can't add a third internal DNS zone because 2 of the sleds have out-of-date hubris images

Copy link
Contributor Author

@karencfv karencfv Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not as familiar with the code to expunge and restore zones. That said, as far as I know the only thing gated by unfinished MGS driven updates are zone updates. Not restoring zone redundancy.

Looking at the cmds-expunge-newly-added-internal-dns.txt test I believe that to get reconfigurator-cli to attempt to restore the zone I need to run blueprint-edit latest mark-for-cleanup {EXPUNGED_ZONE_ID} after expunging the zone.

I added this command locally to the test here and this is the output I get

> # Mark for cleanup
> blueprint-edit latest mark-for-cleanup 99e2f30b-3174-40bf-a78a-90da8abba8ca
blueprint df06bb57-ad42-4431-9206-abff322896c7 created from latest blueprint (af934083-59b5-4bf6-8966-6fb5292c29e1): marked zone 99e2f30b-3174-40bf-a78a-90da8abba8ca ready for cleanup


> blueprint-plan latest latest
INFO performed noop zone image source checks on sled, sled_id: 2b8f0cb3-0295-4b3c-bc58-4fe88b57112c, num_total: 8, num_already_artifact: 8, num_eligible: 0, num_ineligible: 0
INFO BootPartitionDetails inventory hash not found in TUF repo, ignoring for noop checks, sled_id: 2b8f0cb3-0295-4b3c-bc58-4fe88b57112c, slot: a, expected_hash: 0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a
INFO BootPartitionDetails inventory hash not found in TUF repo, ignoring for noop checks, sled_id: 2b8f0cb3-0295-4b3c-bc58-4fe88b57112c, slot: b, expected_hash: 0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b
INFO performed noop zone image source checks on sled, sled_id: 98e6b7c2-2efa-41ca-b20a-0a4d61102fe6, num_total: 8, num_already_artifact: 8, num_eligible: 0, num_ineligible: 0
INFO BootPartitionDetails inventory hash not found in TUF repo, ignoring for noop checks, sled_id: 98e6b7c2-2efa-41ca-b20a-0a4d61102fe6, slot: a, expected_hash: 0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a
INFO BootPartitionDetails inventory hash not found in TUF repo, ignoring for noop checks, sled_id: 98e6b7c2-2efa-41ca-b20a-0a4d61102fe6, slot: b, expected_hash: 0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b
INFO performed noop zone image source checks on sled, sled_id: d81c6a84-79b8-4958-ae41-ea46c9b19763, num_total: 8, num_already_artifact: 8, num_eligible: 0, num_ineligible: 0
INFO BootPartitionDetails inventory hash not found in TUF repo, ignoring for noop checks, sled_id: d81c6a84-79b8-4958-ae41-ea46c9b19763, slot: a, expected_hash: 0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a0a
INFO BootPartitionDetails inventory hash not found in TUF repo, ignoring for noop checks, sled_id: d81c6a84-79b8-4958-ae41-ea46c9b19763, slot: b, expected_hash: 0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b
INFO skipping board for MGS-driven update, serial_number: serial0, part_number: model0 due to unsafe zones: internal_dns
INFO skipping board for MGS-driven update, serial_number: serial2, part_number: model2 due to unsafe zones: internal_dns
INFO configuring MGS-driven update, artifact_version: 1.0.0, artifact_hash: 5b0f601b1fbb8674db9c751a02f8b14f8e6d4e8470f4f7b686fecb2c49ec11f9, expected_stage0_next_version: NoValidVersion, expected_stage0_version: 0.0.1, component: rot_bootloader, sp_slot: 1, sp_type: Sled, serial_number: serial1, part_number: model1
INFO ran out of boards for MGS-driven update
generated blueprint 7f976e0d-d2a5-4eeb-9e82-c82bc2824aba based on parent blueprint df06bb57-ad42-4431-9206-abff322896c7
blueprint source: planner with report:
planning report:
* skipping noop zone image source check on sled 2b8f0cb3-0295-4b3c-bc58-4fe88b57112c: all 8 zones are already from artifacts
* skipping noop zone image source check on sled 98e6b7c2-2efa-41ca-b20a-0a4d61102fe6: all 8 zones are already from artifacts
* skipping noop zone image source check on sled d81c6a84-79b8-4958-ae41-ea46c9b19763: all 8 zones are already from artifacts
* 1 pending MGS update:
  * model1:serial1: RotBootloader(PendingMgsUpdateRotBootloaderDetails { expected_stage0_version: ArtifactVersion("0.0.1"), expected_stage0_next_version: NoValidVersion })
* 2 zones not ready to shut down safely for an MGS driven update:
  * zone 427ec88f-f467-42fa-9bbb-66a91a36103c: only 2/2 internal DNS zones are synchronized; require at least 3
  * zone ea5b4030-b52f-44b2-8d70-45f15f987d01: only 2/2 internal DNS zones are synchronized; require at least 3
* discretionary zones placed:
  * internal_dns zone on sled 2b8f0cb3-0295-4b3c-bc58-4fe88b57112c from source install dataset
* zone updates waiting on discretionary zones
* waiting to update top-level nexus_generation: some non-Nexus zone are not yet updated



> blueprint-diff latest
from: blueprint df06bb57-ad42-4431-9206-abff322896c7
to:   blueprint 7f976e0d-d2a5-4eeb-9e82-c82bc2824aba

 MODIFIED SLEDS:

  sled 2b8f0cb3-0295-4b3c-bc58-4fe88b57112c (active, config generation 4 -> 5):

    host phase 2 contents:
    ------------------------
    slot   boot image source
    ------------------------
    A      current contents 
    B      current contents 


    physical disks:
    ------------------------------------------------------------------------------------
    vendor        model        serial                                        disposition
    ------------------------------------------------------------------------------------
    fake-vendor   fake-model   serial-727522a7-934f-494d-b5b3-160968e74463   in service 
    fake-vendor   fake-model   serial-72c59873-31ff-4e36-8d76-ff834009349a   in service 
    fake-vendor   fake-model   serial-b5fd5bc1-099e-4e77-8028-a9793c11f43b   in service 


    datasets:
    -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    dataset name                                                                                                   dataset id                             disposition   quota     reservation   compression
    -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    oxp_727522a7-934f-494d-b5b3-160968e74463/crucible                                                              2f204c50-a327-479c-8852-f53ec7a19c1f   in service    none      none          off        
    oxp_72c59873-31ff-4e36-8d76-ff834009349a/crucible                                                              78f34ce7-42f1-41da-995f-318f32054ad2   in service    none      none          off        
    oxp_b5fd5bc1-099e-4e77-8028-a9793c11f43b/crucible                                                              1640adb6-70bf-44cf-b05c-bff6dd300cf3   in service    none      none          off        
    oxp_727522a7-934f-494d-b5b3-160968e74463/crypt/clickhouse                                                      841d5648-05f0-47b0-b446-92f6b60fe9a6   in service    none      none          off        
    oxp_727522a7-934f-494d-b5b3-160968e74463/crypt/external_dns                                                    8e0bd2bd-23b7-4bc6-9e73-c4d4ebc0bc8c   in service    none      none          off        
    oxp_727522a7-934f-494d-b5b3-160968e74463/crypt/internal_dns                                                    2ad1875a-92ac-472f-8c26-593309f0e4da   expunged      none      none          off        
    oxp_727522a7-934f-494d-b5b3-160968e74463/crypt/zone                                                            4829f422-aa31-41a8-ab73-95684ff1ef48   in service    none      none          off        
    oxp_72c59873-31ff-4e36-8d76-ff834009349a/crypt/zone                                                            775f9207-c42d-4af2-9186-27ffef67735e   in service    none      none          off        
    oxp_b5fd5bc1-099e-4e77-8028-a9793c11f43b/crypt/zone                                                            3b66453b-7148-4c1b-84a9-499e43290ab4   in service    none      none          off        
    oxp_727522a7-934f-494d-b5b3-160968e74463/crypt/zone/oxz_clickhouse_353b3b65-20f7-48c3-88f7-495bd5d31545        b46de15d-33e7-4cd0-aa7c-e7be2a61e71b   in service    none      none          off        
    oxp_b5fd5bc1-099e-4e77-8028-a9793c11f43b/crypt/zone/oxz_crucible_86a22a56-0168-453d-9df1-cb2a7c64b5d3          3e0d6188-c503-49cf-a441-fa7df40ceb43   in service    none      none          off        
    oxp_727522a7-934f-494d-b5b3-160968e74463/crypt/zone/oxz_crucible_bd354eef-d8a6-4165-9124-283fb5e46d77          5ae11c7e-08fa-4d78-a4ea-14b4a9a10241   in service    none      none          off        
    oxp_72c59873-31ff-4e36-8d76-ff834009349a/crypt/zone/oxz_crucible_e2fdefe7-95b2-4fd2-ae37-56929a06d58c          b8f2a09f-8bd2-4418-872b-a4457a3f958c   in service    none      none          off        
    oxp_727522a7-934f-494d-b5b3-160968e74463/crypt/zone/oxz_crucible_pantry_ad6a3a03-8d0f-4504-99a4-cbf73d69b973   49f8fbb6-5bac-4609-907f-6e3dfc206059   in service    none      none          off        
    oxp_727522a7-934f-494d-b5b3-160968e74463/crypt/zone/oxz_external_dns_6c3ae381-04f7-41ea-b0ac-74db387dbc3a      8c4fa711-1d5d-4e93-85f0-d17bff47b063   in service    none      none          off        
    oxp_727522a7-934f-494d-b5b3-160968e74463/crypt/zone/oxz_internal_dns_99e2f30b-3174-40bf-a78a-90da8abba8ca      c31623de-c19b-4615-9f1d-5e1daa5d3bda   expunged      none      none          off        
    oxp_727522a7-934f-494d-b5b3-160968e74463/crypt/zone/oxz_nexus_466a9f29-62bf-4e63-924a-b9efdb86afec             3560dd69-3b23-4c69-807d-d673104cfc68   in service    none      none          off        
    oxp_727522a7-934f-494d-b5b3-160968e74463/crypt/zone/oxz_ntp_62620961-fc4a-481e-968b-f5acbac0dc63               09b9cc9b-3426-470b-a7bc-538f82dede03   in service    none      none          off        
    oxp_727522a7-934f-494d-b5b3-160968e74463/crypt/debug                                                           93957ca0-9ed1-4e7b-8c34-2ce07a69541c   in service    100 GiB   none          gzip-9     
    oxp_72c59873-31ff-4e36-8d76-ff834009349a/crypt/debug                                                           2db6b7c1-0f46-4ced-a3ad-48872793360e   in service    100 GiB   none          gzip-9     
    oxp_b5fd5bc1-099e-4e77-8028-a9793c11f43b/crypt/debug                                                           318fae85-abcb-4259-b1b6-ac96d193f7b7   in service    100 GiB   none          gzip-9     
+   oxp_727522a7-934f-494d-b5b3-160968e74463/crypt/internal_dns                                                    0e04c3b5-47fc-412c-b352-dbf09a6bfac8   in service    none      none          off        
+   oxp_727522a7-934f-494d-b5b3-160968e74463/crypt/zone/oxz_internal_dns_ff6dceb4-84c5-4e0e-afe1-3c884cb656cb      71d27157-3ff7-4806-be2e-1ad6d6b5bfa2   in service    none      none          off        


    omicron zones:
    ------------------------------------------------------------------------------------------------------------------------
    zone type         zone id                                image source              disposition    underlay IP           
    ------------------------------------------------------------------------------------------------------------------------
    clickhouse        353b3b65-20f7-48c3-88f7-495bd5d31545   artifact: version 0.0.1   in service     fd00:1122:3344:102::23
    crucible          86a22a56-0168-453d-9df1-cb2a7c64b5d3   artifact: version 0.0.1   in service     fd00:1122:3344:102::28
    crucible          bd354eef-d8a6-4165-9124-283fb5e46d77   artifact: version 0.0.1   in service     fd00:1122:3344:102::26
    crucible          e2fdefe7-95b2-4fd2-ae37-56929a06d58c   artifact: version 0.0.1   in service     fd00:1122:3344:102::27
    crucible_pantry   ad6a3a03-8d0f-4504-99a4-cbf73d69b973   artifact: version 0.0.1   in service     fd00:1122:3344:102::25
    external_dns      6c3ae381-04f7-41ea-b0ac-74db387dbc3a   artifact: version 0.0.1   in service     fd00:1122:3344:102::24
    internal_dns      99e2f30b-3174-40bf-a78a-90da8abba8ca   artifact: version 0.0.1   expunged ✓     fd00:1122:3344:1::1   
    internal_ntp      62620961-fc4a-481e-968b-f5acbac0dc63   artifact: version 0.0.1   in service     fd00:1122:3344:102::21
    nexus             466a9f29-62bf-4e63-924a-b9efdb86afec   artifact: version 0.0.1   in service     fd00:1122:3344:102::22
+   internal_dns      ff6dceb4-84c5-4e0e-afe1-3c884cb656cb   install dataset           in service     fd00:1122:3344:1::1   


 COCKROACHDB SETTINGS:
    state fingerprint:::::::::::::::::   (none) (unchanged)
    cluster.preserve_downgrade_option:   (do not modify) (unchanged)

 METADATA:
    internal DNS version:::   1 (unchanged)
    external DNS version:::   1 (unchanged)
    target release min gen:   1 (unchanged)
    nexus gen::::::::::::::   1 (unchanged)

 OXIMETER SETTINGS:
    generation:   1 (unchanged)
    read from::   SingleNode (unchanged)

 PENDING MGS UPDATES:

    Pending MGS-managed updates (all baseboards):
    --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    sp_type   slot   part_number   serial_number   artifact_hash                                                      artifact_version   details                                                                                                                                                
    --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+   sled      1      model1        serial1         5b0f601b1fbb8674db9c751a02f8b14f8e6d4e8470f4f7b686fecb2c49ec11f9   1.0.0              RotBootloader(PendingMgsUpdateRotBootloaderDetails { expected_stage0_version: ArtifactVersion("0.0.1"), expected_stage0_next_version: NoValidVersion })


internal DNS:
* DNS zone: "control-plane.oxide.internal": 
*   name: @                                                  (records: 2 -> 3)
-       NS   ns1.control-plane.oxide.internal
-       NS   ns2.control-plane.oxide.internal
+       NS   ns1.control-plane.oxide.internal
+       NS   ns2.control-plane.oxide.internal
+       NS   ns3.control-plane.oxide.internal
*   name: _nameservice._tcp                                  (records: 2 -> 3)
-       SRV  port  5353 427ec88f-f467-42fa-9bbb-66a91a36103c.host.control-plane.oxide.internal
-       SRV  port  5353 ea5b4030-b52f-44b2-8d70-45f15f987d01.host.control-plane.oxide.internal
+       SRV  port  5353 427ec88f-f467-42fa-9bbb-66a91a36103c.host.control-plane.oxide.internal
+       SRV  port  5353 ea5b4030-b52f-44b2-8d70-45f15f987d01.host.control-plane.oxide.internal
+       SRV  port  5353 ff6dceb4-84c5-4e0e-afe1-3c884cb656cb.host.control-plane.oxide.internal
+   name: ff6dceb4-84c5-4e0e-afe1-3c884cb656cb.host          (records: 1)
+       AAAA fd00:1122:3344:1::1
*   name: ns1                                                (records: 1 -> 1)
-       AAAA fd00:1122:3344:2::1
+       AAAA fd00:1122:3344:1::1
*   name: ns2                                                (records: 1 -> 1)
-       AAAA fd00:1122:3344:3::1
+       AAAA fd00:1122:3344:2::1
+   name: ns3                                                (records: 1)
+       AAAA fd00:1122:3344:3::1
    unchanged names: 46 (records: 58)

It seems to me like everything is working as expected. I don't think I want to add the mark-for-cleanup command here because I want the zone to stay busted for my testing purposes. But I can add a note to the tests to explain why the zone isn't fixing itself if that helps

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh yeah I forgot about needing to see it cleaned up. Thanks, the comment is great.

Copy link
Contributor Author

@karencfv karencfv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to review @jgallagher! I think I've addressed all of your comments

Comment on lines +29 to +34
# The above blueprint includes a pending MGS update, which we should delete
# (we want to start from a fresh state).
blueprint-edit latest delete-sp-update serial0
# Also set the Omicron config for all sleds to reflect the
# corresponding image sources.
sled-set serial0 omicron-config latest
Copy link
Contributor Author

@karencfv karencfv Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it's my unfamiliarity with this part of the codebase, but I couldn't find an easy quick way to create this command as part of this PR. Do you mind if this is done as a follow-up PR? I want to get this merged soon so it can be tested before the release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SP, host OS updates should use per-zone safety checks
3 participants