Skip to content

[sled-agent-config-reconciler] Flesh out main reconicilation task #8162

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 16, 2025

Conversation

jgallagher
Copy link
Contributor

The primary change here is to implement ReconcilerTask::run(). There are quite a few ancillary changes (new helper methods, etc.).

I didn't add any tests of this task explicitly; it's mostly straightline code that calls into subsystems that themselves have tests. Open to feedback whether you think this is okay or it needs some explicit tests of its own.

@@ -109,64 +130,55 @@ impl OmicronZones {
// Filter desired zones down to just those that we need to stop. See
// [`ZoneState`] for more discussion of why we're willing (or unwilling)
// to stop zones in various current states.
let mut zones_to_shut_down = self
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change (which is mostly whitespace) and the almost-identical change in start_zones_if_needed are both to workaround a rustc bug in the rust-lang/rust#110338 family. peek() seems to trigger it similarly to flatten().

Comment on lines +355 to +365
// If we can't contact the dataset task, reuse the result from
// our previous attempt. This should still be correct (until we
// start deleting datasets, at which point we'll need a more
// holistic tracker for dataset status like we already have for
// disks and zones).
self.reconciler_result_tx
.borrow()
.latest_result
.as_ref()
.map(|inner| inner.datasets.clone())
.unwrap_or_else(DatasetEnsureResult::default)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having an "uncommon, split pathway" is kinda a yellow-flag to me. Why would we be unable to contact the dataset task? What would be wrong with "throwing an error, trying later" in that case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, fair enough, and I wasn't super thrilled with this when I wrote it. I'm not sure there's a better alternative though. Largely thinking out loud:

We can only fail to contact the dataset task if try_send'ing this request to it fails because the channel is full. We only send at most one request at a time, so that would mean the channel is full of other callers' requests (at the moment, other callers include anything related to support bundles and any request for inventory). I think that could only happen with one of:

  • The dataset task is still processing requests but has just gotten behind; we have too many concurrent callers bursting at once, so need to increase the channel size.
  • The dataset task is unhealthy - it's either stuck on some request, or it can't keep up with the sustained rate of requests its receiving (neither of which can be addressed by increasing the channel size).

The problem the reconciler has is that the result of this operation will be used by the rest of sled-agent in various ways, including "what zone root datasets are mounted" (e.g., for instance placement) and "what debug datasets are mounted" (e.g., for zone bundle placement). If we hit this error condition with the dataset task, I think we have two options w.r.t. what to tell sled-agent about the results of this reconciliation attempt:

  • Do what this PR does, and assume whatever datasets were mounted the last time we ran are still mounted. If we can't make this call, we're still using the most recent information we did get from the dataset task. That might be out of date if (e.g.) a disk has been removed, but that's an inherent race that's always around for anyone asking "what datasets are available for me to use".
  • Report that no datasets are available until we're able to talk to the dataset task again. This seems overly conservative to me, but I could be talked out of it!

The driving motivator for the separate dataset task is that the reconciler is not the only one doing ZFS operations, and it seems bad to let different ZFS operations proceed concurrently (e.g., taking inventory simultaneously with a datasets_ensure() running seems like it would lead to weird results, and similarly for the various support bundle operations). This would be a bigger change, but I wonder if we could do something like:

  1. Change support bundles to not need to interact with datasets at the level they do today. (E.g., the thing we talked about last week where there's a DatasetConfig for a DatasetKind::SupportBundles that lives underneath the debug dataset, and all support bundles are stored there. Then we can drop all of the nested dataset operations that support bundles currently use, I think?)
  2. Change inventory to report dataset information that's collected by the reconciler itself instead of by issuing ZFS commands directly. This seems a little dicier, since we'd be reporting an in-memory view that might be out of date, but I guess that's true of many of the inventory fields anyway.

I think if we did both of those, then the only ZFS operations remaining would be issued by the reconciler itself, so we could drop the dataset task entirely and let the reconciler do those operations directly. That would eliminate this error case and avoid all the questions above.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not opposed to the current structure, where we have a separate "Dataset task" responsible for fielding these requests, and which has a bounded channel to be used for communication.

I think my main opposition here is the decision to proceed in this case where we have explicitly failed to call "ensure_datasets". Yes, we grab the old value of datasets, but as an example, we aren't actually passing that into start_zones_if_needed! So, from my read of the callstack:

  • If we fail to ensure datasets...
  • we'll use the "old value" of datasets...
  • ... and we'll call start_zones_if_needed anyway
  • ... which calls start_omicron_zone

I don't think this ever checks that the datasets actually exist! I think it just tries to proceed with starting the zone.

So, as an example:

  • Suppose you asked for a transient zone filesystem dataset with an explicit quota
  • We fail to create it
  • We call "start_omicron_zone" anyway
  • We'll now start the zone without previously creating the dataset, which (I believe) creates a new dataset as zone install time without any quota

Report that no datasets are available until we're able to talk to the dataset task again. This seems overly conservative to me, but I could be talked out of it!

Until we have some way of saying "ENOUGH datasets exist to start the zones we want to start", I don't think this is too conservative!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that you're describing a real problem that needs to be fixed, but disagree that changing how we handle failing to call ensure_datasets makes a material difference. Two reasons:

  1. I think the crux of the problem you're describing is

we aren't actually passing that into start_zones_if_needed

If we succeed in calling ensure_datasets, but every single dataset we requested failed to be ensured for some reason, we'd still proceed into start_zones_if_needed and never check that the datasets zones depend on were actually ensured. This is definitely wrong! But can happen on a successful call to ensure_datasets just as easily as a "failed to call ensure_datasets and reused an old result".

  1. Everything you're describing is the way sled-agent works on main today. On a cold boot, we never even attempt to read the ledgered datasets config, and therefore never ensure any of the datasets needed by zones, and yet still proceed to starting zones up.

I definitely want to fix this, but it requires more invasive changes in ServiceManager. I left this comment as a breadcrumb:

// TODO-cleanup This is implemented by
// `ServiceManager::start_omicron_zone()`, which does too much; we should
// absorb some of its functionality and shrink this interface. We definitely
// should not need to pass the full list of U2 zpools.

I think what ought to happen is:

  • start_zones_if_needed takes the current set of dataset ensure results (or the most recent one, if we failed to contact the dataset task; I think this is still okay)
  • start_zones_if_needed checks the result of ensuring the zone's filesystem_pool and (if relevant) durable dataset, and refuses to start the zone if the dataset failed to ensure
  • if the dataset(s) did ensure successfully, it calls out to ServiceManager, but no longer passes all_u2_pools; instead, ServiceManager should use the specific datasets the zone config specifies instead of choosing on its own and/or creating them with no quota if they don't exist

but that seemed like a change that could come after all this work landed, since it'd be new behavior / fixing preexisting bugs.


A final thought on the specific bit here in the reconciler: we could change this datasets_ensure() call to use send instead of try_send, and just block until the dataset task can service the request. I'm not sure if that's better or worse in the "dataset task gets wedged" case. It would get rid of this error branch entirely, but would mean the dataset task getting wedged also causes the reconciler to get wedged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the following, I think that the code is doing the right thing as is.

If we succeed in calling ensure_datasets, but every single dataset we requested failed to be ensured for some reason, we'd still proceed into start_zones_if_needed and never check that the datasets zones depend on were actually ensured. This is definitely wrong! But can happen on a successful call to ensure_datasets just as easily as a "failed to call ensure_datasets and reused an old result".

It should always be fine to operate on stale data, and it seems that start_zones_if_needed should be corrected do not do the wrong thing. That can come in a follow up PR and I think @jgallagher's suggestions make sense.

I'm largely against using a blocking send, as this can have unintended effects such as deadlock. However, it would be nice if we had a proper alert mechanism so if we failed to do something like talk to the dataset task for a few attempts we could inform an operator and not just log.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return value of datasets_ensure is a DatasetEnsureResult. 😅 My question was from the context of "what if we can't call datasets_ensure at all", in which case we don't get a DatasetEnsureResult (unless we're willing to clone the previous one, as the code currently does).

The return value of datasets_ensure is Result<DatasetEnsureResult, DatasetTaskError>. If the datasets task is not responding, we cannot access that result, and we'll see a DatasetTaskError. That's what I meant by "maybe we have an Arc that allows a client to access the data without accessing the task".

Isn't this exactly equivalent to what I'm doing on the PR as written? If we get a DatasetTaskError, we couldn't talk to the task, so we clone the last result we got from it, which will be identical to the DatasetEnsureResult it's currently holding (since we're the only ones that can call datasets_ensure).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, fwiw, I'm okay with merging this PR as-is, as long as we have an issue to track this behavior. I think we're on the same page on the goals in the limit, anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Filed as #8173 (not sure that has the best title, but I wasn't sure how to summarize the issue succinctly - feel free to edit it!)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this exactly equivalent to what I'm doing on the PR as written? If we get a DatasetTaskError, we couldn't talk to the task, so we clone the last result we got from it, which will be identical to the DatasetEnsureResult it's currently holding (since we're the only ones that can call datasets_ensure).

If we're making the assumption that no one else is calling datasets_ensure, and that we have seen a result, -- the .unwrap_or_else(DatasetEnsureResult::default) line implies we'd claim this is "no datasets" if we haven't been able to get a call in successfully -- then yeah, this seems true.

I think this just means the case you mentioned earlier -- e.g., "a bunch of other requests go to the dataset task, it's full, and we fail to try_send to it", can mean "we'll launch zones anyway, even though we failed to initialize any datasets for them".

But as you say, we don't care about this case for now. Sorry for the distraction. Feel free to mark this resolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're making the assumption that no one else is calling datasets_ensure, and that we have seen a result, -- the .unwrap_or_else(DatasetEnsureResult::default) line implies we'd claim this is "no datasets" if we haven't been able to get a call in successfully -- then yeah, this seems true.

I think it's okay to make that assumption; datasets_ensure is not public. I think the behavior is still the same even in the unwrap_or_else case: if we've never called datasets_ensure, then the DatasetEnsureResult held by the dataset task is itself still the empty default (because it's never ensured any datasets!):

datasets: DatasetEnsureResult::default(),

Comment on lines +355 to +365
// If we can't contact the dataset task, reuse the result from
// our previous attempt. This should still be correct (until we
// start deleting datasets, at which point we'll need a more
// holistic tracker for dataset status like we already have for
// disks and zones).
self.reconciler_result_tx
.borrow()
.latest_result
.as_ref()
.map(|inner| inner.datasets.clone())
.unwrap_or_else(DatasetEnsureResult::default)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the following, I think that the code is doing the right thing as is.

If we succeed in calling ensure_datasets, but every single dataset we requested failed to be ensured for some reason, we'd still proceed into start_zones_if_needed and never check that the datasets zones depend on were actually ensured. This is definitely wrong! But can happen on a successful call to ensure_datasets just as easily as a "failed to call ensure_datasets and reused an old result".

It should always be fine to operate on stale data, and it seems that start_zones_if_needed should be corrected do not do the wrong thing. That can come in a follow up PR and I think @jgallagher's suggestions make sense.

I'm largely against using a blocking send, as this can have unintended effects such as deadlock. However, it would be nice if we had a proper alert mechanism so if we failed to do something like talk to the dataset task for a few attempts we could inform an operator and not just log.

@jgallagher jgallagher merged commit d24ba55 into main May 16, 2025
17 checks passed
@jgallagher jgallagher deleted the john/sled-agent-config-reconciler-main-task branch May 16, 2025 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants