-
Notifications
You must be signed in to change notification settings - Fork 43
[sled-agent-config-reconciler] Flesh out main reconicilation task #8162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -109,64 +130,55 @@ impl OmicronZones { | |||
// Filter desired zones down to just those that we need to stop. See | |||
// [`ZoneState`] for more discussion of why we're willing (or unwilling) | |||
// to stop zones in various current states. | |||
let mut zones_to_shut_down = self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change (which is mostly whitespace) and the almost-identical change in start_zones_if_needed
are both to workaround a rustc bug in the rust-lang/rust#110338 family. peek()
seems to trigger it similarly to flatten()
.
// If we can't contact the dataset task, reuse the result from | ||
// our previous attempt. This should still be correct (until we | ||
// start deleting datasets, at which point we'll need a more | ||
// holistic tracker for dataset status like we already have for | ||
// disks and zones). | ||
self.reconciler_result_tx | ||
.borrow() | ||
.latest_result | ||
.as_ref() | ||
.map(|inner| inner.datasets.clone()) | ||
.unwrap_or_else(DatasetEnsureResult::default) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having an "uncommon, split pathway" is kinda a yellow-flag to me. Why would we be unable to contact the dataset task? What would be wrong with "throwing an error, trying later" in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, fair enough, and I wasn't super thrilled with this when I wrote it. I'm not sure there's a better alternative though. Largely thinking out loud:
We can only fail to contact the dataset task if try_send
'ing this request to it fails because the channel is full. We only send at most one request at a time, so that would mean the channel is full of other callers' requests (at the moment, other callers include anything related to support bundles and any request for inventory). I think that could only happen with one of:
- The dataset task is still processing requests but has just gotten behind; we have too many concurrent callers bursting at once, so need to increase the channel size.
- The dataset task is unhealthy - it's either stuck on some request, or it can't keep up with the sustained rate of requests its receiving (neither of which can be addressed by increasing the channel size).
The problem the reconciler has is that the result of this operation will be used by the rest of sled-agent in various ways, including "what zone root datasets are mounted" (e.g., for instance placement) and "what debug datasets are mounted" (e.g., for zone bundle placement). If we hit this error condition with the dataset task, I think we have two options w.r.t. what to tell sled-agent about the results of this reconciliation attempt:
- Do what this PR does, and assume whatever datasets were mounted the last time we ran are still mounted. If we can't make this call, we're still using the most recent information we did get from the dataset task. That might be out of date if (e.g.) a disk has been removed, but that's an inherent race that's always around for anyone asking "what datasets are available for me to use".
- Report that no datasets are available until we're able to talk to the dataset task again. This seems overly conservative to me, but I could be talked out of it!
The driving motivator for the separate dataset task is that the reconciler is not the only one doing ZFS operations, and it seems bad to let different ZFS operations proceed concurrently (e.g., taking inventory simultaneously with a datasets_ensure()
running seems like it would lead to weird results, and similarly for the various support bundle operations). This would be a bigger change, but I wonder if we could do something like:
- Change support bundles to not need to interact with datasets at the level they do today. (E.g., the thing we talked about last week where there's a
DatasetConfig
for aDatasetKind::SupportBundles
that lives underneath the debug dataset, and all support bundles are stored there. Then we can drop all of the nested dataset operations that support bundles currently use, I think?) - Change inventory to report dataset information that's collected by the reconciler itself instead of by issuing ZFS commands directly. This seems a little dicier, since we'd be reporting an in-memory view that might be out of date, but I guess that's true of many of the inventory fields anyway.
I think if we did both of those, then the only ZFS operations remaining would be issued by the reconciler itself, so we could drop the dataset task entirely and let the reconciler do those operations directly. That would eliminate this error case and avoid all the questions above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not opposed to the current structure, where we have a separate "Dataset task" responsible for fielding these requests, and which has a bounded channel to be used for communication.
I think my main opposition here is the decision to proceed in this case where we have explicitly failed to call "ensure_datasets". Yes, we grab the old value of datasets, but as an example, we aren't actually passing that into start_zones_if_needed
! So, from my read of the callstack:
- If we fail to ensure datasets...
- we'll use the "old value" of datasets...
- ... and we'll call
start_zones_if_needed
anyway - ... which calls
start_omicron_zone
I don't think this ever checks that the datasets actually exist! I think it just tries to proceed with starting the zone.
So, as an example:
- Suppose you asked for a transient zone filesystem dataset with an explicit quota
- We fail to create it
- We call "start_omicron_zone" anyway
- We'll now start the zone without previously creating the dataset, which (I believe) creates a new dataset as zone install time without any quota
Report that no datasets are available until we're able to talk to the dataset task again. This seems overly conservative to me, but I could be talked out of it!
Until we have some way of saying "ENOUGH datasets exist to start the zones we want to start", I don't think this is too conservative!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that you're describing a real problem that needs to be fixed, but disagree that changing how we handle failing to call ensure_datasets
makes a material difference. Two reasons:
- I think the crux of the problem you're describing is
we aren't actually passing that into
start_zones_if_needed
If we succeed in calling ensure_datasets
, but every single dataset we requested failed to be ensured for some reason, we'd still proceed into start_zones_if_needed
and never check that the datasets zones depend on were actually ensured. This is definitely wrong! But can happen on a successful call to ensure_datasets
just as easily as a "failed to call ensure_datasets
and reused an old result".
- Everything you're describing is the way sled-agent works on
main
today. On a cold boot, we never even attempt to read the ledgered datasets config, and therefore never ensure any of the datasets needed by zones, and yet still proceed to starting zones up.
I definitely want to fix this, but it requires more invasive changes in ServiceManager
. I left this comment as a breadcrumb:
omicron/sled-agent/config-reconciler/src/sled_agent_facilities.rs
Lines 34 to 37 in 03081e1
// TODO-cleanup This is implemented by | |
// `ServiceManager::start_omicron_zone()`, which does too much; we should | |
// absorb some of its functionality and shrink this interface. We definitely | |
// should not need to pass the full list of U2 zpools. |
I think what ought to happen is:
start_zones_if_needed
takes the current set of dataset ensure results (or the most recent one, if we failed to contact the dataset task; I think this is still okay)start_zones_if_needed
checks the result of ensuring the zone'sfilesystem_pool
and (if relevant) durable dataset, and refuses to start the zone if the dataset failed to ensure- if the dataset(s) did ensure successfully, it calls out to
ServiceManager
, but no longer passesall_u2_pools
; instead,ServiceManager
should use the specific datasets the zone config specifies instead of choosing on its own and/or creating them with no quota if they don't exist
but that seemed like a change that could come after all this work landed, since it'd be new behavior / fixing preexisting bugs.
A final thought on the specific bit here in the reconciler: we could change this datasets_ensure()
call to use send
instead of try_send
, and just block until the dataset task can service the request. I'm not sure if that's better or worse in the "dataset task gets wedged" case. It would get rid of this error branch entirely, but would mean the dataset task getting wedged also causes the reconciler to get wedged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the following, I think that the code is doing the right thing as is.
If we succeed in calling ensure_datasets, but every single dataset we requested failed to be ensured for some reason, we'd still proceed into start_zones_if_needed and never check that the datasets zones depend on were actually ensured. This is definitely wrong! But can happen on a successful call to ensure_datasets just as easily as a "failed to call ensure_datasets and reused an old result".
It should always be fine to operate on stale data, and it seems that start_zones_if_needed
should be corrected do not do the wrong thing. That can come in a follow up PR and I think @jgallagher's suggestions make sense.
I'm largely against using a blocking send, as this can have unintended effects such as deadlock. However, it would be nice if we had a proper alert mechanism so if we failed to do something like talk to the dataset task for a few attempts we could inform an operator and not just log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return value of
datasets_ensure
is aDatasetEnsureResult
. 😅 My question was from the context of "what if we can't calldatasets_ensure
at all", in which case we don't get aDatasetEnsureResult
(unless we're willing to clone the previous one, as the code currently does).The return value of
datasets_ensure
isResult<DatasetEnsureResult, DatasetTaskError>
. If the datasets task is not responding, we cannot access that result, and we'll see aDatasetTaskError
. That's what I meant by "maybe we have an Arc that allows a client to access the data without accessing the task".
Isn't this exactly equivalent to what I'm doing on the PR as written? If we get a DatasetTaskError
, we couldn't talk to the task, so we clone the last result we got from it, which will be identical to the DatasetEnsureResult
it's currently holding (since we're the only ones that can call datasets_ensure
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, fwiw, I'm okay with merging this PR as-is, as long as we have an issue to track this behavior. I think we're on the same page on the goals in the limit, anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Filed as #8173 (not sure that has the best title, but I wasn't sure how to summarize the issue succinctly - feel free to edit it!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this exactly equivalent to what I'm doing on the PR as written? If we get a
DatasetTaskError
, we couldn't talk to the task, so we clone the last result we got from it, which will be identical to theDatasetEnsureResult
it's currently holding (since we're the only ones that can calldatasets_ensure
).
If we're making the assumption that no one else is calling datasets_ensure
, and that we have seen a result, -- the .unwrap_or_else(DatasetEnsureResult::default)
line implies we'd claim this is "no datasets" if we haven't been able to get a call in successfully -- then yeah, this seems true.
I think this just means the case you mentioned earlier -- e.g., "a bunch of other requests go to the dataset task, it's full, and we fail to try_send
to it", can mean "we'll launch zones anyway, even though we failed to initialize any datasets for them".
But as you say, we don't care about this case for now. Sorry for the distraction. Feel free to mark this resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're making the assumption that no one else is calling datasets_ensure, and that we have seen a result, -- the
.unwrap_or_else(DatasetEnsureResult::default)
line implies we'd claim this is "no datasets" if we haven't been able to get a call in successfully -- then yeah, this seems true.
I think it's okay to make that assumption; datasets_ensure
is not public. I think the behavior is still the same even in the unwrap_or_else
case: if we've never called datasets_ensure
, then the DatasetEnsureResult
held by the dataset task is itself still the empty default (because it's never ensured any datasets!):
datasets: DatasetEnsureResult::default(), |
// If we can't contact the dataset task, reuse the result from | ||
// our previous attempt. This should still be correct (until we | ||
// start deleting datasets, at which point we'll need a more | ||
// holistic tracker for dataset status like we already have for | ||
// disks and zones). | ||
self.reconciler_result_tx | ||
.borrow() | ||
.latest_result | ||
.as_ref() | ||
.map(|inner| inner.datasets.clone()) | ||
.unwrap_or_else(DatasetEnsureResult::default) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the following, I think that the code is doing the right thing as is.
If we succeed in calling ensure_datasets, but every single dataset we requested failed to be ensured for some reason, we'd still proceed into start_zones_if_needed and never check that the datasets zones depend on were actually ensured. This is definitely wrong! But can happen on a successful call to ensure_datasets just as easily as a "failed to call ensure_datasets and reused an old result".
It should always be fine to operate on stale data, and it seems that start_zones_if_needed
should be corrected do not do the wrong thing. That can come in a follow up PR and I think @jgallagher's suggestions make sense.
I'm largely against using a blocking send, as this can have unintended effects such as deadlock. However, it would be nice if we had a proper alert mechanism so if we failed to do something like talk to the dataset task for a few attempts we could inform an operator and not just log.
The primary change here is to implement
ReconcilerTask::run()
. There are quite a few ancillary changes (new helper methods, etc.).I didn't add any tests of this task explicitly; it's mostly straightline code that calls into subsystems that themselves have tests. Open to feedback whether you think this is okay or it needs some explicit tests of its own.