-
Notifications
You must be signed in to change notification settings - Fork 134
Retrying compartmentless containers #1608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: experimental-windows-ambient
Are you sure you want to change the base?
Retrying compartmentless containers #1608
Conversation
src/inpod/windows/statemanager.rs
Outdated
@@ -339,6 +362,98 @@ impl WorkloadProxyManagerState { | |||
} | |||
} | |||
|
|||
pub async fn retry_comparmentless(&mut self, poddata: &WorkloadData) -> Result<(), Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compartmentless*
@@ -34,8 +34,7 @@ mod workloadmanager; | |||
#[cfg(any(test, feature = "testing"))] | |||
pub mod test_helpers; | |||
|
|||
|
|||
#[derive(Debug)] | |||
#[derive(Debug, Clone)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure we want to clone this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reason here is just to facilitate persisting this information across requests in our message loop (as now we can retry workloads later). We can make an effort to simplify that struct, or use an Rc if you think that's desirable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, let's optimize later if we feel like we need it. This doesn't affect Linux so I feel good coming back to it
src/inpod/windows/admin.rs
Outdated
pub fn proxy_pending(&self, uid: &crate::inpod::WorkloadUid, workload_info: &WorkloadInfo) { | ||
let mut state = self.state.write().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to merge these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, they're pretty much the same.
src/inpod/windows/config.rs
Outdated
@@ -36,14 +34,14 @@ impl InPodConfig { | |||
..cfg.socket_config | |||
}; | |||
Ok(InPodConfig { | |||
cur_namespace: InpodNamespace::current()?, | |||
cur_namespace: NetworkNamespace::current()?, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the rename?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, the current naming makes no distiction between the specific namespace. Does this rename add value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was more because I removed one layer there, and this was the original "inner" struct. I'll change it back to InpodNamespace
👍
@keithmattix @MikeZappa87 thanks for the reviews, I've addressed your comments. Could you please give me a hand with a second pass? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some error handling questions, but the bones look good to me
src/inpod/windows/statemanager.rs
Outdated
"network compartment ID not yet available for namespace {}", | ||
netns.namespace_guid | ||
); | ||
self.compartmentless_workloads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we ever pop this off the list? Could this create duplicates?
src/inpod/windows/statemanager.rs
Outdated
compartment_id, netns.namespace_guid | ||
); | ||
let new_netns = InpodNamespace::new(netns.namespace_guid.clone()).map_err(|e| { | ||
self.compartmentless_workloads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why potentially do this again?
src/inpod/windows/statemanager.rs
Outdated
match self.add_workload(&uid, info.clone(), new_netns).await { | ||
Ok(()) => {} | ||
Err(e) => { | ||
self.compartmentless_workloads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing something, but isn't there a possibility that we add this guid to the list 3 times?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For compartmentless_workloads? Its a Vec<(WorkloadUid, WorkloadInfo, InpodNamespace)> it has nothing would prevent it from having duplicate entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That vector shouldn't be there anymore. It should've been removed after the queue of events was created.
/cc @Stevenjin8 @howardjohn for a closer look |
pinging @Stevenjin8 and @howardjohn for a closer look |
looking at this now... Going to build and maybe run on an aks cluster |
Cargo.toml
Outdated
@@ -112,6 +112,7 @@ tracing-core = "0.1" | |||
tracing-appender = "0.2" | |||
tokio-util = { version = "0.7", features = ["io-util"] } | |||
educe = "0.6" | |||
uuid = "1.17.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is windows only right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think we don't need that anymore, as we're using windows::core::UUID now.
}; | ||
match self | ||
.event_queue | ||
.binary_search_by_key(&expiration, |event| event.expiration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can use partition_point
instead: https://doc.rust-lang.org/std/primitive.slice.html#method.partition_point
// We can't tell the CNI that a node needs to be removed due | ||
// to an unretriable error. So at the moment we always retry, | ||
// always increasing the the timeout by a factor on each attempt. | ||
warn!("error while retyring workload: {}", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warn!("error while retyring workload: {}", e); | |
warn!("error while retrying workload: {}", e); |
poddata, | ||
new_timeout.as_secs() | ||
); | ||
self.enqueue_local_event( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some thought, this doesn't matter, but it is a bit sketchy to modify the event queue as we are iterating over it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it works out, but feels a bit fragile (like the fact that enqueue_local_event will always enqueue an event that has a timeout that has an expiration > now (otherwise the retry event would get immediately deleted). It would be nice to at least have some comments for this.
// to an unretriable error. So at the moment we always retry, | ||
// always increasing the the timeout by a factor on each attempt. | ||
warn!("error while retyring workload: {}", e); | ||
let new_timeout = previous_timeout.mul(2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should put a cap on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and add jitter
Problem
We have an issue with containerd and Windows' HCS where namespaces for pods don't yet have a compartment ID when the CNI signals ztunnel for a new workload that's just been created. Without that ID we can't create ztunnel proxy sockets inside the pod's network compartment. The compartment ID will only be available after ztunnel signals the CNI that the workload has been assimilated by ztunnel (which can't happen without a compartment ID 🫤).
What this PR does
This PR mitigates the stated problem with ztunnel replying an ACK to the CNI during the ADD operation, even though the workload proxies haven't been yet instantiated inside ztunnel. After a timeout, ztunnel tries again to add the workflow, and now the HCS API returns a valid compartment ID, which allows the creation of a proxy without any problems.
A more detailed flow looks like:
2.1 If the compartment ID is available, we create a proxy for the workload and ACK the CNI
2.2 If the compartment ID is not available, we mark the workload as pending inside ztunnel and ACK the CNI.
How the PR does it
It introduces a queue of events that runs in the same thread where ZDS commands are processed. These events are managed in a quite simple way at the moment, but can be expanded in the future for more advanced handling of future "internal events", including the current retries of pending workloads.
Caveats
We'll keep retrying compartmentless workloads even if they're failing in an unretriable way (to be fixed in a different PR). There's no current way to signal CNI after sending an ACK for the ADD command that the workload actually failed to be assimilated by ztunnel.