Skip to content

Conversation

grnmeira
Copy link
Contributor

@grnmeira grnmeira commented Aug 7, 2025

Problem

We have an issue with containerd and Windows' HCS where namespaces for pods don't yet have a compartment ID when the CNI signals ztunnel for a new workload that's just been created. Without that ID we can't create ztunnel proxy sockets inside the pod's network compartment. The compartment ID will only be available after ztunnel signals the CNI that the workload has been assimilated by ztunnel (which can't happen without a compartment ID 🫤).

What this PR does

This PR mitigates the stated problem with ztunnel replying an ACK to the CNI during the ADD operation, even though the workload proxies haven't been yet instantiated inside ztunnel. After a timeout, ztunnel tries again to add the workflow, and now the HCS API returns a valid compartment ID, which allows the creation of a proxy without any problems.

A more detailed flow looks like:

  1. CNI sends an ADD command to ztunnel
  2. ztunnel queries HCS and checks if a comapartment ID is available for that pod
    2.1 If the compartment ID is available, we create a proxy for the workload and ACK the CNI
    2.2 If the compartment ID is not available, we mark the workload as pending inside ztunnel and ACK the CNI.
  3. The CNI receives an ACK and the pod creation proceeds.
  4. After a timeout, ztunnel tries to add the pending workload again.

How the PR does it

It introduces a queue of events that runs in the same thread where ZDS commands are processed. These events are managed in a quite simple way at the moment, but can be expanded in the future for more advanced handling of future "internal events", including the current retries of pending workloads.

Caveats

We'll keep retrying compartmentless workloads even if they're failing in an unretriable way (to be fixed in a different PR). There's no current way to signal CNI after sending an ACK for the ADD command that the workload actually failed to be assimilated by ztunnel.

@grnmeira grnmeira requested a review from a team as a code owner August 7, 2025 14:15
@istio-testing istio-testing added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 7, 2025
@grnmeira grnmeira added the windows Experimental Windows support label Aug 8, 2025
@@ -339,6 +362,98 @@ impl WorkloadProxyManagerState {
}
}

pub async fn retry_comparmentless(&mut self, poddata: &WorkloadData) -> Result<(), Error> {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compartmentless*

@@ -34,8 +34,7 @@ mod workloadmanager;
#[cfg(any(test, feature = "testing"))]
pub mod test_helpers;


#[derive(Debug)]
#[derive(Debug, Clone)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we want to clone this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reason here is just to facilitate persisting this information across requests in our message loop (as now we can retry workloads later). We can make an effort to simplify that struct, or use an Rc if you think that's desirable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, let's optimize later if we feel like we need it. This doesn't affect Linux so I feel good coming back to it

Comment on lines 65 to 66
pub fn proxy_pending(&self, uid: &crate::inpod::WorkloadUid, workload_info: &WorkloadInfo) {
let mut state = self.state.write().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to merge these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, they're pretty much the same.

@@ -36,14 +34,14 @@ impl InPodConfig {
..cfg.socket_config
};
Ok(InPodConfig {
cur_namespace: InpodNamespace::current()?,
cur_namespace: NetworkNamespace::current()?,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the rename?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, the current naming makes no distiction between the specific namespace. Does this rename add value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was more because I removed one layer there, and this was the original "inner" struct. I'll change it back to InpodNamespace 👍

@grnmeira
Copy link
Contributor Author

grnmeira commented Sep 2, 2025

@keithmattix @MikeZappa87 thanks for the reviews, I've addressed your comments. Could you please give me a hand with a second pass?

Copy link
Contributor

@keithmattix keithmattix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some error handling questions, but the bones look good to me

"network compartment ID not yet available for namespace {}",
netns.namespace_guid
);
self.compartmentless_workloads
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we ever pop this off the list? Could this create duplicates?

compartment_id, netns.namespace_guid
);
let new_netns = InpodNamespace::new(netns.namespace_guid.clone()).map_err(|e| {
self.compartmentless_workloads
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why potentially do this again?

match self.add_workload(&uid, info.clone(), new_netns).await {
Ok(()) => {}
Err(e) => {
self.compartmentless_workloads
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something, but isn't there a possibility that we add this guid to the list 3 times?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For compartmentless_workloads? Its a Vec<(WorkloadUid, WorkloadInfo, InpodNamespace)> it has nothing would prevent it from having duplicate entries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That vector shouldn't be there anymore. It should've been removed after the queue of events was created.

@keithmattix
Copy link
Contributor

/cc @Stevenjin8 @howardjohn for a closer look

@grnmeira
Copy link
Contributor Author

grnmeira commented Sep 8, 2025

pinging @Stevenjin8 and @howardjohn for a closer look

@Stevenjin8
Copy link
Contributor

looking at this now... Going to build and maybe run on an aks cluster

Cargo.toml Outdated
@@ -112,6 +112,7 @@ tracing-core = "0.1"
tracing-appender = "0.2"
tokio-util = { version = "0.7", features = ["io-util"] }
educe = "0.6"
uuid = "1.17.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is windows only right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think we don't need that anymore, as we're using windows::core::UUID now.

@Stevenjin8
Copy link
Contributor

Stevenjin8 commented Sep 11, 2025

istio/istio#57303

};
match self
.event_queue
.binary_search_by_key(&expiration, |event| event.expiration)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// We can't tell the CNI that a node needs to be removed due
// to an unretriable error. So at the moment we always retry,
// always increasing the the timeout by a factor on each attempt.
warn!("error while retyring workload: {}", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
warn!("error while retyring workload: {}", e);
warn!("error while retrying workload: {}", e);

poddata,
new_timeout.as_secs()
);
self.enqueue_local_event(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some thought, this doesn't matter, but it is a bit sketchy to modify the event queue as we are iterating over it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it works out, but feels a bit fragile (like the fact that enqueue_local_event will always enqueue an event that has a timeout that has an expiration > now (otherwise the retry event would get immediately deleted). It would be nice to at least have some comments for this.

// to an unretriable error. So at the moment we always retry,
// always increasing the the timeout by a factor on each attempt.
warn!("error while retyring workload: {}", e);
let new_timeout = previous_timeout.mul(2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should put a cap on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and add jitter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/L Denotes a PR that changes 100-499 lines, ignoring generated files. windows Experimental Windows support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants