Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

📖 Add designs/multi-cluster.md #2746

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

sttts
Copy link

@sttts sttts commented Mar 31, 2024

Controller-runtime today allows to write controllers against one cluster only.
Multi-cluster use-cases require the creation of multiple managers and/or cluster
objects. This proposal is about adding native support for multi-cluster use-cases
to controller-runtime.

The proposed changes are prototyped in #3019.

Signed-off-by: Dr. Stefan Schimanski <[email protected]>
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 31, 2024
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 31, 2024
}

// pkg/handler
type DeepCopyableEventHandler interface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The eventhandlers are stateless, why do we need the deepcopy for them?

Copy link
Member

@sbueringer sbueringer Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the propotype. I think this is because EventHandler then would store the Cluster (it is using that info to set the ClusterName field in the request)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is gone now in #2726.

Will update the design here.

Copy link
Member

@embik embik Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the BYO request/eventhandler changes in #3019, I brought this back after remembering it was mentioned in the proposal. The previous version of my prototype had a weird second layer of event handler that was wrapping the actual event wrapper and was using the event object to communicate the cluster name in. That felt all kinds of weird.

Because we now have BYO EventHandlers, it's possible that they are not entirely stateless (as @sbueringer pointed out, some event handlers might have to store the cluster name in absence of any information on the event object itself). So I think this approach is the most clean, to be honest. It's entirely optional in #3019, existing EventHandlers don't need to be changed.

Disengage(context.Context, Cluster) error
}
```
In particular, controllers implement the `AwareRunnable` interface. They react
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than changing the controller type directly and requiring all its dependencies to known how to deepcopy themselves, how about having something like a controllerconstructor (name tbd) in between that is filled with a []watchConstructor{source func(Cluster) source.Source, handler func(Cluster) handler.Handler, predicate func(cluster) []predicate.Predicate}?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would require more invasive changes to our public API (the Controller interface)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you can call Watch on an existing controller. The idea is to not let the Controller or its dependencies have any knowledge about this but instead have a thing on top of the Controller that is configured with constructors that take a cluster.Cluster and return a source/predicate/handler and then uses those to call Watch when a new cluster appears.

When one disappears, it would cancel the context on the Source.

The idea really is the opposite, I do not want the Controller to know how to extend itself like this, IMHO this is a higher-level abstraction.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compare #2726 after latest push. I have implemented @alvaroaleman's idea via a MultiClusterController wrapper implementing cluster.AwareRunnable and just calling Watch on the actual controller. All the deepcopy'ing is gone 🎉 Much nicer IMO. @alvaroaleman great intuition!

// pkg/cluster
type Provider interface {
Get(ctx context.Context, clusterName string, opts ...Option) (Cluster, error)
List(ctx context.Context) ([]string, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why return []string here rather than []Cluster?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Would be good for consistency with the Get func

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a misunderstanding of the interface. The getter is actually the constructor. The life-cycle of the returned clusters is owned by the manager (they are added as runnables). Hence, the List returns names, not clusters. We should rather rename Get to Create or Connect.

}
```

The `ctrl.Manager` will use the provider to watch clusters coming and going, and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to think about if and how this is doable, but ideally the "thing that comes and goes" wouldn't be typed to cluster.Cluster but can be anything, so this mechanism can also be used if folks have sources that are not kube watches

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be mostly about a more generic name? (can't think of much that would work, maybe something like scope)

Copy link

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is an interesting idea and i could see using it, i just have a question about some of the mechanics.

for context, i am investigating a cluster-api provider for karpenter and it would be nice to have the controllers discriminate between objects in the management cluster and objects in the workload clusters.

### Examples

- Run a controller-runtime controller against a kubeconfig with arbitrary many contexts, all being reconciled.
- Run a controller-runtime controller against cluster-managers like kind, Cluster-API, Open-Cluster-Manager or Hypershift.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given the cluster-api example here, is the intention that controllers will be able to reconcile CRDs in clusters that they know about that may only exist in a subset of clusters (e.g. Machine objects in the management cluster but not in the workload cluster) ?

Copy link
Member

@sbueringer sbueringer Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think that has to be possible. Otherwise we need all resources that we watch in all clusters

(especially good point because today a controller crashes if a resource doesn't exist)

EDIT: Further down:

For example, it can well be that every cluster has different REST mapping because installed CRDs are different. Without a context, we cannot return the right REST mapper.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Question is whether one would rather group them in managers such that every manager has a uniform set of clusters.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my updated PR #2726. You can now opt into provider and/or the default cluster per controller via options:

	// EngageWithDefaultCluster indicates whether the controller should engage
	// with the default cluster of a manager. This defaults to false through the
	// global controller options of the manager if a cluster provider is set,
	// and to true otherwise. Here it can be overridden.
	EngageWithDefaultCluster *bool
	// EngageWithProvidedClusters indicates whether the controller should engage
	// with the provided clusters of a manager. This defaults to true through the
	// global controller options of the manager if a cluster provider is set,
	// and to false otherwise. Here it can be overridden.
	EngageWithProviderClusters *bool

There is no logic yet for a controller to decide whether to engage with a provider cluster or not. Now it's with all of them. If the setup is more diverse, we might want such a functionality, e.g. some kind of pre-check: ctrl.WantsToEngage(ctx, cluster) bool`.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm still understanding the changes in #2726, but i think what you are saying here makes sense to me and would solve the issue.

some kind of pre-check: ctrl.WantsToEngage(ctx, cluster) bool`.

+1, i think we definitely need some way for the client user to specify when it should check a specific cluster for a resource.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I somehow think it should be the author's and managers responsibility (for now) to group them into groups which are working with the pattern. At this point, we don't know what we don't know. Once this is released, we can gather some feedback on edge cases and take it from there. I suspect the majority of use cases will be still single cluster reconcile loops.

Maybe document this edge case and mark this feature overall as experimental? This way we not committing to full production level stability, and allow to gather more feedback?

@sttts
Copy link
Author

sttts commented May 28, 2024

For those reading, this is currently a little outdated. #2726 has a changed design proposed by @alvaroaleman. Will come back soon to both PRs.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 26, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 25, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@embik
Copy link
Member

embik commented Oct 28, 2024

/reopen

We'd like to continue working on this, time is simply a bit scarce at the moment.

@k8s-ci-robot k8s-ci-robot reopened this Oct 28, 2024
@k8s-ci-robot
Copy link
Contributor

@embik: Reopened this PR.

In response to this:

/reopen

We'd like to continue working on this, time is simply a bit scarce at the moment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@Gomaya
Copy link

Gomaya commented Nov 13, 2024

Hi, could you please share the future plans for this feature? Thank you!
@embik

@embik
Copy link
Member

embik commented Nov 13, 2024

@Gomaya I'm working on a prototype that attempts to address the review comments in #2726. Once everyone is back from KubeCon, I plan to run this by everyone involved and try to move the feature forward.

@embik
Copy link
Member

embik commented Dec 3, 2024

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 3, 2024
📖  Update multi-cluster proposal with new implementation details
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sttts
Once this PR has been reviewed and has the lgtm label, please assign sbueringer for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sttts added 2 commits January 7, 2025 15:35
Signed-off-by: Dr. Stefan Schimanski <[email protected]>
…iform

Add notes about uniform controllers
@embik
Copy link
Member

embik commented Jan 7, 2025

Hi @alvaroaleman @sbueringer, we (@sttts and I) finally got around to refreshing this document. Could you please take a look and let us know what you think? The (new) implementation PR at #3019 is functional if you want to look at implementation details, but it's of course not finalised yet (e.g. missing tests).

// pkg/cluster
type Aware interface {
// Engage gets called when the component should start operations for the given Cluster.
// The given context is tied to the Cluster's lifecycle and will be cancelled when the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would a context be selected from the kubeconfig passed to the controller?

Copy link
Member

@embik embik Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to avoid any confusion, this description is talking about a context.Context, not a kubeconfig context.

But in general: This would be something to implement in a Provider, a kubeconfig provider could be a very simple implementation (although the focus is on dynamic providers, and a kubeconfig would probably be fairly static).

How that provider translates the clusterName parameter in the Get method (see the Provider interface above) to a kubeconfig context would be up to the implementation, but I could see the context name being the identifier here (since Get returns a cluster.Cluster, we need credentials for the embedded client, so a context makes a lot of sense here).

Does that make sense? 🤔

@sttts
Copy link
Author

sttts commented Feb 3, 2025

Linked the updated implementation #3019.

@maximilianbraun
Copy link

maximilianbraun commented Feb 3, 2025

+1 highly appreciated in the name of @SAP & is needed in context of our open source efforts with our european partners.

@mirzakopic
Copy link

+1 We are very excited about this and are looking forward us it in our open source efforts with https://apeirora.eu/ @SAP

Copy link
Member

@alvaroaleman alvaroaleman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are actually two subcases in here that we don't seem to be differentiating well:
a) Controller whose request will always contain the cluster the event originated from
b) Controller whose request may contain a cluster that is not the one the event originated from

I think you are mostly thinking about a) with this proposal. You write:

We could deepcopy the builder instead of the sources and handlers. This would
lead to one controller and one workqueue per cluster. For the reason outlined
in the previous alternative, this is not desireable.

I am having trouble finding the "reasons outlined" and I do think instantiating a controller per cluster rather than adding/removing eventsources is preferrable whenever possible, because:

  1. Otherwise workqueue metrics get dilluted, this is a big issue for ops, workqueue_depth is usually one of the main metrics people alert on
  2. It requires zero changes in existing reconcilers

I'd love to hear your thoughts on this and I do think we should clearly differentiate the two cases in this doc. What also seems to get lost a bit is that the main issue today is with doing all of this dynamically - pkg/cluster already exists, if the cluster set is static, this design adds very little.

Consequently, there is no multi-cluster controller ecosystem, but could and
should be.
- kcp maintains a [controller-runtime fork with multi-cluster support](https://github.com/kcp-dev/controller-runtime)
because adding support on-top leads to an inefficient controller design, and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leads to an inefficient controller design

Could you elaborate on this point?

Copy link
Author

@sttts sttts Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We explicitly don't want one controller (with its own workqueue) per cluster.

Example: forget about workspaces. Imagine controller-runtime only supported controllers per one (!) namespace, i.e. another controller with another namespace for every namespace you want to serve. Same argument here, just a level higher.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And independently you could imagine cases where the same is true e.g. for cluster-api cases where the workqueue should be shared. That's what this enhancement is enabling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a decision that shouldn't be forced onto customers. I can see the case where a workqueue per cluster is desired as it provides some built in fairness

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We explicitly don't want one controller (with its own workqueue) per cluster.

Others might disagree. This question needs a proper evaluation with pro/cons of both approaches rather than jumping to conclusions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that both topologies can have their place. Am not even sure pro/con is helpful. We shouldn't be opinionated, but give the primitives for the developer to decide.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying my comment from here since this is the bigger thread:

Would you be comfortable with a way to tell this design to either use a shared queue (e.g. start sources) or to start controllers with a config switch [in the manager] or similar?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that both topologies can have their place. Am not even sure pro/con is helpful. We shouldn't be opinionated, but give the primitives for the developer to decide.

A stated goal of this doc is to avoid divergence in the ecosystem, writing that down while at the same time handwaving away comments about this not actually being a good approach and saying we shouldn't be opinionated is not particularly convincing.

Our goal is be make the majority use case simple and other use-cases possible. This is not possible if we refuse to even look into the question what the majority use-case is and default to assuming that the use-case of the author of a design must be the majority use-case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I didn't mean we shouldn't think about which workqueue topology is useful when. I meant that there are good reasons for a joint workqueue in some situations (like when I want to throttle all reconciles in a process because that throughput it limited), and independent ones in other situations (like when e.g. writes to a cluster are the limited factor).

I played with a TypedFair queue:

// Fair is a queue that ensures items are dequeued fairly across different
// fairness keys while maintaining FIFO order within each key.
type Fair TypedFair[any]

// FairnessKeyFunc is a function that returns a string key for a given item.
// Items with different keys are dequeued fairly.
type FairnessKeyFunc[T comparable] func(T) string

// NewFair creates a new Fair instance.
func NewFair(keyFunc FairnessKeyFunc[any]) *Fair {
	return (*Fair)(NewTypedFair[any](keyFunc))
}

that could be plugged in here, wrapped by throttling and delays.

object to these kind of changes.

Here we call a controller to be multi-cluster-compatible if the reconcilers get
reconcile requests in cluster `X` and do all reconciliation in cluster `X`. This
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So "Start/Stop a controller for each cluster" is out of scope, this is purely about "Add/Remove sources to/from controller on cluster arrival/departure"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to "one workqueue" I guess. Start/Stop means another workqueue, which we don't want.

```

`ctrl.Manager` will implement `cluster.Aware`. As specified in the `Provider` interface,
it is the cluster provider's responsibility to call `Engage` and `Disengage` on a `ctrl.Manager`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't that mean the current interface specification for cluster.Provider is insufficient, as this entails that the cluster.Provider needs a reference to the manager?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's out of scope of the goal of the interface. Wiring the manager in will happen when starting the provider:

prov := NewProvider()
mgr := NewManager(..., Options: {provider: prov})
go mgr.Start(ctx)
go prov.Start(ctx, mgr)

// The given context is tied to the Cluster's lifecycle and will be cancelled when the
// Cluster is removed or an error occurs.
//
// Implementers should return an error if they cannot start operations for the given Cluster,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If its non-blocking, how is the controller supposed to surface errors here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that anything that needs to be done for starting operations on a cluster is "blocking" (and thus would return an error), but the operations on the engaged cluster themselves are not blocking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example?

I have some trouble understanding the difference between "anything that needs to be done for starting operations on a cluster" vs "operations on the engaged cluster" that both seems to be done or aysnchronously triggered (maybe?) in the Engage func

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the things that happen in the prototype implementation (#3019) for the typedMultiClusterController is starting the new watches on the newly engaged cluster. So if that fails, Engage returns an error, but it's not blocking for processing items.


The multi-cluster controller implementation reacts to engaged clusters by starting
a new `TypedSyncingSource` that also wraps the context passed down from the call to `Engage`,
which _MUST_ be canceled by the cluster provider at the end of a cluster's lifecycle.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you are saying the ctx passed by cluster.Provider when calling Engage on the manager needs to be stored by the manager and re-used when calling Engage on any multi-cluster runnable which in turn needs to use it to control the lifecycle of the source? What is the point of having Disengage then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right, and we can do without Disengage. I will try to change the prototype implementation to eliminate it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to imply that the context used to call Start() on a Source will be used to stop the Source?

I think stopping the Source is not possible by just canceling this context for our current source implementations (or at least some of them)

IIRC the only similar functionality that we have today is that cancelling the context passed into Cache.Start() will stop all informers.

I also wonder how this works when multiple controllers are sharing the same underlying informer. I think if a controller doesn't exclusively own an informer it also shouldn't just shut it down. Or is my assumption wrong that they usually would share informers? (like it works today when multiple controllers are sharing the same cache & underlying informers).

For additional context. If I apply what we currently do in Cluster API to this proposal, it would be the cluster provider that shuts down the cache and all underlying informers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think stopping the Source is not possible by just canceling this context for our current source implementations (or at least some of them)

Am curious which you have in mind.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least source.Kind. The ctx passed into Start can be used to cancel the start process (e.g. WaitForCacheSync) but not the informer

Comment on lines +241 to +243
type TypedDeepCopyableEventHandler[object any, request comparable] interface {
TypedEventHandler[object, request]
DeepCopyFor(c cluster.Cluster) TypedDeepCopyableEventHandler[object, request]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why teach handlers to copy themselves rather than just layering this:

type HandlerConstructor[object any, request comparable] func(cluster.Cluster) TypedHandler[object, request]

?
(name likely needs improvement but you get the idea)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for this was the attempt to keep existing function signatures stable while enabling them to be multi-cluster aware. A HandlerConstructor would probably need a new separate builder function to be passed as argument, so e.g. something Watches vs ClusterAwareWatches (or whatever).

I'm totally open to changing this if you prefer it.

Comment on lines +284 to +290
The builder will chose the correct `EventHandler` implementation for both `For` and `Owns`
depending on the `request` type used.

With the described changes (use `GetCluster(ctx, req.ClusterName)`, making `reconciler`
a `TypedFunc[reconcile.ClusterAwareRequest]`) an existing controller will automatically act as
*uniform multi-cluster controller* if a cluster provider is configured.
It will reconcile resources from cluster `X` in cluster `X`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is worth pointing out that this can also be achieved by instantiating a controller per target cluster rather than adding/removing sources to/from an existing controller.

IMHO if you ever actually want to operate the resulting component, you likely want "create/remove controller" rather than "create/remove source", because otherwise a single problematic cluster can completely mess up the workqueue metrics and on-calls can't tell if one cluster has an issue or all, which is going to be a big difference in term of severity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you be comfortable with a way to tell this design to either use a shared queue (e.g. start sources) or to start controllers with a config switch or similar?

Owns(&v1.ReplicaSet{}).
Complete(reconciler)
```

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc is completely missing info on:

  • The actual implementation of a multi-cluster controller (i.E. Engage/Disengage in the Controller) - We are not expecting users to do that, right?
  • The same for source.Source, but arguably a subtask of the above

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See implementation proposal #3019.

Comment on lines +303 to +304
EngageWithDefaultCluster: ptr.To(true),
EngageWithProviderClusters: ptr.To(false),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This double binary distinction is pretty much guaranted to be too little. It would be better if we somehow tell the builder if this is a multi-cluster controller or not and then the cluster.Provider calls Engage for all clusters that should be engaged and its up to the implementor of the provider if they want to include the cluster the manager has a kubeconfig for or not.

If this is insufficient, we need a way for the cluster.Provider to decide if a given Aware runnable should be engaged or not.

Copy link
Member

@sbueringer sbueringer Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its up to the implementor of the provider

Hm not sure if it should be up to the provider. Let's say I implement a cluster provider for Cluster API. I think it shouldn't be my call if all or none of the controllers that are used with this provider also watch the hub cluster or not.

I could make this a configuration option of the cluster provider but if this doesn't work because we only have one cluster provider per manager and I think it's valid that only some controllers of a manager watch the hub cluster and others do not.

So I think it should be a per-controller decision.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if it would make sense to just always call Engage and then the Engage function always can just do nothing if it doesn't want to engage a Cluster. This seems the most flexible option.

(If necessary Engage could have a bool return parameter signalling if the controller actually engaged a cluster or not)

## Alternatives

- Multi-cluster support could be built outside of core controller-runtime. This would
lead likely to a design with one manager per cluster. This has a number of problems:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - This is precicesly why pkg/cluster exists

Copy link
Member

@sbueringer sbueringer Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about details why it is not possible to implement this outside of CR. (I got the points around adoption across the ecosystem, just wondering about the technical reasons)

We have implemented a component (called ClusterCache) in Cluster API that seems to come very close to what this design is trying to achieve (apart from that it is Cluster API specific of course). Especially since the generic support was added to CR.

Basically ClusterCache in CAPI:

  • discovers Clusters
  • maintains a Cache per Cluster
  • allows retrieving Clients for a Cluster
  • allows adding Watches (kind Sources) for a Cluster
    • this also allows mapping events that we get from these sources back to the one controller with the one shared work queue

xref: https://github.com/kubernetes-sigs/cluster-api/tree/main/controllers/clustercache

P.S. we are not creating multiple Cluster objects, instead we have our own simplified version that only contains what we need (https://github.com/kubernetes-sigs/cluster-api/blob/main/controllers/clustercache/cluster_accessor.go#L85-L89)

P.S.2. I don't understand the last two points in this list

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main "blocker" for this is that the builder is pretty common in 3rdparty controller code. If we do all of this outside of CR, this will very likely mean a fork of pkg/builder. I think everything else in the implementation PR could be done outside of CR.

Consequently, there is no multi-cluster controller ecosystem, but could and
should be.
- kcp maintains a [controller-runtime fork with multi-cluster support](https://github.com/kcp-dev/controller-runtime)
because adding support on-top leads to an inefficient controller design, and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a decision that shouldn't be forced onto customers. I can see the case where a workqueue per cluster is desired as it provides some built in fairness

- writing controllers for upper systems in a **portable** way is hard today.
Consequently, there is no multi-cluster controller ecosystem, but could and
should be.
- kcp maintains a [controller-runtime fork with multi-cluster support](https://github.com/kcp-dev/controller-runtime)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a general note, for the purpose of this proposal we should focus on general controller runtime users. While we can keep kcp as a reference along other implementation. I'd rephrase the motivation at a high level "setup controllers and watches across multiple Kubernetes clusters in a transparent way"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we tried to cover the high-level part with the first bullet point. The kcp controller-runtime fork is just mentioned to give personal motivation, but I don't think we have to mention it here if that is preferred.

Comment on lines +88 to +90
- Run a controller-runtime controller against a kubeconfig with arbitrary many contexts, all being reconciled.
- Run a controller-runtime controller against cluster managers like kind, Cluster API, Open-Cluster-Manager or Hypershift.
- Run a controller-runtime controller against a kcp shard with a wildcard watch.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to focus the first iteration of this proposal to how Kubernetes works today? Adding uncommon use cases at this point in time increase overall complexity of the implementation. Other use cases should be pluggable imo

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We agree things should be highly pluggable, that's why this is just an (incomplete) list of things that you could eventually plug in. Agreed that kcp is an uncommon use case, but so far we've made (recent) design decisions with Kubernetes clusters in general in mind. The "kcp" provider we'd like to built is itself far from ready yet.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That wouldn't be helpful for us though. I don't think the design now is influenced much by the kcp requirements, maybe with exception of the shared workqueue. Other than that the fleet-namespace example (which kind of reflects the kcp requirements) shows that the kcp use-case can be covered by a pretty generic design.

out-of-tree subprojects that can individually evolve and vendor'ed by controller authors.
- Make controller-runtime controllers "binary pluggable".
- Manage one manager per cluster.
- Manage one controller per cluster with dedicated workqueues.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Manage one controller per cluster with dedicated workqueues.

This should be a goal

type Provider interface {
// Get returns a cluster for the given identifying cluster name. Get
// returns an existing cluster if it has been created before.
Get(ctx context.Context, clusterName string) (Cluster, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second argument should probably be a typed reference, like we have for ObjectReference, even if it contains a single Name field, it would help with expanding it later, wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you thinking of logical.Name here or more a struct?

Comment on lines +196 to +199
The embedded `cluster.Cluster` corresponds to `GetCluster(ctx, "")`. We call the
clusters with non-empty name "provider clusters" or "enganged clusters", while
the embedded cluster of the manager is called the "default cluster" or "hub
cluster".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go by being explicit here, this was one of the main issues is that an empty string is also a default value. We can set a cluster.Reference to a specific value that's very specific, which in turns is used across the entire codebase

// pkg/reconcile
type ClusterAwareRequest struct {
Request
ClusterName string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ClusterName string
Cluster cluster.Reference

can be used as `request` type even for controllers that do not have an active cluster provider.
The cluster name will simply be an empty string, which is compatible with calls to `mgr.GetCluster`.

**Note:** controller-runtime must provide this cluster-aware request type to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we're saying must here but SHOULD in the text right underneath the Cluster-Aware Request title

Complete(reconciler)

// new
builder.TypedControllerManagedBy[reconcile.ClusterAwareRequest](mgr).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit too easy to mess up. What's stopping a reconcile.ClusterAwareRequest being used in the wrong place or vice-versa?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general: reconcile.ClusterAwareRequest could totally be used in a non-multi-cluster setup and it wouldn't change a thing since it embeds a reconcile.Request. If the ClusterName field is empty, req.ClusterName would imply the "default cluster", which is the single-cluster use-case today (e.g. we changed the fleet example in #3019 to have a flag that lets you toggle multi-cluster vs single-cluster usage).

If you end up using reconcile.Request you would quickly notice that you don't have the cluster name to pass to mgr.GetCluster.

Comment on lines +292 to +294
For a manager with `cluster.Provider`, the builder _SHOULD_ create a controller
that sources events **ONLY** from the provider clusters that got engaged with
the controller.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that the "default" cluster we won't get events for?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be configured by controller. We default to writing a uniform multi-cluster controller, i.e. one that only reacts to provider clusters. It is not common afaik to have the same semantics for both a local cluster (the hub usually) and provider clusters.


- Ship integration for different multi-cluster setups. This should become
out-of-tree subprojects that can individually evolve and vendor'ed by controller authors.
- Make controller-runtime controllers "binary pluggable".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "binary pluggable" mean in this context?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like https://pkg.go.dev/plugin to dynamically load providers.

### Controller Author without self-interest in multi-cluster, but open for adoption in multi-cluster setups

- Replace `mgr.GetClient()` and `mgr.GetCache` with `mgr.GetCluster(req.ClusterName).GetClient()` and `mgr.GetCluster(req.ClusterName).GetCache()`.
- Make manager and controller plumbing vendor'able to allow plugging in multi-cluster provider and BYO request type.
Copy link
Member

@sbueringer sbueringer Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea that something like cert-manager would decide on startup which cluster provider should be used and can then only work with one cluster provider at a time?

Phrased differently. Do we also want to support using multiple cluster providers at the same time?

Copy link
Author

@sttts sttts Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • One provider per manager at a time.
  • Low friction to make reconcilers uniform-multi-cluster capable (this basically means using the cluster-enabled request and to call mgr.GetCluster(name) instead of accessing cluster methods directly).
  • If a controller project wants to add support for a number of provider in their repository, this is fine, but not necessarily the goal.
  • Instead it should be easy to instantiate the controllers from an alternative main.go with a provider of your choice.

@sttts
Copy link
Author

sttts commented Feb 16, 2025

I already discussed this with @sbueringer and others:

I went ahead earlier this week to implement the design outside of controller-runtime: https://github.com/multicluster-runtime/multicluster-runtime.

As written in the upper comment, everything but the builder is easy to implement by wrapping controller-runtime concepts. Thanks to the extensive work to get Go generics into CR, those wrappers are easy and natural.

The builder consists of roughly 500 lines of code with mostly mechanical changes, which should be feasible to regularly rebase onto the latest state in controller-runtime.

Our plan for the moment is to use https://github.com/multicluster-runtime/multicluster-runtime to prove the design with non-trivial projects. Last but not least, it will help to iterate fast and learn on the way. It is not ruled out that parts of it should flow back into controller-runtime at some point. Definitely it can and should influence the design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.