Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client-side QPS appears to share a bucket with the leader election client #3092

Open
jonathan-innis opened this issue Jan 28, 2025 · 2 comments

Comments

@jonathan-innis
Copy link
Member

jonathan-innis commented Jan 28, 2025

While scale testing kubernetes-sigs/karpenter using the controller-runtime client with client-side QPS enabled, we would try to scale-up thousands of objects at one time. While this scale-up was occurring, we noticed that we received logs indicating that the client was getting client-side throttled (which was expected); however, what wasn't expected was that during this same time, we would also drop updating our lease and lose leader election.

I tried this on a large AWS instance so I think it's highly unlikely that we were getting CPU throttled. To me, this appeared to look like a case where the same client-side QPS that was used by Go client was sharing a bucket with the lease QPS, causing them to get throttled at the same rate.

Do we know if the lease QPS and the generic object QPS are sharing the same bucket? If so, do we think that it makes sense to split them into separate buckets since, in general, retaining the lease should be prioritized above creating or updating objects at the apiserver?

Error Log

{"level":"INFO","time":"2024-11-20T07:31:09.079Z","logger":"controller","caller":"rest/request.go:634","message":"Waited for 17.884562833s due to client-side throttling, not priority and fairness, request: PATCH:https://10.100.0.1:443/apis/karpenter.sh/v1/nodeclaims/default-50-9trgn","commit":"0b107d2-dirty"}
panic: leader election lost

goroutine 197 [running]:
github.com/samber/lo.must({0x1d02ec0, 0xc024b31160}, {0x0, 0x0, 0x0})
        github.com/samber/[email protected]/errors.go:53 +0x1df
github.com/samber/lo.Must0(...)
        github.com/samber/[email protected]/errors.go:72
sigs.k8s.io/karpenter/pkg/operator.(*Operator).Start.func1()
        sigs.k8s.io/karpenter/pkg/operator/operator.go:258 +0x77
created by sigs.k8s.io/karpenter/pkg/operator.(*Operator).Start in goroutine 1
        sigs.k8s.io/karpenter/pkg/operator/operator.go:256 +0xe5
@jonathan-innis jonathan-innis changed the title Controller Runtime's Client-side QPS appears to treat leases the same as other objects Controller Runtime's client-side QPS appears to share a bucket with the leader election client Jan 28, 2025
@jonathan-innis jonathan-innis changed the title Controller Runtime's client-side QPS appears to share a bucket with the leader election client Client-side QPS appears to share a bucket with the leader election client Jan 28, 2025
@alvaroaleman
Copy link
Member

The explanation seems unlikely to me, we copy the config we use for leader election:

leaderConfig = rest.CopyConfig(config)

Did you see logs rquests being throttled from client-go? It logs it if it does that.

An easy way to test your theory would be to set the LeaderElectionConfig to a copy of your config in the manager opts.

@jonathan-innis
Copy link
Member Author

Here's a longer log example of the client-side rate limiting that we were seeing. I'll definitely try the LeaderElectionConfig suggestion that you called out!

{"level":"ERROR","time":"2025-01-29T05:26:25.921Z","logger":"controller","caller":"leaderelection/leaderelection.go:285","message":"Failed to update lock optimistically: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline, falling back to slow path","commit":"2a09110-dirty"}
{"level":"ERROR","time":"2025-01-29T05:26:25.921Z","logger":"controller","caller":"leaderelection/leaderelection.go:285","message":"error retrieving resource lock kube-system/karpenter-leader-election: client rate limiter Wait returned an error: context deadline exceeded","commit":"2a09110-dirty"}
{"level":"INFO","time":"2025-01-29T05:26:25.921Z","logger":"controller","caller":"wait/backoff.go:226","message":"failed to renew lease kube-system/karpenter-leader-election: context deadline exceeded","commit":"2a09110-dirty"}
{"level":"ERROR","time":"2025-01-29T05:26:25.921Z","logger":"controller","caller":"leaderelection/leaderelection.go:303","message":"Failed to release lock: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline","commit":"2a09110-dirty"}
panic: leader election lost

goroutine 96 [running]:
github.com/samber/lo.must({0x1cecc40, 0xc7c3b19730}, {0x0, 0x0, 0x0})
        github.com/samber/[email protected]/errors.go:53 +0x1df
github.com/samber/lo.Must0(...)
        github.com/samber/[email protected]/errors.go:72
sigs.k8s.io/karpenter/pkg/operator.(*Operator).Start.func1()
        sigs.k8s.io/karpenter/pkg/operator/operator.go:220 +0x75
created by sigs.k8s.io/karpenter/pkg/operator.(*Operator).Start in goroutine 1
        sigs.k8s.io/karpenter/pkg/operator/operator.go:218 +0xa5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants