Client-side QPS appears to share a bucket with the leader election client #3092

jonathan-innis · 2025-01-28T17:20:45Z

While scale testing kubernetes-sigs/karpenter using the controller-runtime client with client-side QPS enabled, we would try to scale-up thousands of objects at one time. While this scale-up was occurring, we noticed that we received logs indicating that the client was getting client-side throttled (which was expected); however, what wasn't expected was that during this same time, we would also drop updating our lease and lose leader election.

I tried this on a large AWS instance so I think it's highly unlikely that we were getting CPU throttled. To me, this appeared to look like a case where the same client-side QPS that was used by Go client was sharing a bucket with the lease QPS, causing them to get throttled at the same rate.

Do we know if the lease QPS and the generic object QPS are sharing the same bucket? If so, do we think that it makes sense to split them into separate buckets since, in general, retaining the lease should be prioritized above creating or updating objects at the apiserver?

Error Log

{"level":"INFO","time":"2024-11-20T07:31:09.079Z","logger":"controller","caller":"rest/request.go:634","message":"Waited for 17.884562833s due to client-side throttling, not priority and fairness, request: PATCH:https://10.100.0.1:443/apis/karpenter.sh/v1/nodeclaims/default-50-9trgn","commit":"0b107d2-dirty"}
panic: leader election lost

goroutine 197 [running]:
github.com/samber/lo.must({0x1d02ec0, 0xc024b31160}, {0x0, 0x0, 0x0})
        github.com/samber/[email protected]/errors.go:53 +0x1df
github.com/samber/lo.Must0(...)
        github.com/samber/[email protected]/errors.go:72
sigs.k8s.io/karpenter/pkg/operator.(*Operator).Start.func1()
        sigs.k8s.io/karpenter/pkg/operator/operator.go:258 +0x77
created by sigs.k8s.io/karpenter/pkg/operator.(*Operator).Start in goroutine 1
        sigs.k8s.io/karpenter/pkg/operator/operator.go:256 +0xe5

The text was updated successfully, but these errors were encountered:

alvaroaleman · 2025-01-29T04:20:54Z

The explanation seems unlikely to me, we copy the config we use for leader election:

controller-runtime/pkg/manager/manager.go

Line 360 in 990f2ed

leaderConfig = rest.CopyConfig(config)

Did you see logs rquests being throttled from client-go? It logs it if it does that.

An easy way to test your theory would be to set the LeaderElectionConfig to a copy of your config in the manager opts.

jonathan-innis · 2025-01-29T05:31:13Z

Here's a longer log example of the client-side rate limiting that we were seeing. I'll definitely try the LeaderElectionConfig suggestion that you called out!

{"level":"ERROR","time":"2025-01-29T05:26:25.921Z","logger":"controller","caller":"leaderelection/leaderelection.go:285","message":"Failed to update lock optimistically: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline, falling back to slow path","commit":"2a09110-dirty"}
{"level":"ERROR","time":"2025-01-29T05:26:25.921Z","logger":"controller","caller":"leaderelection/leaderelection.go:285","message":"error retrieving resource lock kube-system/karpenter-leader-election: client rate limiter Wait returned an error: context deadline exceeded","commit":"2a09110-dirty"}
{"level":"INFO","time":"2025-01-29T05:26:25.921Z","logger":"controller","caller":"wait/backoff.go:226","message":"failed to renew lease kube-system/karpenter-leader-election: context deadline exceeded","commit":"2a09110-dirty"}
{"level":"ERROR","time":"2025-01-29T05:26:25.921Z","logger":"controller","caller":"leaderelection/leaderelection.go:303","message":"Failed to release lock: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline","commit":"2a09110-dirty"}
panic: leader election lost

goroutine 96 [running]:
github.com/samber/lo.must({0x1cecc40, 0xc7c3b19730}, {0x0, 0x0, 0x0})
        github.com/samber/[email protected]/errors.go:53 +0x1df
github.com/samber/lo.Must0(...)
        github.com/samber/[email protected]/errors.go:72
sigs.k8s.io/karpenter/pkg/operator.(*Operator).Start.func1()
        sigs.k8s.io/karpenter/pkg/operator/operator.go:220 +0x75
created by sigs.k8s.io/karpenter/pkg/operator.(*Operator).Start in goroutine 1
        sigs.k8s.io/karpenter/pkg/operator/operator.go:218 +0xa5

jonathan-innis changed the title ~~Controller Runtime's Client-side QPS appears to treat leases the same as other objects~~ Controller Runtime's client-side QPS appears to share a bucket with the leader election client Jan 28, 2025

jonathan-innis changed the title ~~Controller Runtime's client-side QPS appears to share a bucket with the leader election client~~ Client-side QPS appears to share a bucket with the leader election client Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client-side QPS appears to share a bucket with the leader election client #3092

Client-side QPS appears to share a bucket with the leader election client #3092

jonathan-innis commented Jan 28, 2025 •

edited

Loading

alvaroaleman commented Jan 29, 2025

jonathan-innis commented Jan 29, 2025

Client-side QPS appears to share a bucket with the leader election client #3092

Client-side QPS appears to share a bucket with the leader election client #3092

Comments

jonathan-innis commented Jan 28, 2025 • edited Loading

Error Log

alvaroaleman commented Jan 29, 2025

jonathan-innis commented Jan 29, 2025

jonathan-innis commented Jan 28, 2025 •

edited

Loading