fix(cluster): make ClusterManager concurrency-safe with sync.RWMutex#436
Open
DioCrafts wants to merge 1 commit intokite-org:mainfrom
Open
fix(cluster): make ClusterManager concurrency-safe with sync.RWMutex#436DioCrafts wants to merge 1 commit intokite-org:mainfrom
DioCrafts wants to merge 1 commit intokite-org:mainfrom
Conversation
Finding 2.1: ClusterManager.clusters and .errors maps were read from
every HTTP request (via GetClientSet, GetClusters, GetClusterList) and
written by the background syncClusters goroutine without any
synchronization. This is a data race that causes 'concurrent map read
and map write' panics in production.
Solution A — sync.RWMutex:
- Added sync.RWMutex to ClusterManager struct
- GetClientSet() now holds RLock while reading maps
- Readers (HTTP requests) can execute concurrently with each other
- Only syncClusters takes an exclusive Lock, and only briefly
Solution D — Encapsulated access methods:
- Snapshot() returns shallow copies of both maps + defaultContext
under RLock so callers can iterate safely without holding the lock
- ClusterVersion(name) and ClusterError(name) provide single-key
lookups under RLock for GetClusterList
- GetClusters and GetClusterList no longer touch cm.clusters/errors
directly — impossible to forget the lock in future changes
Solution E — Minimal write-lock duration in syncClusters:
- Phase 1: RLock — snapshot current state (microseconds)
- Phase 2: No lock — build all new ClientSets (slow I/O, seconds)
- Phase 3: Lock — swap 3 pointers (microseconds)
- Phase 4: No lock — stop old clients
- The exclusive Lock is held for only ~microseconds instead of the
entire duration of building new K8s clients (potentially seconds)
Dead code removed:
- Replaced map[string]interface{} with map[string]struct{} for
dbClusterMap (no value needed, just set membership)
- Eliminated delete-during-iterate pattern that was unsafe with
concurrent readers
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🔒 fix(cluster): Make ClusterManager concurrency-safe with sync.RWMutex
Summary
The
ClusterManager— the central hub that every single HTTP request passes through to get a Kubernetes client — has a data race that can crash the entire process in production. Theclustersanderrorsmaps are read by every HTTP request and written by a background goroutine without any synchronization.This PR fixes the data race with a
sync.RWMutexwhile simultaneously improving sync performance by restructuringsyncClusters()to hold the exclusive write-lock for only microseconds instead of seconds.The Problem
Data race = random crashes in production
ClusterManagerstores its state in three unprotected fields:These fields are accessed from two concurrent contexts:
ClusterMiddlewareclustersmapGetClusters()clusters+errorsmapsGetClusterList()clusters+errorsmapssyncClusters()clusters,errors,defaultContextIn Go, concurrent read+write to a map is undefined behavior. The runtime deliberately detects this and panics:
This is not theoretical — it will happen in production whenever:
syncNowchannel) while other users are readingWith Kite serving a team of developers, there are almost always concurrent HTTP requests, making this crash a matter of when, not if.
The original
syncClusters()was also structurally unsafeBeyond the missing mutex, the original code had additional issues:
cm.clusterswhile iterating it in the same function, with no protection against concurrent readers seeing a partially-modified mapClientSetobjects (which involves TCP connections to Kubernetes API servers) happened while the maps were being modified — a window of seconds where any concurrent reader could crashmap[string]interface{}used wheremap[string]struct{}suffices (unnecessary allocations)The Solution
Three complementary strategies working together
1.
sync.RWMutex— Correct concurrent access (Solution A)GetClientSet,GetClusters,GetClusterList) acquireRLock— they run concurrently with each other, zero contention between HTTP requestssyncClusters) acquires exclusiveLock— but only for microseconds (see Solution E below)RLock/RUnlock: ~10 nanoseconds — completely negligible vs the milliseconds spent on Kubernetes API calls2. Encapsulated access methods — Future-proof safety (Solution D)
Instead of having handlers directly access
cm.clusters[name], we provide safe methods:Why this matters:
Snapshot()returns shallow copies, soGetClusters()can iterate without holding the lock (the lock is held only for the copy operation — microseconds)cm.clustersBefore (unsafe — direct map access in HTTP handlers):
After (safe — works with independent copies):
3. Minimal write-lock duration in
syncClusters()(Solution E)The original
syncClusters()calledbuildClientSet()(which does TCP connections and TLS handshakes to Kubernetes API servers — seconds of I/O) while actively modifying the shared maps. Our rewrite separates this into 4 phases:The exclusive write-lock is held for exactly 3 pointer assignments — microseconds, regardless of how many clusters exist or how slow the Kubernetes API servers are.
Performance Impact
Latency — Zero regression for readers
GetClientSet()(every request)GetClusters()syncClusters()write-lock heldThroughput — Better under load
Memory — Negligible overhead
sync.RWMutexSnapshot()copies per calldbClusterMaptypemap[string]interface{}(64-byte values)map[string]struct{}(0-byte values)Correctness Verification
All direct map accesses are now protected
GetClientSet()→ readsclusters,defaultContextRLockgetClientSetLocked()Snapshot()→ copiesclusters,errors,defaultContextRLockClusterVersion()→ readsclusters[name]RLockClusterError()→ readserrors[name]RLocksyncClusters()Phase 1 → copies current stateRLocksyncClusters()Phase 3 → swaps mapsLockNewClusterManager()→ initializes mapsGetClusters()→ iterates clusters + errorsSnapshot()copiesGetClusterList()→ reads version + errorRLockClusterVersion()/ClusterError()ClusterMiddleware()→ reads clusterRLockGetClientSet()Zero unprotected accesses remain. Verified with
grep -n "cm\.clusters\|cm\.errors\|cm\.defaultContext".Tests
go build ./...— Compiles cleanlygo vet ./pkg/cluster/...— No issuesgo test ./pkg/cluster/ -v -count=1— 9/9 tests pass (shouldUpdateClustersuite + mockey tests)What Changed
Added
sync.RWMutexfield onClusterManagerSnapshot()method — returns shallow copies for safe iterationClusterVersion(name)method — single-key lookup under RLockClusterError(name)method — single-key lookup under RLockgetClientSetLocked()— internal method called while RLock is held (replaces the recursiveGetClientSet()call that would deadlock with a mutex)syncClusters()with separated I/O and lockingChanged
GetClientSet()— acquires RLock, delegates togetClientSetLocked()GetClusters()— usesSnapshot()instead of direct map accessGetClusterList()— usesClusterVersion()/ClusterError()instead of direct map accesssyncClusters()— rewritten with minimal lock duration patterndbClusterMaptype:map[string]interface{}→map[string]struct{}(zero-size values)Removed
cm.clusters/cm.errors/cm.defaultContextaccess fromcluster_handler.gosyncClusters()(replaced with full map swap)interface{}allocations indbClusterMapVisual Summary
Why This Matters
This isn't a performance optimization — it's a correctness fix for a crash bug. Every Kite installation running with more than one concurrent user is vulnerable to random
fatal error: concurrent map read and map writepanics. The fact that it also improves sync performance (write-lock held ~1,000,000x shorter) is a bonus.The fix is minimal (2 files, 104 insertions), well-encapsulated (all map access goes through safe methods), and fully backward-compatible (zero API changes, zero behavior changes from the user's perspective).