fix: resolve check-then-act data races in datastore#1197
fix: resolve check-then-act data races in datastore#1197shivansh-gohem wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR updates the ModelServer CRD to support routing to external cloud LLM providers, and refactors router datastore updates to rely on atomic LoadOrStore patterns.
Changes:
- Add
externalProvidersupport toModelServerSpecwith CRD-level XValidations enforcing mutual exclusivity vsworkloadSelector. - Refactor datastore
AddOrUpdateModelServer/AddOrUpdatePodto useLoadOrStoreinstead ofLoad+Store. - Update generated CRD docs/manifests and golden YAML expectations.
Reviewed changes
Copilot reviewed 6 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/model-booster-controller/convert/testdata/expected/pd-model-server.yaml | Updates expected revision annotation value in golden YAML. |
| pkg/model-booster-controller/convert/testdata/expected/model-server.yaml | Updates expected revision annotation value in golden YAML. |
| pkg/kthena-router/datastore/store.go | Uses LoadOrStore to reduce races and avoid redundant Store calls. |
| pkg/apis/networking/v1alpha1/modelserver_types.go | Adds ExternalProvider API + XValidations; makes some fields optional depending on mode. |
| docs/kthena/docs/reference/crd/networking.serving.volcano.sh.md | Adds documentation for ExternalProvider and updates requiredness of fields. |
| charts/kthena/charts/networking/crds/networking.serving.volcano.sh_modelservers.yaml | Updates CRD schema and validations to include externalProvider. |
Files not reviewed (4)
- client-go/applyconfiguration/networking/v1alpha1/externalprovider.go: Generated file
- client-go/applyconfiguration/networking/v1alpha1/modelserverspec.go: Generated file
- client-go/applyconfiguration/utils.go: Generated file
- pkg/apis/networking/v1alpha1/zz_generated.deepcopy.go: Generated file
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| workload.serving.volcano.sh/model-name: test-model | ||
| workload.serving.volcano.sh/model-uid: randomUID | ||
| workload.serving.volcano.sh/revision: 587ff8c655 | ||
| workload.serving.volcano.sh/revision: 695b798d9 |
| newObj := newModelServer(ms) | ||
| if len(pods) != 0 { | ||
| newObj.pods = pods | ||
| } | ||
|
|
||
| actual, loaded := s.modelServer.LoadOrStore(name, newObj) | ||
| modelServerObj := actual.(*modelServer) | ||
|
|
||
| if loaded { |
| // ExternalProvider specifies an external cloud LLM provider to route requests to. | ||
| // When this is set, WorkloadSelector is ignored and requests are proxied to the external endpoint. | ||
| // +optional | ||
| ExternalProvider *ExternalProvider `json:"externalProvider,omitempty"` |
| // Reference to a Kubernetes Secret containing the API credentials/keys for the external provider. | ||
| // The Secret should contain the token/key under the expected key name (e.g., 'api-key' or 'token'). | ||
| // +optional | ||
| CredentialsRef *corev1.LocalObjectReference `json:"credentialsRef,omitempty"` |
There was a problem hiding this comment.
Code Review
This pull request introduces support for external cloud LLM providers in the ModelServer CRD by adding an externalProvider field to ModelServerSpec. It implements validation rules ensuring that exactly one of workloadSelector or externalProvider is specified, and that inferenceEngine is defined when workloadSelector is set. The changes also include updates to generated client-go apply configurations, deepcopy functions, documentation, and a refactoring of datastore operations in pkg/kthena-router/datastore/store.go to use LoadOrStore for safer concurrent map access. There are no review comments, and I have no feedback to provide.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
70cf34f to
9823433
Compare
kube-gopher
left a comment
There was a problem hiding this comment.
- The ExternalProvider work touches modelserver_types.go, the generated CRD YAML, deepcopy, applyconfiguration client code, and CRD docs; these items do not match the PR/issue description. It is recommended to enhance the PR/issue description or remove them for better review.
- It is recommended to write a unit test to provide regression protection. go test -race ./pkg/kthena-router/datastore/...
hzxuzhonghu
left a comment
There was a problem hiding this comment.
if it is fixing a data race , please donot change api
This commit fixes TOCTOU race conditions in AddOrUpdateModelServer and AddOrUpdatePod by replacing the Load-then-Store pattern with atomic LoadOrStore operations. Signed-off-by: Shivansh Sahu <sahushivansh142@gmail.com>
9823433 to
9c67336
Compare
| go func() { | ||
| defer wg.Done() | ||
| ms := &aiv1alpha1.ModelServer{ | ||
| ObjectMeta: metav1.ObjectMeta{ | ||
| Namespace: "default", | ||
| Name: "concurrent-model1", | ||
| }, | ||
| } | ||
| pods := sets.New[types.NamespacedName](types.NamespacedName{Namespace: "default", Name: "pod1"}) | ||
| err := s.AddOrUpdateModelServer(ms, pods) | ||
| assert.NoError(t, err) | ||
| }() |
| go func() { | ||
| defer wg.Done() | ||
| pod := &corev1.Pod{ | ||
| ObjectMeta: metav1.ObjectMeta{ | ||
| Namespace: "default", | ||
| Name: "concurrent-pod1", | ||
| }, | ||
| } | ||
| ms := &aiv1alpha1.ModelServer{ | ||
| ObjectMeta: metav1.ObjectMeta{ | ||
| Namespace: "default", | ||
| Name: "model1", | ||
| }, | ||
| } | ||
| err := s.AddOrUpdatePod(pod, []*aiv1alpha1.ModelServer{ms}) | ||
| assert.NoError(t, err) | ||
| }() |
| newObj := newModelServer(ms) | ||
| if len(pods) != 0 { | ||
| modelServerObj.pods = pods | ||
| newObj.pods = pods | ||
| } | ||
| } else { | ||
| modelServerObj = value.(*modelServer) | ||
| actual, loaded = s.modelServer.LoadOrStore(name, newObj) |
|
@hzxuzhonghu @kube-gopher Thanks for the review! I've just pushed a force update to clean up the PR based on your feedback:
Please let me know if everything looks good now! |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
hzxuzhonghu
left a comment
There was a problem hiding this comment.
There is no real race. It can call AddOrUpdateModelServer only in syncModelServerHandler now, which is run in searial
|
Hi @hzxuzhonghu, @kube-gopher, I've pushed an update addressing the latest feedback:
@hzxuzhonghu regarding your point: 'There is no real race. It can call AddOrUpdateModelServer only in syncModelServerHandler now, which is run in serial' — that's a very fair point! The primary motivation here is defense-in-depth against future refactoring (or concurrent metrics-scraping reading while updating). The Does everything look good to merge now? |
82b69ad to
04214eb
Compare
|
@shivansh-gohem #1196 Let me ask again: is this actually observed, or is it derived? |
|
Hi @kube-gopher, to answer your question: this is entirely derived from code inspection, not actually observed in a live cluster crash. As @hzxuzhonghu pointed out, since I noticed the check-then-act pattern on But I totally understand if we want to avoid merging theoretical fixes! If you folks feel this isn't strictly necessary since the current execution path is safe, I'm completely fine with closing this PR. Let me know what you prefer! |

Fixes #1196
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR fixes the TOCTOU check-then-act data races in
AddOrUpdateModelServerandAddOrUpdatePodwithin the router datastore (store.go).It replaces the non-atomic
Load()followed byStore()pattern with atomicLoadOrStore()operations, preventing concurrent map insertions from overwriting each other.Which issue(s) this PR fixes:
Fixes #1196
Special notes for reviewer:
This applies the same fix pattern that was applied to
handleErrorPodin PR #1157, completing the concurrency fix that was partially addressed in PR #781.Does this PR introduce a user-facing change?: