Skip to content

fix: resolve check-then-act data races in datastore#1197

Open
shivansh-gohem wants to merge 2 commits into
volcano-sh:mainfrom
shivansh-gohem:fix/datastore-race
Open

fix: resolve check-then-act data races in datastore#1197
shivansh-gohem wants to merge 2 commits into
volcano-sh:mainfrom
shivansh-gohem:fix/datastore-race

Conversation

@shivansh-gohem

@shivansh-gohem shivansh-gohem commented Jun 14, 2026

Copy link
Copy Markdown

Fixes #1196

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR fixes the TOCTOU check-then-act data races in AddOrUpdateModelServer and AddOrUpdatePod within the router datastore (store.go).
It replaces the non-atomic Load() followed by Store() pattern with atomic LoadOrStore() operations, preventing concurrent map insertions from overwriting each other.

Which issue(s) this PR fixes:

Fixes #1196

Special notes for reviewer:

This applies the same fix pattern that was applied to handleErrorPod in PR #1157, completing the concurrency fix that was partially addressed in PR #781.

Does this PR introduce a user-facing change?:

NONE

Copilot AI review requested due to automatic review settings June 14, 2026 20:27

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR updates the ModelServer CRD to support routing to external cloud LLM providers, and refactors router datastore updates to rely on atomic LoadOrStore patterns.

Changes:

  • Add externalProvider support to ModelServerSpec with CRD-level XValidations enforcing mutual exclusivity vs workloadSelector.
  • Refactor datastore AddOrUpdateModelServer / AddOrUpdatePod to use LoadOrStore instead of Load + Store.
  • Update generated CRD docs/manifests and golden YAML expectations.

Reviewed changes

Copilot reviewed 6 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pkg/model-booster-controller/convert/testdata/expected/pd-model-server.yaml Updates expected revision annotation value in golden YAML.
pkg/model-booster-controller/convert/testdata/expected/model-server.yaml Updates expected revision annotation value in golden YAML.
pkg/kthena-router/datastore/store.go Uses LoadOrStore to reduce races and avoid redundant Store calls.
pkg/apis/networking/v1alpha1/modelserver_types.go Adds ExternalProvider API + XValidations; makes some fields optional depending on mode.
docs/kthena/docs/reference/crd/networking.serving.volcano.sh.md Adds documentation for ExternalProvider and updates requiredness of fields.
charts/kthena/charts/networking/crds/networking.serving.volcano.sh_modelservers.yaml Updates CRD schema and validations to include externalProvider.
Files not reviewed (4)
  • client-go/applyconfiguration/networking/v1alpha1/externalprovider.go: Generated file
  • client-go/applyconfiguration/networking/v1alpha1/modelserverspec.go: Generated file
  • client-go/applyconfiguration/utils.go: Generated file
  • pkg/apis/networking/v1alpha1/zz_generated.deepcopy.go: Generated file

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

workload.serving.volcano.sh/model-name: test-model
workload.serving.volcano.sh/model-uid: randomUID
workload.serving.volcano.sh/revision: 587ff8c655
workload.serving.volcano.sh/revision: 695b798d9
Comment thread pkg/kthena-router/datastore/store.go Outdated
Comment on lines +666 to +674
newObj := newModelServer(ms)
if len(pods) != 0 {
newObj.pods = pods
}

actual, loaded := s.modelServer.LoadOrStore(name, newObj)
modelServerObj := actual.(*modelServer)

if loaded {
Comment on lines +55 to +58
// ExternalProvider specifies an external cloud LLM provider to route requests to.
// When this is set, WorkloadSelector is ignored and requests are proxied to the external endpoint.
// +optional
ExternalProvider *ExternalProvider `json:"externalProvider,omitempty"`
Comment on lines +98 to +101
// Reference to a Kubernetes Secret containing the API credentials/keys for the external provider.
// The Secret should contain the token/key under the expected key name (e.g., 'api-key' or 'token').
// +optional
CredentialsRef *corev1.LocalObjectReference `json:"credentialsRef,omitempty"`

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for external cloud LLM providers in the ModelServer CRD by adding an externalProvider field to ModelServerSpec. It implements validation rules ensuring that exactly one of workloadSelector or externalProvider is specified, and that inferenceEngine is defined when workloadSelector is set. The changes also include updates to generated client-go apply configurations, deepcopy functions, documentation, and a refactoring of datastore operations in pkg/kthena-router/datastore/store.go to use LoadOrStore for safer concurrent map access. There are no review comments, and I have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@kube-gopher kube-gopher left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The ExternalProvider work touches modelserver_types.go, the generated CRD YAML, deepcopy, applyconfiguration client code, and CRD docs; these items do not match the PR/issue description. It is recommended to enhance the PR/issue description or remove them for better review.
  • It is recommended to write a unit test to provide regression protection. go test -race ./pkg/kthena-router/datastore/...

@hzxuzhonghu hzxuzhonghu left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is fixing a data race , please donot change api

This commit fixes TOCTOU race conditions in AddOrUpdateModelServer and AddOrUpdatePod by replacing the Load-then-Store pattern with atomic LoadOrStore operations.

Signed-off-by: Shivansh Sahu <sahushivansh142@gmail.com>
Copilot AI review requested due to automatic review settings June 15, 2026 17:13

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comment on lines +2255 to +2266
go func() {
defer wg.Done()
ms := &aiv1alpha1.ModelServer{
ObjectMeta: metav1.ObjectMeta{
Namespace: "default",
Name: "concurrent-model1",
},
}
pods := sets.New[types.NamespacedName](types.NamespacedName{Namespace: "default", Name: "pod1"})
err := s.AddOrUpdateModelServer(ms, pods)
assert.NoError(t, err)
}()
Comment on lines +2288 to +2304
go func() {
defer wg.Done()
pod := &corev1.Pod{
ObjectMeta: metav1.ObjectMeta{
Namespace: "default",
Name: "concurrent-pod1",
},
}
ms := &aiv1alpha1.ModelServer{
ObjectMeta: metav1.ObjectMeta{
Namespace: "default",
Name: "model1",
},
}
err := s.AddOrUpdatePod(pod, []*aiv1alpha1.ModelServer{ms})
assert.NoError(t, err)
}()
Comment on lines +668 to +672
newObj := newModelServer(ms)
if len(pods) != 0 {
modelServerObj.pods = pods
newObj.pods = pods
}
} else {
modelServerObj = value.(*modelServer)
actual, loaded = s.modelServer.LoadOrStore(name, newObj)
@shivansh-gohem

Copy link
Copy Markdown
Author

@hzxuzhonghu @kube-gopher Thanks for the review!

I've just pushed a force update to clean up the PR based on your feedback:

  1. API Changes Removed: I completely stripped out the accidental ExternalProvider modifications. The branch is now strictly isolated to the datastore TOCTOU fix against main.
  2. Regression Protection: Added concurrency unit tests (TestStoreAddOrUpdateModelServer_ConcurrentAccess and TestStoreAddOrUpdatePod_ConcurrentAccess) in store_test.go to protect against future regressions as recommended.
  3. Hot-Path Optimization: Addressed the AI review comment by doing a fast-path Load before allocating the new objects, avoiding unnecessary allocations while keeping the LoadOrStore atomic insertion on the slow path.

Please let me know if everything looks good now!

@volcano-sh-bot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hzxuzhonghu for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kube-gopher kube-gopher left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small question.
Image

This is the result after running make lint, caused by blank lines. It is recommended to run make fmt first, then make lint, to ensure no errors are displayed.

@hzxuzhonghu hzxuzhonghu left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no real race. It can call AddOrUpdateModelServer only in syncModelServerHandler now, which is run in searial

@shivansh-gohem

Copy link
Copy Markdown
Author

Hi @hzxuzhonghu, @kube-gopher,

I've pushed an update addressing the latest feedback:

  1. Linter & Formatting: Ran make fmt && make lint and verified no blank line violations.
  2. Datastore Defensive Copying: Switched to pods.Copy() before storing it into the ModelServer datastore to prevent aliasing caller-owned sets (as pointed out by the AI review).
  3. Test Races: Updated store_test.go to collect concurrency errors into a buffered channel errCh <- s.AddOrUpdate... instead of invoking assert.NoError(t, err) directly inside the goroutines, fixing the test races.

@hzxuzhonghu regarding your point: 'There is no real race. It can call AddOrUpdateModelServer only in syncModelServerHandler now, which is run in serial' — that's a very fair point! The primary motivation here is defense-in-depth against future refactoring (or concurrent metrics-scraping reading while updating). The sync.Map Load() followed by Store() creates a TOCTOU surface that is easily avoided by LoadOrStore. Keeping the datastore operations intrinsically thread-safe prevents regression if controllers/handlers ever become multi-threaded.

Does everything look good to merge now?

Signed-off-by: Shivansh Sahu <sahushivansh142@gmail.com>
@kube-gopher

Copy link
Copy Markdown
Contributor

@shivansh-gohem #1196 Let me ask again: is this actually observed, or is it derived?

@shivansh-gohem

Copy link
Copy Markdown
Author

Hi @kube-gopher, to answer your question: this is entirely derived from code inspection, not actually observed in a live cluster crash. As @hzxuzhonghu pointed out, since syncModelServerHandler currently runs serially, the race isn't practically exploitable at runtime right now.

I noticed the check-then-act pattern on sync.Map while reading through the datastore code, and I thought replacing it with LoadOrStore would be a good defensive measure to future-proof it against potential multi-threaded execution (like concurrent metrics scrapers) down the line.

But I totally understand if we want to avoid merging theoretical fixes! If you folks feel this isn't strictly necessary since the current execution path is safe, I'm completely fine with closing this PR. Let me know what you prefer!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Data race in AddOrUpdateModelServer and AddOrUpdatePod due to check-then-act sync.Map pattern

5 participants