-
Notifications
You must be signed in to change notification settings - Fork 615
[RayJob] Sidecar Mode #3971
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
kevin85421
merged 55 commits into
ray-project:master
from
Future-Outlier:rayjob-sidecar-mode-v2
Sep 1, 2025
Merged
[RayJob] Sidecar Mode #3971
Changes from 33 commits
Commits
Show all changes
55 commits
Select commit
Hold shift + click to select a range
3b5b865
[RayJob] Sidecar Mode to avoid cross pod communication
Future-Outlier 4530cb5
make generate
Future-Outlier ef92c89
Remove logs
Future-Outlier 5016896
# shutdownAfterJobFinishes: false
Future-Outlier a3ce5fc
Add a new field submitterContainer in rayjob_types.go
Future-Outlier 9600d8e
make sync
Future-Outlier dad2e43
update
Future-Outlier f4b8908
update
Future-Outlier 9803619
add validation logic when submission mode is sidecar and SubmitterPod…
Future-Outlier 4bf6c68
Remove sidecar error handling logic, need to re-use k8sjob mode's cod…
Future-Outlier 8f5c8e8
add function checkSidecarContainerAndUpdateStatusIfNeeded to handle s…
Future-Outlier b94f3ad
polish checkSidecarContainerAndUpdateStatusIfNeeded for rayJob.Status…
Future-Outlier 89ca572
refactor GetDefaultSubmitterTemplate and GetDefaultSubmitterContainer…
Future-Outlier 69d0715
refactor GetSidecarJobCommand/GetK8sJobCommand
Future-Outlier 204515b
mode -> submissionMode
Future-Outlier 4d21705
remove GetSidecarJobCommand
Future-Outlier 9edd606
use configureSubmitterContainer to aboid repeated code
Future-Outlier d91789a
Remove sidecarmode logging when JobDeploymentStatusInitializing
Future-Outlier 0c6effa
nit
Future-Outlier 7d9a852
nit
Future-Outlier 8f378d0
nit
Future-Outlier 125f0ae
sidecar name string matching
Future-Outlier bdbaed5
handle initialize status failed scenario
Future-Outlier b3c1d20
add comments
Future-Outlier f68ff65
Update Kai-Hsun's advice
Future-Outlier 14ae58b
update api doc
Future-Outlier a936880
fix validation and container status problem
Future-Outlier ddda86f
Merge branch 'master' into rayjob-sidecar-mode-v2
Future-Outlier e35a2a6
Use checkSubmitterAndUpdateStatusIfNeeded
Future-Outlier e8fe3c2
add comments in sidecar mode yaml!
Future-Outlier cd01094
Merge branch 'master' into rayjob-sidecar-mode-v2
Future-Outlier 1383ec5
Update Andrew's advice
Future-Outlier f4ae235
update andrew's api doc advice
Future-Outlier a1e71d4
update andrew's sidecar mode command advice
Future-Outlier 6e9185c
Merge branch 'ray-project:master' into rayjob-sidecar-mode-v2
Future-Outlier 94b2ddb
update kai hsun's SOLID advices
Future-Outlier 563a09d
add submitterConfig
Future-Outlier 35f0d0d
RayJob checkSidecarContainerAndUpdateStatusIfNeeded
Future-Outlier aefc08e
Update most minor fix
Future-Outlier ed6e025
use getDashboardPortFromRayJobSpec
Future-Outlier 107066e
use rayDashboardGCSHealthCommand in BuildJobSubmitCommand
Future-Outlier 0fdc468
update
Future-Outlier 7e7e931
Apply Kai-Hsun and Jun Hao's advices
Future-Outlier bef4401
update
Future-Outlier f0de681
port = utils.FindContainerPort
Future-Outlier 35f1383
integration test, most are copied from http mode
Future-Outlier 7c586e2
Add other tests
Future-Outlier 4a6e714
remove unnecessary test
Future-Outlier a793386
Simpler and clearer yaml description.
Future-Outlier 538a439
add tests for case whether the sidecar container is injected in the h…
Future-Outlier a0012af
merge
Future-Outlier 948c362
Verify sidecar container injection
Future-Outlier e3c9ae2
update
Future-Outlier bf81123
update
Future-Outlier d2df229
We don't need to use Eventually
Future-Outlier File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,140 @@ | ||
apiVersion: ray.io/v1 | ||
kind: RayJob | ||
metadata: | ||
name: rayjob-sidecar-mode-shutdown | ||
spec: | ||
# submissionMode specifies how RayJob submits the Ray job to the RayCluster. | ||
# The default value is "K8sJobMode", meaning RayJob will submit the Ray job via a submitter Kubernetes Job. | ||
# The alternative value is "HTTPMode", indicating that KubeRay will submit the Ray job by sending an HTTP request to the RayCluster. | ||
# In SidecarMode: | ||
# - In "SidecarMode", the KubeRay operator injects a container into the Ray head Pod that acts as the job submitter to submit the Ray job. | ||
# - If SidecarMode is enabled, retries are managed at the Ray job level using a backoff limit. | ||
# - A retry is triggered if the Ray job fails or if the submitter container fails to submit the Ray job. | ||
submissionMode: "SidecarMode" | ||
Future-Outlier marked this conversation as resolved.
Show resolved
Hide resolved
|
||
entrypoint: python /home/ray/samples/sample_code.py | ||
# shutdownAfterJobFinishes specifies whether the RayCluster should be deleted after the RayJob finishes. Default is false. | ||
# shutdownAfterJobFinishes: false | ||
|
||
# ttlSecondsAfterFinished specifies the number of seconds after which the RayCluster will be deleted after the RayJob finishes. | ||
# ttlSecondsAfterFinished: 10 | ||
|
||
# activeDeadlineSeconds is the duration in seconds that the RayJob may be active before | ||
# KubeRay actively tries to terminate the RayJob; value must be positive integer. | ||
# activeDeadlineSeconds: 120 | ||
|
||
# RuntimeEnvYAML represents the runtime environment configuration provided as a multi-line YAML string. | ||
# See https://docs.ray.io/en/latest/ray-core/handling-dependencies.html for details. | ||
# (New in KubeRay version 1.0.) | ||
runtimeEnvYAML: | | ||
pip: | ||
- requests==2.26.0 | ||
- pendulum==2.1.2 | ||
env_vars: | ||
counter_name: "test_counter" | ||
|
||
# Suspend specifies whether the RayJob controller should create a RayCluster instance. | ||
# If a job is applied with the suspend field set to true, the RayCluster will not be created and we will wait for the transition to false. | ||
# If the RayCluster is already created, it will be deleted. In the case of transition to false, a new RayCluster will be created. | ||
# suspend: false | ||
|
||
# rayClusterSpec specifies the RayCluster instance to be created by the RayJob controller. | ||
rayClusterSpec: | ||
rayVersion: '2.46.0' # should match the Ray version in the image of the containers | ||
# Ray head pod template | ||
headGroupSpec: | ||
# The `rayStartParams` are used to configure the `ray start` command. | ||
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay. | ||
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`. | ||
rayStartParams: {} | ||
#pod template | ||
template: | ||
spec: | ||
containers: | ||
- name: ray-head | ||
image: rayproject/ray:2.46.0 | ||
ports: | ||
- containerPort: 6379 | ||
name: gcs-server | ||
- containerPort: 8265 # Ray dashboard | ||
name: dashboard | ||
- containerPort: 10001 | ||
name: client | ||
resources: | ||
limits: | ||
cpu: "1" | ||
requests: | ||
cpu: "200m" | ||
volumeMounts: | ||
Future-Outlier marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- mountPath: /home/ray/samples | ||
name: code-sample | ||
volumes: | ||
# You set volumes at the Pod level, then mount them into containers inside that Pod | ||
- name: code-sample | ||
configMap: | ||
# Provide the name of the ConfigMap you want to mount. | ||
name: ray-job-code-sample | ||
# An array of keys from the ConfigMap to create as files | ||
items: | ||
- key: sample_code.py | ||
path: sample_code.py | ||
workerGroupSpecs: | ||
# the pod replicas in this group typed worker | ||
- replicas: 1 | ||
minReplicas: 1 | ||
maxReplicas: 5 | ||
# logical group name, for this called small-group, also can be functional | ||
groupName: small-group | ||
# The `rayStartParams` are used to configure the `ray start` command. | ||
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay. | ||
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`. | ||
rayStartParams: {} | ||
#pod template | ||
template: | ||
spec: | ||
containers: | ||
- name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc' | ||
image: rayproject/ray:2.46.0 | ||
resources: | ||
limits: | ||
cpu: "1" | ||
requests: | ||
cpu: "200m" | ||
|
||
######################Ray code sample################################# | ||
# this sample is from https://docs.ray.io/en/latest/cluster/job-submission.html#quick-start-example | ||
# it is mounted into the container and executed to show the Ray job at work | ||
--- | ||
apiVersion: v1 | ||
kind: ConfigMap | ||
metadata: | ||
name: ray-job-code-sample | ||
data: | ||
sample_code.py: | | ||
import ray | ||
import os | ||
import requests | ||
|
||
ray.init() | ||
|
||
@ray.remote | ||
class Counter: | ||
def __init__(self): | ||
# Used to verify runtimeEnv | ||
self.name = os.getenv("counter_name") | ||
assert self.name == "test_counter" | ||
self.counter = 0 | ||
|
||
def inc(self): | ||
self.counter += 1 | ||
|
||
def get_counter(self): | ||
return "{} got {}".format(self.name, self.counter) | ||
|
||
counter = Counter.remote() | ||
|
||
for _ in range(5): | ||
ray.get(counter.inc.remote()) | ||
print(ray.get(counter.get_counter.remote())) | ||
|
||
# Verify that the correct runtime env was used for the job. | ||
assert requests.__version__ == "2.26.0" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -71,7 +71,7 @@ func TestGetMetadataJson(t *testing.T) { | |
assert.JSONEq(t, expected, metadataJson) | ||
} | ||
|
||
func TestGetK8sJobCommand(t *testing.T) { | ||
func TestBuildJobSubmitCommandWithK8sJobMode(t *testing.T) { | ||
expected := []string{ | ||
"if", | ||
"!", "ray", "job", "status", "--address", "http://127.0.0.1:8265", "testJobId", ">/dev/null", "2>&1", | ||
|
@@ -88,12 +88,12 @@ func TestGetK8sJobCommand(t *testing.T) { | |
";", "fi", ";", | ||
"ray", "job", "logs", "--address", "http://127.0.0.1:8265", "--follow", "testJobId", | ||
} | ||
command, err := GetK8sJobCommand(testRayJob) | ||
command, err := BuildJobSubmitCommand(testRayJob, rayv1.K8sJobMode) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could we have a command test for sidecar mode? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. After confirming with Kai-Hsun, I will add tests once the non-test code is finalized. Thank you! |
||
require.NoError(t, err) | ||
assert.Equal(t, expected, command) | ||
} | ||
|
||
func TestGetK8sJobCommandWithYAML(t *testing.T) { | ||
func TestBuildJobSubmitCommandWithK8sJobModeAndYAML(t *testing.T) { | ||
rayJobWithYAML := &rayv1.RayJob{ | ||
Spec: rayv1.RayJobSpec{ | ||
RuntimeEnvYAML: ` | ||
|
@@ -126,7 +126,7 @@ pip: ["python-multipart==0.0.6"] | |
";", "fi", ";", | ||
"ray", "job", "logs", "--address", "http://127.0.0.1:8265", "--follow", "testJobId", | ||
} | ||
command, err := GetK8sJobCommand(rayJobWithYAML) | ||
command, err := BuildJobSubmitCommand(rayJobWithYAML, rayv1.K8sJobMode) | ||
require.NoError(t, err) | ||
|
||
// Ensure the slices are the same length. | ||
|
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.