Releases: ray-project/kuberay
v1.3.0
Highlights
RayCluster Conditions API
The RayCluster conditions API is graduating to Beta status in v1.3. The new API provides more details about the RayCluster’s observable state that were not possible to express in the old API. The following conditions are supported for v1.3: AllPodRunningAndReadyFirstTime
, RayClusterPodsProvisioning
, HeadPodNotFound
and HeadPodRunningAndReady
. We will be adding more conditions in future releases.
Ray Kubectl Plugin
The Ray Kubectl Plugin is graduating to Beta status. The following commands are supported with KubeRay v1.3:
kubectl ray logs <cluster-name>
: download Ray logs to a local directorykubectl ray session <cluster-name>
: initiate port-forwarding session to the Ray headkubectl ray create <cluster>
: create a Ray clusterkubectl ray job submit
: create a RayJob and submit a job using a local working directory
See the Ray Kubectl Plugin docs for more details.
RayJob Stability Improvements
Several improvements have been made to enhance the stability of long-running RayJobs. In particular, when using submissionMode=K8sJobMode
, job submissions will no longer fail due to the submission of duplicate IDs. Now, if a submission ID already exists, the logs of the existing job will be retrieved instead.
RayService API Improvements
RayService strives to deliver zero-downtime serving. When changes in the RayService spec cannot be applied in place, it attempts to migrate traffic to a new RayCluster in the background. However, users might not always have sufficient resources for a new RayCluster. Beginning with KubeRay 1.3, users can customize this behavior using the new UpgradeStrategy option within the RayServiceSpec.
Previously, the serviceStatus
field in RayService was inconsistent and did not accurately represent the actual state. Starting with KubeRay v1.3.0, we have introduced two conditions, Ready
and UpgradeInProgress
, to RayService. Following the approach taken with RayCluster, we have decided to deprecate serviceStatus. In the future, serviceStatus will be removed, and conditions will serve as the definitive source of truth. For now, serviceStatus remains available but is limited to two possible values: "Running" or an empty string.
GCS Fault Tolerance API Improvements
The new GcsFaultToleranceOptions field in the RayCluster now provides a streamlined way for users to enable GCS Fault Tolerance on a RayCluster. This eliminates the previous need to distribute related settings across Pod annotations, container environment variables, and the RayStartParams. Furthermore, users can now specify their Redis username in the newly introduced field (requires Ray 2.4.1 or later). To see the impact of this change on a YAML configuration, please refer to the example manifest.
Breaking Changes
RayService API
Starting from KubeRay v1.3.0, we have removed all possible values of RayService.Status.ServiceStatus except Running, so the only valid values for ServiceStatus are Running and empty. If ServiceStatus is Running, it means that RayService is ready to serve requests. In other words, ServiceStatus is equivalent to the Ready condition. It is strongly recommended to use the Ready condition instead of ServiceStatus going forward.
Features
- RayCluster Conditions API is graduating to Beta status. The feature gate RayClusterStatusConditions is now enabled by default.
- New events were added for RayCluster, RayJob and RayService for improved observability
- Various improvements to Ray autoscaler v2
- Introduce a new API in RayService
spec.upgradeStrategy
. The upgrade strategy type can be set toNewCluster
orNone
to modify the behavior of zero-downtime upgrades for RayService. - Add RayCluster controller expecatations to mitigate stale informer caches
- RayJob now supports submission mode InteractiveMode. Use this submission mode when you want to submit jobs from a local working directory on your laptop.
- RayJob now supports
spec.deletionPolicy
API, this feature requires theRayJobDeletionPolicy
feature gate to be enabled. Initial deltion policies areDeleteCluster
,DeleteWorkers
,DeleteSelf
andDeleteNone
. - KubeRay now detects TPUs and Neuron Core resources and specifies them as custom resources to ray start parameters
- Introduce
RayClusterSuspending
andRayClusterSuspended
conditions - Container CPU requests are now used in Ray
–num-cpus
if CPU limits is not specified - Various example manifests for using TPU v6 with KubeRay
- Add ManagedBy field in RayJob and RayCluster. This is required for Multi-Kueue support.
- Add support for
kubectl ray create cluster
command - Add support for
kubectl ray create workergroup
command
Guides & Tutorials
- Use Ray Kubectl Plugin
- New sample manifests using TPU v6e chips
- Tuning Redis for a Persistent Fault Tolerant GCS
- Reducing image pull latency on Kubernetes
- Configure Ray clusters with authentication and access control using KubeRay
- RayService + vLLM examples updated to use vLLM v0.6.2
- All YAML samples in KubeRay repo has been updated to use Ray v2.41.0
Changelog
- [Fix][RayCluster] fix missing pod name in CreatedWorkerPod and Failed… (#3057, @rueian)
- [Refactor] Use constants for image tag, image repo, and versions in golang to avoid hard-coded strings (#2978, @400Ping)
- Update TPU Ray CR manifests to use Ray 2.41.0 (#2965, @ryanaoleary)
- Update samples to use Ray 2.41.0 images (#2964, @andrewsykim)
- [Test] Use GcsFaultToleranceOptions in test and backward compatibility (#2972, @fscnick)
- [chore][docs] enable Markdownlint rule MD004 (#2973, @davidxia)
- [release] Update Volcano YAML files to Ray 2.41 (#2976, @win5923)
- [release] Update Yunikorn YAML file to Ray 2.41 (#2969, @kenchung285)
- [CI] Change Pre-commit-shellcheck-to-shellcheck-py (#2974, @owenowenisme)
- [chore][docs] enable Markdownlint rule MD010 (#2975, @davidxia)
- [Release] Upgrade ray-job.batch-inference.yaml image to 2.41 (#2971, @MortalHappiness)
- [RayService] adapter vllm 0.6.1.post2 (#2823, @pxp531)
- [release][9/N] Update text summarizer RayService to Ray 2.41 (#2961, @kevin85421)
- [RayService] Deflaky RayService envtest (#2962, @kevin85421)
- [RayJob] Deflaky RayJob e2e tests (#2963, @kevin85421)
- [fix][kubectl-plugin] set worker group CPU limit (#2958, @davidxia)
- [docs][kubectl-plugin] fix incorrect example commands (#2951, @davidxia)
- [release][8/N] Upgrade Stable Diffusion RayService to Ray 2.41 (#2960, @kevin85421)
- [kubectl-plugin] Fix panic when GPU resource is not set (#2954, @win5923)
- [docs][kubectl-plugin] improve help messages (#2952, @davidxia)
- [CI] Enable
testifylint
len
rule (#2945, @LeoLiao123) - [release][7/N] Update RayService YAMLs (#2956, @kevin85421)
- [Fix][RayJob] Invalid quote for RayJob submitter (#2949, @MortalHappiness)
- [chore][kubectl-plugin] use consistent capitalization (#2950, @davidxia)
- [chore] add Markdown linting pre-commit hook (#2953, @davidxia)
- [chore][kubectl-plugin] use better test assertions (#2955, @davidxia)
- [CI] Add shellcheck and fix error of it (#2933, @owenowenisme)
- [docs][kubectl-plugin] add dev docs (#2912, @davidxia)
- [release][6/N] Remove unnecessary YAMLs (#2946, @kevin85421)
- [release][5/N] Update some RayJob YAMLs from Ray 2.9 to Ray 2.41 (#2941, @kevin85421)
- [release][4/N] Update Ray images / versions in kubectl plugin (#2938, @kevin85421)
- [release][3/N] Update RayService e2e tests YAML files from Ray 2.9 to Ray 2.41 ([#2937](https://github.com...
v1.2.2
Highlights
- (alpha) Ray kubectl plugin
get
,session
,log
,job submit
- (alpha) Kubernetes events: create Kubernetes events for important information about the interactions between KubeRay and the Kubernetes API server
- (alpha) Apache YuniKorn integration
Changelog
- [release] Update Ray image to 2.34.0 (#2303, @kevin85421)
- Revert "[release] Update Ray image to 2.34.0 (#2303)" (#2413, @kevin85421)
- Revert "[release] Update Ray image to 2.34.0 (#2303)" (#2413) (#2415, @kevin85421)
- [Build][kubectl-plugin] Add release script for kubectl plugin (#2407, @MortalHappiness)
- [Feat][kubectl-plugin] Add Long, Example, shell completion for kubectl ray log (#2405, @MortalHappiness)
- Support gang scheduling with Apache YuniKorn (#2396, @yangwwei)
- [Feat][Kubectl-Plugin]Implement kubectl ray job submit (#2394, @chiayi)
- Add 1K, 5K and 10K RayCluster/RayJob scalability test results (#2218, @andrewsykim)
- [Feat][kubectl-plugin] Add dynamic shell completion for kubectl ray session (#2390, @MortalHappiness)
- [Feature][RayJob]: Generate submitter and RayCluster creation/deletion events (#2389, @rueian)
- [RayJob] Add Failure Feedback (log and event) for Failed k8s Creation Task (#2306, @tinaxfwu)
- [Feat][Kubectl-Plugin] Implement kubectl session for RayJob and RayService (#2379, @MortalHappiness)
- [Feat][kubectl-plugin] Add instructions for static shell completion (#2384, @MortalHappiness)
- [Feat][RayJob] UserMode SubmissionMode (#2364, @MortalHappiness)
- [Feature] Add Kubernetes manifest validation in pre-commit. (#2380, @LeoLiao123)
- [Feature][RayCluster]: Generate GCS FT Redis Cleanup Job creation events (#2382, @rueian)
- [Chore][Minor] Add .gitignore to kubectl-plugin (#2383, @MortalHappiness)
- Remove default option for batch scheduler name (#2371, @yangwwei)
- RayCluster Headless Worker Service Should PublishNotReadyAddresses (#2375, @ryanaoleary)
- [CI][GitHub-Actions] Upgrade actions/upload-artifact to v4 (#2373, @MortalHappiness)
- add support for pipeline-parallel-size in vLLM example (#2370, @andrewsykim)
- Add kubectl ray cluster log command (#2296, @chiayi)
- [Chore] Fix lint errors caused by casting int to int32 (#2368, @kevin85421)
- [Feature][kubectl-plugin] Implement kubectl ray session (#2298, @MortalHappiness)
- Use longer exec probe timeouts for Head pods (#2353, @andrewsykim)
- Remove redundant log line that is failing golangci-lint (#2366, @andrewsykim)
- [Chore][Linter] Upgrade golangci-lint to 1.60.3 (#2362, @MortalHappiness)
- Add batch-scheduler option, deprecate enable-batch-scheduler option (#2300, @yangwwei)
- [Feature] Display reconcile failures as events (ServiceAccount) (#2290, @cchen777)
- [Feature][RayCluster]: Deprecate the RayCluster .Status.State field (#2288, @rueian)
- Don't print redundant time unit in the log message (#2335, @tczekajlo)
- [Refactor][sample-yaml-test] Create sampleyaml package and run tests in CI (#2312, @MortalHappiness)
- [Refactor] Fix CreatedWorkerPod for worker Pod deletion event and refactor logs (#2346, @kevin85421)
- raycluster_controller: generate events for failed pod creation (#2286, @MadhavJivrajani)
- [Refactor][kubectl-plugin] Rename filenames and variables based on kubectl repo (#2295, @MortalHappiness)
v1.2.1 release
Compared to KubeRay v1.2.0, KubeRay v1.2.1 includes an additional commit (#2243). This commit fixes the issue where a RayService created by a KubeRay version older than v1.2.0 does not support zero-downtime upgrades after upgrading to KubeRay v1.2.0.
- [RayService] Use original ClusterIP for new head service (#2343, @kevin85421)
v1.2.0 release
Highlights
- RayCluster CRD status observability improvement: design doc
- Support retry in RayJob: #2192
- Coding style improvement
RayCluster
- [RayCluster][Fix] evicted head-pod can be recreated or restarted (#2217, @JasonChen86899)
- [Test][RayCluster] Add tests for RestartPolicyOnFailure for eviction (#2302, @MortalHappiness)
- kuberay autoscaler pod use same command and args as ray head container (#2268, @cswangzheng)
- Updated default timeout seconds for probes (#2265, @HarshAgarwal11)
- Buildkite autoscaler e2e (#2199, @rueian)
- [Test][Autoscaler][2/n] Add Ray Autoscaler e2e tests for GPU workers (#2181, @rueian)
- [Test][Autoscaler][1/n] Add Ray Autoscaler e2e tests (#2168, @kevin85421)
- [Bug] Fix RayCluster with an overridden app.kubernetes.io/name (#2147) (#2166, @rueian)
- [Feat][RayCluster] Make the Head service headless (#2117, @rueian)
- [Refactor][RayCluster] Make ray.io/group=headgroup be constant (#1970, @rueian)
- [Feature][autoscaler v2] Set RAY_NODE_TYPE_NAME when starting ray node (#1973, @kevin85421)
- feat: add
RayCluster.status.readyWorkerReplicas
(#1930, @davidxia) - [Chore][Samples] Rename ray-cluster.mini.yaml and add workerGroupSpecs (#2100, @MortalHappiness)
- [Chore] Delete redundant pod existance checking (#2113, @MortalHappiness)
- [Autoscaler V2] Polish Autoscaler V2 YAML (#2064, @kevin85421)
- [Refactor] Use RayClusterHeadPodsAssociationOptions to replace MatchingLabels (#2056, @evalaiyc98)
- [Sample][autoscaler v2] Add sample yaml for autosclaer v2 (#1974, @rickyyx)
- Allow configuration of restartPolicy (#2197, @c0dearm)
- [Chore][Log] Delete error loggings right before returned errors (#2103, @MortalHappiness)
- [Refactor] Follow-up for PR 1930 (#2124, @MortalHappiness)
- [Test] Move StateTransitionTimes envtest to a better place (#2111, @kevin85421)
- support using proxy subresources when connecting to Ray head node (#1980, @andrewsykim)
- [Bug] All worker Pods are deleted if using KubeRay v1.0.0 CRD with KubeRay operator v1.1.0 image (#2087, @kevin85421)
- [Bug] KubeRay operator failed to watch endpoint (#2080, @kevin85421)
- [Refactor] Remove
cleanupInvalidVolumeMounts
(#2104, @kevin85421) - support using proxy subresources when connecting to Ray head node (#1980, @andrewsykim)
- [Chore] Run operator outside the cluster (#2090, @MortalHappiness)
- [Feat] Deprecate ForcedClusterUpgrade (#2075, @MortalHappiness)
- [Bug] Ray operator crashes when specifying RayCluster with resources.limits but no resources.requests (#2077, @kevin85421)
RayCluster CRD status improvement
- RayClusterProvisioned status should be set while cluster is being provisioned for the first time (#2304, @andrewsykim)
- Add RayClusterProvisioned Condition Type (#2301, @Yicheng-Lu-llll)
- [Test][RayCluster] Add envtests for RayCluster conditions (#2283, @MortalHappiness)
- [Fix][RayCluster] Make the RayClusterReplicaFailureReason to capture the correct reason (#2282, @rueian)
- Add RayClusterReady Condition Type (#2271, @Yicheng-Lu-llll)
- [Feature][RayCluster]: Implement the HeadReady condition (#2261, @cchen777)
- [Feature] REP 54: Add PodName to the HeadInfo (#2266, @rueian)
- [Feat][RayCluster] Use a new RayClusterReplicaFailure condition to reflect the result of reconcilePods (#2259, @rueian)
- Don’t assign the rayv1.Failed to the State field (#2258, @Yicheng-Lu-llll)
- [Refactor][RayCluster] Unify status update to single place (#2249, @MortalHappiness)
- [Feat][RayCluster] Introduce the RayClusterStatus.Conditions field (#2214, @rueian)
- [Test][Autoscaling] Add custom resource test (#2193, @MortalHappiness)
- feat: record last state transition times (#2053, @davidxia)
- [RayCluster] Add serviceName to status.headInfo (#2089, @andrewsykim)
- [RayCluster][Status][1/n] Remove ClusterState Unhealthy (#2068, @kevin85421)
Coding style improvement
- [Style] Fix golangci-lint rule: govet (#2144, @MortalHappiness)
- [Chore] Fix golangci-lint rule: gosec (#2163, @MortalHappiness)
- [Style] Fix golangci-lint rule: nolintlint (#2196, @MortalHappiness)
- [Style] Fix golangci-lint rule: unparam (#2195, @MortalHappiness)
- [Fix][CI] Fix revive error (#2183, @MortalHappiness)
- [Style] Fix golangci-lint rule: revive (#2167, @MortalHappiness)
- [Style] Fix golangci-lint rule: ginkgolinter (#2164, @MortalHappiness)
- [Style] Fix golangci-lint rule: errorlint (#2141, @MortalHappiness)
- [Chore] Use new golangci-lint rules only for ray-operator (#2152, @MortalHappiness)
- [Docs][Development] Delete linting docs (#2145, @MortalHappiness)
- [Style] Fix golangci-lint rule: unconvert (#2143, @MortalHappiness)
- [Style] Fix golangci-lint rule: noctx (#2142, @MortalHappiness)
- [Fix][precommit] Fix pre-commit golangci-lint always succeed (#2140, @MortalHappiness)
- [N/N][Chore] Add golangci-lint rules (#2128, @MortalHappiness)
- [Chore] Turn off no-commit-to-branch rule (#2139, @MortalHappiness)
- [5/N][Refactor] Run golangci-lint for all files (only autofix rules) (#2133, @MortalHappiness)
- [4/N][Chore] Turn off golangci-lint rules except ray-operator (#2138, @MortalHappiness)
- [3/N][CI] Replace lint CI with pre-commit (#2129, @MortalHappiness)
- [2/N][Refactor] Run pre-commit for all files (without golangci-lint) (#2130, @MortalHappiness)
- [1/N][Chore] Add pre-commit hooks (#2127, @MortalHappiness)
RayJob
- [RayJob] allow create verb for services/proxy, which is required for HTTPMode (#2321, @andrewsykim)
- [Fix][Sample-Yaml] Increase ray head CPU resource for pytorch minst (#2330, @MortalHappiness)
- Support Apache YuniKorn as one batch scheduler option (#2184, @yangwwei)
- [RayJob] add RayJob pass Deadline e2e-test with retry (#2241, @karta1502545)
- add feature gate mechanism to ray-operator (#2219, @andrewsykim)
- [RayJob] add Failing RayJob in HTTPMode e2e test for rayjob with retry (#2242, @tinaxfwu)
- [Feat][RayJob] Delete RayJob CR after job termination (#2225, @MortalHappiness)
- reconcile concurrency flag should apply for RayJob and RayService controllers (#2228, @andrewsykim)
- [RayJob] add Failing submitter K8s Job e2e ...
v1.1.1 release
Compared to KubeRay v1.1.0, KubeRay v1.1.1 includes four cherry-picked commits.
- [Bug] Ray operator crashes when specifying RayCluster with resources.limits but no resources.requests (#2077, @kevin85421)
- [CI] Pin kustomize to v5.3.0 (#2067, @kevin85421)
- [Bug] All worker Pods are deleted if using KubeRay v1.0.0 CRD with KubeRay operator v1.1.0 image (#2087, @kevin85421)
- [Hotfix][CI] Pin setup-envtest dep (#2038, @kevin85421)
v1.1.0 release
Highlights
-
RayJob improvements
- Gang / Priority scheduling with Kueue:
- ActiveDeadlineSeconds (new field): A feature to control the lifecycle of a RayJob. See this doc and #1933 for more details.
- submissionMode (new field): Users can specify “K8sJobMode” or “HTTPMode”. The default value is “K8sJobMode”. In HTTPMode, the submitter K8s Job will not be created. Instead, KubeRay sends a HTTP request to the Ray head Pod to create a Ray job. See this doc and #1893 for more details.
- Fix a lot of stability issues.
-
Structured logging
- In KubeRay v1.1.0, we have changed the KubeRay logs to JSON format, and each log message includes context information such as the custom resource’s name and reconcileID. Hence, users can filter out logs associated with a RayCluster, RayJob, or RayService CR by its name.
-
RayService improvements
- Refactor health check mechanism to improve the stability.
- Deprecate the
deploymentUnhealthySecondThreshold
andserviceUnhealthySecondThreshold
to avoid unintentional preparation of new RayCluster custom resource.
-
TPU multi-host PodSlice support
- The KubeRay team is actively working with the Google GKE and TPU teams on integration. The required changes in KubeRay have already been completed. The GKE team will complete some tasks on their side this week or next. Then, users should be able to use multi-host TPU PodSlice with a static RayCluster (without autoscaling).
-
Stop publishing images on DockerHub; instead, we will only publish on Quay.
- https://quay.io/repository/kuberay/operator?tab=tags
- Users should use docker pull
quay.io/kuberay/operator:v1.1.0
instead of docker pullkuberay/operator:v1.1.0
.
RayJob
RayJob state machine refactor
- [RayJob][Status][1/n] Redefine the definition of JobDeploymentStatusComplete (#1719, @kevin85421)
- [RayJob][Status][2/n] Redefine
ready
for RayCluster to avoid using HTTP requests to check dashboard status (#1733, @kevin85421) - [RayJob][Status][3/n] Define JobDeploymentStatusInitializing (#1737, @kevin85421)
- [RayJob][Status][4/n] Remove some JobDeploymentStatus and updateState function calls (#1743, @kevin85421)
- [RayJob][Status][5/n] Refactor getOrCreateK8sJob (#1750, @kevin85421)
- [RayJob][Status][6/n] Redefine JobDeploymentStatusComplete and clean up K8s Job after TTL (#1762, @kevin85421)
- [RayJob][Status][7/n] Define JobDeploymentStatusNew explicitly (#1772, @kevin85421)
- [RayJob][Status][8/n] Only a RayJob with the status Running can transition to Complete at this moment (#1774, @kevin85421)
- [RayJob][Status][9/n] RayJob should not pass any changes to RayCluster (#1776, @kevin85421)
- [RayJob][10/n] Add finalizer to the RayJob when the RayJob status is JobDeploymentStatusNew (#1780, @kevin85421)
- [RayJob][Status][11/n] Refactor the suspend operation (#1782, @kevin85421)
- [RayJob][Status][12/n] Resume suspended RayJob (#1783, @kevin85421)
- [RayJob][Status][13/n] Make suspend operation atomic by introducing the new status
Suspending
(#1798, @kevin85421) - [RayJob][Status][14/n] Decouple the Initializing status and Running status (#1801, @kevin85421)
- [RayJob][Status][15/n] Unify the codepath for the status transition to
Suspended
(#1805, @kevin85421) - [RayJob][Status][16/n] Refactor
Running
status (#1807, @kevin85421) - [RayJob][Status][17/n] Unify the codepath for status updates (#1814, @kevin85421)
- [RayJob][Status][18/n] Control the entire lifecycle of the Kubernetes submitter Job using KubeRay (#1831, @kevin85421)
- [RayJob][Status][19/n] Transition to
Complete
if the K8s Job fails (#1833, @kevin85421)
Others
- [Refactor] Remove global utils.GetRayXXXClientFuncs (#1727, @rueian)
- [Feature] Warn Users When Updating the RayClusterSpec in RayJob CR (#1778, @Yicheng-Lu-llll)
- Add apply configurations to generated client (#1818, @astefanutti)
- RayJob: inject RAY_DASHBOARD_ADDRESS envariable variable for user provided submiter templates (#1852, @andrewsykim)
- [Bug] Submitter K8s Job fails even though the RayJob has a JobDeploymentStatus
Complete
and a JobStatusSUCCEEDED
(#1919, @kevin85421) - add toleration for GPUs in sample pytorch RayJob (#1914, @andrewsykim)
- Add a sample RayJob to fine-tune a PyTorch lightning text classifier with Ray Data (#1891, @andrewsykim)
- rayjob controller: refactor environment variable check in unit tests (#1870, @andrewsykim)
- RayJob: don't delete submitter job when ShutdownAfterJobFinishes=true (#1881, @andrewsykim)
- rayjob controller: update EndTime to always be the time when the job deployment transitions to Complete status (#1872, @andrewsykim)
- chore: remove ConfigMap from ray-job.kueue-toy-sample.yaml (#1976, @kevin85421)
- [Kueue] Add a sample YAML for Kueue toy sample (#1956, @kevin85421)
- [RayJob] Support ActiveDeadlineSeconds (#1933, @kevin85421)
- [Feature][RayJob] Support light-weight job submission (#1893, @kevin85421)
- [RayJob] Add JobDeploymentStatusFailed Status and Reason Field to Enhance Observability for Flyte/RayJob Integration (#1942, @Yicheng-Lu-llll)
- [RayJob] Refactor Rayjob E2E Tests to Use Server-Side Apply (#1927, @Yicheng-Lu-llll)
- [RayJob] Rewrite RayJob envtest (#1916, @kevin85421)
- [Chore][RayJob] Remove the TODO of verifying the schema of RayJobInfo because it is already correct (#1911, @rueian)
- [RayJob] Set missing CPU limit (#1899, @kevin85421)
- [RayJob] Set the timeout of the HTTP client from 2 mins to 2 seconds (#1910, @kevin85421)
- [Feature][RayJob] Support light-weight job submission with entrypoint_num_cpus, entrypoint_num_gpus and entrypoint_resources (#1904, @rueian)
- [RayJob] Improve dashboard client log (#1903, @kevin85421)
- [RayJob] Validate whether runtimeEnvYAML is a valid YAML string (#1898, @kevin85421)
- [RayJob] Add additional print columns for RayJob (#1895, @andrewsykim)
- [Test][RayJob] Transition to
Complete
if the JobStatus is STOPPED (#1871, @kevin85421) - [RayJob] Inject RAY_SUBMISSION_ID env variable for user provided submitter template (#1868, @kevin85421)
- [RayJob] Transition to
Complete
if the JobStatus is STOPPED (#1855, @kevin85421) - [RayJob][Kueue] Move limitation check to validateRayJobSpec (#1854, @kevin85421)
- [RayJob] Validate RayJob spec (#1813, @kevin85421)
- [Test][RayJob] Kueue happy-path scenario (#1809, @kevin85421)
- [RayJob] Delete the Kubernetes Job and its Pods immediately when suspending (#1791, @rueian)
- [Feature][RayJob] Remove the deprecated RuntimeEnv from CRD. Use RuntimeEnvYAML instead. (#1792, @rueian)
- [Bug][RayJob] Avoid nil pointer dereference ([#1756](https://github.c...
v1.0.0 release
KubeRay is officially in General Availability!
- Bump the CRD version from v1alpha1 to v1.
- Relocate almost all documentation to the Ray website.
- Improve RayJob UX.
- Improve GCS fault tolerance.
GCS fault tolerance
- [GCS FT] Improve GCS FT cleanup UX (#1592, @kevin85421)
- [Bug][RayCluster] Fix RAY_REDIS_ADDRESS parsing with redis scheme and… (#1556, @rueian)
- [Bug] RayService with GCS FT HA issue (#1551, @kevin85421)
- [Test][GCS FT] End-to-end test for cleanup_redis_storage (#1422)(#1459) (#1466, @rueian)
- [Feature][GCS FT] Clean up Redis once a GCS FT-Enabled RayCluster is deleted (#1412, @kevin85421)
- Update GCS fault tolerance YAML (#1404, @kevin85421)
- [GCS FT] Consider the case of sidecar containers (#1386, @kevin85421)
- [GCS FT] Give readiness / liveness probes good default values (#1364, @kevin85421)
- [GCS FT][Refactor] Redefine the behavior for deleting Pods and stop listening to Kubernetes events (#1341, @kevin85421)
CRD versioning
- [CRD] Inject CRD version to the Autoscaler sidecar container (#1496, @kevin85421)
- [CRD][2/n] Update from CRD v1alpha1 to v1 (#1482, @kevin85421)
- [CRD][1/n] Create v1 CRDs (#1481, @kevin85421)
- [CRD] Set maxDescLen to 0 (#1449, @kevin85421)
RayService
- [Hotfix][Bug] Avoid unnecessary zero-downtime upgrade (#1581, @kevin85421)
- [Feature] Add an example for RayService high availability (#1566, @kevin85421)
- [Feature] Add a flag to make zero downtime upgrades optional (#1564, @kevin85421)
- [Bug][RayService] KubeRay does not recreate Serve applications if a head Pod without GCS FT recovers from a failure. (#1420, @kevin85421)
- [Bug] Fix the filename of text summarizer YAML (#1415, @kevin85421)
- [serve] Change text ml yaml to use french in user config (#1403, @zcin)
- [services] Add text ml rayservice yaml (#1402, @zcin)
- [Bug] Fix flakiness of RayService e2e tests (#1385, @kevin85421)
- Add RayService sample test (#1377, @Darren221)
- [RayService] Revisit the conditions under which a RayService is considered unhealthy and the default threshold (#1293, @kevin85421)
- [RayService][Observability] Add more loggings about networking issues (#1282, @kevin85421)
RayJob
- [Feature] Improve observability for flaky RayJob test (#1587, @kevin85421)
- [Bug][RayJob] Fix FailedToGetJobStatus by allowing transition to Running (#1583, @architkulkarni)
- [RayJob] Fix RayJob status reconciliation (#1539, @astefanutti)
- [RayJob]: Always use target RayCluster image as default RayJob submitter image (#1548, @astefanutti)
- [RayJob] Add default CPU and memory for job submitter pod (#1319, @architkulkarni)
- [Bug][RayJob] Check dashboard readiness before creating job pod (#1381) (#1429, @rueian)
- [Feature][RayJob] Use RayContainerIndex instead of 0 (#1397) (#1427, @rueian)
- [RayJob] Enable job log streaming by setting
PYTHONUNBUFFERED
in job container (#1375, @architkulkarni) - Add field to expose entrypoint num cpus in rayjob (#1359, @shubhscoder)
- [RayJob] Add runtime env YAML field (#1338, @architkulkarni)
- [Bug][RayJob] RayJob with custom head service name (#1332, @kevin85421)
- [RayJob] Add e2e sample yaml test for shutdownAfterJobFinishes (#1269, @architkulkarni)
RayCluster
- [Enhancement] Remove unused variables in constant.go (#1474, @evalaiyc98)
- [Enhancement] GPU RayCluster doesn't work on GKE Autopilot (#1470, @kevin85421)
- [Refactor] Parameterize TestGetAndCheckServeStatus (#1450, @evalaiyc98)
- [Feature] Make replicas optional for WorkerGroupSpec (#1443, @kevin85421)
- use raycluster app's name as podgroup name key word (#1446, @lowang-bh)
- [Refactor] Make port name variables consistent and meaningful (#1389, @evalaiyc98)
- [Feature] Use image of Ray head container as the default Ray Autoscaler container (#1401, @kevin85421)
- Update Autoscaler YAML for the Autoscaler tutorial (#1400, @kevin85421)
- [Feature] Ray container must be the first application container (#1379, @kevin85421)
- [release blocker][Feature] Only Autoscaler can make decisions to delete Pods (#1253, @kevin85421)
- [release blocker][Autoscaler] Randomly delete Pods when scaling down the cluster (#1251, @kevin85421)
Helm charts
- Remove miniReplicas in raycluster-cluster.yaml (#1473, @evalaiyc98)
- Helm chart ray-cluster template reference fix (#1469, @chrisxstyles)
- fix: Issue #1391 - Custom labels not being pulled in (#1398, @rxraghu)
- Remove unnecessary kustomize in make helm (#1370, @shubhscoder)
- [Feature] Allow RayCluster Helm chart to specify different images for different worker groups (#1352, @Darren221)
- Allow manually creating init containers in Kuberay helm charts (#1287, @richardsliu)
KubeRay API Server
- Added Python API server client (#1561, @blublinsky)
- updating url use v1 (#1577, @blublinsky)
- Fixed processing of job submitter (#1562, @blublinsky)
- extended job APIs (#1537, @blublinsky)
- fixed volumes test in cluster test (#1498, @blublinsky)
- Add documentation for API Server monitoring (#1479, @blublinsky)
- created HA example for API server (#1461, @blublinsky)
- Numerous fixes to the API server to make RayJob APIs working (#1447, @blublinsky)
- Updated API server documentation (#1435, @z103cb)
- servev2 support for API server (#1419, @blublinsky)
- replacement for #1312 (#1409, @blublinsky)
- Updates to the apiserver swagger-ui (#1410, @z103cb)
- implemented liveness/readyness probe for the API server (#1369, @blublinsky)
- Operator support for openShift (#1371, @blublinsky)
- Removed use of the of BUILD_FLAGS in apiserver makefile (#1336, @z103cb)
- Api server makefile (#1301, @z103cb)
Documentation
- [Doc] Update release docs (#1621, @kevin85421)
- [Doc] Fix release doc format (#1578, @kevin85421)
- Update kuberay mcad integration doc (#1373, @tedhtchang)
- [Release][Doc] Add instructions to release Go modules. (#1546, @kevin85421)
- [Post v1.0.0-rc.1] Reenable sample YAML tests for latest release and update some docs (#1544, @kevin85421)
- Update operator development instruction ([#1458](https://g...
v0.6.0 release
Highlights
-
RayService
- RayService starts to support Ray Serve multi-app API (#1136, #1156)
- RayService stability improvements (#1231, #1207, #1173)
- RayService observability (#1230)
- RayService examples
- [RayService] Stable Diffusion example (#1181, @kevin85421)
- MobileNet example (#1175, @kevin85421)
- RayService troubleshooting handbook (#1221)
-
RayJob refactoring (#1177)
RayService
- [RayService][Observability] Add more logging for RayService troubleshooting (#1230, @kevin85421)
- [Bug] Long image pull time will trigger blue-green upgrade after the head is ready (#1231, @kevin85421)
- [RayService] Stable Diffusion example (#1181, @kevin85421)
- [RayService] Update docs to use multi-app (#1179, @zcin)
- [RayService] Change runtime env for e2e autoscaling test (#1178, @zcin)
- [RayService] Add e2e tests (#1167, @zcin)
- [RayService][docs] Improve explanation for config file and in-place updates (#1229, @zcin)
- [RayService][Doc] RayService troubleshooting handbook (#1221, @kevin85421)
- [Doc] Improve RayService doc (#1235, @kevin85421)
- [Doc] Improve FAQ page and RayService troubleshooting guide (#1225, @kevin85421)
- [RayService] Add RayService alb ingress CR (#1169, @sihanwang41)
- [RayService] Add support for multi-app config in yaml-string format (#1156, @zcin)
- [rayservice] Add support for getting multi-app status (#1136, @zcin)
- [Refactor] Remove Dashboard Agent service (#1207, @kevin85421)
- [Bug] KubeRay operator fails to get serve deployment status due to 500 Internal Server Error (#1173, @kevin85421)
- MobileNet example (#1175, @kevin85421)
- [Bug] fix RayActorOptionSpec.items.spec.serveConfig.deployments.rayActorOptions.memory int32 data type (#1220, @kevin85421)
RayJob
- [RayJob] Submit job using K8s job instead of checking Status and using DashboardHTTPClient (#1177, @architkulkarni)
- [Doc] [RayJob] Add documentation for submitterPodTemplate (#1228, @architkulkarni)
Autoscaler
- [release blocker][Feature] Only Autoscaler can make decisions to delete Pods (#1253, @kevin85421)
- [release blocker][Autoscaler] Randomly delete Pods when scaling down the cluster (#1251, @kevin85421)
Helm
- [Helm][RBAC] Introduce the option crNamespacedRbacEnable to enable or disable the creation of Role/RoleBinding for RayCluster preparation (#1162, @kevin85421)
- [Bug] Allow zero replica for workers for Helm (#968, @ducviet00)
- [Bug] KubeRay tries to create ClusterRoleBinding when singleNamespaceInstall and rbacEnable are set to true (#1190, @kevin85421)
KubeRay API Server
- Add support for openshift routes (#1183, @blublinsky)
- Adding API server support for service account (#1148, @blublinsky)
Documentation
- [release v0.6.0] Update tags and versions (#1270, @kevin85421)
- [release v0.6.0-rc.1] Update tags and versions (#1264, @kevin85421)
- [release v0.6.0-rc.0] Update tags and versions (#1237, @kevin85421)
- [Doc] Develop Ray Serve Python script on KubeRay (#1250, @kevin85421)
- [Doc] Fix the order of comments in sample Job YAML file (#1242, @architkulkarni)
- [Doc] Upload a screenshot for the Serve page in Ray dashboard (#1236, @kevin85421)
- [Doc] GKE GPU cluster setup (#1223, @kevin85421)
- [Doc][Website] Add complete document link (#1224, @yuxiaoba)
- Add FAQ page (#1150, @Yicheng-Lu-llll)
- [Doc] Add gofumpt lint instructions (#1180, @architkulkarni)
- [Doc] Add
helm update
command to chart validation step in release process (#1165, @architkulkarni) - [Doc] Add git fetch --tags command to release instructions (#1164, @architkulkarni)
- Add KubeRay related blogs (#1147, @tedhtchang)
- [2.5.0 Release] Change version numbers 2.4.0 -> 2.5.0 (#1151, @ArturNiederfahrenhorst)
- [Sample YAML] Bump ray version in pod security YAML to 2.4.0 (#1160, @architkulkarni)
- Add instruction to skip unit tests in DEVELOPMENT.md (#1171, @architkulkarni)
- Fix typo (#1241, @mmourafiq)
- Fix typo (#1232, @mmourafiq)
CI
- [CI] Add
kind
-in-Docker test to Buildkite CI (#1243, @architkulkarni) - [CI] Remove unnecessary release.yaml workflow (#1168, @architkulkarni)
Others
- Pin operator version in single namespace installation(#1193) (#1210, @wjzhou)
- RayCluster updates status frequently (#1211, @kevin85421)
- Improve the observability of the init container (#1149, @Yicheng-Lu-llll)
- [Ray Observability] Disk usage in Dashboard (#1152, @kevin85421)
v0.5.2 release
Changelog for v0.5.2
Highlights
The KubeRay 0.5.2 patch release includes the following improvements.
- Allow specifying the entire headService and serveService YAML spec. Previously, only certain special fields such as
labels
andannotations
were exposed to the user.- Expose entire head pod Service to the user (#1040, @architkulkarni)
- Exposing Serve Service (#1117, @kodwanis)
- RayService stability improvements
- RayService object’s Status is being updated due to frequent reconciliation (#1065, @kevin85421)
- [RayService] Submit requests to the Dashboard after the head Pod is running and ready (#1074, @kevin85421)
- Fix in HeadPod Service Generation logic which was causing frequent reconciliation (#1056, @msumitjain)
- Allow watching multiple namespaces
- [Feature] Watch CR in multiple namespaces with namespaced RBAC resources (#1106, @kevin85421)
- Autoscaler stability improvements
- [Bug] RayService restarts repeatedly with Autoscaler (#1037, @kevin85421)
- [Bug] autoscaler not working properly in rayjob (#1064, @Yicheng-Lu-llll)
- [Bug][Autoscaler] Operator does not remove workers (#1139, @kevin85421)
Contributors
We'd like to thank the following contributors for their contributions to this release:
@ByronHsu, @Yicheng-Lu-llll, @anishasthana, @architkulkarni, @blublinsky, @chrisxstyles, @dirtyValera, @ecurtin, @jasoonn, @jjyao, @kevin85421, @kodwanis, @msumitjain, @oginskis, @psschwei, @scarlet25151, @sihanwang41, @tedhtchang, @varungup90, @xubo245
Features
- Add a flag to enable/disable worker init container injection (#1069, @ByronHsu)
- Add a warning to discourage users from launching a KubeRay-incompatible autoscaler. (#1102, @kevin85421)
- Add consistency check for deepcopy generated files (#1127, @varungup90)
- Add kubernetes dependency in python client library (#998, @jasoonn)
- Add support for pvcs to apiserver (#1118, @psschwei)
- Add support for tolerations, env, annotations and labels (#1070, @blublinsky)
- Align Init Container's ImagePullPolicy with Ray Container's ImagePullPolicy (#1080, @Yicheng-Lu-llll)
- Connect Ray client with TLS using Nginx Ingress on Kind cluster (#729) (#1051, @tedhtchang)
- Expose entire head pod Service to the user (#1040, @architkulkarni)
- Exposing Serve Service (#1117, @kodwanis)
- [Test] Add e2e test for sample RayJob yaml on kind (#935, @architkulkarni)
- Parametrize ray-operator makefile (#1121, @anishasthana)
- RayService object's Status is being updated due to frequent reconciliation (#1065, @kevin85421)
- [Feature] Support suspend in RayJob (#926, @oginskis)
- [Feature] Watch CR in multiple namespaces with namespaced RBAC resources (#1106, @kevin85421)
- [RayService] Submit requests to the Dashboard after the head Pod is running and ready (#1074, @kevin85421)
- feat: Rename instances of rayiov1alpha1 to rayv1alpha1 (#1112, @anishasthana)
- ray-operator: Reuse contexts across ray operator reconcilers (#1126, @anishasthana)
Fixes
- Fix CI (#1145, @kevin85421)
- Fix config frequent update (#1014, @sihanwang41)
- Fix for Sample YAML Config Test - 2.4.0 Failure due to 'suspend' Field (#1096, @Yicheng-Lu-llll)
- Fix in HeadPod Service Generation logic which was causing frequent reconciliation (#1056, @msumitjain)
- [Bug] Autoscaler doesn't support TLS (#1119, @chrisxstyles)
- [Bug] Enable ResourceQuota by adding Resources for the health-check init container (#1043, @kevin85421)
- [Bug] Fix null map handling in
BuildServiceForHeadPod
function (#1095, @architkulkarni) - [Bug] RayService restarts repeatedly with Autoscaler (#1037, @kevin85421)
- [Bug] Service (Serve) changing port from 8000 to 9000 doesn't work (#1081, @kevin85421)
- [Bug] autoscaler not working properly in rayjob (#1064, @Yicheng-Lu-llll)
- [Bug] compatibility test for the nightly Ray image fails (#1055, @kevin85421)
- [Bug] rayStartParams is required at this moment. (#1031, @kevin85421)
- [Bug][Autoscaler] Operator does not remove workers (#1139, @kevin85421)
- [Bug][Doc] fix the link error of operator document (#1046, @xubo245)
- [Bug][GCS FT] Worker pods crash unexpectedly when gcs_server on head pod is killed (#1036, @kevin85421)
- [Bug][breaking change] Unauthorized 401 error on fetching Ray Custom Resources from K8s API server (#1128, @kevin85421)
- [Bug][k8s compatibility] k8s v1.20.7 ClusterIP svc do not updated under RayService (#1110, @kevin85421)
- [Helm][ray-cluster] Fix parsing envFrom field in additionalWorkerGroups (#1039, @dirtyValera)
Documentation
- [Doc] Copyedit dev guide (#1012, @architkulkarni)
- [Doc] Update nav to include missing files and reorganize nav (#1011, @architkulkarni)
- [Doc] Update version from 0.4.0 to 0.5.0 on remaining kuberay docs files (#1018, @architkulkarni)
- [Doc][Website] Update KubeRay introduction and fix layout issues (#1042, @kevin85421)
- [Docs][Website] One word typo fix in docs and README (#1068, @ecurtin)
- Add a document to outline the default settings for
rayStartParams
in Kuberay (#1057, @Yicheng-Lu-llll) - Example Pod to connect Ray client to remote a Ray cluster with TLS enabled (#994, @tedhtchang)
- [Post release v0.5.0] Update CHANGELOG.md (#1026, @kevin85421)
- [Post release v0.5.0] Update release doc (#1028, @kevin85421)
- [Post Ray 2.4 Release] Update Ray versions to Ray 2.4.0 (#1049, @jjyao)
- [Post release v0.5.0] Remove block from rayStartParams (#1015, @kevin85421)
- [Post release v0.5.0] Remove block from rayStartParams for python client and KubeRay operator tests (#1050, @Yicheng-Lu-llll)
- [Post release v0.5.0] Remove serviceType (#1013, @kevin85421)
- [Post v0.5.0] Remove init containers from YAML files (#1010, @kevin85421)
- [Sample YAML] Bump ray version in pod security YAML to 2.4.0 (#1160) (#1161, @architkulkarni)
- Kuberay 0.5.0 docs validation update docs for GCS FT (#1004, @scarlet25151)
- Release v0.5.0 doc validation (#997, @kevin85421)
- Release v0.5.0 doc validation part 2 (#999, @architkulkarni)
- Release v0.5.0 python client library validation (#1006, @jasoonn)
- [release v0.5.2] Update tags and versions to 0.5.2 (#1159, @architkulkarni)
v0.5.0 release
Highlights
The KubeRay 0.5.0 release includes the following improvements.
- Interact with KubeRay via a Python client
- Integrate KubeRay with Kubeflow to provide an interactive development environment (link).
- Integrate KubeRay with Ray TLS authentication
- Improve the user experience for KubeRay on AWS EKS (link)
- Fix some Kubernetes networking issues
- Fix some stability bugs in RayJob and RayService
Contributors
The following individuals contributed to KubeRay 0.5.0. This list is alphabetical and incomplete.
@akanso @alex-treebeard @architkulkarni @cadedaniel @cskornel-doordash @davidxia @DmitriGekhtman @ducviet00 @gvspraveen @harryge00 @jasoonn @Jeffwan @kevin85421 @psschwei @scarlet25151 @sihanwang41 @wilsonwang371 @Yicheng-Lu-llll
Python client (alpha)(New!)
Kubeflow (New!)
- [Feature][Doc] Kubeflow integration (#937, @kevin85421)
- [Feature] Ray restricted podsecuritystandards for enterprise security and Kubeflow integration (#750, @kevin85421)
TLS authentication (New!)
- [Feature] TLS authentication (#989, @kevin85421)
AWS EKS (New!)
- [Feature][Doc] Access S3 bucket from Pods in EKS (#958, @kevin85421)
Kubernetes networking (New!)
- Read cluster domain from resolv.conf or env (#951, @harryge00)
- [Feature] Replace service name with Fully Qualified Domain Name (#938, @kevin85421)
- [Feature] Add default init container in workers to wait for GCS to be ready (#973, @kevin85421)
Observability
- Fix issue with head pod not monitered by Prometheus under certain condition (#963, @Yicheng-Lu-llll)
- [Feature] Improve and fix Prometheus & Grafana integrations (#895, @kevin85421)
- Add example and tutorial to explain how to create custom metrics for Prometheus (#914, @Yicheng-Lu-llll)
- feat: enrich
kubectl get
output (#878, @davidxia)
RayCluster
- Fix issue with operator OOM restart (#946, @wilsonwang371)
- [Feature][Hotfix] Add observedGeneration to the status of CRDs (#979, @kevin85421)
- Customize the Prometheus export port (#954, @Yicheng-Lu-llll)
- [Feature] The default ImagePullPolicy should be IfNotPresent (#947, @kevin85421)
- Inject the --block option to ray start command automatically (#932, @Yicheng-Lu-llll)
- Inject cluster name as an environment variable into head and worker pods (#934, @Yicheng-Lu-llll)
- Ensure container ports without names are also included in the head node service (#891, @Yicheng-Lu-llll)
- fix:
.status.availableWorkerReplicas
(#887, @davidxia) - fix: only filter RayCluster events for reconciliation (#882, @davidxia)
- refactor: remove redundant import in
raycluster_controller.go
(#884, @davidxia) - refactor: use equivalent, shorter
Builder.Owns()
method (#881, @davidxia) - [RayCluster controller] [Bug] Unconditionally reconcile RayCluster every 60s instead of only upon change (#850, @architkulkarni)
- [Feature] Make head serviceType optional (#851, @kevin85421)
- [RayCluster controller] Add headServiceAnnotations field to RayCluster CR (#841, @cskornel-doordash)
RayJob (alpha)
- [Hotfix][release blocker][RayJob] HTTP client from submitting jobs before dashboard initialization completes (#1000, @kevin85421)
- [RayJob] Propagate error traceback string when GetJobInfo doesn't return valid JSON (#943, @architkulkarni)
- [RayJob][Doc] Fix RayJob sample config. (#807, @DmitriGekhtman)
RayService (alpha)
- [RayService] Skip update events without change (#811, @sihanwang41)
Helm
- Add rayVersion in the RayCluster chart (#975, @Yicheng-Lu-llll)
- [Feature] Support environment variables for KubeRay operator chart (#978, @kevin85421)
- [Feature] Add service account section in helm chart (#969, @ducviet00)
- Update apiserver chart location in readme (#896, @psschwei)
- add sidecar container option (#920, @akihikokuroda)
- match selector of service to pod labels (#918, @akihikokuroda)
- [Feature] Nodeselector/Affinity/Tolerations value to kuberay-apiserver chart (#879, @alex-treebeard)
- [Feature] Enable namespaced installs via helm chart (#860, @alex-treebeard)
- Remove unused fields from KubeRay operator and RayCluster charts (#839, @kevin85421)
- [Bug] Remove an unused field (ingress.enabled) from KubeRay operator chart (#812, @kevin85421)
- [helm] Add memory limits and resource documentation. (#789, @DmitriGekhtman)
CI
- [Feature] Add python client test to action (#993, @jasoonn)
- [CI][Buildkite] Fix the PATH issue (#952, @kevin85421)
- [CI][Buildkite] An example test for Buildkite (#919, @kevin85421)
- refactor: Fix flaky tests by using RetryOnConflict (#904, @Yicheng-Lu-llll)
- Use k8sClient from client.New in controller test (#898, @Yicheng-Lu-llll)
- [Bug] Fix flaky test: should be able to update all Pods to Running (#893, @kevin85421)
- Enable test framework to install operator with custom config and put operator in a namespace with enforced PSS in security testing (#876, @Yicheng-Lu-llll)
- Ensure all temp files are deleted after the compatibility test (#886, @Yicheng-Lu-llll)
- Adding a test for the document for the Pod security standard (#866, @Yicheng-Lu-llll)
- [Feature] Run config tests with the latest release of KubeRay operator (#858, @kevin85421)
- [Feature] Define a general-purpose cleanup method for CREvent (#849, @kevin85421)
- [Feature] Remove Docker container and NodePort from compatibility test (#844, @kevin85421)
- Remove Docker from BasicRayTestCase (#840, @kevin85421)
- [Feature] Move some functions from prototype test framework to a new utils file (#837, @kevin85421)
- [CI] Add workflow to manually trigger release image push (#801, @DmitriGekhtman)
- [CI] Pin go version in CRD consistency check (#794, @DmitriGekhtman)
- [Feature] Improve the observability of integration tests (#775, @jasoonn)
Sample YAML files
- Improve ray-cluster.external-redis.yaml (#986, @Yicheng-Lu-llll)
- remove ray-cluster.getting-started.yaml (#987, @Yicheng-Lu-llll)
- [Feature] Read Redis password from Kubernetes Secret (#950, @kevin85421)
- [Ray 2.3.0] Update --redis-password for RayCluster (#929, @kevin85421)
- [Bug] KubeRay does not work on M1 macs. (#869, @kevin85421)
- [Post Ray 2.3 Release] Update Ray versions to Ray 2.3.0 (#925, @cadedaniel)
- [Post Ray...