Skip to content

Releases: ray-project/kuberay

v1.3.0

19 Feb 00:39
8ba2b33
Compare
Choose a tag to compare

Highlights

RayCluster Conditions API

The RayCluster conditions API is graduating to Beta status in v1.3. The new API provides more details about the RayCluster’s observable state that were not possible to express in the old API. The following conditions are supported for v1.3: AllPodRunningAndReadyFirstTime, RayClusterPodsProvisioning, HeadPodNotFound and HeadPodRunningAndReady. We will be adding more conditions in future releases.

Ray Kubectl Plugin

The Ray Kubectl Plugin is graduating to Beta status. The following commands are supported with KubeRay v1.3:

  • kubectl ray logs <cluster-name>: download Ray logs to a local directory
  • kubectl ray session <cluster-name>: initiate port-forwarding session to the Ray head
  • kubectl ray create <cluster>: create a Ray cluster
  • kubectl ray job submit: create a RayJob and submit a job using a local working directory

See the Ray Kubectl Plugin docs for more details.

RayJob Stability Improvements

Several improvements have been made to enhance the stability of long-running RayJobs. In particular, when using submissionMode=K8sJobMode, job submissions will no longer fail due to the submission of duplicate IDs. Now, if a submission ID already exists, the logs of the existing job will be retrieved instead.

RayService API Improvements

RayService strives to deliver zero-downtime serving. When changes in the RayService spec cannot be applied in place, it attempts to migrate traffic to a new RayCluster in the background. However, users might not always have sufficient resources for a new RayCluster. Beginning with KubeRay 1.3, users can customize this behavior using the new UpgradeStrategy option within the RayServiceSpec.

Previously, the serviceStatus field in RayService was inconsistent and did not accurately represent the actual state. Starting with KubeRay v1.3.0, we have introduced two conditions, Ready and UpgradeInProgress, to RayService. Following the approach taken with RayCluster, we have decided to deprecate serviceStatus. In the future, serviceStatus will be removed, and conditions will serve as the definitive source of truth. For now, serviceStatus remains available but is limited to two possible values: "Running" or an empty string.

GCS Fault Tolerance API Improvements

The new GcsFaultToleranceOptions field in the RayCluster now provides a streamlined way for users to enable GCS Fault Tolerance on a RayCluster. This eliminates the previous need to distribute related settings across Pod annotations, container environment variables, and the RayStartParams. Furthermore, users can now specify their Redis username in the newly introduced field (requires Ray 2.4.1 or later). To see the impact of this change on a YAML configuration, please refer to the example manifest.

Breaking Changes

RayService API

Starting from KubeRay v1.3.0, we have removed all possible values of RayService.Status.ServiceStatus except Running, so the only valid values for ServiceStatus are Running and empty. If ServiceStatus is Running, it means that RayService is ready to serve requests. In other words, ServiceStatus is equivalent to the Ready condition. It is strongly recommended to use the Ready condition instead of ServiceStatus going forward.

Features

  • RayCluster Conditions API is graduating to Beta status. The feature gate RayClusterStatusConditions is now enabled by default.
  • New events were added for RayCluster, RayJob and RayService for improved observability
  • Various improvements to Ray autoscaler v2
  • Introduce a new API in RayService spec.upgradeStrategy. The upgrade strategy type can be set to NewCluster or None to modify the behavior of zero-downtime upgrades for RayService.
  • Add RayCluster controller expecatations to mitigate stale informer caches
  • RayJob now supports submission mode InteractiveMode. Use this submission mode when you want to submit jobs from a local working directory on your laptop.
  • RayJob now supports spec.deletionPolicy API, this feature requires the RayJobDeletionPolicy feature gate to be enabled. Initial deltion policies are DeleteCluster, DeleteWorkers, DeleteSelf and DeleteNone.
  • KubeRay now detects TPUs and Neuron Core resources and specifies them as custom resources to ray start parameters
  • Introduce RayClusterSuspending and RayClusterSuspended conditions
  • Container CPU requests are now used in Ray –num-cpus if CPU limits is not specified
  • Various example manifests for using TPU v6 with KubeRay
  • Add ManagedBy field in RayJob and RayCluster. This is required for Multi-Kueue support.
  • Add support for kubectl ray create cluster command
  • Add support for kubectl ray create workergroup command

Guides & Tutorials

Changelog

Read more

v1.2.2

29 Sep 08:40
0ea404b
Compare
Choose a tag to compare

Highlights

  • (alpha) Ray kubectl plugin
    • get, session, log, job submit
  • (alpha) Kubernetes events: create Kubernetes events for important information about the interactions between KubeRay and the Kubernetes API server
  • (alpha) Apache YuniKorn integration

Changelog

v1.2.1 release

31 Aug 06:43
fa3d8ee
Compare
Choose a tag to compare

Compared to KubeRay v1.2.0, KubeRay v1.2.1 includes an additional commit (#2243). This commit fixes the issue where a RayService created by a KubeRay version older than v1.2.0 does not support zero-downtime upgrades after upgrading to KubeRay v1.2.0.

v1.2.0 release

29 Aug 21:44
58ba733
Compare
Choose a tag to compare

Highlights

  • RayCluster CRD status observability improvement: design doc
  • Support retry in RayJob: #2192
  • Coding style improvement

RayCluster

RayCluster CRD status improvement

Coding style improvement

RayJob

  • [RayJob] allow create verb for services/proxy, which is required for HTTPMode (#2321, @andrewsykim)
  • [Fix][Sample-Yaml] Increase ray head CPU resource for pytorch minst (#2330, @MortalHappiness)
  • Support Apache YuniKorn as one batch scheduler option (#2184, @yangwwei)
  • [RayJob] add RayJob pass Deadline e2e-test with retry (#2241, @karta1502545)
  • add feature gate mechanism to ray-operator (#2219, @andrewsykim)
  • [RayJob] add Failing RayJob in HTTPMode e2e test for rayjob with retry (#2242, @tinaxfwu)
  • [Feat][RayJob] Delete RayJob CR after job termination (#2225, @MortalHappiness)
  • reconcile concurrency flag should apply for RayJob and RayService controllers (#2228, @andrewsykim)
  • [RayJob] add Failing submitter K8s Job e2e ...
Read more

v1.1.1 release

08 May 20:14
f460fda
Compare
Choose a tag to compare

Compared to KubeRay v1.1.0, KubeRay v1.1.1 includes four cherry-picked commits.

  • [Bug] Ray operator crashes when specifying RayCluster with resources.limits but no resources.requests (#2077, @kevin85421)
  • [CI] Pin kustomize to v5.3.0 (#2067, @kevin85421)
  • [Bug] All worker Pods are deleted if using KubeRay v1.0.0 CRD with KubeRay operator v1.1.0 image (#2087, @kevin85421)
  • [Hotfix][CI] Pin setup-envtest dep (#2038, @kevin85421)

v1.1.0 release

23 Mar 04:05
8adc538
Compare
Choose a tag to compare

Highlights

  • RayJob improvements

  • Structured logging

    • In KubeRay v1.1.0, we have changed the KubeRay logs to JSON format, and each log message includes context information such as the custom resource’s name and reconcileID. Hence, users can filter out logs associated with a RayCluster, RayJob, or RayService CR by its name.
  • RayService improvements

    • Refactor health check mechanism to improve the stability.
    • Deprecate the deploymentUnhealthySecondThreshold and serviceUnhealthySecondThreshold to avoid unintentional preparation of new RayCluster custom resource.
  • TPU multi-host PodSlice support

    • The KubeRay team is actively working with the Google GKE and TPU teams on integration. The required changes in KubeRay have already been completed. The GKE team will complete some tasks on their side this week or next. Then, users should be able to use multi-host TPU PodSlice with a static RayCluster (without autoscaling).
  • Stop publishing images on DockerHub; instead, we will only publish on Quay.

RayJob

RayJob state machine refactor

  • [RayJob][Status][1/n] Redefine the definition of JobDeploymentStatusComplete (#1719, @kevin85421)
  • [RayJob][Status][2/n] Redefine ready for RayCluster to avoid using HTTP requests to check dashboard status (#1733, @kevin85421)
  • [RayJob][Status][3/n] Define JobDeploymentStatusInitializing (#1737, @kevin85421)
  • [RayJob][Status][4/n] Remove some JobDeploymentStatus and updateState function calls (#1743, @kevin85421)
  • [RayJob][Status][5/n] Refactor getOrCreateK8sJob (#1750, @kevin85421)
  • [RayJob][Status][6/n] Redefine JobDeploymentStatusComplete and clean up K8s Job after TTL (#1762, @kevin85421)
  • [RayJob][Status][7/n] Define JobDeploymentStatusNew explicitly (#1772, @kevin85421)
  • [RayJob][Status][8/n] Only a RayJob with the status Running can transition to Complete at this moment (#1774, @kevin85421)
  • [RayJob][Status][9/n] RayJob should not pass any changes to RayCluster (#1776, @kevin85421)
  • [RayJob][10/n] Add finalizer to the RayJob when the RayJob status is JobDeploymentStatusNew (#1780, @kevin85421)
  • [RayJob][Status][11/n] Refactor the suspend operation (#1782, @kevin85421)
  • [RayJob][Status][12/n] Resume suspended RayJob (#1783, @kevin85421)
  • [RayJob][Status][13/n] Make suspend operation atomic by introducing the new status Suspending (#1798, @kevin85421)
  • [RayJob][Status][14/n] Decouple the Initializing status and Running status (#1801, @kevin85421)
  • [RayJob][Status][15/n] Unify the codepath for the status transition to Suspended (#1805, @kevin85421)
  • [RayJob][Status][16/n] Refactor Running status (#1807, @kevin85421)
  • [RayJob][Status][17/n] Unify the codepath for status updates (#1814, @kevin85421)
  • [RayJob][Status][18/n] Control the entire lifecycle of the Kubernetes submitter Job using KubeRay (#1831, @kevin85421)
  • [RayJob][Status][19/n] Transition to Complete if the K8s Job fails (#1833, @kevin85421)

Others

  • [Refactor] Remove global utils.GetRayXXXClientFuncs (#1727, @rueian)
  • [Feature] Warn Users When Updating the RayClusterSpec in RayJob CR (#1778, @Yicheng-Lu-llll)
  • Add apply configurations to generated client (#1818, @astefanutti)
  • RayJob: inject RAY_DASHBOARD_ADDRESS envariable variable for user provided submiter templates (#1852, @andrewsykim)
  • [Bug] Submitter K8s Job fails even though the RayJob has a JobDeploymentStatus Complete and a JobStatus SUCCEEDED (#1919, @kevin85421)
  • add toleration for GPUs in sample pytorch RayJob (#1914, @andrewsykim)
  • Add a sample RayJob to fine-tune a PyTorch lightning text classifier with Ray Data (#1891, @andrewsykim)
  • rayjob controller: refactor environment variable check in unit tests (#1870, @andrewsykim)
  • RayJob: don't delete submitter job when ShutdownAfterJobFinishes=true (#1881, @andrewsykim)
  • rayjob controller: update EndTime to always be the time when the job deployment transitions to Complete status (#1872, @andrewsykim)
  • chore: remove ConfigMap from ray-job.kueue-toy-sample.yaml (#1976, @kevin85421)
  • [Kueue] Add a sample YAML for Kueue toy sample (#1956, @kevin85421)
  • [RayJob] Support ActiveDeadlineSeconds (#1933, @kevin85421)
  • [Feature][RayJob] Support light-weight job submission (#1893, @kevin85421)
  • [RayJob] Add JobDeploymentStatusFailed Status and Reason Field to Enhance Observability for Flyte/RayJob Integration (#1942, @Yicheng-Lu-llll)
  • [RayJob] Refactor Rayjob E2E Tests to Use Server-Side Apply (#1927, @Yicheng-Lu-llll)
  • [RayJob] Rewrite RayJob envtest (#1916, @kevin85421)
  • [Chore][RayJob] Remove the TODO of verifying the schema of RayJobInfo because it is already correct (#1911, @rueian)
  • [RayJob] Set missing CPU limit (#1899, @kevin85421)
  • [RayJob] Set the timeout of the HTTP client from 2 mins to 2 seconds (#1910, @kevin85421)
  • [Feature][RayJob] Support light-weight job submission with entrypoint_num_cpus, entrypoint_num_gpus and entrypoint_resources (#1904, @rueian)
  • [RayJob] Improve dashboard client log (#1903, @kevin85421)
  • [RayJob] Validate whether runtimeEnvYAML is a valid YAML string (#1898, @kevin85421)
  • [RayJob] Add additional print columns for RayJob (#1895, @andrewsykim)
  • [Test][RayJob] Transition to Complete if the JobStatus is STOPPED (#1871, @kevin85421)
  • [RayJob] Inject RAY_SUBMISSION_ID env variable for user provided submitter template (#1868, @kevin85421)
  • [RayJob] Transition to Complete if the JobStatus is STOPPED (#1855, @kevin85421)
  • [RayJob][Kueue] Move limitation check to validateRayJobSpec (#1854, @kevin85421)
  • [RayJob] Validate RayJob spec (#1813, @kevin85421)
  • [Test][RayJob] Kueue happy-path scenario (#1809, @kevin85421)
  • [RayJob] Delete the Kubernetes Job and its Pods immediately when suspending (#1791, @rueian)
  • [Feature][RayJob] Remove the deprecated RuntimeEnv from CRD. Use RuntimeEnvYAML instead. (#1792, @rueian)
  • [Bug][RayJob] Avoid nil pointer dereference ([#1756](https://github.c...
Read more

v1.0.0 release

07 Nov 06:12
1add258
Compare
Choose a tag to compare

KubeRay is officially in General Availability!

  • Bump the CRD version from v1alpha1 to v1.
  • Relocate almost all documentation to the Ray website.
  • Improve RayJob UX.
  • Improve GCS fault tolerance.

GCS fault tolerance

CRD versioning

RayService

  • [Hotfix][Bug] Avoid unnecessary zero-downtime upgrade (#1581, @kevin85421)
  • [Feature] Add an example for RayService high availability (#1566, @kevin85421)
  • [Feature] Add a flag to make zero downtime upgrades optional (#1564, @kevin85421)
  • [Bug][RayService] KubeRay does not recreate Serve applications if a head Pod without GCS FT recovers from a failure. (#1420, @kevin85421)
  • [Bug] Fix the filename of text summarizer YAML (#1415, @kevin85421)
  • [serve] Change text ml yaml to use french in user config (#1403, @zcin)
  • [services] Add text ml rayservice yaml (#1402, @zcin)
  • [Bug] Fix flakiness of RayService e2e tests (#1385, @kevin85421)
  • Add RayService sample test (#1377, @Darren221)
  • [RayService] Revisit the conditions under which a RayService is considered unhealthy and the default threshold (#1293, @kevin85421)
  • [RayService][Observability] Add more loggings about networking issues (#1282, @kevin85421)

RayJob

RayCluster

  • [Enhancement] Remove unused variables in constant.go (#1474, @evalaiyc98)
  • [Enhancement] GPU RayCluster doesn't work on GKE Autopilot (#1470, @kevin85421)
  • [Refactor] Parameterize TestGetAndCheckServeStatus (#1450, @evalaiyc98)
  • [Feature] Make replicas optional for WorkerGroupSpec (#1443, @kevin85421)
  • use raycluster app's name as podgroup name key word (#1446, @lowang-bh)
  • [Refactor] Make port name variables consistent and meaningful (#1389, @evalaiyc98)
  • [Feature] Use image of Ray head container as the default Ray Autoscaler container (#1401, @kevin85421)
  • Update Autoscaler YAML for the Autoscaler tutorial (#1400, @kevin85421)
  • [Feature] Ray container must be the first application container (#1379, @kevin85421)
  • [release blocker][Feature] Only Autoscaler can make decisions to delete Pods (#1253, @kevin85421)
  • [release blocker][Autoscaler] Randomly delete Pods when scaling down the cluster (#1251, @kevin85421)

Helm charts

KubeRay API Server

Documentation

Read more

v0.6.0 release

26 Jul 22:35
9b21af9
Compare
Choose a tag to compare

Highlights

RayService

  • [RayService][Observability] Add more logging for RayService troubleshooting (#1230, @kevin85421)
  • [Bug] Long image pull time will trigger blue-green upgrade after the head is ready (#1231, @kevin85421)
  • [RayService] Stable Diffusion example (#1181, @kevin85421)
  • [RayService] Update docs to use multi-app (#1179, @zcin)
  • [RayService] Change runtime env for e2e autoscaling test (#1178, @zcin)
  • [RayService] Add e2e tests (#1167, @zcin)
  • [RayService][docs] Improve explanation for config file and in-place updates (#1229, @zcin)
  • [RayService][Doc] RayService troubleshooting handbook (#1221, @kevin85421)
  • [Doc] Improve RayService doc (#1235, @kevin85421)
  • [Doc] Improve FAQ page and RayService troubleshooting guide (#1225, @kevin85421)
  • [RayService] Add RayService alb ingress CR (#1169, @sihanwang41)
  • [RayService] Add support for multi-app config in yaml-string format (#1156, @zcin)
  • [rayservice] Add support for getting multi-app status (#1136, @zcin)
  • [Refactor] Remove Dashboard Agent service (#1207, @kevin85421)
  • [Bug] KubeRay operator fails to get serve deployment status due to 500 Internal Server Error (#1173, @kevin85421)
  • MobileNet example (#1175, @kevin85421)
  • [Bug] fix RayActorOptionSpec.items.spec.serveConfig.deployments.rayActorOptions.memory int32 data type (#1220, @kevin85421)

RayJob

  • [RayJob] Submit job using K8s job instead of checking Status and using DashboardHTTPClient (#1177, @architkulkarni)
  • [Doc] [RayJob] Add documentation for submitterPodTemplate (#1228, @architkulkarni)

Autoscaler

  • [release blocker][Feature] Only Autoscaler can make decisions to delete Pods (#1253, @kevin85421)
  • [release blocker][Autoscaler] Randomly delete Pods when scaling down the cluster (#1251, @kevin85421)

Helm

  • [Helm][RBAC] Introduce the option crNamespacedRbacEnable to enable or disable the creation of Role/RoleBinding for RayCluster preparation (#1162, @kevin85421)
  • [Bug] Allow zero replica for workers for Helm (#968, @ducviet00)
  • [Bug] KubeRay tries to create ClusterRoleBinding when singleNamespaceInstall and rbacEnable are set to true (#1190, @kevin85421)

KubeRay API Server

Documentation

CI

Others

v0.5.2 release

14 Jun 21:10
aeed3cd
Compare
Choose a tag to compare

Changelog for v0.5.2

Highlights

The KubeRay 0.5.2 patch release includes the following improvements.

  • Allow specifying the entire headService and serveService YAML spec. Previously, only certain special fields such as labels and annotations were exposed to the user.
  • RayService stability improvements
    • RayService object’s Status is being updated due to frequent reconciliation (#1065, @kevin85421)
    • [RayService] Submit requests to the Dashboard after the head Pod is running and ready (#1074, @kevin85421)
    • Fix in HeadPod Service Generation logic which was causing frequent reconciliation (#1056, @msumitjain)
  • Allow watching multiple namespaces
    • [Feature] Watch CR in multiple namespaces with namespaced RBAC resources (#1106, @kevin85421)
  • Autoscaler stability improvements

Contributors

We'd like to thank the following contributors for their contributions to this release:

@ByronHsu, @Yicheng-Lu-llll, @anishasthana, @architkulkarni, @blublinsky, @chrisxstyles, @dirtyValera, @ecurtin, @jasoonn, @jjyao, @kevin85421, @kodwanis, @msumitjain, @oginskis, @psschwei, @scarlet25151, @sihanwang41, @tedhtchang, @varungup90, @xubo245

Features

Fixes

  • Fix CI (#1145, @kevin85421)
  • Fix config frequent update (#1014, @sihanwang41)
  • Fix for Sample YAML Config Test - 2.4.0 Failure due to 'suspend' Field (#1096, @Yicheng-Lu-llll)
  • Fix in HeadPod Service Generation logic which was causing frequent reconciliation (#1056, @msumitjain)
  • [Bug] Autoscaler doesn't support TLS (#1119, @chrisxstyles)
  • [Bug] Enable ResourceQuota by adding Resources for the health-check init container (#1043, @kevin85421)
  • [Bug] Fix null map handling in BuildServiceForHeadPod function (#1095, @architkulkarni)
  • [Bug] RayService restarts repeatedly with Autoscaler (#1037, @kevin85421)
  • [Bug] Service (Serve) changing port from 8000 to 9000 doesn't work (#1081, @kevin85421)
  • [Bug] autoscaler not working properly in rayjob (#1064, @Yicheng-Lu-llll)
  • [Bug] compatibility test for the nightly Ray image fails (#1055, @kevin85421)
  • [Bug] rayStartParams is required at this moment. (#1031, @kevin85421)
  • [Bug][Autoscaler] Operator does not remove workers (#1139, @kevin85421)
  • [Bug][Doc] fix the link error of operator document (#1046, @xubo245)
  • [Bug][GCS FT] Worker pods crash unexpectedly when gcs_server on head pod is killed (#1036, @kevin85421)
  • [Bug][breaking change] Unauthorized 401 error on fetching Ray Custom Resources from K8s API server (#1128, @kevin85421)
  • [Bug][k8s compatibility] k8s v1.20.7 ClusterIP svc do not updated under RayService (#1110, @kevin85421)
  • [Helm][ray-cluster] Fix parsing envFrom field in additionalWorkerGroups (#1039, @dirtyValera)

Documentation

v0.5.0 release

11 Apr 07:52
ee982a3
Compare
Choose a tag to compare

Highlights

The KubeRay 0.5.0 release includes the following improvements.

  • Interact with KubeRay via a Python client
  • Integrate KubeRay with Kubeflow to provide an interactive development environment (link).
  • Integrate KubeRay with Ray TLS authentication
  • Improve the user experience for KubeRay on AWS EKS (link)
  • Fix some Kubernetes networking issues
  • Fix some stability bugs in RayJob and RayService

Contributors

The following individuals contributed to KubeRay 0.5.0. This list is alphabetical and incomplete.

@akanso @alex-treebeard @architkulkarni @cadedaniel @cskornel-doordash @davidxia @DmitriGekhtman @ducviet00 @gvspraveen @harryge00 @jasoonn @Jeffwan @kevin85421 @psschwei @scarlet25151 @sihanwang41 @wilsonwang371 @Yicheng-Lu-llll

Python client (alpha)(New!)

Kubeflow (New!)

  • [Feature][Doc] Kubeflow integration (#937, @kevin85421)
  • [Feature] Ray restricted podsecuritystandards for enterprise security and Kubeflow integration (#750, @kevin85421)

TLS authentication (New!)

AWS EKS (New!)

Kubernetes networking (New!)

  • Read cluster domain from resolv.conf or env (#951, @harryge00)
  • [Feature] Replace service name with Fully Qualified Domain Name (#938, @kevin85421)
  • [Feature] Add default init container in workers to wait for GCS to be ready (#973, @kevin85421)

Observability

RayCluster

RayJob (alpha)

  • [Hotfix][release blocker][RayJob] HTTP client from submitting jobs before dashboard initialization completes (#1000, @kevin85421)
  • [RayJob] Propagate error traceback string when GetJobInfo doesn't return valid JSON (#943, @architkulkarni)
  • [RayJob][Doc] Fix RayJob sample config. (#807, @DmitriGekhtman)

RayService (alpha)

Helm

CI

Sample YAML files

Read more