Skip to content

feat(RHOAIENG-26487): Cluster lifecycling via RayJob #873

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

chipspeak
Copy link
Contributor

@chipspeak chipspeak commented Jul 31, 2025

Issue link

Jira

What changes have been made

Support has been added to submit a RayJob that will create and lifecycle its own cluster.

Verification steps

Prerequisites

  • Build a CodeFlare SDK whl file based on this branch by doing poetry build
  • Disable Kueue in your RHOAI cluster.
  • Log in to OpenShift on your local machine via oc login
  • Create a local jupyter notebook in advance of pasting the below cells.

Steps

  1. Run poetry build.
  2. Copy the below as a cell into the jupyter notebook and execute it:
# This obviously presumes the location of your whl file. Adjust as needed.
%pip install dist/codeflare_sdk-0.0.0.dev0-py3-none-any.whl --force-reinstall
  1. Once the install is complete, restart your notebooks kernel.
  2. Next, copy the below into a cell and execute it (ensuring the names are within the 63 character limit):
from codeflare_sdk import RayJob, RayJobClusterConfig

# Create cluster configuration for auto-creation
cluster_config = RayJobClusterConfig(
    head_cpu_requests='1',
    head_cpu_limits='2',
    head_memory_requests=4,
    head_memory_limits=5,
    # head_accelerators={'nvidia.com/gpu':0}, # not needed anymore if no GPUs - may change with kueue stuff in future
    # worker_accelerators={'nvidia.com/gpu':0}, 
    num_workers=1,
    worker_cpu_requests='1',
    worker_cpu_limits='2',
    worker_memory_requests=3,
    worker_memory_limits=4,
)

# Create RayJob with embedded cluster - will auto-create and manage cluster lifecycle
job = RayJob(
    job_name="test-lifecycle",
    cluster_config=cluster_config,  # This triggers auto-cluster creation
    namespace="rhods-notebooks",
    entrypoint="python -c 'import time; print(\"Job starting...\"); time.sleep(15); print(\"Job completed!\")'",
    shutdown_after_job_finishes=True,  # Auto-cleanup cluster after job finishes
    ttl_seconds_after_finished=30,     # Wait 30s after job completion before cleanup
)

job.submit()

print(f"RayJob '{ray_job.name}' configured to create cluster '{ray_job.cluster_name}'")
  1. If you open your namespace and check the pods and you should observe both a job and cluster creation. You can verify the job status by running:
job.status()
  1. Once the job has completed, you should see the cluster pods terminate after the 30 seconds set by the RayJob.
  2. Check the logs of the RayJob and you should see success messages as applied in the entrypoint.
  3. Delete the RayJob CR manually (not implemented via SDK yet)

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

@openshift-ci openshift-ci bot requested review from dimakis and pawelpaszki July 31, 2025 16:26
@chipspeak
Copy link
Contributor Author

Supporting screenshots below:
Screenshot 2025-07-31 at 17 13 42
Screenshot 2025-07-31 at 17 14 04
Screenshot 2025-07-31 at 17 15 03

Copy link

codecov bot commented Jul 31, 2025

Codecov Report

❌ Patch coverage is 96.87500% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.67%. Comparing base (e2fc98b) to head (6585567).
⚠️ Report is 2 commits behind head on ray-jobs-feature.

Files with missing lines Patch % Lines
src/codeflare_sdk/ray/rayjobs/config.py 96.55% 5 Missing ⚠️
src/codeflare_sdk/ray/rayjobs/rayjob.py 96.07% 2 Missing ⚠️
Additional details and impacted files
@@                 Coverage Diff                  @@
##           ray-jobs-feature     #873      +/-   ##
====================================================
+ Coverage             93.06%   93.67%   +0.61%     
====================================================
  Files                    28       31       +3     
  Lines                  1513     1724     +211     
====================================================
+ Hits                   1408     1615     +207     
- Misses                  105      109       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@chipspeak chipspeak force-pushed the RHOAIENG-26487 branch 2 times, most recently from 486a6bd to 0add890 Compare July 31, 2025 16:44
@chipspeak
Copy link
Contributor Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 31, 2025
@chipspeak
Copy link
Contributor Author

/retest

@laurafitzgerald
Copy link
Contributor

laurafitzgerald commented Aug 1, 2025

I've verified this change works as described
I still need to do a code review but if someone else has time to do that, please work away.

Copy link
Contributor

@kryanbeane kryanbeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some requested changes for reference @chipspeak. I also left random reminders for myself when I finish this PR off next week.

cc: @LilyLinh

@kryanbeane kryanbeane force-pushed the RHOAIENG-26487 branch 2 times, most recently from 0a32a85 to df435d8 Compare August 12, 2025 11:14
Copy link
Contributor

@kryanbeane kryanbeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: Pat's changes were tested by myself and @laurafitzgerald & reviewed by me, and my changes were tested and reviewed by @chipspeak

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 13, 2025
Copy link
Contributor

openshift-ci bot commented Aug 13, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kryanbeane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 13, 2025
@kryanbeane
Copy link
Contributor

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 13, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit 3228739 into project-codeflare:ray-jobs-feature Aug 13, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants