-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Onboard vLLM upstream CI project to MOC OpenShift Clusters #779
Comments
TO DO: Make the bullet points in the list above into dedicated smaller issues |
cc: @hpdempsey |
Buildkite operator is working on the test cluster in the A few quirks while deploying:
|
@tssala23 or @dystewart please provide an estimate for this issue. |
/CC @schwesig |
@schwesig @dystewart there hasn't been an update in a couple weeks. What's the current status? Any blockers? Thanks! |
Hey all, sorry for the delay in updates here is our progress so far:
Next steps:
|
Motivation
The objective of the project is to install and test the upstream CI pipeline that Anyscale has already deployed to Google (using free credits) and other places in the MOC. We would like to be a continuing provider of the vLLM code builds from Berkeley as a first step to collaborating with them.
Completion Criteria
The definition of done is that this CI build kicks off automatically in production once a night after the project has been integrated and tested, and any errors or issues are reported both to an local engineer (MOC or Red Hat) and to the regular CI pipeline owners, using their existing methods, or revisions if needed to comply with MOC production rules. The project should run in production, but we can start building in a test cluster if necessary.
Description
Heidi has created this project in ColdFront and added Taj as manager. we didn't request any resources yet, because we should be able to accommodate this project in one of the existing test clusters (nerc-ocp-test or rhoai-test) before moving into nerc-ocp-prod.
(Note that Red Hat already uses vLLM in RHOAI we think, but that would be using an older release. This project is for building whatever is the newest release nightly.) Here's the project repo: https://github.com/vllm-project/vllm
They said that they can do this with any GPUs, so let's try the A100s first, since they seem to be relatively unused in production NERC. We can try v100s etc. later if we succeed with the A100s first. Please note the resource usage during builds once you are running in production, so that we will be able to estimate the ongoing usage charges for the project.
The RHOAI team is paying for resource usage on the MOC for this project (Jen is setting this up, but we can start with collecting the project usage with Heidi as the PI.).
The text was updated successfully, but these errors were encountered: