Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Onboard vLLM upstream CI project to MOC OpenShift Clusters #779

Open
1 of 3 tasks
dystewart opened this issue Oct 22, 2024 · 7 comments
Open
1 of 3 tasks

Onboard vLLM upstream CI project to MOC OpenShift Clusters #779

dystewart opened this issue Oct 22, 2024 · 7 comments
Assignees

Comments

@dystewart
Copy link

dystewart commented Oct 22, 2024

Motivation

The objective of the project is to install and test the upstream CI pipeline that Anyscale has already deployed to Google (using free credits) and other places in the MOC. We would like to be a continuing provider of the vLLM code builds from Berkeley as a first step to collaborating with them.

Completion Criteria

The definition of done is that this CI build kicks off automatically in production once a night after the project has been integrated and tested, and any errors or issues are reported both to an local engineer (MOC or Red Hat) and to the regular CI pipeline owners, using their existing methods, or revisions if needed to comply with MOC production rules. The project should run in production, but we can start building in a test cluster if necessary.

Description

Heidi has created this project in ColdFront and added Taj as manager. we didn't request any resources yet, because we should be able to accommodate this project in one of the existing test clusters (nerc-ocp-test or rhoai-test) before moving into nerc-ocp-prod.

(Note that Red Hat already uses vLLM in RHOAI we think, but that would be using an older release. This project is for building whatever is the newest release nightly.) Here's the project repo: https://github.com/vllm-project/vllm

They said that they can do this with any GPUs, so let's try the A100s first, since they seem to be relatively unused in production NERC. We can try v100s etc. later if we succeed with the A100s first. Please note the resource usage during builds once you are running in production, so that we will be able to estimate the ongoing usage charges for the project.

The RHOAI team is paying for resource usage on the MOC for this project (Jen is setting this up, but we can start with collecting the project usage with Heidi as the PI.).

  • Install and verify buildkite operator in test cluster
  • Install and verify vLLM nightly in RHOAI in cluster
  • Monitor the resource usage in the project namespace(s) so that we will be able to estimate the ongoing usage charges for the project
@dystewart dystewart self-assigned this Oct 22, 2024
@dystewart
Copy link
Author

TO DO: Make the bullet points in the list above into dedicated smaller issues

@dystewart
Copy link
Author

cc: @hpdempsey

@tssala23
Copy link

Buildkite operator is working on the test cluster in the buildkite namespace.
As none of the resources are cluster scoped I have not made a PR to add the manifests to the config repo, but instead just applied them to that namespace on the test cluster. Here is a link to a repo containing the deployed manifests.
Here is a link to the repo https://github.com/tssala23/buildkite
Helm template along with the this repo https://github.com/dtrifiro/buildkite-on-openshift were used to obtain the manifests.

A few quirks while deploying:

  • There's an organization name and organization slug, the slug is what should be used for --set config.org= in the helm command.
  • Recreated secret resource using stringData as values were not getting encoded properly.
  • Agent never shows up in the web UI despite being able to deploy to it

@joachimweyl
Copy link
Contributor

@tssala23 or @dystewart please provide an estimate for this issue.

@tssala23 tssala23 self-assigned this Oct 22, 2024
@schwesig
Copy link
Member

schwesig commented Oct 24, 2024

/CC @schwesig
now on test
then on prod
buildkite namespace

@maxamillion
Copy link

maxamillion commented Nov 4, 2024

@schwesig @dystewart there hasn't been an update in a couple weeks. What's the current status? Any blockers? Thanks!

@dystewart
Copy link
Author

Hey all, sorry for the delay in updates here is our progress so far:

  1. We have installed a buildkite agent
  2. Had some discussion of how and where we would build and store container images (we went with buildConfigs actually build the image and output to imageStreams)
  3. We built the Dockerfile.cpu image successfully, though I did have to change the base image reference from Docker.io to quay.io to sidestep pull rate errors
  4. I created a fork of vllm to make sure our buildConfigs were triggered automatically on push events which worked
  5. We attempted to build also Dockerfile but ran into some gcc error (Digging into this in the morning)

Next steps:

  1. Figure out what is causing the main Dockerfile build to fail
  2. create the test-pipeline.yaml as a sanity check on our config so far

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants