Onboard vLLM upstream CI project to MOC OpenShift Clusters #779

dystewart · 2024-10-22T01:35:16Z

Motivation

The objective of the project is to install and test the upstream CI pipeline that Anyscale has already deployed to Google (using free credits) and other places in the MOC. We would like to be a continuing provider of the vLLM code builds from Berkeley as a first step to collaborating with them.

Completion Criteria

The definition of done is that this CI build kicks off automatically in production once a night after the project has been integrated and tested, and any errors or issues are reported both to an local engineer (MOC or Red Hat) and to the regular CI pipeline owners, using their existing methods, or revisions if needed to comply with MOC production rules. The project should run in production, but we can start building in a test cluster if necessary.

Description

Heidi has created this project in ColdFront and added Taj as manager. we didn't request any resources yet, because we should be able to accommodate this project in one of the existing test clusters (nerc-ocp-test or rhoai-test) before moving into nerc-ocp-prod.

(Note that Red Hat already uses vLLM in RHOAI we think, but that would be using an older release. This project is for building whatever is the newest release nightly.) Here's the project repo: https://github.com/vllm-project/vllm

They said that they can do this with any GPUs, so let's try the A100s first, since they seem to be relatively unused in production NERC. We can try v100s etc. later if we succeed with the A100s first. Please note the resource usage during builds once you are running in production, so that we will be able to estimate the ongoing usage charges for the project.

The RHOAI team is paying for resource usage on the MOC for this project (Jen is setting this up, but we can start with collecting the project usage with Heidi as the PI.).

Install and verify buildkite operator in test cluster
Install and verify vLLM nightly in RHOAI in cluster
Monitor the resource usage in the project namespace(s) so that we will be able to estimate the ongoing usage charges for the project

dystewart · 2024-10-22T01:35:44Z

TO DO: Make the bullet points in the list above into dedicated smaller issues

dystewart · 2024-10-22T01:36:01Z

cc: @hpdempsey

tssala23 · 2024-10-22T01:49:23Z

Buildkite operator is working on the test cluster in the buildkite namespace.
As none of the resources are cluster scoped I have not made a PR to add the manifests to the config repo, but instead just applied them to that namespace on the test cluster. Here is a link to a repo containing the deployed manifests.
Here is a link to the repo https://github.com/tssala23/buildkite
Helm template along with the this repo https://github.com/dtrifiro/buildkite-on-openshift were used to obtain the manifests.

A few quirks while deploying:

There's an organization name and organization slug, the slug is what should be used for --set config.org= in the helm command.
Recreated secret resource using stringData as values were not getting encoded properly.
Agent never shows up in the web UI despite being able to deploy to it

joachimweyl · 2024-10-22T13:26:09Z

@tssala23 or @dystewart please provide an estimate for this issue.

schwesig · 2024-10-24T17:18:37Z

/CC @schwesig
now on test
then on prod
buildkite namespace

maxamillion · 2024-11-04T14:17:55Z

@schwesig @dystewart there hasn't been an update in a couple weeks. What's the current status? Any blockers? Thanks!

dystewart · 2024-11-12T23:12:15Z

Hey all, sorry for the delay in updates here is our progress so far:

We have installed a buildkite agent
Had some discussion of how and where we would build and store container images (we went with buildConfigs actually build the image and output to imageStreams)
We built the Dockerfile.cpu image successfully, though I did have to change the base image reference from Docker.io to quay.io to sidestep pull rate errors
I created a fork of vllm to make sure our buildConfigs were triggered automatically on push events which worked
We attempted to build also Dockerfile but ran into some gcc error (Digging into this in the morning)

Next steps:

Figure out what is causing the main Dockerfile build to fail
create the test-pipeline.yaml as a sanity check on our config so far

dystewart self-assigned this Oct 22, 2024

tssala23 self-assigned this Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Onboard vLLM upstream CI project to MOC OpenShift Clusters #779

Onboard vLLM upstream CI project to MOC OpenShift Clusters #779

dystewart commented Oct 22, 2024 •

edited by tssala23

Loading

dystewart commented Oct 22, 2024

dystewart commented Oct 22, 2024

tssala23 commented Oct 22, 2024

joachimweyl commented Oct 22, 2024

schwesig commented Oct 24, 2024 •

edited

Loading

maxamillion commented Nov 4, 2024 •

edited

Loading

dystewart commented Nov 12, 2024

Onboard vLLM upstream CI project to MOC OpenShift Clusters #779

Onboard vLLM upstream CI project to MOC OpenShift Clusters #779

Comments

dystewart commented Oct 22, 2024 • edited by tssala23 Loading

Motivation

Completion Criteria

Description

dystewart commented Oct 22, 2024

dystewart commented Oct 22, 2024

tssala23 commented Oct 22, 2024

joachimweyl commented Oct 22, 2024

schwesig commented Oct 24, 2024 • edited Loading

maxamillion commented Nov 4, 2024 • edited Loading

dystewart commented Nov 12, 2024

dystewart commented Oct 22, 2024 •

edited by tssala23

Loading

schwesig commented Oct 24, 2024 •

edited

Loading

maxamillion commented Nov 4, 2024 •

edited

Loading