Skip to content

Added self-hosted bcm install #1456

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: v2.19
Choose a base branch
from
29 changes: 29 additions & 0 deletions docs/admin/runai-setup/self-hosted/bcm/files/metallb.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@

---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
name: l2-ingress
namespace: metallb-system
spec:
ipAddressPools:
- ingress-pool
nodeSelectors:
- matchLabels:
node-role.kubernetes.io/runai-system: "true"

---
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: ingress-pool
namespace: metallb-system
spec:
addresses:
- 192.168.0.250-192.168.0.251 # Example of two ip address -
autoAssign: false
serviceAllocation:
priority: 50
namespaces:
- ingress-nginx
- knative-serving
24 changes: 24 additions & 0 deletions docs/admin/runai-setup/self-hosted/bcm/files/networkoperator.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@

deployCR: true
nfd:
enabled: true
ofedDriver:
deploy: false
psp:
enabled: false
rdmaSharedDevicePlugin:
deploy: false
secondaryNetwork:
cniPlugins:
deploy: true
deploy: true
ipamPlugin:
deploy: false
multus:
deploy: true
nvIpam:
deploy: true
sriovDevicePlugin:
deploy: false
sriovNetworkOperator:
enabled: true
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
80 changes: 80 additions & 0 deletions docs/admin/runai-setup/self-hosted/bcm/install-cluster.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Install the Cluster


## System and Network Requirements
Before installing the NVIDIA Run:ai cluster, validate that the [system requirements](./system-requirements.md) and [network requirements](./network-requirements.md) are met. Make sure you have the [software artifacts](./preparations.md) prepared.

Once all the requirements are met, it is highly recommend to use the NVIDIA Run:ai cluster preinstall diagnostics tool to:

* Test the below requirements in addition to failure points related to Kubernetes, NVIDIA, storage, and networking
* Look at additional components installed and analyze their relevance to a successful installation

For more information, see [preinstall diagnostics](https://github.com/run-ai/preinstall-diagnostics). To run the preinstall diagnostics tool, [download](https://runai.jfrog.io/ui/native/pd-cli-prod/preinstall-diagnostics-cli/) the latest version, and run:

```bash
chmod +x ./preinstall-diagnostics-<platform> && \
./preinstall-diagnostics-<platform> \
--domain ${CONTROL_PLANE_FQDN} \
--cluster-domain ${CLUSTER_FQDN} \
#if the diagnostics image is hosted in a private registry
--image-pull-secret ${IMAGE_PULL_SECRET_NAME} \
--image ${PRIVATE_REGISTRY_IMAGE_URL}
```

## Helm

NVIDIA Run:ai requires [Helm](https://helm.sh/) 3.14 or later. To install Helm, see [Installing Helm](https://helm.sh/docs/intro/install/).

## Permissions

A Kubernetes user with the `cluster-admin` role is required to ensure a successful installation. For more information, see [Using RBAC authorization](https://kubernetes.io/docs/reference/access-authn-authz/rbac/).

## Installation

Follow the steps below to add a new cluster.

!!! Note
When adding a cluster for the first time, the New Cluster form automatically opens when you log in to the NVIDIA Run:ai platform. Other actions are prevented, until the cluster is created.

If this is your first cluster and you have completed the New Cluster form, start at step 3. Otherwise, start at step 1.

1. In the NVIDIA Run:ai platform, go to **Resources**
2. Click **+NEW CLUSTER**
3. Enter a unique name for your cluster
4. Choose the NVIDIA Run:ai cluster version (latest, by default)
5. Select **Same as control plane**
6. Click **Continue**

**Installing NVIDIA Run:ai Cluster**

In the next Section, the NVIDIA Run:ai cluster installation steps will be presented.

1. Follow the installation instructions and run the commands provided on your Kubernetes cluster
2. Append `--set global.customCA.enabled=true` to the Helm installation command
3. Click **DONE**

The cluster is displayed in the table with the status **Waiting to connect**. Once installation is complete, the cluster status changes to **Connected**.

!!! Tip
Use the `--dry-run` flag to gain an understanding of what is being installed before the actual installation. For more details, see see [Understanding cluster access roles.](https://docs.run.ai/v2.19/admin/config/access-roles/).


!!! Note
To customize the installation based on your environment, see [Customize cluster installation](../../cluster-setup/customize-cluster-install.md).

## Troubleshooting

If you encounter an issue with the installation, try the troubleshooting scenario below.

### Installation

If the NVIDIA Run:ai cluster installation failed, check the installation logs to identify the issue. Run the following script to print the installation logs:

``` bash
curl -fsSL https://raw.githubusercontent.com/run-ai/public/main/installation/get-installation-logs.sh
```

### Cluster Status

If the NVIDIA Run:ai cluster installation completed, but the cluster status did not change its status to Connected, check the cluster [troubleshooting scenarios](../../troubleshooting/troubleshooting.md#cluster-health)

43 changes: 43 additions & 0 deletions docs/admin/runai-setup/self-hosted/bcm/install-control-plane.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Install the Control Plane

Installing the NVIDIA Run:ai control plane requires Internet connectivity.


## System and Network Requirements
Before installing the NVIDIA Run:ai control plane, validate that the [system requirements](./system-requirements.md) and [network requirements](./network-requirements.md) are met. Make sure you have the [software artifacts](./preparations.md) prepared.

## Permissions

As part of the installation, you will be required to install the NVIDIA Run:ai control plane [Helm chart](https://helm.sh/). The Helm charts require Kubernetes administrator permissions. You can review the exact objects that are created by the charts using the `--dry-run` flag on both helm charts.

## Installation

Run the following command. Replace `global.domain=<DOMAIN>` with the one obtained [here](./system-requirements.md#fully-qualified-domain-name-fqdn)

```bash
helm upgrade -i runai-backend -n runai-backend runai-backend/control-plane \
--version "<VERSION> " \
--set global.customCA.enabled=true \
--set global.domain=<DOMAIN>

Release "runai-backend" does not exist. Installing it now.
NAME: runai-backend
LAST DEPLOYED: Mon Dec 30 17:30:19 2024
NAMESPACE: runai-backend
STATUS: deployed
REVISION: 1
```

!!! Note
To install a specific version, add --version <VERSION> to the install command. You can find available versions by running helm search repo -l runai-backend.

## Connect to NVIDIA Run:ai User Interface

1. Open your browser and go to: `https://<DOMAIN>`.
2. Log in using the default credentials:

* User: `[email protected]`
* Password: `Abcd!234`

You will be prompted to change the password.

64 changes: 64 additions & 0 deletions docs/admin/runai-setup/self-hosted/bcm/network-requirements.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Network requirements

The following network requirements are for the NVIDIA Run:ai components installation and usage.

## Installation

### Inbound rules

| Name | Description | Source | Destination | Port |
| --------------------------- | ---------------- | ------- | -------------------------- | ---- |
| Installation via BCM | SSH Access | Installer Machine | NVIDIA Base Command Manager headnodes | 22 |

### Outbound rules
| Name | Description | Source | Destination | Port |
| --------------------------- | ---------------- | ------- | -------------------------- | ---- |
| Container Registry | Pull NVIDIA Run:ai images | All kubernetes nodes | runai.jfrog.io | 443 |
| Helm repository | NVIDIA Run:ai Helm repository for installation | Installer machine | runai.jfrog.io | 443 |

The NVIDIA Run:ai installation has [software requirements](system-requirements.md) that require additional components to be installed on the cluster. This article includes simple installation examples which can be used optionally and require the following cluster outbound ports to be open:

| Name | Description | Source | Destination | Port |
| -------------------------- | ------------------------------------------ | -------------------- | --------------- | ---- |
| Kubernetes Registry | Ingress Nginx image repository | All kubernetes nodes | registry.k8s.io | 443 |
| Google Container Registry | GPU Operator, and Knative image repository | All kubernetes nodes | gcr.io | 443 |
| Red Hat Container Registry | Prometheus Operator image repository | All kubernetes nodes | quay.io | 443 |
| Docker Hub Registry | Training Operator image repository | All kubernetes nodes | docker.io | 443 |



## External access

Set out below are the domains to whitelist and ports to open for installation, upgrade, and usage of the application and its management.


!!! Note
Ensure the inbound and outbound rules are correctly applied to your firewall.

### Inbound rules

To allow your organization’s NVIDIA Run:ai users to interact with the cluster using the [NVIDIA Run:ai Command-line interface](../../reference/cli/runai/), or access specific UI features, certain inbound ports need to be open:

| Name | Description | Source | Destination | Port |
| --------------------------- | ---------------- | ------- | -------------------------- | ---- |
| NVIDIA Run:ai control plane | HTTPS entrypoint | 0.0.0.0 | NVIDIA Run:ai system nodes | 443 |
| NVIDIA Run:ai cluster | HTTPS entrypoint | RFC1918 private IP ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16)
| NVIDIA Run:ai system nodes | 443 |


### Outbound rules

!!! Note
Outbound rules applied to the NVIDIA Run:ai cluster component only. In case the NVIDIA Run:ai cluster is installed together with the NVIDIA Run:ai control plane, the NVIDIA Run:ai cluster FQDN refers to the NVIDIA Run:ai control plane FQDN.
{% endhint %}

For the NVIDIA Run:ai cluster installation and usage, certain **outbound** ports must be open:

| Name | Description | Source | Destination | Port |
| ------------------ | -------------------------------------------------------------------------------- | -------------------------- | -------------------------------- | ---- |
| Cluster sync | Sync NVIDIA Run:ai cluster with NVIDIA Run:ai control plane | NVIDIA Run:ai system nodes | NVIDIA Run:ai control plane FQDN | 443 |
| Metric store | Push NVIDIA Run:ai cluster metrics to NVIDIA Run:ai control plane's metric store | NVIDIA Run:ai system nodes | NVIDIA Run:ai control plane FQDN | 443 |

## Internal network

Ensure that all Kubernetes nodes can communicate with each other across all necessary ports. Kubernetes assumes full interconnectivity between nodes, so you must configure your network to allow this seamless communication. Specific port requirements may vary depending on your network setup.
11 changes: 11 additions & 0 deletions docs/admin/runai-setup/self-hosted/bcm/next-steps.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Next Steps

## Restrict System Node Scheduling (Post-Installation)

After installation, you can configure NVIDIA Run:ai to enforce stricter scheduling rules that ensure system components and workloads are assigned to the correct nodes. The following flags are set using the `runaiconfig`. See [Advanced Cluster Configurations](../../../config/advanced-cluster-config.md) for more details.

1. Set `global.NodeAffinity.RestrictRunAISystem=true`. This ensures that NVIDIA Run:ai system components are scheduled only on nodes labeled as system nodes:

2. Set `global.nodeAffinity.restrictScheduling=true`. This prevents pure CPU workloads from being scheduled on GPU nodes.


13 changes: 13 additions & 0 deletions docs/admin/runai-setup/self-hosted/bcm/preparations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Preparations

You should receive a token from NVIDIA Run:ai customer support. The following command provides access to the NVIDIA Run:ai container registry:

```bash
kubectl create secret docker-registry runai-reg-creds \
--docker-server=https://runai.jfrog.io \
--docker-username=self-hosted-image-puller-prod \
--docker-password=<$TOKEN> \
[email protected] \
--namespace=runai-backend
```

Loading