Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add steps for creating an instant clusters & running pytorch process #214

Merged
merged 8 commits into from
Mar 31, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
194 changes: 147 additions & 47 deletions docs/instant-clusters/index.md
Original file line number Diff line number Diff line change
@@ -1,89 +1,189 @@
---
title: Instant Clusters
sidebar_position: 1
description: Instant Clusters enables high-performance computing across multiple machines with high-speed networking capabilities.
description: Instant Clusters enable high-performance computing across multiple GPUs with high-speed networking capabilities.
---

Instant Clusters enables high-performance computing across multiple machines with high-speed networking capabilities.
Instant Clusters enable high-performance computing across multiple GPU [Pods](/pods/overview), with high-speed networking capabilities.

**Key characteristics:**
Instant Clusters provide:

- Fast local networking with bandwidths from 100 Gbps to 3200 Gbps within a single data center
- Static IP assignment for each pod in the cluster
- Environment variables set automatically for coordination between nodes
- Fast local networking between Pods, with bandwidths from 100 Gbps to 3200 Gbps within a single data center.
- Static IP assignment for each Pod in the cluster.
- Automatic assignment of [environment variables](#environment-variables) for seamless coordination between Pods.

## Deploy your first Instant Cluster
## Use cases for Instant Clusters

Instant Clusters provide powerful computing capabilities that benefit a wide range of applications:

### Deep learning & AI

- **Large Language Model training**: Distribute training of models across multiple GPUs for significantly faster convergence.
- **Federated Learning**: Train models across distributed systems while preserving data privacy and security.

### High-performance computing

- **Scientific simulations**: Use multi-GPU acceleration to run complex simulations for weather forecasting, molecular dynamics, and climate modeling.
- **Computational physics**: Solve large-scale physics problems requiring massive parallel computing power.
- **Fluid dynamics & engineering**: Perform fluid dynamics computations for use in aerospace, automotive, and energy sectors.

### Graphics computing & rendering

- **Large-scale rendering**: Generate high-fidelity images and animations for film, gaming, and visualization.
- **Real-time graphics processing**: Power complex visual effects and simulations requiring multiple GPUs.
- **Game development & testing**: Render game environments, test AI-driven behaviors, and generate procedural content.
- **Virtual reality & augmented reality**: Deliver real-time multi-view rendering for immersive AR/VR experiences.

### Large-scale data analytics

This guide explains how to use Instant Clusters to support larger workloads.
- **Big data processing**: Analyze large-scale datasets with distributed computing frameworks.
- **Social media analysis**: Detect real-time trends, analyze sentiment, and identify misinformation.

Each pod receives a static IP on the overlay network. The system designates one machine as the primary node by setting `PRIMARY_IP` and `CLUSTER_IP` environment variables. This primary designation simplifies working with multiprocessing libraries that require a primary node.
## Network interfaces

### Environment variables
High-bandwidth interfaces (`eth1`, `eth2`, etc.) handle communication between Pods, while the management interface (`eth0`) manages external traffic. The [NCCL](https://developer.nvidia.com/nccl) environment variable `NCCL_SOCKET_IFNAME` uses all available interfaces by default. The `PRIMARY_ADDR` corresponds to `eth1` to enable launching and bootstrapping distributed processes.

The following environment variables are available in all pods:
Instant Clusters support up to 8 interfaces per Pod. Each interface (`eth1` - `eth8`) provides a private network connection for inter-node communication, made available to distributed backends such as NCCL or GLOO.

## Environment variables

The following environment variables are available in all Pods:

| Environment Variable | Description |
| ------------------------------ | ------------------------------------------------------------ |
| `PRIMARY_ADDR` / `MASTER_ADDR` | The address of the primary pod |
| `PRIMARY_PORT` / `MASTER_PORT` | The port of the primary pod (all ports are available) |
| `NODE_ADDR` | The static IP of this pod within the cluster network |
| `NODE_RANK` | The cluster rank assigned to this pod (set to 0 for primary) |
| `NUM_NODES` | Number of pods in the cluster |
| `NUM_TRAINERS` | Number of GPUs per pod |
| `HOST_NODE_ADDR` | Defined as `PRIMARY_ADDR:PRIMARY_PORT` for convenience |
| `WORLD_SIZE` | The total number of GPUs in the cluser (`NUM_NODES` * `NUM_TRAINERS`). |
| `PRIMARY_ADDR` / `MASTER_ADDR` | The address of the primary Pod. |
| `PRIMARY_PORT` / `MASTER_PORT` | The port of the primary Pod (all ports are available). |
| `NODE_ADDR` | The static IP of this Pod within the cluster network. |
| `NODE_RANK` | The Cluster (i.e., global) rank assigned to this Pod (0 for the primary Pod). |
| `NUM_NODES` | The number of Pods in the Cluster. |
| `NUM_TRAINERS` | The number of GPUs per Pod. |
| `HOST_NODE_ADDR` | Defined as `PRIMARY_ADDR:PRIMARY_PORT` for convenience. |
| `WORLD_SIZE` | The total number of GPUs in the Cluster (`NUM_NODES` * `NUM_TRAINERS`). |

Each Pod receives a static IP (`NODE_ADDR`) on the overlay network. When a Cluster is deployed, the system designates one Pod as the primary node by setting the `PRIMARY_ADDR` and `PRIMARY_PORT` environment variables. This simplifies working with multiprocessing libraries that require a primary node.

The variables `MASTER_ADDR`/`PRIMARY_ADDR` and `MASTER_PORT`/`PRIMARY_PORT` are equivalent. The `MASTER_*` variables provide compatibility with tools that expect these legacy names.

### Network interfaces
## Deploy your first Instant Cluster

Follow these steps to deploy an Instant Cluster and run a multi-node process using PyTorch.

:::note

High-bandwidth interfaces (`eth1`, `eth2`, etc.) handle inter-node communication, while the management interface (`eth0`) manages external traffic. The NCCL environment variable `NCCL_SOCKET_IFNAME` uses all available interfaces by default. The `PRIMARY_ADDR` corresponds to `eth1` to enable launching and bootstrapping distributed processes. Instant Clusters support up to 8 interfaces per Pod. Each interface (`eth1` - `eth8`) provides a private network connection for inter-node communication, made available to distributed backends such as NCCL or GLOO.
All accounts have a default spending limit. To deploy a larger cluster, submit a support ticket at [email protected]

### Example PyTorch implementation
:::

```python
export NCCL_SOCKET_IFNAME=eth1
### Step 1: Deploy an Instant Cluster using the web interface

1. Open the [Instant Clusters page](https://www.runpod.io/console/cluster) on the RunPod web interface.
2. Click **Create Cluster**.
3. Use the UI to name and configure your Cluster. For this walkthrough, keep **Pod Count** at **2** and select the option for **16x H100 SXM** GPUs. Keep the **Pod Template** at its default setting (RunPod PyTorch).
4. Click **Deploy Cluster**. You should be redirected to the Instant Clusters page after a few seconds.

### Step 2: Clone the PyTorch demo into each Pod

1. Click your Cluster to expand the list of Pods.
2. Click your first Pod, for example `CLUSTERNAME-pod-0`, to expand the Pod.
3. Click **Connect**, then click **web terminal**.
4. Run the following command to clone a basic `main.py` file into the Pod's main directory:

```bash
git clone https://github.com/murat-runpod/torch-demo.git
```

Repeat these steps for **each Pod** in your cluster.

### Step 3: Examine the main.py file

Let's look at the code in our `main.py` file:

```python
import os
import torch
import torch.distributed as dist

def init_distributed():
"""Initialize the distributed training environment"""
# Initialize the process group
dist.init_process_group(backend="nccl")

# Get local rank and global rank
local_rank = int(os.environ["LOCAL_RANK"])
global_rank = dist.get_rank()
world_size = dist.get_world_size()

# Set device for this process
device = torch.device(f"cuda:{local_rank}")
torch.cuda.set_device(device)

return local_rank, global_rank, world_size, device

def cleanup_distributed():
"""Clean up the distributed environment"""
dist.destroy_process_group()

def main():
# Initialize distributed environment
local_rank, global_rank, world_size, device = init_distributed()

print(f"Running on rank {global_rank}/{world_size-1} (local rank: {local_rank}), device: {device}")

# Your code here

# Clean up distributed environment when done
cleanup_distributed()

if __name__ == "__main__":
main()
```

This is the minimal code necessary for initializing a distributed environment. The `main()` function prints the local and global rank for each GPU process (this is also where you can add your own code). `LOCAL_RANK` is assigned dynamically to each process by PyTorch. All other environment variables are set automatically by RunPod during deployment.

### Step 4: Start the PyTorch process on each Pod

Run the following command in the web terminal of **each Pod**:

```bash
export NCCL_DEBUG=WARN
torchrun \
--nproc_per_node=$NUM_TRAINERS \
--nnodes=$NUM_NODES \
--node_rank=$NODE_RANK \
--master_addr=$MASTER_ADDR \
--master_port=$MASTER_PORT \
main.py
torch-demo/main.py
```

:::note
All accounts have a default spending limit. To launch a larger cluster, submit a support ticket at [email protected]
:::

## Applications
This command launches eight `main.py` processes per node (one per GPU in the Pod).

Instant Clusters benefit these use cases:
After running the command on the last Pod, you should see output similar to this:

### Deep Learning & AI

- **Training Large Neural Networks**: Speed up deep learning by distributing data across GPUs for faster convergence
- **Federated Learning**: Train models across distributed systems while maintaining data privacy
```bash
Running on rank 8/15 (local rank: 0), device: cuda:0
Running on rank 15/15 (local rank: 7), device: cuda:7
Running on rank 9/15 (local rank: 1), device: cuda:1
Running on rank 12/15 (local rank: 4), device: cuda:4
Running on rank 13/15 (local rank: 5), device: cuda:5
Running on rank 11/15 (local rank: 3), device: cuda:3
Running on rank 14/15 (local rank: 6), device: cuda:6
Running on rank 10/15 (local rank: 2), device: cuda:2
```

### High-Performance Computing (HPC)
The first number refers to the global rank of the thread, spanning from `0` to `WORLD_SIZE-1` (`WORLD_SIZE` = the total number of GPUs in the Cluster). In our example there are two Pods of eight GPUs, so the global rank spans from 0-15. The second number is the local rank, which defines the order of GPUs within a single Pod (0-7 for this example).

- **Scientific Simulations**: Run weather forecasting, molecular dynamics, and climate modeling with multi-GPU acceleration
- **Astrophysics & Space Exploration**: Simulate galaxy formations, detect gravitational waves, and model space weather
- **Fluid Dynamics & Engineering**: Perform computational fluid dynamics in aerospace, automotive, and energy sectors
The specific number and order of ranks may be different in your terminal, and the global ranks listed will be different for each Pod.

### Gaming & Graphics Rendering
The following diagram illustrates how local and global ranks are distributed across multiple Pods:

- **Ray Tracing & Real-Time Rendering**: Create ultra-realistic graphics for gaming, VR, and movie CGI
- **Game Development & Testing**: Render game environments, test AI-driven behaviors, and generate procedural content
- **Virtual Reality & Augmented Reality**: Deliver real-time multi-view rendering for immersive experiences
<img src="/img/docs/instant-clusters-rank-diagram.png" alt="Instant Cluster rank diagram" width="750"/>

### Large-Scale Data Analytics
### Step 5: Clean up

- **Big Data Processing**: Accelerate data processing in AI-driven analytics and recommendation systems
- **Social Media Analysis**: Detect real-time trends, analyze sentiment, and identify misinformation
If you no longer need your Cluster, make sure you return to the [Instant Clusters page](https://www.runpod.io/console/cluster) and delete your Cluster to avoid incurring extra charges.

:::note

You can review your spending in the **Clusters** tab in the billing section.
You can monitor your cluster usage and spending using the **Billing Explorer** at the bottom of the [Billing page](https://www.runpod.io/console/user/billing) section under the **Cluster** tab.

:::
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.