diff --git a/cloud/file-shares/configure-file-shares.mdx b/cloud/file-shares/configure-file-shares.mdx index 100f4059f..d1ddc2ea0 100644 --- a/cloud/file-shares/configure-file-shares.mdx +++ b/cloud/file-shares/configure-file-shares.mdx @@ -15,7 +15,7 @@ Two types of file shares are available: - The best practice is to create VAST shares **when** creating [GPU clusters](/edge-ai/ai-infrastructure/create-an-ai-cluster) or **before** provisioning the corresponding [compute resources](/cloud/virtual-instances/types-of-virtual-machines) (such as VMs). + The best practice is to create VAST shares **when** creating [GPU clusters](/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster) or **before** provisioning the corresponding [compute resources](/cloud/virtual-instances/types-of-virtual-machines) (such as VMs). The creation flow for both types starts the same. Use the steps below and follow the instructions for the selected file share type. @@ -107,7 +107,7 @@ Replace `/mount/path` with the absolute local directory path where the file shar - The best practice is to create VAST shares **when** creating [GPU clusters](/edge-ai/ai-infrastructure/create-an-ai-cluster) or **before** provisioning the corresponding [compute resources](/cloud/virtual-instances/types-of-virtual-machines) (such as VMs). + The best practice is to create VAST shares **when** creating [GPU clusters](/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster) or **before** provisioning the corresponding [compute resources](/cloud/virtual-instances/types-of-virtual-machines) (such as VMs). ### Step 1. Create a VAST share @@ -146,12 +146,12 @@ The VAST network only becomes available after the file share has been created. I While the VAST interface can be attached to an already-provisioned GPU cluster or compute resource, this requires additional manual network configuration and is not the standard workflow. - The best practice is to create VAST shares **when** creating [GPU clusters](/edge-ai/ai-infrastructure/create-an-ai-cluster) or **before** provisioning the corresponding [compute resources](/cloud/virtual-instances/types-of-virtual-machines) (such as VMs). + The best practice is to create VAST shares **when** creating [GPU clusters](/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster) or **before** provisioning the corresponding [compute resources](/cloud/virtual-instances/types-of-virtual-machines) (such as VMs). **Attach VAST network interface** -1. Go to server **Resource** settings ([VM](/cloud/virtual-instances/create-an-instance), [Bare Metal](/cloud/bare-metal-servers/create-a-bare-metal-server), or [GPU cluster](/edge-ai/ai-infrastructure/create-an-ai-cluster)). +1. Go to server **Resource** settings ([VM](/cloud/virtual-instances/create-an-instance), [Bare Metal](/cloud/bare-metal-servers/create-a-bare-metal-server), or [GPU cluster](/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster)). 2. Select the **Networking** tab and click **Add interface**. 3. Click the **Network** drop-down and select the **VAST network**, then click **Add**. 4. Once the interface is added, **note the following details** for use in subsequent steps: diff --git a/docs.json b/docs.json index 7dfdd414e..0f081de12 100644 --- a/docs.json +++ b/docs.json @@ -642,8 +642,10 @@ { "group": "GPU cloud", "pages": [ - "edge-ai/ai-infrastructure/about-our-ai-infrastructure", - "edge-ai/ai-infrastructure/create-an-ai-cluster" + "edge-ai/ai-infrastructure/about-gpu-cloud", + "edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster", + "edge-ai/ai-infrastructure/spot-bare-metal-gpu", + "edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster" ] }, { @@ -1623,15 +1625,19 @@ }, { "source": "/docs/cloud/ai-Infrustructure/about-our-ai-infrastructure", - "destination": "/docs/edge-ai/ai-infrastructure/about-our-ai-infrastructure" + "destination": "/docs/edge-ai/ai-infrastructure/about-gpu-cloud" }, { "source": "/docs/cloud/ai-Infrustructure/create-an-ai-cluster", - "destination": "/docs/edge-ai/ai-infrastructure/create-an-ai-cluster" + "destination": "/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster" }, { "source": "/docs/cloud/ai-Infrustructure/about-virtual-vpod", - "destination": "/docs/edge-ai/ai-infrastructure/about-our-ai-infrastructure" + "destination": "/docs/edge-ai/ai-infrastructure/about-gpu-cloud" + }, + { + "source": "/docs/edge-ai/ai-infrastructure/about-our-ai-infrastructure", + "destination": "/docs/edge-ai/ai-infrastructure/about-gpu-cloud" }, { "source": "/docs/edge-ai/inference-at-the-edge/:slug*", diff --git a/edge-ai/ai-infrastructure/about-gpu-cloud.mdx b/edge-ai/ai-infrastructure/about-gpu-cloud.mdx new file mode 100644 index 000000000..aa3b33993 --- /dev/null +++ b/edge-ai/ai-infrastructure/about-gpu-cloud.mdx @@ -0,0 +1,92 @@ +--- +title: About GPU Cloud +sidebarTitle: About GPU Cloud +--- + +GPU Cloud provides dedicated compute infrastructure for machine learning workloads. Use GPU clusters to train models, run inference, and process large-scale AI tasks. + +## What is a GPU cluster + +A GPU cluster is a group of interconnected servers, each equipped with multiple high-performance GPUs. Clusters are designed for workloads that require massive parallel processing power, such as training large language models (LLMs), fine-tuning foundation models, running inference at scale, and high-performance computing (HPC) tasks. + + + GPU Cloud create cluster page showing region selection, cluster type, and GPU configuration options + + +All nodes in a cluster share the same configuration: operating system image, network settings, and storage mounts. This ensures consistent behavior across the cluster. + +## Cluster types + +Gcore offers two types of GPU clusters: + +| Type | Description | Best for | +|------|-------------|----------| +| **Bare Metal GPU** | Dedicated physical servers with guaranteed resources. No virtualization overhead | Production workloads, long-running training jobs, and latency-sensitive inference | +| **Spot Bare Metal GPU** | Same hardware as Bare Metal, but at a reduced price (up to 50% discount). Instances can be preempted with a 24-hour notice when capacity is needed | Fault-tolerant training with checkpointing, batch processing, development, and testing | + + +Spot instances are ideal for workloads that can handle interruptions. When a Spot cluster is reclaimed, you receive an email notification 24 hours before deletion. Use this time to save critical data to file shares or object storage. + + +Clusters can scale to hundreds of nodes. Production deployments with 250+ nodes in a single cluster are supported, limited only by regional stock availability. + +## Available configurations + +Select a configuration based on your workload requirements: + +| Configuration | GPUs | Interconnect | RAM | Storage | Use case | +|--------------|------|--------------|-----|---------|----------| +| H100 with InfiniBand | 8x NVIDIA H100 80GB | 3.2 Tbit/s InfiniBand | 2TB | 8x 3.84TB NVMe | Distributed LLM training requiring high-speed inter-node communication | +| H100 (bm3-ai-ndp) | 8x NVIDIA H100 80GB | 3.2 Tbit/s InfiniBand | 2TB | 6x 3.84TB NVMe | Distributed training and latency-sensitive inference at scale | +| A100 with InfiniBand | 8x NVIDIA A100 80GB | 800 Gbit/s InfiniBand | 2TB | 8x 3.84TB NVMe | Multi-node ML training and HPC workloads | +| A100 without InfiniBand | 8x NVIDIA A100 80GB | 2x 100 Gbit/s Ethernet | 2TB | 8x 3.84TB NVMe | Single-node training, inference for large models requiring more than 48GB VRAM | +| L40S | 8x NVIDIA L40S | 2x 25 Gbit/s Ethernet | 2TB | 4x 7.68TB NVMe | Inference, fine-tuning small to medium models requiring less than 48GB VRAM | + +Outbound data transfer (egress) from GPU clusters is free. For pricing details, see [GPU Cloud billing](/edge-ai/billing). + +## InfiniBand networking + +InfiniBand is a high-bandwidth, low-latency interconnect technology used for communication between nodes in a cluster. + +InfiniBand is configured automatically when you create a cluster. If the selected configuration includes InfiniBand network cards, all nodes are placed in the same InfiniBand domain with no manual setup required. + +H100 configurations typically have 8 InfiniBand ports per node, each creating a dedicated network interface. + +InfiniBand matters most for distributed training, where models that don't fit on a single node require frequent gradient synchronization between GPUs. The same applies to multi-node inference when large models are split across servers. In these cases, InfiniBand reduces communication overhead significantly compared to Ethernet. + +For single-node workloads or independent batch jobs that don't require node-to-node communication, InfiniBand provides no benefit. Standard Ethernet configurations work equally well and may be more cost-effective. + +## Storage options + +GPU clusters support two storage types: + +| Storage type | Persistence | Performance | Use case | +|-------------|-------------|-------------|----------| +| Local NVMe | Temporary (deleted with cluster) | Highest IOPS, lowest latency | Training data cache, checkpoints during training | +| File shares | Persistent (independent of cluster) | Network-attached, lower latency than object storage | Datasets, model weights, shared checkpoints | + +Learn more about [configuring file shares](/cloud/file-shares/configure-file-shares) for persistent storage and sharing data between nodes. + +## Cluster lifecycle + +``` +Create --> Configure --> Run workloads --> Resize (optional) --> Delete +``` + +1. **Create**: Select region, GPU type, number of nodes, image, and network settings when [creating a Bare Metal GPU cluster](/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster). + +2. **Configure**: Connect via SSH to each node, install required dependencies, and mount file shares to prepare the environment for workloads. + +3. **Run workloads**: Execute training jobs, run inference services, process data. + +4. **Resize**: Add or remove nodes based on demand. New nodes inherit the cluster configuration, which you can manage in the [Bare Metal GPU cluster details](/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster). + +5. **Delete**: Remove the cluster when no longer needed. Local storage is erased; file shares remain. + + + +GPU clusters may take 15–40 minutes to provision, and their configuration (image, network, and storage) is fixed at creation. Local NVMe storage is temporary, so critical data should be saved to persistent file shares. Spot clusters can be interrupted with a 24-hour notice, and cluster size is limited by available regional stock. + + +Hardware firewall support is available on servers equipped with BlueField network cards, enhancing network security for GPU clusters. + diff --git a/edge-ai/ai-infrastructure/about-our-ai-infrastructure.mdx b/edge-ai/ai-infrastructure/about-our-ai-infrastructure.mdx deleted file mode 100644 index 84a7254d0..000000000 --- a/edge-ai/ai-infrastructure/about-our-ai-infrastructure.mdx +++ /dev/null @@ -1,33 +0,0 @@ ---- -title: GPU cloud infrastructure -sidebarTitle: About GPU cloud ---- - -Gcore [GPU Cloud](https://gcore.com/cloud/ai-gpu) provides high-performance compute clusters designed for machine learning tasks. - -## AI GPU infrastructure - -Train your ML models with the latest [NVIDIA GPUs](https://www.nvidia.com/en-us/data-center/data-center-gpus/). We offer a wide range of Bare Metal servers and Virtual Machines powered by NVIDIA A100, H100, and L40S GPUs. - -Pick the configuration and reservation plan that best fits your computing requirements. - -| **Specification** | **Characteristics** | **Use case** | **Performance** | -|----------------------------|------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------| -| H100 with Infiniband | - 8x Nvidia H100 80GB
- 2 Intel Xeon 8480+
- 2TB RAM
- 2x 960GB
- 8x 3.84 TB NVMe
- 3.2 Tbit/s Infiniband
- 2x100Gbit/s Ethernet | Optimized for distributed training of Large Language Models. | Ultimate performance for compute-intensive tasks that require a significant exchange of data by the network. | -| bm3‑ai‑ndp‑1xlarge‑h100‑80‑8 | - 8× Nvidia H100 80 GB
- 2× Intel Xeon 8480+
- 2 TB RAM; 2× 1.92 TB NVMe SSD
- 6× 3.84 TB NVMe SSD
- 3.2 Tbit/s Infiniband
- 2× 25 Gbit/s Ethernet | Distributed training of large language models and latency‑sensitive inference at scale. | Peak throughput for high‑speed multi‑node workloads. | -| A100 with Infiniband | - 8x Nvidia A100 80GB
- 2 Intel Xeon 8468
- 2TB RAM
- 2x 960GB SSD
- 8x 3.84 TB NVMe
- 800Gbit/s Infiniband | Distributed training for ML models and a broad range of HPC workloads. | Well-balanced in performance and price. | -| A100 without Infiniband | - 8x Nvidia A100 80GB
- 2 Intel Xeon 8468
- 2TB RAM
- 2x 960GB SSD
- 8x 3.84 TB NVMe
- 2x100Gbit/s Ethernet | Training and fine-tuning of models on single nodes.

Inference for large models.
Multi-user HPC cluster. | The best solution for inference models that require more than 48GB vRAM. | -| L40 | - 8x Nvidia L40S
- 2x Intel Xeon 8468
- 2TB RAM
- 4x 7.68TB NVMe SSD
- 2x25Gbit/s Ethernet | Model inference.

Fine-tuning for small and medium-size models. | The best solution for inference models that require less than 48GB vRAM. | - - -Explore our competitive pricing on the [AI GPU Cloud infrastructure pricing page](https://gcore.com/cloud/ai-gpu). - -## Tools supported by GCore GPU cloud - -**Tool class** | **List of tools** | **Explanation** ----|---|--- -Framework | TensorFlow, Keras, PyTorch, Paddle Paddle, ONNX, Hugging Face | Your model is supposed to use one of these frameworks for correct work. -Data platforms | PostgreSQL, Hadoop, Spark, Vertika | You can set up a connection between our cluster and your data platforms of these types to make them work together. -Programming languages | JavaScript, R, Swift, Python | Your model is supposed to be written in one of these languages for correct work. -Resources for receiving and processing data | Storm, Spark, Kafka, PySpark, MS SQL, Oracle, MongoDB | You can set up a connection between our cluster and your resources of these types to make them work together. -Exploration and visualization tools | Seaborn, Matplotlib, TensorBoard | You can connect our cluster to these tools to visualize your model. \ No newline at end of file diff --git a/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster.mdx b/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster.mdx new file mode 100644 index 000000000..abc2d286e --- /dev/null +++ b/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster.mdx @@ -0,0 +1,238 @@ +--- +title: "Create a Bare Metal GPU cluster" +sidebarTitle: "Create a Bare Metal GPU cluster" +--- + +GPU clusters are high-performance computing resources designed for AI/ML workloads, inference, and large-scale data processing. Each cluster consists of one or more GPU servers connected via high-speed networking. + +GPU clusters come in two types: + +- **Bare Metal GPU**: Dedicated physical servers without virtualization, offering maximum performance and full hardware control. +- **Spot Bare Metal GPU**: Discounted servers suitable for batch processing, experiments, and testing. [Spot clusters](/edge-ai/ai-infrastructure/spot-bare-metal-gpu) provide the same hardware access as standard Bare Metal GPUs and may be reclaimed with 24 hours' notice. + +Cluster type and GPU model availability vary by region. The creation form displays only the options available in the selected region. + +## Cluster architecture + +Each cluster consists of one or more dedicated bare-metal GPU servers. When creating a multi-node cluster, all servers are placed in the same private network and share an identical configuration, including the image, network settings, and file shares. + +For flavors with InfiniBand cards, high-speed inter-node networking is configured automatically. No manual network configuration is required for distributed training. + +The platform provides the infrastructure layer: GPU servers, networking, storage options, and secure access. This allows installing and running preferred frameworks for distributed training, job scheduling, or container orchestration. + +For multi-node workloads, configure SSH trust between nodes to enable distributed training frameworks. File shares provide shared storage for datasets and checkpoints across all nodes. + +## Create a GPU cluster + +To create a Bare Metal GPU cluster, complete the following steps in the Gcore Customer Portal. + +1. In the [Gcore Customer Portal](https://portal.gcore.com), navigate to **GPU Cloud**. +2. In the sidebar, expand **GPU Clusters** and select **Bare Metal GPU Clusters**. +3. Click **Create Cluster**. + +### Step 1. Select region + +In the **Region** section, select the data center location for the cluster. + + + Region selection section showing available regions grouped by geography + + +Regions are grouped by geography (Asia-Pacific, EMEA). Each region card shows its availability status. Some features (such as file share integration or firewall settings) are available only in select regions. + + + GPU model availability and pricing vary by region. If a specific GPU model is required, check multiple regions for stock availability. + + +### Step 2. Configure cluster capacity + +Cluster capacity determines the hardware specifications for each node in the cluster. The available options depend on the selected region. + +1. In the **Cluster capacity** section, select the **GPU Cluster type**: + - **Bare Metal GPU** for dedicated physical servers + - **Spot Bare Metal GPU** for discounted, interruptible instances (available in select regions) + +2. Select the **GPU Model**. Available models (such as A100, H100, or H200) depend on the region. + +3. Enable or disable **Show out of stock** to filter available flavors. + +4. Select a flavor. Each flavor card displays GPU configuration, CPU type, RAM capacity, storage, network connectivity, pricing, and stock availability. + + + Cluster capacity section showing GPU Cluster type, GPU Model selector, and flavor card with specifications + + +### Step 3. Set the number of instances + +In the **Number of Instances** section, specify how many servers to provision in the cluster. + + + Number of Instances section with instance counter + + +Each instance is a separate physical server with the selected flavor configuration. For single-node workloads, one instance is sufficient. For distributed training, provision multiple instances. + +The maximum number of instances is limited by the current stock availability in the region. There is no fixed per-cluster limit—clusters can scale to hundreds of nodes if capacity is available. + + +After creation, the cluster can be resized. Scaling up adds nodes with the same configuration used at creation. Scaling down removes a random node—to delete a specific node, use the per-node delete action in the cluster details. Deleting the last node in a cluster deletes the entire cluster. + + +### Step 4. Select image + +The image defines the operating system and pre-installed software for cluster nodes. + + + Image section with Public and Custom tabs and image selector + + +1. In the **Image** section, choose the operating system: + - **Public**: Pre-configured images with NVIDIA drivers and CUDA toolkit (recommended) + - **Custom**: Custom images uploaded to the account + +The default Ubuntu images include pre-installed NVIDIA drivers and CUDA toolkit. Check the image name for specific driver version details. + +2. Note the default login credentials displayed below the image selector: username `ubuntu`, SSH port `22`. These credentials are used to connect to the cluster after creation. + +### Step 5. Configure file share integration + +File shares provide shared storage accessible from all cluster nodes simultaneously, allowing access to shared datasets, checkpoints, and outputs even if a cluster is deleted. They use NFS with a minimum size of 100 GiB, and the creation form displays this option only in regions where file shares are available. Full configuration details, including manual mounting procedures, are described in the [file share documentation](/cloud/file-shares/configure-file-shares). + + + File share integration section with Enable File Share checkbox + + +To configure a file share: + +1. Enable the **File Share integration** checkbox. + +2. Select an existing file share, or create a new one by specifying its name, size (minimum 100 GiB), and optional settings such as Root squash or Slurm compatibility. + + + Create VAST File Share dialog with basic settings and additional options + + +3. Specify the mount path for the file share on cluster nodes (default: `/home/ubuntu/mnt/nfs`). Additional file shares can be attached by clicking **Add File Share**. + + +If **User data** is enabled in Additional options, mounting commands are automatically included in the user data script. Do not modify or delete these commands, as this breaks automatic mounting. + + +### Step 6. Configure network settings + +Network settings define how the cluster communicates with external services and other resources. At least one interface is required. + + + Network settings section showing interface configuration + + +1. In the **Network settings** section, configure the network interface: + +| Type | Access | Use case | +|------|--------|----------| +| **Public** | Direct internet access with dynamic public IP | Development, testing, quick access to cluster | +| **Private** | Internal network only, no external access | Production workloads, security-sensitive environments | +| **Dedicated public** | Reserved static public IP | Production APIs, services requiring stable endpoints | + + For multi-node clusters, a private interface keeps internal traffic separate from internet-facing traffic. Inter-node training communication uses the automatically configured InfiniBand network when available. + +To add or configure interfaces, expand the interface card and adjust settings as needed. Additional interfaces can be attached by clicking **Add Interface**. + +All public interfaces include Basic DDoS Protection at no additional cost. + +For detailed networking configuration, see [Create and manage a network.](/cloud/networking/create-and-manage-a-network) + +### Step 7. Configure firewall settings (conditional) + + +Firewall settings appear only in regions where the hardware supports this feature (servers with Bluefield network cards). If this section does not appear, proceed to the next step. + + +In the **Firewall settings** section, configure firewall rules to control inbound and outbound traffic. + + + Firewall settings section with firewall selector + + +Select an existing firewall from the dropdown or use the default. Additional firewalls can be attached if needed. + +For detailed firewall configuration, see [Create and configure firewalls.](/cloud/networking/add-and-configure-a-firewall) + +### Step 8. Configure SSH key + +In the **SSH key** section, select an existing key from the dropdown or create a new one. Keys can be uploaded or generated directly in the portal. If generating a new key pair, save the private key immediately as it cannot be retrieved later. + + + SSH key section with dropdown and options to add or generate keys + + +### Step 9. Set additional options + +The **Additional options** section provides optional settings: user data scripts for automated configuration and metadata tags for resource organization. + + + Additional options section with User data and Add tags checkboxes + + +### Step 10. Name and create the cluster + +The final step assigns a name to the cluster and initiates provisioning. + + + GPU Cluster Name section with name input field + + +1. In the **GPU Cluster Name** section, enter a name or use the auto-generated one. + +2. Review the **Estimated cost** panel on the right. + +3. Click **Create Cluster**. + +Once all instances reach **Power on** status, the cluster is ready for use. + + +Cluster-level settings (image, file share integration, default networks) cannot be changed after creation. New nodes added via scaling inherit the original configuration. To change these settings, create a new cluster. + + +## Connect to the cluster + +After the cluster is created, use SSH to access the nodes. The default username is `ubuntu`. + +```bash +ssh ubuntu@ +``` + +Replace `` with the public or floating IP shown in the cluster details. + +For instances with only private interfaces, connect through a bastion host or VPN, or use the [Gcore Customer Portal console.](/cloud/virtual-instances/connect/connect-to-your-instance-via-control-panel) + +## Verify cluster status + +After connecting, verify that GPUs are available and drivers are loaded: + +```bash +nvidia-smi +``` + +The output displays all available GPUs, driver version, and CUDA version. If no GPUs appear, check that the image includes the correct NVIDIA drivers for the GPU model. + +If file share integration was enabled during cluster creation, verify the mount is accessible: + +```bash +ls /home/ubuntu/mnt/nfs +``` + +The directory should be empty initially. Files saved here are accessible from all nodes in the cluster. + +## Automating cluster management + +The Customer Portal is suitable for creating and managing individual clusters. For automated workflows—such as CI/CD pipelines, infrastructure-as-code, or batch provisioning—use the GPU Bare Metal API. + +The API allows: + +- Creating and deleting clusters programmatically +- Scaling the number of instances in a cluster +- Querying available GPU flavors and regions +- Checking quota and capacity before provisioning + +For authentication, request formats, and code examples, see the [GPU Bare Metal API reference.](/api-reference/cloud/gpu-bare-metal/create-bare-metal-gpu-cluster) diff --git a/edge-ai/ai-infrastructure/create-an-ai-cluster.mdx b/edge-ai/ai-infrastructure/create-an-ai-cluster.mdx deleted file mode 100644 index ce3791960..000000000 --- a/edge-ai/ai-infrastructure/create-an-ai-cluster.mdx +++ /dev/null @@ -1,78 +0,0 @@ ---- -title: "Create an AI cluster" -sidebarTitle: "Create an AI Cluster" ---- - -Let's create a GPU-powered cluster for your AI workloads. - - 1. Open [the GPU Cloud page](https://cloud.gcore.com/gpu-cloud/) in the **Gcore Customer Portal**. You'll be taken to the **Create GPU Cluster** page. - - - ![Create an AI Cluster](/images/create-ai-cluster-1.png) - - 2. Select a region, which is a physical location of the data center. For example, if you choose Helsinki, your cluster will be deployed on servers in Helsinki. - 3. Select a **GPU Cluster type**: - - **Virtual GPU:** Uses virtual instances with dedicated GPUs. - - **Bare metal GPU:** Uses dedicated instances with dedicated GPUs. - - **Spot bare metal GPU:** Uses dedicated instances without guaranteed availability. - 4. Select a **GPU Model** and instance type. - 5. Select the **Number of Instances**. - 6. Select the OS [image](/cloud/images/about-images) on which your model will be running. - - - Select an OS image - - 7. (For bare metal clusters) Set up a [network interface](ps://gcore.com/docs/cloud/networking/create-and-manage-a-network). You can choose a public or private one: - - **Public**: Attach this interface if you plan to use the GPU Cloud with **servers hosted outside Gcore Cloud**. Your cluster will be accessible from external networks. - - **Private**: If you want to use the service **with Gcore servers only**, your cluster will be available only to internal networks. - - Select one of the existing networks or create a new one to attach to your server. - - - Network config - - 8. (Optional) Turn on the **Use floating IP** toggle if you want to use a floating IP address. It'll make your server accessible from outside networks even if they have only a private interface. Create a new IP address or choose an existing one. For more details, check out the article [Create and configure a floating IP address](/cloud/networking/ip-address/create-and-configure-a-floating-ip-address). - 9. (Optional) If you need several network interfaces, click **Add Interface** and repeat the instructions from Step 6. -10. Select one of your SSH keys from the list, add a new key, or generate a key pair. You'll use this SSH key to connect to your cluster. - - - Select SSH key - -11. (Optional) To add userdata to your cluster, enable the **User data** toggle and add cloud config information. - - - Add cloud config - -12. (Optional) To add metadata to your cluster, enable the **Add tags** toggle and add tags as key-value pairs. - - - Add metadata via tags - -13. Name your cluster and click **Create Cluster**. - -You've successfully created the cluster. To connect to your server, use the IP address of your AI Cluster and the SSH key from Step 8. - -User login: `ubuntu` - -Connection port: `22` \ No newline at end of file diff --git a/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster.mdx b/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster.mdx new file mode 100644 index 000000000..509045274 --- /dev/null +++ b/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster.mdx @@ -0,0 +1,206 @@ +--- +title: "Manage a Bare Metal GPU cluster" +sidebarTitle: "Manage a Bare Metal GPU cluster" +--- + +After creating a GPU cluster, use the cluster details page to monitor nodes, resize the cluster, manage power state, configure network interfaces, and delete cluster resources. + +## Access cluster details + +To view and manage an existing cluster, open the cluster details page. + +1. In the [Gcore Customer Portal](https://portal.gcore.com), navigate to **GPU Cloud**. +2. In the sidebar, expand **GPU Clusters** and select **Bare Metal GPU Clusters**. +3. Click on a cluster name to open the details page. + +The cluster details page displays summary information in the header panel: + +| Field | Description | +|-------|-------------| +| Cluster ID | Unique identifier for the cluster | +| Pkey ID | InfiniBand Partition Key ID. Displayed as "-" if InfiniBand is not configured or during cluster provisioning | +| OS Distro | Operating system image installed on all nodes | +| Status | Current cluster state | +| Region | Data center location | +| Plan | Monthly pricing plan for the cluster | + +The page is organized into tabs: **Overview**, **Power**, **Networking**, **Tags**, **User Actions**, and **Delete**. + +![Cluster overview page showing header panel and tabs](/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-overview.png) + +## View cluster nodes + +The **Overview** tab lists all nodes (servers) in the cluster. Each node entry shows the node name, flavor, assigned IP addresses, status, timestamps, cost, and other metadata. The table supports filtering by name, date range, and status. + + +All nodes in a cluster share the same configuration (image, network settings, file shares) defined at cluster creation. + + +## Resize a cluster + +Cluster size can be adjusted after creation by adding or removing nodes. + +To resize a cluster: + +1. On the **Overview** tab, click **Resize Cluster**. +2. Adjust the instance count using the **+** and **-** buttons. +3. Click **Resize**. + +![Resize cluster dialog with instance count controls](/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-resize-dialog.png) + +New nodes inherit the cluster's original configuration defined at creation time. The maximum number of nodes is limited by the current stock availability in the selected region. + + +While a cluster is in the Resizing state, most management actions are unavailable. + + + +When scaling down, the system removes a random node from the cluster. To delete a specific node, use the per-node delete action instead of resize. + + +### Delete a specific node + +To remove a specific node without random selection: + +1. Locate the node in the cluster list. +2. Click the actions menu (three dots) on the node row. +3. Select **Delete**. +4. Confirm the deletion. + + +Deleting the last node in a cluster deletes the entire cluster. All cluster-level metadata is removed. + + +## Power actions + +Power actions control the running state of cluster nodes. Actions can be applied to individual nodes or in bulk. + +### Individual node actions + +To control a single node: + +1. Locate the node in the cluster list. +2. Click the actions menu (three dots) on the node row. +3. Select the desired action: + - **Power on**: Start the node + - **Power off**: Shut down the node + - **Soft reboot**: Graceful restart of the operating system + - **Hard reboot**: If soft reboot fails, force restart the node + - **Rebuild**: Reinstall the original operating system image used at cluster creation. All data on local storage is deleted. + +![Node actions menu showing power and management options](/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/node-actions-menu.png) + +### Bulk actions + +To apply actions to multiple nodes simultaneously: + +1. Select nodes using the checkboxes in the nodes table. +2. Click **Group actions** in the toolbar. +3. Select the action to apply to all selected nodes. + +Alternatively, use the **Power** tab to perform soft or hard reboot on all cluster nodes at once. + +![Power tab with soft and hard reboot options](/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-power-tab.png) + +## Network interfaces + +The **Networking** tab displays network interfaces for each node. + +Click on a node name to expand its interface details. + +Interface types include: + +| Type | Description | +|------|-------------| +| Public | External IP address for internet access | +| Private | Internal network for communication with other cloud resources | +| InfiniBand | High-speed, low-latency inter-node network for GPU-to-GPU communication | + +![Networking tab showing node and interface list](/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-networking-tab.png) + +For flavors with InfiniBand, multiple InfiniBand interfaces are created automatically (by default, 8 for H100 configurations). These appear as "GPU-cluster ib-subnet" entries in the interface list. + +Click on an interface to expand its details, including IP address, network configuration, and other network information. + +![Node with expanded interface showing network details](/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-network-interfaces.png) + +### Modify network interfaces + +To add network interfaces on a specific node: + +1. Navigate to the **Networking** tab. +2. Click on a node to expand its details. +3. Click **Add Interface** or **Add Sub-Interface**. +4. Configure the interface type (public or private) and IP allocation settings as supported by the service. + + +InfiniBand interfaces are managed automatically and cannot be modified or deleted. This prevents accidental disruption of inter-node communication. + + +## Console access + +For troubleshooting or when SSH access is unavailable, use the browser-based console. + +1. In the **Overview** tab, locate the node you want to access. +2. Click **Open Console** in the node row. +3. The console opens in a new browser tab using noVNC. +4. Log in using the same credentials as for SSH access. + +![Browser-based noVNC console with login prompt](/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-console.png) + +The console provides direct terminal access to the node, useful when network connectivity issues prevent SSH access. + +## Tags + +Tags are key-value pairs used to organize and categorize clusters. Tags are applied at the cluster level and inherited by all nodes. + +To manage tags: + +1. Navigate to the **Tags** tab on the cluster details page. +2. Enable the **Add custom tags** checkbox. +3. Enter the key and value for each tag. +4. Click **Save changes**. + +![Tags tab with custom tags toggle](/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-tags-tab.png) + +Use tags for billing allocation, environment identification, or organizational purposes. + + +Tags can be modified even while the cluster is in the Resizing state, unlike most other management actions. + + +## User actions + +The **User Actions** tab displays a log of all operations performed on the cluster, including creation, deletion, resize, power, and network actions. + +Use the date and action type filters to narrow results. + +This log is useful for auditing and troubleshooting. + +![User Actions tab showing audit log of cluster operations](/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-user-actions-tab.png) + +## Delete a cluster + +When a cluster is deleted, local NVMe storage is permanently erased; file shares and object storage remain intact. + +To delete a cluster: + +1. Navigate to the **Delete** tab on the cluster details page. +2. Click **Delete Cluster**. +3. Confirm the deletion. + +![Delete tab with warning message and delete button](/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-delete-tab.png) + + +Cluster deletion is irreversible. Data on local NVMe storage cannot be recovered. + + +## Limitations + +Current limitations for cluster management: + +- **Provisioning time**: Bare Metal GPU servers, especially H100 configurations, may take 15-40 minutes to provision due to hardware preparation. +- **Configuration immutability**: Cluster-level configuration is fixed at creation time. Image, file share integration, and network settings cannot be changed after cluster creation. +- **Node-level settings**: Only network interfaces can be modified on individual nodes. Image and storage cannot be changed. +- **Resize behavior**: Scaling down removes a random node. Use per-node delete for targeted removal. +- **State-based restrictions**: Management actions are restricted while the cluster is in transitional states (for example, Resizing). diff --git a/edge-ai/ai-infrastructure/spot-bare-metal-gpu.mdx b/edge-ai/ai-infrastructure/spot-bare-metal-gpu.mdx new file mode 100644 index 000000000..0122b5e63 --- /dev/null +++ b/edge-ai/ai-infrastructure/spot-bare-metal-gpu.mdx @@ -0,0 +1,83 @@ +--- +title: "Spot Bare Metal GPU" +sidebarTitle: "Spot Bare Metal GPU" +--- + +Spot Bare Metal GPU clusters are discounted GPU servers that utilize unused capacity at reduced pricing. They provide the same hardware specifications and functionality as standard Bare Metal GPU clusters, with one key difference: they can be reclaimed with 24 hours' notice. + +## Spot vs On-demand + +| Aspect | On-demand (Bare Metal GPU) | Spot (Spot Bare Metal GPU) | +|--------|---------------------------|---------------------------| +| Pricing | Standard rates | Discounted rates | +| Availability | Guaranteed until deleted | Can be reclaimed with 24 hours' notice | +| Use case | Production workloads, critical applications | Cost-sensitive workloads that tolerate interruption | +| Capacity source | Dedicated capacity | Unused/excess capacity | + +## When to use Spot clusters + +Spot clusters are ideal for interruptible workloads, such as batch processing, experiments, testing, and development. They should not be used for production inference, time-critical tasks, long-running jobs without checkpoints, or any workload where unexpected reclamation could have serious consequences. + +## Availability + +Spot Bare Metal GPU availability depends on region and current stock. When available, a **Spot Bare Metal GPU** option appears alongside the standard **Bare Metal GPU** in the cluster type selector: + + + GPU Cluster type selector + + +If only **Bare Metal GPU** appears in the selector, Spot is not currently available in that region. In some regions, Spot may appear but show "Out of Stock"—this indicates the option exists, but no capacity is currently available. + +## Reclamation process + +Spot clusters can be reclaimed when Gcore needs the capacity for on-demand workloads or other operational requirements. The reclamation process follows a fixed timeline: + +1. An email notification is sent to the account owner. +2. A 24-hour window begins to save data, transfer workloads, and prepare for cluster deletion. +3. The cluster is deleted. Data on local storage is erased immediately. + + +The notice period is fixed and starts when the email is sent. After 24 hours, the cluster is deleted automatically. + + +## Data preservation + +When a Spot cluster is deleted, data is handled as follows: + +| Resource | What happens | +|----------|--------------| +| Local NVMe storage | Erased immediately | +| File shares | Not affected (independent resource) | +| Object storage | Not affected (independent resource) | + +To protect critical data, use [file shares](/cloud/file-shares/configure-file-shares) for datasets, checkpoints, and model weights. Save outputs and backups to [object storage](/storage/manage-object-storage/manage-buckets-via-the-control-panel). Implement regular checkpointing in training scripts every 1-4 hours. When a reclamation notice is received, prioritize transferring any data not already on persistent storage. + +## Pricing and billing + +Spot clusters are billed at a discounted rate compared to standard Bare Metal GPU. The exact discount varies by region and GPU model. The flavor selection card displays both hourly and monthly rates: + + + Spot flavor pricing + + +Billing is per entire node (all GPUs on the server), calculated per minute, and aggregated hourly. Billing stops when the cluster is deleted, whether by user action or reclamation. + +A minimum account balance is required before provisioning. If the balance is insufficient, provisioning will fail. For details, see [GPU Cloud billing.](/edge-ai/billing) + +## Creating a Spot cluster + +The creation process is identical to standard Bare Metal GPU clusters, with one additional step: acknowledging the Spot terms. + +1. Navigate to **GPU Cloud** > **GPU Clusters** > **Bare Metal GPU Clusters**. +2. Click **Create Cluster**. +3. Select a region where Spot is available. + + + Region selector + + +4. In **GPU Cluster type**, select **Spot Bare Metal GPU**. A warning banner displays the terms and conditions. +5. Select a GPU flavor and configure network, SSH key, and cluster name. +6. Click **Create Cluster**. + +For detailed configuration, see [Create a Bare Metal GPU cluster.](/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster) \ No newline at end of file diff --git a/edge-ai/getting-started.mdx b/edge-ai/getting-started.mdx index 710028dfc..542bfdb04 100644 --- a/edge-ai/getting-started.mdx +++ b/edge-ai/getting-started.mdx @@ -1,26 +1,26 @@ ---- -title: Getting started -sidebarTitle: Getting started ---- - -The development of machine learning involves two main stages: training and inference. - -In the first stage, an AI model is trained on big data, like an image catalog, to recognize and label objects. This results in a trained model. - -**If you want to train AI models**, check out or guide on [creating an AI cluster](/edge-ai/ai-infrastructure/create-an-ai-cluster) to set up an AI cluster with the [Gcore GPU Cloud](https://gcore.com/cloud/ai-gpu) via the Gcore Customer Portal. - - -**Tip** - -Check out [our API docs](https://api.gcore.com/docs/cloud#tag/GPU-Cloud) if you want to control your GPU resources programmatically. - - -The second stage of AI is [model inference](https://gcore.com/learning/what-is-ai-inference/), where the model makes predictions based on user requests. In this stage, it's crucial that the AI model can respond promptly to users regardless of network delays, latency, and distance from data centers. - -**If you need inference** for open-source models or models you trained yourself, [our guide on deploying AI models](/edge-ai/everywhere-inference/ai-models/deploy-an-ai-model) explains how to set up [Everywhere Inference](https://gcore.com/everywhere-inference) via the Gcore Customer Portal. - - -**Tip** - -Check out [our API docs](https://api.gcore.com/docs/cloud#tag/Inference-Instances) if you want to control your inference resources programmatically. +--- +title: Getting started +sidebarTitle: Getting started +--- + +The development of machine learning involves two main stages: training and inference. + +In the first stage, an AI model is trained on big data, like an image catalog, to recognize and label objects. This results in a trained model. + +**If you want to train AI models**, check out or guide on [creating an AI cluster](/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster) to set up an AI cluster with the [Gcore GPU Cloud](https://gcore.com/cloud/ai-gpu) via the Gcore Customer Portal. + + +**Tip** + +Check out [our API docs](https://api.gcore.com/docs/cloud#tag/GPU-Cloud) if you want to control your GPU resources programmatically. + + +The second stage of AI is [model inference](https://gcore.com/learning/what-is-ai-inference/), where the model makes predictions based on user requests. In this stage, it's crucial that the AI model can respond promptly to users regardless of network delays, latency, and distance from data centers. + +**If you need inference** for open-source models or models you trained yourself, [our guide on deploying AI models](/edge-ai/everywhere-inference/ai-models/deploy-an-ai-model) explains how to set up [Everywhere Inference](https://gcore.com/everywhere-inference) via the Gcore Customer Portal. + + +**Tip** + +Check out [our API docs](https://api.gcore.com/docs/cloud#tag/Inference-Instances) if you want to control your inference resources programmatically. \ No newline at end of file diff --git a/images/docs/edge-ai/ai-infrastructure/about-gpu-cloud/create-cluster-page.png b/images/docs/edge-ai/ai-infrastructure/about-gpu-cloud/create-cluster-page.png new file mode 100644 index 000000000..e0272078f Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/about-gpu-cloud/create-cluster-page.png differ diff --git a/images/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster/gpu-cluster-additional-options.png b/images/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster/gpu-cluster-additional-options.png new file mode 100644 index 000000000..6f39d96d9 Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster/gpu-cluster-additional-options.png differ diff --git a/images/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster/gpu-cluster-capacity.png b/images/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster/gpu-cluster-capacity.png new file mode 100644 index 000000000..eb901854c Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster/gpu-cluster-capacity.png differ diff --git a/images/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster/gpu-cluster-create-file-share-modal.png b/images/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster/gpu-cluster-create-file-share-modal.png new file mode 100644 index 000000000..21a481804 Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster/gpu-cluster-create-file-share-modal.png differ diff --git a/images/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster/gpu-cluster-file-share.png b/images/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster/gpu-cluster-file-share.png new file mode 100644 index 000000000..9675c1c88 Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster/gpu-cluster-file-share.png differ diff --git a/images/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster/gpu-cluster-firewall.png b/images/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster/gpu-cluster-firewall.png new file mode 100644 index 000000000..037691d4d Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster/gpu-cluster-firewall.png differ diff --git a/images/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster/gpu-cluster-image.png b/images/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster/gpu-cluster-image.png new file mode 100644 index 000000000..261489699 Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster/gpu-cluster-image.png differ diff --git a/images/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster/gpu-cluster-instances.png b/images/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster/gpu-cluster-instances.png new file mode 100644 index 000000000..a687f3c62 Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster/gpu-cluster-instances.png differ diff --git a/images/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster/gpu-cluster-name.png b/images/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster/gpu-cluster-name.png new file mode 100644 index 000000000..169bf5435 Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster/gpu-cluster-name.png differ diff --git a/images/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster/gpu-cluster-network-settings.png b/images/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster/gpu-cluster-network-settings.png new file mode 100644 index 000000000..10bf5fcc0 Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster/gpu-cluster-network-settings.png differ diff --git a/images/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster/gpu-cluster-network.png b/images/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster/gpu-cluster-network.png new file mode 100644 index 000000000..8894cb7d0 Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster/gpu-cluster-network.png differ diff --git a/images/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster/gpu-cluster-region.png b/images/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster/gpu-cluster-region.png new file mode 100644 index 000000000..9d02b70c7 Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster/gpu-cluster-region.png differ diff --git a/images/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster/gpu-cluster-ssh-key.png b/images/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster/gpu-cluster-ssh-key.png new file mode 100644 index 000000000..d89ddfde3 Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/create-a-bare-metal-gpu-cluster/gpu-cluster-ssh-key.png differ diff --git a/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-actions-disabled.png b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-actions-disabled.png new file mode 100644 index 000000000..7d04c9b5e Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-actions-disabled.png differ diff --git a/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-console.png b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-console.png new file mode 100644 index 000000000..5309a7b95 Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-console.png differ diff --git a/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-delete-tab.png b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-delete-tab.png new file mode 100644 index 000000000..8b5a4a2c0 Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-delete-tab.png differ diff --git a/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-header-info.png b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-header-info.png new file mode 100644 index 000000000..d1b31b96d Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-header-info.png differ diff --git a/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-instances-table.png b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-instances-table.png new file mode 100644 index 000000000..8bd703caa Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-instances-table.png differ diff --git a/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-list.png b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-list.png new file mode 100644 index 000000000..a711690ed Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-list.png differ diff --git a/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-network-interface-details.png b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-network-interface-details.png new file mode 100644 index 000000000..f8810fdda Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-network-interface-details.png differ diff --git a/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-network-interfaces.png b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-network-interfaces.png new file mode 100644 index 000000000..787ed8d55 Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-network-interfaces.png differ diff --git a/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-networking-tab.png b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-networking-tab.png new file mode 100644 index 000000000..ea4974e39 Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-networking-tab.png differ diff --git a/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-overview.png b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-overview.png new file mode 100644 index 000000000..8312d5d21 Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-overview.png differ diff --git a/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-power-tab.png b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-power-tab.png new file mode 100644 index 000000000..32f6880a7 Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-power-tab.png differ diff --git a/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-resize-dialog.png b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-resize-dialog.png new file mode 100644 index 000000000..37cebe95a Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-resize-dialog.png differ diff --git a/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-tags-form.png b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-tags-form.png new file mode 100644 index 000000000..510763e6b Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-tags-form.png differ diff --git a/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-tags-tab.png b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-tags-tab.png new file mode 100644 index 000000000..d3bbdc22d Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-tags-tab.png differ diff --git a/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-user-actions-tab.png b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-user-actions-tab.png new file mode 100644 index 000000000..d899ae15f Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/cluster-user-actions-tab.png differ diff --git a/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/gpu-cluster-create-page.png b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/gpu-cluster-create-page.png new file mode 100644 index 000000000..64f31ab17 Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/gpu-cluster-create-page.png differ diff --git a/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/network-public-interface.png b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/network-public-interface.png new file mode 100644 index 000000000..8fef57cb1 Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/network-public-interface.png differ diff --git a/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/node-actions-menu.png b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/node-actions-menu.png new file mode 100644 index 000000000..a6fcd3779 Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/node-actions-menu.png differ diff --git a/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/node-delete-confirmation.png b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/node-delete-confirmation.png new file mode 100644 index 000000000..43a96089c Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/node-delete-confirmation.png differ diff --git a/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/user-actions-filters.png b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/user-actions-filters.png new file mode 100644 index 000000000..45ff64063 Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/user-actions-filters.png differ diff --git a/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/user-actions-log-table.png b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/user-actions-log-table.png new file mode 100644 index 000000000..c04a522b4 Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/manage-a-bare-metal-gpu-cluster/user-actions-log-table.png differ diff --git a/images/docs/edge-ai/ai-infrastructure/spot-bare-metal-gpu/spot-with-price.png b/images/docs/edge-ai/ai-infrastructure/spot-bare-metal-gpu/spot-with-price.png new file mode 100644 index 000000000..818c0fd25 Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/spot-bare-metal-gpu/spot-with-price.png differ diff --git a/images/docs/edge-ai/ai-infrastructure/spot-bare-metal-gpu/step-gpu-cluster-type.png b/images/docs/edge-ai/ai-infrastructure/spot-bare-metal-gpu/step-gpu-cluster-type.png new file mode 100644 index 000000000..30a7dbda8 Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/spot-bare-metal-gpu/step-gpu-cluster-type.png differ diff --git a/images/docs/edge-ai/ai-infrastructure/spot-bare-metal-gpu/step-region.png b/images/docs/edge-ai/ai-infrastructure/spot-bare-metal-gpu/step-region.png new file mode 100644 index 000000000..e188d3abd Binary files /dev/null and b/images/docs/edge-ai/ai-infrastructure/spot-bare-metal-gpu/step-region.png differ