Skip to content

gpu-cluster: Remove step to create gpu-operator ns #173

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 6 additions & 19 deletions articles/aks/gpu-cluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ ms.author: schaffererin

# Use GPUs for compute-intensive workloads on Azure Kubernetes Service (AKS)

Graphical processing units (GPUs) are often used for compute-intensive workloads, such as graphics and visualization workloads. AKS supports GPU-enabled Linux node pools to run compute-intensive Kubernetes workloads.
Graphical processing units (GPUs) are often used for compute-intensive workloads, such as graphics and visualization workloads. AKS supports GPU-enabled Linux node pools to run compute-intensive Kubernetes workloads.

This article helps you provision nodes with schedulable GPUs on new and existing AKS clusters.

Expand Down Expand Up @@ -93,7 +93,7 @@ To use the default OS SKU, you create the node pool without specifying an OS SKU
* `--max-count`: Configures the cluster autoscaler to maintain a maximum of three nodes in the node pool.

> [!NOTE]
> Taints and VM sizes can only be set for node pools during node pool creation, but you can update autoscaler settings at any time.
> Taints and VM sizes can only be set for node pools during node pool creation, but you can update autoscaler settings at any time.

##### [Azure Linux node pool](#tab/add-azure-linux-gpu-node-pool)

Expand Down Expand Up @@ -128,13 +128,7 @@ To use Azure Linux, you specify the OS SKU by setting `os-sku` to `AzureLinux` d

---

1. Create a namespace using the [`kubectl create namespace`][kubectl-create] command.

```bash
kubectl create namespace gpu-operator
```

2. Create a file named *nvidia-device-plugin-ds.yaml* and paste the following YAML manifest provided as part of the [NVIDIA device plugin for Kubernetes project][nvidia-github]:
1. Create a file named *nvidia-device-plugin-ds.yaml* and paste the following YAML manifest provided as part of the [NVIDIA device plugin for Kubernetes project][nvidia-github]:

```yaml
apiVersion: apps/v1
Expand Down Expand Up @@ -182,13 +176,13 @@ To use Azure Linux, you specify the OS SKU by setting `os-sku` to `AzureLinux` d
path: /var/lib/kubelet/device-plugins
```

3. Create the DaemonSet and confirm the NVIDIA device plugin is created successfully using the [`kubectl apply`][kubectl-apply] command.
2. Create the DaemonSet and confirm the NVIDIA device plugin is created successfully using the [`kubectl apply`][kubectl-apply] command.

```bash
kubectl apply -f nvidia-device-plugin-ds.yaml
```

4. Now that you successfully installed the NVIDIA device plugin, you can check that your [GPUs are schedulable](#confirm-that-gpus-are-schedulable) and [run a GPU workload](#run-a-gpu-enabled-workload).
3. Now that you successfully installed the NVIDIA device plugin, you can check that your [GPUs are schedulable](#confirm-that-gpus-are-schedulable) and [run a GPU workload](#run-a-gpu-enabled-workload).


### Skip GPU driver installation (preview)
Expand Down Expand Up @@ -354,7 +348,7 @@ To see the GPU in action, you can schedule a GPU-enabled workload with the appro

```console
2019-05-16 16:08:31.258328: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-05-16 16:08:31.396846: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
2019-05-16 16:08:31.396846: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 2fd7:00:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
Expand Down Expand Up @@ -430,14 +424,11 @@ To see the GPU in action, you can schedule a GPU-enabled workload with the appro
[kubectl-describe]: https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#describe
[kubectl-logs]: https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#logs
[kubectl delete]: https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#delete
[kubectl-create]: https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#create
[azure-pricing]: https://azure.microsoft.com/pricing/
[azure-availability]: https://azure.microsoft.com/global-infrastructure/services/
[nvidia-github]: https://github.com/NVIDIA/k8s-device-plugin/blob/4b3d6b0a6613a3672f71ea4719fd8633eaafb4f3/deployments/static/nvidia-device-plugin.yml

<!-- LINKS - internal -->
[az-aks-create]: /cli/azure/aks#az_aks_create
[az-aks-nodepool-update]: /cli/azure/aks/nodepool#az_aks_nodepool_update
[az-aks-nodepool-add]: /cli/azure/aks/nodepool#az_aks_nodepool_add
[az-aks-get-credentials]: /cli/azure/aks#az_aks_get_credentials
[aks-quickstart-cli]: ./learn/quick-kubernetes-deploy-cli.md
Expand All @@ -451,10 +442,6 @@ To see the GPU in action, you can schedule a GPU-enabled workload with the appro
[azureml-triton]: /azure/machine-learning/how-to-deploy-with-triton
[aks-container-insights]: monitor-aks.md#integrations
[advanced-scheduler-aks]: operator-best-practices-advanced-scheduler.md
[az-provider-register]: /cli/azure/provider#az-provider-register
[az-feature-register]: /cli/azure/feature#az-feature-register
[az-feature-show]: /cli/azure/feature#az-feature-show
[az-extension-add]: /cli/azure/extension#az-extension-add
[az-extension-update]: /cli/azure/extension#az-extension-update
[NVadsA10]: /azure/virtual-machines/nva10v5-series