We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
k3s version v1.24.8+k3s1 (k3s-io/k3s@648004e) go version go1.18.8
Node(s) CPU architecture, OS, and Version:
node202: x86_64, Ubuntu 18.04.6 LTS, kernel version 5.4.0-136-generic node203: x86_64, Ubuntu 18.04.6 LTS, kernel version 5.4.0-136-generic node204: arch64, Ubuntu 20.04.5 LTS, kernel version 5.10.104-tegra
Cluster Configuration:
1 server, 2 agents. server: node202 (192.168.1.202) agent: node203 (192.168.1.203) and node204 (192.168.1.204)
POD gpushare-schd-extender-xxx could be run on node202, but POD gpushare-device-plugin-ds-xxx couldn't be run on node204.
Follow the steps of link https://github.com/AliyunContainerService/gpushare-scheduler-extender/blob/master/docs/install.md
daemon.json in /etc/docker on node204
{ "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } }, "insecure-registries": ["192.168.1.229:5000"] }
kube-scheduler.yaml in /etc/kubernetes/manifest on node202
apiVersion: v1 kind: Pod metadata: creationTimestamp: null labels: component: kube-scheduler tier: control-plane name: kube-scheduler namespace: kube-system spec: containers: - command: - kube-scheduler - --authentication-kubeconfig=/etc/kubernetes/scheduler.conf - --authorization-kubeconfig=/etc/kubernetes/scheduler.conf - --bind-address=127.0.0.1 - --kubeconfig=/etc/kubernetes/scheduler.conf - --leader-elect=true - --config=/etc/kubernetes/scheduler-policy-config.yaml - --policy-config-file=/home/u18/k3sgpushare/scheduler-policy-config.yaml image: k8s.gcr.io/kube-scheduler:v1.23.3 imagePullPolicy: IfNotPresent livenessProbe: failureThreshold: 8 httpGet: host: 127.0.0.1 path: /healthz port: 10259 scheme: HTTPS initialDelaySeconds: 10 periodSeconds: 10 timeoutSeconds: 15 name: kube-scheduler resources: requests: cpu: 100m startupProbe: failureThreshold: 24 httpGet: host: 127.0.0.1 path: /healthz port: 10259 scheme: HTTPS initialDelaySeconds: 10 periodSeconds: 10 timeoutSeconds: 15 volumeMounts: - mountPath: /etc/kubernetes/scheduler.conf name: kubeconfig readOnly: true - mountPath: /etc/kubernetes/scheduler-policy-config.yaml name: scheduler-policy-config readOnly: true hostNetwork: true priorityClassName: system-node-critical securityContext: seccompProfile: type: RuntimeDefault volumes: - hostPath: path: /etc/kubernetes/scheduler.conf type: FileOrCreate name: kubeconfig - hostPath: path: /etc/kubernetes/scheduler-policy-config.yaml type: FileOrCreate name: scheduler-policy-config status: {}
scheduler-policy-config.yaml in /etc/kubernetes on node202
{ "kind": "Policy", "apiVersion": "v1", "extenders": [ { "urlPrefix": "http://127.0.0.1:32766/gpushare-scheduler", "filterVerb": "filter", "bindVerb": "bind", "enableHttps": false, "nodeCacheCapable": true, "managedResources": [ { "name": "aliyun.com/gpu-mem", "ignoredByScheduler": false } ], "ignorable": false } ] }
kubectl-inspect-gpushare in /usr/bin on node202
u18@node202:/usr/bin$ ls -al /usr/bin/ | grep gpushare -rwxrw-r-- 1 u18 u18 37310113 12月 7 2021 kubectl-inspect-gpushare
default GPU device plugin be removed
u18@node202:/usr/bin$ kubectl get pod nvidia -n=kube-system Error from server (NotFound): pods "nvidia" not found
"gpushare=true" be labeled on node204
u18@node202:/usr/bin$ kubectl get node --show-labels=true | grep gpushare=true node204 Ready <none> 42d v1.24.8+k3s1 beta.kubernetes.io/arch=arm64,beta.kubernetes.io/instance-type=k3s,beta.kubernetes.io/os=linux,egress.k3s.io/cluster=true,gpushare=true,kubernetes.io/arch=arm64,kubernetes.io/hostname=node204,kubernetes.io/os=linux,node.kubernetes.io/instance-type=k3s,nodeShareGPU=true,nvidia.com/node=true
POD gpushare-schd-extender-xxx and POD gpushare-device-plugin-ds-xxxxx are functioning normally.
u18@node202:~/k3sgpushare$ k get pod -n=kube-system -owide | grep gpushare gpushare-schd-extender-865f956968-vvlnr 1/1 Running 0 13m 192.168.1.202 node202 <none> <none> gpushare-device-plugin-ds-hgslb 0/1 CrashLoopBackOff 15 (65s ago) 52m 192.168.1.204 node204 <none> <none>
u18@node202:~/k3sgpushare$ k describe pod gpushare-device-plugin-ds -n=kube-system Name: gpushare-device-plugin-ds-fr49d Namespace: kube-system Priority: 0 Node: node204/192.168.1.204 Start Time: Thu, 25 May 2023 17:16:48 +0800 Labels: app=gpushare component=gpushare-device-plugin controller-revision-hash=7d7d6b77dd name=gpushare-device-plugin-ds pod-template-generation=1 Annotations: scheduler.alpha.kubernetes.io/priorityClassName: system-cluster-critical Status: Running IP: 192.168.1.204 IPs: IP: 192.168.1.204 Controlled By: DaemonSet/gpushare-device-plugin-ds Containers: gpushare: Container ID: containerd://38eca65174bae8b74878074509ab3e69359558b326a8cfb61bbdcb4f59c66a73 Image: registry.cn-hangzhou.aliyuncs.com/acs/k8s-gpushare-plugin:v2-1.11-aff8a23 Image ID: registry.cn-hangzhou.aliyuncs.com/acs/k8s-gpushare-plugin@sha256:76769d69f5a5b24cbe117f8ac83a0ff7409fda6108ca982c8f3b8f763e016100 Port: <none> Host Port: <none> Command: gpushare-device-plugin-v2 -logtostderr --v=5 --memory-unit=GiB// State: Terminated Reason: Error Exit Code: 1 Started: Thu, 25 May 2023 17:17:04 +0800 Finished: Thu, 25 May 2023 17:17:04 +0800 Last State: Terminated Reason: Error Exit Code: 1 Started: Thu, 25 May 2023 17:16:49 +0800 Finished: Thu, 25 May 2023 17:16:49 +0800 Ready: False Restart Count: 2 Limits: cpu: 1 memory: 300Mi Requests: cpu: 1 memory: 300Mi Environment: KUBECONFIG: /etc/kubernetes/kubelet.conf NODE_NAME: (v1:spec.nodeName) Mounts: /var/lib/kubelet/device-plugins from device-plugin (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wdc25 (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: device-plugin: Type: HostPath (bare host directory volume) Path: /var/lib/kubelet/device-plugins HostPathType: kube-api-access-wdc25: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true QoS Class: Guaranteed Node-Selectors: gpushare=true Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/network-unavailable:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 17s default-scheduler Successfully assigned kube-system/gpushare-device-plugin-ds-fr49d to node204 Normal Pulled 2s (x3 over 17s) kubelet Container image "registry.cn-hangzhou.aliyuncs.com/acs/k8s-gpushare-plugin:v2-1.11-aff8a23" already present on machine Normal Created 2s (x3 over 17s) kubelet Created container gpushare Normal Started 2s (x3 over 17s) kubelet Started container gpushare Warning BackOff 1s (x3 over 16s) kubelet Back-off restarting failed container
The text was updated successfully, but these errors were encountered:
Sorry, something went wrong.
No branches or pull requests
Environmental Info:
k3s version v1.24.8+k3s1 (k3s-io/k3s@648004e)
go version go1.18.8
Node(s) CPU architecture, OS, and Version:
node202: x86_64, Ubuntu 18.04.6 LTS, kernel version 5.4.0-136-generic
node203: x86_64, Ubuntu 18.04.6 LTS, kernel version 5.4.0-136-generic
node204: arch64, Ubuntu 20.04.5 LTS, kernel version 5.10.104-tegra
Cluster Configuration:
1 server, 2 agents.
server: node202 (192.168.1.202)
agent: node203 (192.168.1.203) and node204 (192.168.1.204)
Describe the bug:
POD gpushare-schd-extender-xxx could be run on node202, but POD gpushare-device-plugin-ds-xxx couldn't be run on node204.
Steps To Reproduce:
Follow the steps of link https://github.com/AliyunContainerService/gpushare-scheduler-extender/blob/master/docs/install.md
Configuration Info:
daemon.json in /etc/docker on node204
kube-scheduler.yaml in /etc/kubernetes/manifest on node202
scheduler-policy-config.yaml in /etc/kubernetes on node202
kubectl-inspect-gpushare in /usr/bin on node202
default GPU device plugin be removed
"gpushare=true" be labeled on node204
Expected behavior:
POD gpushare-schd-extender-xxx and POD gpushare-device-plugin-ds-xxxxx are functioning normally.
Actual behavior:
Additional context / logs:
The text was updated successfully, but these errors were encountered: