Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

无法正确调度申请gpu的pod #661

Open
Radical-3 opened this issue Nov 30, 2024 · 6 comments
Open

无法正确调度申请gpu的pod #661

Radical-3 opened this issue Nov 30, 2024 · 6 comments

Comments

@Radical-3
Copy link

Please provide an in-depth description of the question you have:
当我使用下面的yaml来申请pod的时候
apiVersion: v1 #表示本配置文件所定义的k8s对象的api版本 kind: Pod #表示本配置文件所定义的对象的类型 metadata: name: gpu-pod1 spec: containers: - name: ubuntu-container image: ubuntu:18.04 command: ["bash", "-c", "sleep 86400"] #意思是让容器睡眠24小时,保持运行状态 #这时候就相当于待机,不占用计算资源,当收到请求的时候才会占用计算资源 resources: limits: nvidia.com/gpu: 1 # 请求1个vGP
会发生UnexpectedAdmissionError错误,如下图所示。
image
当我使用下面的yaml来申请pod的时候
apiVersion: v1 #表示本配置文件所定义的k8s对象的api版本 kind: Pod #表示本配置文件所定义的对象的类型 metadata: name: gpu-pod2 spec: containers: - name: ubuntu-container image: ubuntu:18.04 command: ["bash", "-c", "sleep 86400"] #意思是让容器睡眠24小时,保持运行状态 #这时候就相当于待机,不占用计算资源,当收到请求的时候才会占用计算资源 resources: limits: nvidia.com/gpu: 1 # 请求1个vGPUs nvidia.com/gpumem: 100 # 每个vGPU申请3000m显存 (可选,整数类型) nvidia.com/gpucores: 5 # 每个vGPU的算力为30%实际显卡的算力 (可选,整数类型)
会发生pod一直在pending的错误,如下图所示
image
image

What do you think about this question?:

Environment:

  1. nvidia-Driver 版本: 550.120
  2. NVIDIA Container Runtime版本
    NVIDIA Container Runtime version 1.17.2
    commit: fa66e4cd562804509055e44a88f666673e6d27c0
    spec: 1.2.0

runc version 1.1.12-0ubuntu2~22.04.1
spec: 1.0.2-dev
go: go1.21.1
libseccomp: 2.5.3

  • HAMi version:2.4.1
  • Kubernetes version:1.30.1
  • Others:
    1.containerd/config.toml:
    disabled_plugins = []
    imports = ["/etc/containerd/config.toml"]
    oom_score = 0
    plugin_dir = ""
    required_plugins = []
    root = "/var/lib/containerd"
    state = "/run/containerd"
    temp = ""
    version = 2

[cgroup]
path = ""

[debug]
address = ""
format = ""
gid = 0
level = ""
uid = 0

[grpc]
address = "/run/containerd/containerd.sock"
gid = 0
max_recv_message_size = 16777216
max_send_message_size = 16777216
tcp_address = ""
tcp_tls_ca = ""
tcp_tls_cert = ""
tcp_tls_key = ""
uid = 0

[metrics]
address = ""
grpc_histogram = false

[plugins]

[plugins."io.containerd.gc.v1.scheduler"]
deletion_threshold = 0
mutation_threshold = 100
pause_threshold = 0.02
schedule_delay = "0s"
startup_delay = "100ms"

[plugins."io.containerd.grpc.v1.cri"]
cdi_spec_dirs = ["/etc/cdi", "/var/run/cdi"]
device_ownership_from_security_context = false
disable_apparmor = false
disable_cgroup = false
disable_hugetlb_controller = true
disable_proc_mount = false
disable_tcp_service = true
drain_exec_sync_io_timeout = "0s"
enable_cdi = false
enable_selinux = false
enable_tls_streaming = false
enable_unprivileged_icmp = false
enable_unprivileged_ports = false
ignore_deprecation_warnings = []
ignore_image_defined_volumes = false
image_pull_progress_timeout = "5m0s"
image_pull_with_sync_fs = false
max_concurrent_downloads = 3
max_container_log_line_size = 16384
netns_mounts_under_state_dir = false
restrict_oom_score_adj = false
sandbox_image = "registry.aliyuncs.com/google_containers/pause:3.9"
selinux_category_range = 1024
stats_collect_period = 10
stream_idle_timeout = "4h0m0s"
stream_server_address = "127.0.0.1"
stream_server_port = "0"
systemd_cgroup = false
tolerate_missing_hugetlb_controller = true
unset_seccomp_profile = ""

[plugins."io.containerd.grpc.v1.cri".cni]
  bin_dir = "/opt/cni/bin"
  conf_dir = "/etc/cni/net.d"
  conf_template = ""
  ip_pref = ""
  max_conf_num = 1
  setup_serially = false

[plugins."io.containerd.grpc.v1.cri".containerd]
  disable_snapshot_annotations = true
  discard_unpacked_layers = false
  ignore_blockio_not_enabled_errors = false
  ignore_rdt_not_enabled_errors = false
  no_pivot = false
  snapshotter = "overlayfs"
  default_runtime_name = "nvidia"

  [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
    base_runtime_spec = ""
    cni_conf_dir = ""
    cni_max_conf_num = 0
    container_annotations = []
    pod_annotations = []
    privileged_without_host_devices = false
    privileged_without_host_devices_all_devices_allowed = false
    runtime_engine = ""
    runtime_path = ""
    runtime_root = ""
    runtime_type = ""
    sandbox_mode = ""
    snapshotter = ""

    [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime.options]

  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
      privileged_without_host_devices = false
      runtime_engine = ""
      runtime_root = ""
      runtime_type = "io.containerd.runc.v2"
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
        BinaryName = "/usr/bin/nvidia-container-runtime"

    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
      base_runtime_spec = ""
      cni_conf_dir = ""
      cni_max_conf_num = 0
      container_annotations = []
      pod_annotations = []
      privileged_without_host_devices = false
      privileged_without_host_devices_all_devices_allowed = false
      runtime_engine = ""
      runtime_path = ""
      runtime_root = ""
      runtime_type = "io.containerd.runc.v2"
      sandbox_mode = "podsandbox"
      snapshotter = ""

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
        BinaryName = ""
        CriuImagePath = ""
        CriuPath = ""
        CriuWorkPath = ""
        IoGid = 0
        IoUid = 0
        NoNewKeyring = false
        NoPivotRoot = false
        Root = ""
        ShimCgroup = ""
        SystemdCgroup = true

  [plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime]
    base_runtime_spec = ""
    cni_conf_dir = ""
    cni_max_conf_num = 0
    container_annotations = []
    pod_annotations = []
    privileged_without_host_devices = false
    privileged_without_host_devices_all_devices_allowed = false
    runtime_engine = ""
    runtime_path = ""
    runtime_root = ""
    runtime_type = ""
    sandbox_mode = ""
    snapshotter = ""

    [plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime.options]

[plugins."io.containerd.grpc.v1.cri".image_decryption]
  key_model = "node"

[plugins."io.containerd.grpc.v1.cri".registry]
  config_path = ""

  [plugins."io.containerd.grpc.v1.cri".registry.auths]

  [plugins."io.containerd.grpc.v1.cri".registry.configs]

  [plugins."io.containerd.grpc.v1.cri".registry.headers]

  [plugins."io.containerd.grpc.v1.cri".registry.mirrors]

    [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.elastic.co"]
      endpoint = ["https://elastic.m.daocloud.io"]

    [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
      endpoint = ["https://docker.m.daocloud.io", "https://dockerproxy.com/"]

    [plugins."io.containerd.grpc.v1.cri".registry.mirrors."gcr.io"]
      endpoint = ["https://gcr.m.daocloud.io", "https://gcr.nju.edu.cn", "https://gcr.dockerproxy.com"]

    [plugins."io.containerd.grpc.v1.cri".registry.mirrors."ghcr.io"]
      endpoint = ["https://ghcr.m.daocloud.io", "https://ghcr.nju.edu.cn"]

    [plugins."io.containerd.grpc.v1.cri".registry.mirrors."k8s.gcr.io"]
      endpoint = ["https://k8s-gcr.m.daocloud.io", "https://gcr.nju.edu.cn/google-containers/", "https://k8s.dockerproxy.com/"]

    [plugins."io.containerd.grpc.v1.cri".registry.mirrors."quay.io"]
      endpoint = ["https://quay.m.daocloud.io", "https://quay.nju.edu.cn"]

    [plugins."io.containerd.grpc.v1.cri".registry.mirrors."registry.k8s.io"]
      endpoint = ["https://k8s.m.daocloud.io", "https://k8s.nju.edu.cn"]

[plugins."io.containerd.grpc.v1.cri".x509_key_pair_streaming]
  tls_cert_file = ""
  tls_key_file = ""

[plugins."io.containerd.internal.v1.opt"]
path = "/opt/containerd"

[plugins."io.containerd.internal.v1.restart"]
interval = "10s"

[plugins."io.containerd.internal.v1.tracing"]

[plugins."io.containerd.metadata.v1.bolt"]
content_sharing_policy = "shared"

[plugins."io.containerd.monitor.v1.cgroups"]
no_prometheus = false

[plugins."io.containerd.nri.v1.nri"]
disable = true
disable_connections = false
plugin_config_path = "/etc/nri/conf.d"
plugin_path = "/opt/nri/plugins"
plugin_registration_timeout = "5s"
plugin_request_timeout = "2s"
socket_path = "/var/run/nri/nri.sock"

[plugins."io.containerd.runtime.v1.linux"]
no_shim = false
runtime = "runc"
runtime_root = ""
shim = "containerd-shim"
shim_debug = false

[plugins."io.containerd.runtime.v2.task"]
platforms = ["linux/amd64"]
sched_core = false

[plugins."io.containerd.service.v1.diff-service"]
default = ["walking"]

[plugins."io.containerd.service.v1.tasks-service"]
blockio_config_file = ""
rdt_config_file = ""

[plugins."io.containerd.snapshotter.v1.aufs"]
root_path = ""

[plugins."io.containerd.snapshotter.v1.blockfile"]
fs_type = ""
mount_options = []
root_path = ""
scratch_file = ""

[plugins."io.containerd.snapshotter.v1.btrfs"]
root_path = ""

[plugins."io.containerd.snapshotter.v1.devmapper"]
async_remove = false
base_image_size = ""
discard_blocks = false
fs_options = ""
fs_type = ""
pool_name = ""
root_path = ""

[plugins."io.containerd.snapshotter.v1.native"]
root_path = ""

[plugins."io.containerd.snapshotter.v1.overlayfs"]
mount_options = []
root_path = ""
sync_remove = false
upperdir_label = false

[plugins."io.containerd.snapshotter.v1.zfs"]
root_path = ""

[plugins."io.containerd.tracing.processor.v1.otlp"]

[plugins."io.containerd.transfer.v1.local"]
config_path = ""
max_concurrent_downloads = 3
max_concurrent_uploaded_layers = 3

[[plugins."io.containerd.transfer.v1.local".unpack_config]]
  differ = ""
  platform = "linux/amd64"
  snapshotter = "overlayfs"

[proxy_plugins]

[stream_processors]

[stream_processors."io.containerd.ocicrypt.decoder.v1.tar"]
accepts = ["application/vnd.oci.image.layer.v1.tar+encrypted"]
args = ["--decryption-keys-path", "/etc/containerd/ocicrypt/keys"]
env = ["OCICRYPT_KEYPROVIDER_CONFIG=/etc/containerd/ocicrypt/ocicrypt_keyprovider.conf"]
path = "ctd-decoder"
returns = "application/vnd.oci.image.layer.v1.tar"

[stream_processors."io.containerd.ocicrypt.decoder.v1.tar.gzip"]
accepts = ["application/vnd.oci.image.layer.v1.tar+gzip+encrypted"]
args = ["--decryption-keys-path", "/etc/containerd/ocicrypt/keys"]
env = ["OCICRYPT_KEYPROVIDER_CONFIG=/etc/containerd/ocicrypt/ocicrypt_keyprovider.conf"]
path = "ctd-decoder"
returns = "application/vnd.oci.image.layer.v1.tar+gzip"

[timeouts]
"io.containerd.timeout.bolt.open" = "0s"
"io.containerd.timeout.metrics.shimstats" = "2s"
"io.containerd.timeout.shim.cleanup" = "5s"
"io.containerd.timeout.shim.load" = "5s"
"io.containerd.timeout.shim.shutdown" = "3s"
"io.containerd.timeout.task.state" = "2s"

[ttrpc]
address = ""
gid = 0
uid = 0
2.helm相关pod状态
image

@Radical-3
Copy link
Author

补充:
1.集群中有三个节点,一个主节点,一个带有gpu的工作节点和一个不带有gpu的工作节点。带有gpu的工作节点的相关信息如下,一张卡,16G显存。
image
2.部署hami后带有gpu的工作节点的信息:
image
Name: lhj-ubuntu
Roles:
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
gpu=on
kubernetes.io/arch=amd64
kubernetes.io/hostname=lhj-ubuntu
kubernetes.io/os=linux
nvidia.com/gpu=true
nvidia.com/gpu.deploy.container-toolkit=true
nvidia.com/gpu.deploy.dcgm=true
nvidia.com/gpu.deploy.dcgm-exporter=true
nvidia.com/gpu.deploy.device-plugin=true
nvidia.com/gpu.deploy.driver=true
nvidia.com/gpu.deploy.gpu-feature-discovery=true
nvidia.com/gpu.deploy.node-status-exporter=true
nvidia.com/gpu.deploy.operator-validator=true
nvidia.com/gpu.present=true
Annotations: csi.volume.kubernetes.io/nodeid: {"csi.tigera.io":"lhj-ubuntu"}
hami.io/node-handshake: Requesting_2024.11.30 10:41:01
hami.io/node-handshake-dcu: Deleted_2024.11.29 07:59:34
hami.io/node-nvidia-register: GPU-6aee528c-0c77-702a-118d-4712762feacf,10,16380,100,NVIDIA-NVIDIA GeForce RTX 4060 Ti,0,true:
kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.sock
node.alpha.kubernetes.io/ttl: 0
nvidia.com/gpu-driver-upgrade-enabled: true
projectcalico.org/IPv4Address: 10.195.12.47/16
projectcalico.org/IPv4VXLANTunnelAddr: 10.60.88.192
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Thu, 28 Nov 2024 21:43:46 +0800
Taints:
Unschedulable: false
Lease:
HolderIdentity: lhj-ubuntu
AcquireTime:
RenewTime: Sat, 30 Nov 2024 18:41:00 +0800
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message


NetworkUnavailable False Sat, 30 Nov 2024 13:19:10 +0800 Sat, 30 Nov 2024 13:19:10 +0800 CalicoIsUp Calico is running on this node
MemoryPressure False Sat, 30 Nov 2024 18:40:45 +0800 Sat, 30 Nov 2024 13:19:07 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Sat, 30 Nov 2024 18:40:45 +0800 Sat, 30 Nov 2024 13:19:07 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Sat, 30 Nov 2024 18:40:45 +0800 Sat, 30 Nov 2024 13:19:07 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Sat, 30 Nov 2024 18:40:45 +0800 Sat, 30 Nov 2024 13:19:07 +0800 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 10.195.12.47
Hostname: lhj-ubuntu
Capacity:
cpu: 12
ephemeral-storage: 702161192Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32056376Ki
nvidia.com/gpu: 10
pods: 110
Allocatable:
cpu: 12
ephemeral-storage: 647111753476
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 31953976Ki
nvidia.com/gpu: 10
pods: 110
System Info:
Machine ID: 8c7ba8a4853e4150b3832c1111435756
System UUID: fd6b44d5-dbf5-2129-7bb4-107c6162e23b
Boot ID: 632f9723-2665-4215-a29e-c00c3d3c8867
Kernel Version: 6.8.0-49-generic
OS Image: Ubuntu 22.04.5 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.7.17
Kubelet Version: v1.30.7
Kube-Proxy Version: v1.30.7
PodCIDR: 10.60.2.0/24
PodCIDRs: 10.60.2.0/24
Non-terminated Pods: (5 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age


calico-system calico-node-wj2q4 0 (0%) 0 (0%) 0 (0%) 0 (0%) 44h
calico-system csi-node-driver-xf4sq 0 (0%) 0 (0%) 0 (0%) 0 (0%) 44h
kube-system hami-device-plugin-n9k6g 0 (0%) 0 (0%) 0 (0%) 0 (0%) 31m
kube-system hami-scheduler-76599c4c74-fljjr 0 (0%) 0 (0%) 0 (0%) 0 (0%) 31m
kube-system kube-proxy-7cz8x 0 (0%) 0 (0%) 0 (0%) 0 (0%) 44h
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits


cpu 0 (0%) 0 (0%)
memory 0 (0%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
nvidia.com/gpu 0 0
Events:

@lixd
Copy link

lixd commented Dec 2, 2024

根据错误,当前 Pod 还是由 default-scheduler 在调度,理论上会使用 hami-scheduler 才对, 看了下 Webhook 默认的failurePolicy: Ignore 可能是因为kube-apiserver 调用 Webhook 失败了,可以关注下 kube-apiserver 和 hami-scheduler 的日志。 还有一种可能是 hami-scheduler 中指定的 ResourceName 和 Pod 申请的 ResourceName 不一致了。

@lixd
Copy link

lixd commented Dec 2, 2024

看起来和 #653 类似,可以参考下~

@Radical-3
Copy link
Author

好的,太感谢了,那我看一下

@xiaoyao
Copy link

xiaoyao commented Dec 2, 2024

参考:
#590 (comment)

@Radical-3
Copy link
Author

知道问题了,好像是我在安装hami之前,安装了nvidia-device-plugin,所以在调度的时候两个会有冲突。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants