Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

运行hami的2.4.1版本,pod报错UnexpectedAdmissionError #653

Open
Jackson8888 opened this issue Nov 28, 2024 · 8 comments
Open

运行hami的2.4.1版本,pod报错UnexpectedAdmissionError #653

Jackson8888 opened this issue Nov 28, 2024 · 8 comments

Comments

@Jackson8888
Copy link

Jackson8888 commented Nov 28, 2024

Please provide an in-depth description of the question you have:
目前运行hami的2.4.1版本,pod运行正常,访问webhook报错
E1128 00:14:10.824408 1 dispatcher.go:170] failed calling webhook "vgpu.hami.io": Post https://hami-scheduler.kube-system.svc:443/webhook?timeout=10s: dial tcp 10.111.53.147:443: connect: no route to host

pod状态报错:
Warning UnexpectedAdmissionError 12m kubelet, k8s-master Allocate failed due to rpc error: code = Unknown desc = no b
inding pod found on node k8s-master, which is unexpected

报错信息

Environment:

  • HAMi version: 2.4.1
  • Kubernetes version: 1.18
  • Others:
  • docker/containerd/cri-o已配置nvidia作为默认runtime
  • glibc = 2.17
  • kernel version = 3.10
@archlitchi
Copy link
Collaborator

is hami-scheduler and hami-device-plugin both in running state?

@Jackson8888
Copy link
Author

image hami-scheduler and hami-device-plugin both in running state

@Nimbus318
Copy link
Contributor

The situation you described might be similar to Issue #590. I’ve provided some troubleshooting ideas there, which you can refer to for guidance

@Jackson8888
Copy link
Author

不太一样,通过helm检查hami的状态没有报错。
image

@Jackson8888
Copy link
Author

补充kube-system 的svc
Uploading image.png…

报错
E1128 05:48:09.368619 1 dispatcher.go:170] failed calling webhook "vgpu.hami.io": Post https://hami-scheduler.kube-system.svc:443/webhook?timeout=10s: dial tcp 10.100.73.42:443: connect: connection refused

@Nimbus318
Copy link
Contributor

目前的信息确实只能知道 APIServer 调用 scheduler 提供的 webhook,调不通,但是 scheduler 看起来正常(或者你再看看 hami-scheduler 的日志有没有可疑的地方),我之前有排查过类似的就是网络问题了,但是排查的思路和分支就可能比较多了,你可以先试试自己创建一个有 curl 命令的 Pod,然后 curl -v https://hami-scheduler.kube-system.svc:443/webhook 试试了

@Jackson8888
Copy link
Author

补充kube-system 的svc
image

报错
E1128 05:48:09.368619 1 dispatcher.go:170] failed calling webhook "vgpu.hami.io": Post https://hami-scheduler.kube-system.svc:443/webhook?timeout=10s: dial tcp 10.100.73.42:443: connect: connection refused

@autherlj
Copy link

我之前也遇到过这个情况,版本是v2.3.12
我规避手段是重启了schduler和device-plugin,有时候还在先重启docker 然后kebelet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants