Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HAMi 是否存在小概率的异常?无法正常获取显存 #648

Open
autherlj opened this issue Nov 27, 2024 · 3 comments
Open

HAMi 是否存在小概率的异常?无法正常获取显存 #648

autherlj opened this issue Nov 27, 2024 · 3 comments

Comments

@autherlj
Copy link

Please provide an in-depth description of the question you have:

What do you think about this question?:

Environment: K8S 1.23.9 with hami helm

  • HAMi version: v2.3.12
  • Kubernetes version: v 1.23.9
  • Others: cuda版本12.2 N卡驱动535.154.05
    -Docker docker版本是18.09.0

问题描述:
有2套完全一致的K8S环境,如上述版本
其中A环境可以运行模型,B环境运行模型是卡主在load过程,没有任何报错以及日志。重启了hami-vgpu-device-plugin对应的pod, 但是helm和hami-vgpu-scheduler没有重启过
A环境的运行情况如下:
bb233add80e8fda6bfb0e709ae13792
B环境的运行情况如下:
16be3fb11aa9ecbd1a73baccb40f177

同样的显卡T4,然后测试在B环境的T4宿主机器用docker run --runtime=nvidia的方式,发现模型也可以运行。
现在怀疑是B环境的HAMi是否有异常,但是查看日志没有任何线索,请求协助!
B环境的hami-vgpu-device-plugin日志:
c9bd7de888646622f3c869dd1e4cfff
2efa8408ed911a3a3d500c4dfe4fc6a

@Nimbus318
Copy link
Contributor

我目前看截图的信息感觉,我有点迷糊了,看你描述是两套一样的环境,都是 T4

可是从 A 环境的截图看,确实是 T4,然后 B 环境的截图,里面又是 V100,可是 B 环境的 DevicePlugin 的日志,看起来是四张 A2 的卡,所以有点没搞清楚

@autherlj
Copy link
Author

我目前看截图的信息感觉,我有点迷糊了,看你描述是两套一样的环境,都是 T4

可是从 A 环境的截图看,确实是 T4,然后 B 环境的截图,里面又是 V100,可是 B 环境的 DevicePlugin 的日志,看起来是四张 A2 的卡,所以有点没搞清楚

不好意思,因为图很难拿出,但是日志内容是一致的,我们B环境在A2 T4 V100都测试了都不行。 不过A环境运行的是T4和V100

@archlitchi
Copy link
Collaborator

看一下任务yaml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants