We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Please provide an in-depth description of the question you have:
What do you think about this question?:
Environment: K8S 1.23.9 with hami helm
问题描述: 有2套完全一致的K8S环境,如上述版本 其中A环境可以运行模型,B环境运行模型是卡主在load过程,没有任何报错以及日志。重启了hami-vgpu-device-plugin对应的pod, 但是helm和hami-vgpu-scheduler没有重启过 A环境的运行情况如下: B环境的运行情况如下:
同样的显卡T4,然后测试在B环境的T4宿主机器用docker run --runtime=nvidia的方式,发现模型也可以运行。 现在怀疑是B环境的HAMi是否有异常,但是查看日志没有任何线索,请求协助! B环境的hami-vgpu-device-plugin日志:
The text was updated successfully, but these errors were encountered:
我目前看截图的信息感觉,我有点迷糊了,看你描述是两套一样的环境,都是 T4
可是从 A 环境的截图看,确实是 T4,然后 B 环境的截图,里面又是 V100,可是 B 环境的 DevicePlugin 的日志,看起来是四张 A2 的卡,所以有点没搞清楚
Sorry, something went wrong.
我目前看截图的信息感觉,我有点迷糊了,看你描述是两套一样的环境,都是 T4 可是从 A 环境的截图看,确实是 T4,然后 B 环境的截图,里面又是 V100,可是 B 环境的 DevicePlugin 的日志,看起来是四张 A2 的卡,所以有点没搞清楚
不好意思,因为图很难拿出,但是日志内容是一致的,我们B环境在A2 T4 V100都测试了都不行。 不过A环境运行的是T4和V100
看一下任务yaml
No branches or pull requests
Please provide an in-depth description of the question you have:
What do you think about this question?:
Environment: K8S 1.23.9 with hami helm
-Docker docker版本是18.09.0
问题描述:
有2套完全一致的K8S环境,如上述版本
其中A环境可以运行模型,B环境运行模型是卡主在load过程,没有任何报错以及日志。重启了hami-vgpu-device-plugin对应的pod, 但是helm和hami-vgpu-scheduler没有重启过
A环境的运行情况如下:
B环境的运行情况如下:
同样的显卡T4,然后测试在B环境的T4宿主机器用docker run --runtime=nvidia的方式,发现模型也可以运行。
现在怀疑是B环境的HAMi是否有异常,但是查看日志没有任何线索,请求协助!
B环境的hami-vgpu-device-plugin日志:
The text was updated successfully, but these errors were encountered: