-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vgpu安装成功后 无法使用cuda #662
Comments
请问您可以成功创建申请了gpu的pod吗,我部署HAMi之后,创建pod报错,想问一下您是怎么部署的。 |
我可以正常申请gpu pod ,就是在pod里使用gpu的时候会报cuda错误,我使用的是master分支安装的,您可以参考master分支安装教程 |
@su161021 你这个程序要申请 3171158016,但是这个 Pod 最多就 3145728000(3000MiB),被 HAMi-core 限了,返回了 OOM,所以你 Pod 的声明把 nvidia.com/gpumem 稍微调大一点试试 |
方便加一下wx吗,我想向您询问一下具体情况 @su161021,我的wx号是a2973051203 |
感谢指教,调大了 确实好了,不过我看算力好像没有限制住,只起一个pod的话 nvidia.com/gpucores: 30,pod里面运行程序还是可以使用到100的算力,这个您有遇到过吗 |
加你了 |
python test_pytorch.py
[HAMI-core Msg(30:140387636202368:libvgpu.c:836)]: Initializing.....
[HAMI-core Msg(30:140387636202368:libvgpu.c:855)]: Initialized
[HAMI-core ERROR (pid:30 thread=140382094939712 allocator.c:53)]: Device 0 OOM 3171158016 / 3145728000
Traceback (most recent call last):
File "/data/test_pytorch.py", line 69, in
run_gpu_stress_test() # 启动 GPU 持续计算任务
^^^^^^^^^^^^^^^^^^^^^
File "/data/test_pytorch.py", line 61, in run_gpu_stress_test
loss = train_one_epoch() # 每次训练都执行一次
^^^^^^^^^^^^^^^^^
File "/data/test_pytorch.py", line 51, in train_one_epoch
loss.backward() # 反向传播
^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_tensor.py", line 581, in backward
torch.autograd.backward(
File "/opt/conda/lib/python3.11/site-packages/torch/autograd/init.py", line 347, in backward
_engine_run_backward(
File "/opt/conda/lib/python3.11/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: unrecognized error code
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.有大佬遇到这个问题吗,插件都是正常的 pod也能起来,同样的程序 同样的经常使用docker run就能跑,vgpu以后就报错了
The text was updated successfully, but these errors were encountered: