Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

这个项目目前在使用过程中存在的问题 #219

Open
freelizhun opened this issue Nov 16, 2023 · 0 comments
Open

这个项目目前在使用过程中存在的问题 #219

freelizhun opened this issue Nov 16, 2023 · 0 comments

Comments

@freelizhun
Copy link

  • 某节点有2个GPU,一个GPU内存10G,另一个GPU内存20G,gpushare-device-plugin虚拟出来另一个内存20G的gpu的device id可能只有10个,即gpu-mem只有10G,总共的gpu-mem只有20G
  • gpushare-device-plugin allocate操作中,kubelet 永远只会发送一个container请求的device ids,而gpushare-device-plugin通过获取pod中所有container的device ids来做比对,可能会存在找不到该pod的情况,而导致container启动失败,如pod中存在2个container
  • pod.annotations: ALIYUN_COM_GPU_MEM_IDX: 0中永远只能分配一个GPU id,如果某个pod的gpu-mem: 12超过了单个GPU 10G,则直接无法调度
  • gpushare-scheduler-extender采用的是在default-scheduler中以http的形式调用,对调度器性能会产生不利影响
  • 虽然pod中可以定义gpu-mem: 2参数,但实际上无法在gpu层面对pod使用多少内存做出限制,只是逻辑上限制了在gpu上可以运行的pod数量,跟nvidia k8s-device-plugin的time-slicing类似
@freelizhun freelizhun changed the title 这个项目目前在使用过程中发现的问题有以下 这个项目目前在使用过程中存在的问题 Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant