这个项目目前在使用过程中存在的问题 #219

freelizhun · 2023-11-16T01:12:55Z

某节点有2个GPU，一个GPU内存10G，另一个GPU内存20G，gpushare-device-plugin虚拟出来另一个内存20G的gpu的device id可能只有10个，即gpu-mem只有10G，总共的gpu-mem只有20G
gpushare-device-plugin allocate操作中，kubelet 永远只会发送一个container请求的device ids，而gpushare-device-plugin通过获取pod中所有container的device ids来做比对，可能会存在找不到该pod的情况，而导致container启动失败，如pod中存在2个container
pod.annotations: ALIYUN_COM_GPU_MEM_IDX: 0中永远只能分配一个GPU id，如果某个pod的gpu-mem: 12超过了单个GPU 10G，则直接无法调度
gpushare-scheduler-extender采用的是在default-scheduler中以http的形式调用，对调度器性能会产生不利影响
虽然pod中可以定义gpu-mem: 2参数，但实际上无法在gpu层面对pod使用多少内存做出限制，只是逻辑上限制了在gpu上可以运行的pod数量，跟nvidia k8s-device-plugin的time-slicing类似

freelizhun changed the title ~~这个项目目前在使用过程中发现的问题有以下~~ 这个项目目前在使用过程中存在的问题 Nov 16, 2023

Provide feedback