You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
LAST SEEN TYPE REASON OBJECT MESSAGE
3m4s Warning FailedScheduling pod/binpack-3-7b8684575d-cqntk 0/1 nodes are available: 1 Insufficient GPU Memory in one device.
使用命令:nvidia-smi
Wed Feb 15 14:57:18 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02 Driver Version: 510.85.02 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. | |===============================+======================+======================|
| 0 NVIDIA A40-4Q On | 00000000:02:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 0MiB / 4096MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
第一次部署的时候可以正常创建,多次进行delete\create同一个Pod之后出现异常
使用命令:
kubectl -n test-testgpu get event
LAST SEEN TYPE REASON OBJECT MESSAGE
3m4s Warning FailedScheduling pod/binpack-3-7b8684575d-cqntk 0/1 nodes are available: 1 Insufficient GPU Memory in one device.
使用命令:
nvidia-smi
Wed Feb 15 14:57:18 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02 Driver Version: 510.85.02 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A40-4Q On | 00000000:02:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 0MiB / 4096MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
使用命令:
kubectl -n kube-system logs -f gpushare-schd-extender-594b9bc6d6-lh8w9
[ debug ] 2023/02/15 06:58:39 routes.go:162: /gpushare-scheduler/filter response=&{0xc42047e1e0 0xc420548300 0xc420355b80 0x565b70 true false false false 0xc4200aa580 {0xc42037a1c0 map[Content-Type:[application/json]] false false} map[Content-Type:[application/json]] true 111 -1 200 false false [] 0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0] [0 0 0] 0xc420348070 0}
[ debug ] 2023/02/15 06:58:58 controller.go:295: No need to update pod name binpack-3-7b8684575d-9ksk4 in ns test-testgpu and old status is Pending, new status is Pending; its old annotation map[ovn.kubernetes.io/logical_switch:ovn-default ovn.kubernetes.io/mac_address:00:00:00:B9:31:56 ovn.kubernetes.io/network_type:geneve ovn.kubernetes.io/pod_nic_type:veth-pair ovn.kubernetes.io/gateway:10.183.0.1 ovn.kubernetes.io/logical_router:ovn-cluster kubernetes.io/psp:20-user-restricted ovn.kubernetes.io/allocated:true ovn.kubernetes.io/cidr:10.183.0.0/16 ovn.kubernetes.io/ip_address:10.183.0.113 cpaas.io/creator:admin kubernetes.io/limit-ranger:LimitRanger plugin set: cpu, memory request for container binpack-3; cpu, memory limit for container binpack-3] and new annotation map[ovn.kubernetes.io/logical_router:ovn-cluster ovn.kubernetes.io/logical_switch:ovn-default ovn.kubernetes.io/mac_address:00:00:00:B9:31:56 ovn.kubernetes.io/network_type:geneve ovn.kubernetes.io/pod_nic_type:veth-pair ovn.kubernetes.io/gateway:10.183.0.1 kubernetes.io/limit-ranger:LimitRanger plugin set: cpu, memory request for container binpack-3; cpu, memory limit for container binpack-3 kubernetes.io/psp:20-user-restricted ovn.kubernetes.io/allocated:true ovn.kubernetes.io/cidr:10.183.0.0/16 ovn.kubernetes.io/ip_address:10.183.0.113 cpaas.io/creator:admin]
[ debug ] 2023/02/15 06:59:28 controller.go:295: No need to update pod name binpack-3-7b8684575d-9ksk4 in ns test-testgpu and old status is Pending, new status is Pending; its old annotation map[ovn.kubernetes.io/logical_router:ovn-cluster ovn.kubernetes.io/logical_switch:ovn-default ovn.kubernetes.io/mac_address:00:00:00:B9:31:56 ovn.kubernetes.io/network_type:geneve ovn.kubernetes.io/pod_nic_type:veth-pair ovn.kubernetes.io/gateway:10.183.0.1 kubernetes.io/limit-ranger:LimitRanger plugin set: cpu, memory request for container binpack-3; cpu, memory limit for container binpack-3 kubernetes.io/psp:20-user-restricted ovn.kubernetes.io/allocated:true ovn.kubernetes.io/cidr:10.183.0.0/16 ovn.kubernetes.io/ip_address:10.183.0.113 cpaas.io/creator:admin] and new annotation map[ovn.kubernetes.io/gateway:10.183.0.1 ovn.kubernetes.io/logical_router:ovn-cluster ovn.kubernetes.io/logical_switch:ovn-default ovn.kubernetes.io/mac_address:00:00:00:B9:31:56 ovn.kubernetes.io/network_type:geneve ovn.kubernetes.io/pod_nic_type:veth-pair cpaas.io/creator:admin kubernetes.io/limit-ranger:LimitRanger plugin set: cpu, memory request for container binpack-3; cpu, memory limit for container binpack-3 kubernetes.io/psp:20-user-restricted ovn.kubernetes.io/allocated:true ovn.kubernetes.io/cidr:10.183.0.0/16 ovn.kubernetes.io/ip_address:10.183.0.113]
后续重新创建了gpushare-scheduler-extender就可以继续正常创建了,但是重复创建几次Pod又Pending
目前没有找到具体什么原因
The text was updated successfully, but these errors were encountered: