Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: kruize: ImagePullBackOff error on test-2-nerc #804

Closed
kusumachalasani opened this issue Nov 7, 2024 · 27 comments
Closed

bug: kruize: ImagePullBackOff error on test-2-nerc #804

kusumachalasani opened this issue Nov 7, 2024 · 27 comments
Assignees
Labels
bug Something isn't working gpu observability openshift This issue pertains to NERC OpenShift

Comments

@kusumachalasani
Copy link

kusumachalasani commented Nov 7, 2024

  • kruize: nerc-ocp-test-2.nerc.mghpcc.org
  • wrk-5 (GPU node, NVIDIA-A100-SXM4-40GB) was restarted this morning
  • the "old" IP before was 192.168.50.98
  • now it is 192.168.50.93
  • but it is still expected to be on .98
  • pod coredns is in an error CrashLoopBack
  • e.g. Error from server: Get "https://192.168.50.98:10250/containerLogs/openshift-kni-infra/coredns-wrk-5/coredns": dial tcp 192.168.50.98:10250: connect: no route to host
  • leads e.g. to fails when starting a notebook in RHOAI
2024-11-07T11:38:02.000Z [Normal] Back-off pulling image "image-registry.openshift-image-registry.svc:5000/redhat-ods-applications/minimal-gpu:2024.1"
2024-11-07T11:37:48.000Z [Warning] Error: ErrImagePull

Description

ImagePullBackOff error is observed on test-2-nerc.

Tried out different applications - but still the same error is observed. The application is trying to run on wrk-5 node.

Image

@bharathappali
Copy link

I feel the issue is with the network as I couldn't see logs of the pods which are running on wrk-5 node

[abharath@abharath-thinkpadt14sgen2i ~]$ oc logs -f nvidia-mig-manager-22t4c -n nvidia-gpu-operator
Defaulted container "nvidia-mig-manager" out of: nvidia-mig-manager, toolkit-validation (init)
Error from server: Get "https://192.168.50.98:10250/containerLogs/nvidia-gpu-operator/nvidia-mig-manager-22t4c/nvidia-mig-manager?follow=true": dial tcp 192.168.50.98:10250: connect: no route to host

@schwesig schwesig self-assigned this Nov 7, 2024
@schwesig schwesig added bug Something isn't working openshift This issue pertains to NERC OpenShift observability gpu labels Nov 7, 2024
@schwesig
Copy link
Member

schwesig commented Nov 7, 2024

/CC @tssala23 @dystewart

@schwesig
Copy link
Member

schwesig commented Nov 7, 2024

  • Node Affected: wrk-5 on the test-2 Kruize cluster.

  • Issue: coredns pod on wrk-5 is in CrashLoopBackOff, causing DNS failures.

  • Current Status: coredns-wrk-5 shows 1/2 readiness, with 2155 restarts within 19 hours.

  • Network problems on wrk-5 prevent pulling container images and connecting to prometheus-k8s for metrics.

  • Prometheus connectivity error shows a Connection timed out on prometheus-k8s.openshift-monitoring.svc.cluster.local:9091.

  • Prior Issues: GPU allocation issues were noted on wrk-5 before the DNS problem.

  • Last Successful Run: The node last worked successfully before a recent restart (earlier today.

  • Maybe Trigger: Issues with GPU allocation were initially reported, followed by a node restart, after which network and DNS issues began affecting image pulls and internal connectivity.

@schwesig
Copy link
Member

schwesig commented Nov 7, 2024

2024-11-07T11:38:15.000Z [Normal] Pulling image "image-registry.openshift-image-registry.svc:5000/redhat-ods-applications/minimal-gpu:2024.1"
2024-11-07T11:38:02.000Z [Warning] Error: ImagePullBackOff
2024-11-07T11:38:02.000Z [Normal] Back-off pulling image "image-registry.openshift-image-registry.svc:5000/redhat-ods-applications/minimal-gpu:2024.1"
2024-11-07T11:37:48.000Z [Warning] Error: ErrImagePull
2024-11-07T11:37:48.000Z [Warning] Failed to pull image "image-registry.openshift-image-registry.svc:5000/redhat-ods-applications/minimal-gpu:2024.1": rpc error: code = DeadlineExceeded desc = pinging container registry image-registry.openshift-image-registry.svc:5000: Get "https://image-registry.openshift-image-registry.svc:5000/v2/": dial tcp 172.30.21.204:5000: i/o timeout
2024-11-07T11:36:47.000Z [Normal] Started container oauth-proxy
2024-11-07T11:36:47.000Z [Normal] Created container oauth-proxy
2024-11-07T11:36:46.000Z [Normal] Successfully pulled image "registry.redhat.io/openshift4/ose-oauth-proxy@sha256:4bef31eb993feb6f1096b51b4876c65a6fb1f4401fee97fa4f4542b6b7c9bc46" in 16.491s (16.491s including waiting)
2024-11-07T11:36:29.000Z [Normal] Pulling image "registry.redhat.io/openshift4/ose-oauth-proxy@sha256:4bef31eb993feb6f1096b51b4876c65a6fb1f4401fee97fa4f4542b6b7c9bc46"
2024-11-07T11:35:29.000Z [Normal] Add eth0 [10.128.4.86/23] from ovn-kubernetes
2024-11-07T11:34:38.000Z [Normal] AttachVolume.Attach succeeded for volume "pvc-abea051c-53a6-43e8-8792-b8330bc9ea6d"
2024-11-07T11:34:37.806Z [Normal] Successfully assigned rhods-notebooks/jupyter-nb-schwesig-0 to wrk-5
Server requested

@schwesig
Copy link
Member

schwesig commented Nov 7, 2024

❯ oc debug node/wrk-5
Starting pod/wrk-5-debug ...
To use host binaries, run `chroot /host`
Pod IP: 192.168.50.98
If you don't see a command prompt, try pressing enter.

Removing debug pod ...
Error from server: error dialing backend: dial tcp 192.168.50.98:10250: connect: no route to host

@schwesig
Copy link
Member

schwesig commented Nov 7, 2024

❯ oc logs coredns-wrk-5
Defaulted container "coredns" out of: coredns, coredns-monitor, render-config-coredns (init)
Error from server: Get "https://192.168.50.98:10250/containerLogs/openshift-kni-infra/coredns-wrk-5/coredns": dial tcp 192.168.50.98:10250: connect: no route to host

@schwesig
Copy link
Member

schwesig commented Nov 7, 2024

❯ oc events -w
LAST SEEN                 TYPE      REASON       OBJECT              MESSAGE
84s (x12799 over 7d19h)   Warning   ProbeError   Pod/coredns-wrk-5   Liveness probe error: Get "http://192.168.50.98:18080/health": dial tcp 192.168.50.98:18080: connect: no route to host
body:
6m42s (x31518 over 7d19h)   Warning   BackOff      Pod/coredns-wrk-5   Back-off restarting failed container coredns in pod coredns-wrk-5_openshift-kni-infra(3377e52796270e7a373902b6aa0d1e78)
5m24s (x2233 over 7d19h)    Warning   Unhealthy    Pod/keepalived-wrk-5   Liveness probe failed:
155m (x14 over 7d19h)       Warning   Unhealthy    Pod/keepalived-wrk-5   Liveness probe failed: command timed out
^N1s (x31548 over 7d19h)      Warning   BackOff      Pod/coredns-wrk-5      Back-off restarting failed container coredns in pod coredns-wrk-5_openshift-kni-infra(3377e52796270e7a373902b6aa0d1e78)
0s                          Normal    Pulled       Pod/wrk-5-debug        Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9d6201c776053346ebce8f90c34797a7a7c05898008e17f3ba9673f5f14507b0" already present on machine
0s                          Normal    Created      Pod/wrk-5-debug        Created container container-00
0s                          Normal    Started      Pod/wrk-5-debug        Started container container-00
0s                          Normal    Killing      Pod/wrk-5-debug        Stopping container container-00
0s (x2234 over 7d19h)       Warning   Unhealthy    Pod/keepalived-wrk-5   Liveness probe failed:
0s (x2235 over 7d19h)       Warning   Unhealthy    Pod/keepalived-wrk-5   Liveness probe failed:

@schwesig
Copy link
Member

schwesig commented Nov 7, 2024

trying to ping from wrk-4, not possible

sh-5.1# arping 192.168.50.98
ARPING 192.168.50.98 from 192.168.50.149 br-ex
^CSent 4 probes (4 broadcast(s))
Received 0 response(s)
sh-5.1# arping 192.168.50.98
ARPING 192.168.50.98 from 192.168.50.149 br-ex
^CSent 4 probes (4 broadcast(s))
Received 0 response(s)
sh-5.1# ssh -v [email protected]
OpenSSH_8.7p1, OpenSSL 3.0.7 1 Nov 2022
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Reading configuration data /etc/ssh/ssh_config.d/50-redhat.conf
debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config
debug1: configuration requests final Match pass
debug1: re-parsing configuration
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Reading configuration data /etc/ssh/ssh_config.d/50-redhat.conf
debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config
debug1: Connecting to 192.168.50.98 [192.168.50.98] port 22.
debug1: connect to address 192.168.50.98 port 22: No route to host
ssh: connect to host 192.168.50.98 port 22: No route to host

@schwesig
Copy link
Member

schwesig commented Nov 7, 2024

sh-5.1# ip route
default via 192.168.50.1 dev br-ex proto dhcp src 192.168.50.149 metric 48 
default via 10.0.120.1 dev eno2 proto dhcp src 10.0.123.127 metric 102 
default via 10.85.0.10 dev vlan2076 proto dhcp src 10.85.2.117 metric 400 
10.0.120.0/22 dev eno2 proto kernel scope link src 10.0.123.127 metric 102 
10.30.9.0/24 via 10.85.0.1 dev vlan2076 proto dhcp src 10.85.2.117 metric 400 
10.85.0.0/22 dev vlan2076 proto kernel scope link src 10.85.2.117 metric 400 
10.128.0.0/14 via 10.129.2.1 dev ovn-k8s-mp0 
10.129.2.0/23 dev ovn-k8s-mp0 proto kernel scope link src 10.129.2.2 
10.247.236.0/25 via 10.0.120.1 dev eno2 proto dhcp src 10.0.123.127 metric 102 
10.255.116.0/23 via 10.0.120.1 dev eno2 proto dhcp src 10.0.123.127 metric 102 
140.247.236.0/25 via 10.0.120.1 dev eno2 proto dhcp src 10.0.123.127 metric 102 
169.254.169.0/29 dev br-ex proto kernel scope link src 169.254.169.2 
169.254.169.1 dev br-ex src 192.168.50.149 
169.254.169.3 via 10.129.2.1 dev ovn-k8s-mp0 
169.254.169.254 via 192.168.50.11 dev br-ex proto dhcp src 192.168.50.149 metric 48 
169.254.169.254 via 10.0.121.2 dev eno2 proto dhcp src 10.0.123.127 metric 102 
172.30.0.0/16 via 169.254.169.4 dev br-ex src 169.254.169.2 mtu 1400 
192.168.50.0/24 dev br-ex proto kernel scope link src 192.168.50.149 metric 48 

@schwesig
Copy link
Member

schwesig commented Nov 7, 2024

❯ oc debug node/ctl-1
Starting pod/ctl-1-debug ...
To use host binaries, run `chroot /host`
Pod IP: 192.168.50.114
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
sh-5.1# ping 192.168.50.98 
PING 192.168.50.98 (192.168.50.98) 56(84) bytes of data.
From 192.168.50.114 icmp_seq=1 Destination Host Unreachable
From 192.168.50.114 icmp_seq=2 Destination Host Unreachable
From 192.168.50.114 icmp_seq=3 Destination Host Unreachable
From 192.168.50.114 icmp_seq=4 Destination Host Unreachable
From 192.168.50.114 icmp_seq=5 Destination Host Unreachable
From 192.168.50.114 icmp_seq=6 Destination Host Unreachable
From 192.168.50.114 icmp_seq=7 Destination Host Unreachable
From 192.168.50.114 icmp_seq=8 Destination Host Unreachable
From 192.168.50.114 icmp_seq=12 Destination Host Unreachable
From 192.168.50.114 icmp_seq=15 Destination Host Unreachable
From 192.168.50.114 icmp_seq=16 Destination Host Unreachable
From 192.168.50.114 icmp_seq=18 Destination Host Unreachable
From 192.168.50.114 icmp_seq=21 Destination Host Unreachable
From 192.168.50.114 icmp_seq=22 Destination Host Unreachable
From 192.168.50.114 icmp_seq=24 Destination Host Unreachable
From 192.168.50.114 icmp_seq=25 Destination Host Unreachable
From 192.168.50.114 icmp_seq=26 Destination Host Unreachable
From 192.168.50.114 icmp_seq=27 Destination Host Unreachable
From 192.168.50.114 icmp_seq=28 Destination Host Unreachable
From 192.168.50.114 icmp_seq=29 Destination Host Unreachable
From 192.168.50.114 icmp_seq=30 Destination Host Unreachable
From 192.168.50.114 icmp_seq=31 Destination Host Unreachable
^C
--- 192.168.50.98 ping statistics ---
32 packets transmitted, 0 received, +22 errors, 100% packet loss, time 31749ms
pipe 4

@schwesig
Copy link
Member

schwesig commented Nov 7, 2024

wrk-5 says 192.168.50.98
but it is 192.168.50.93 in real

@schwesig
Copy link
Member

schwesig commented Nov 7, 2024

 oc describe node/wrk-5 | grep .93
                    k8s.ovn.org/host-cidrs: ["10.0.120.39/22","10.85.3.145/22","192.168.50.93/24"]
                      {"default":{"mode":"local","interface-id":"br-ex_wrk-5","mac-address":"08:8f:c3:a6:03:8e","ip-addresses":["192.168.50.93/24"],"ip-address"...
                    k8s.ovn.org/node-primary-ifaddr: {"ipv4":"192.168.50.93/24"}
  System UUID:                                 0d3eb5fe-aba0-11ee-baa4-0a8fc3a60393

vs

❯ oc get node -o wide wrk-5
NAME    STATUS   ROLES    AGE   VERSION           INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                                                KERNEL-VERSION                 CONTAINER-RUNTIME
wrk-5   Ready    worker   8d    v1.29.5+29c95f3   192.168.50.98   <none>        Red Hat Enterprise Linux CoreOS 416.94.202406172220-0   5.14.0-427.22.1.el9_4.x86_64   cri-o://1.29.5-5.rhaos4.16.git7032128.el9

@schwesig
Copy link
Member

schwesig commented Nov 8, 2024

Just tagging for easier finding
/CC @jtriley @larsks @tzumainn

@tzumainn
Copy link

tzumainn commented Nov 8, 2024

I'm actually not sure where the 192.168.50.98 came from in the first place. The Neutron port associated with MOC-R8PAC23U39 has 192.168.50.93 as its IP address; it was created on October 29th and hasn't been updated since, so I don't think it's been modified. And I don't see any port in the inventory that has the 192.168.50.98 IP. So I'm guessing that configuration came from outside of ESI?

In any case - I'm not that familiar with OpenShift configuration, but is it possible to just update the worker IP? Failing that, I could update the IP address of the port in ESI (and maybe reboot the machine). Let me know!

@larsks
Copy link
Contributor

larsks commented Nov 8, 2024

@schwesig I rebooted wrk-5 this morning, which caused it to re-register with the cluster. This resulted in a small number of pending certificate signing requests:

$ k get csr
NAME        AGE    SIGNERNAME                            REQUESTOR               REQUESTEDDURATION   CONDITION
csr-9zbkw   45m    kubernetes.io/kube-apiserver-client   system:multus:ctl-0     24h                 Approved,Issued
csr-btxhn   72m    kubernetes.io/kube-apiserver-client   system:multus:wrk-3     24h                 Approved,Issued
csr-dhxvc   102m   kubernetes.io/kube-apiserver-client   system:ovn-node:wrk-1   24h                 Approved,Issued
csr-l6l2g   40m    kubernetes.io/kube-apiserver-client   system:ovn-node:ctl-2   24h                 Approved,Issued
csr-q6jxs   8s     kubernetes.io/kubelet-serving         system:node:wrk-5       <none>              Pending
csr-rp8wg   15m    kubernetes.io/kubelet-serving         system:node:wrk-5       <none>              Pending
csr-tplw4   16m    kubernetes.io/kube-apiserver-client   system:multus:wrk-0     24h                 Approved,Issued
csr-v592h   30m    kubernetes.io/kubelet-serving         system:node:wrk-5       <none>              Pending

After approving these requests (oc adm certificate approve ...), the node seems healthy and I am able to successfully schedule pods on the node.

@larsks
Copy link
Contributor

larsks commented Nov 8, 2024

There appears to be some issue with PV access on wrk-5. While the pods start and volumes mount successfully, actually writing to those volumes seems to block indefinitely.

@hpdempsey
Copy link

Should we replace wrk-5 with a different server?

@larsks
Copy link
Contributor

larsks commented Nov 8, 2024

@hpdempsey I don't think we have a hardware problem, but if someone else wants to give that a shot they should feel free to have at it. Simply removing the node from the cluster and re-adding it should accomplish the same thing. Since this is a test environment, it seems like a good opportunity to figure out what's going on so that we understand better next time.

From my perspective it looks more like a networking issue.

@larsks
Copy link
Contributor

larsks commented Nov 8, 2024

The node is logging this every few seconds:

Nov 08 23:08:56 wrk-5 kernel: rbd: rbd0: encountered watch error: -107
Nov 08 23:10:59 wrk-5 kernel: rbd: rbd1: encountered watch error: -107
Nov 08 23:11:02 wrk-5 kernel: rbd: rbd0: encountered watch error: -107

@tssala23
Copy link

@larsks reading through the thread seems like easiest solution would be to remove the node from the cluster and re add it, @schwesig am I clear to do that now?

@schwesig schwesig changed the title ImagePullBackOff error on test-2-nerc bug: ImagePullBackOff error on test-2-nerc Nov 11, 2024
@schwesig schwesig changed the title bug: ImagePullBackOff error on test-2-nerc bug: kruize: ImagePullBackOff error on test-2-nerc Nov 11, 2024
@schwesig
Copy link
Member

@tssala23 , yes please, proceed.

@tssala23
Copy link

@schwesig I have removed and added it back, you can check if the problem is still happening

@schwesig
Copy link
Member

@tssala23 thanks, wikll check

@schwesig
Copy link
Member

I was able to start a new notebook.
kruize is informed and will do their tests.

@schwesig
Copy link
Member

i can close this issue. the core DNS problem is solved.
correct IP etc

@schwesig
Copy link
Member

Image

@schwesig
Copy link
Member

Image

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gpu observability openshift This issue pertains to NERC OpenShift
Projects
None yet
Development

No branches or pull requests

8 participants