node-installer job does not terminate properly #140

voigt · 2024-05-20T21:17:34Z

As part of #68 I investigated an issue in the containerd restart routine. When the node-installer installs a runtime and restarts containerd, the corresponding pod terminates with status Unknown

Overview:

kubectl get job
NAME                            COMPLETIONS   DURATION   AGE
kwasm-worker-spin-v2-install    1/1           28s        21m

kubectl get po
NAME                                  READY   STATUS      RESTARTS   AGE
kwasm-worker-spin-v2-install-n82d9    0/1     Unknown     0          7m25s
kwasm-worker-spin-v2-install-rq78d    0/1     Completed   0          7m3s

Logs of Pod with status `Unknown`

kubectl logs kwasm-worker-spin-v2-install-n82d9 -c downloader
2024-05-20T20:49:40     INFO    start downloading shim from  https://github.com/spinkube/containerd-shim-spin/releases/download/v0.14.1/containerd-shim-spin-v2-linux-aarch64.tar.gz...
2024-05-20T20:49:42     INFO    download successful:
total 40M
drwxrwxrwx    1 root     root          46 May 20 20:49 .
drwxr-xr-x    1 root     root          48 May 20 20:49 ..
-rwxr-xr-x    1 1001     127        39.6M May  8 17:13 containerd-shim-spin-v2

kubectl logs kwasm-worker-spin-v2-install-n82d9 -c provisioner
2024/05/20 20:49:46 INFO shim installed shim=spin-v2 path=/opt/kwasm/bin/containerd-shim-spin-v2 new-version=true
2024/05/20 20:49:46 INFO shim configured shim=spin-v2 path=/etc/containerd/config.toml
2024/05/20 20:49:46 INFO restarting containerd

Logs of Pod with status `Completed`

kubectl logs kwasm-worker-spin-v2-install-rq78d -c downloader
2024-05-20T20:49:57     INFO    start downloading shim from  https://github.com/spinkube/containerd-shim-spin/releases/download/v0.14.1/containerd-shim-spin-v2-linux-aarch64.tar.gz...
2024-05-20T20:49:59     INFO    download successful:
total 40M
drwxrwxrwx    1 root     root          46 May 20 20:49 .
drwxr-xr-x    1 root     root          48 May 20 20:49 ..
-rwxr-xr-x    1 1001     127        39.6M May  8 17:13 containerd-shim-spin-v2

kubectl logs kwasm-worker-spin-v2-install-rq78d -c provisioner
2024/05/20 20:50:00 INFO shim installed shim=spin-v2 path=/opt/kwasm/bin/containerd-shim-spin-v2 new-version=false
2024/05/20 20:50:00 INFO runtime config already exists, skipping runtime=spin-v2
2024/05/20 20:50:00 INFO shim configured shim=spin-v2 path=/etc/containerd/config.toml
2024/05/20 20:50:00 INFO nothing changed, nothing more to do

The Completed pod only gets scheduled in the first place, as the first one did not terminated successfully; even though the actual job (rewriting containerd config and removing the binary) is done. As a result, the second run of the job has nothing left todo.

Description of Pod with Status `Unknown`

    State:          Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Mon, 20 May 2024 22:49:46 +0200
      Finished:     Mon, 20 May 2024 22:49:48 +0200

kubectl describe po kwasm-worker-spin-v2-install-n82d9

Name:             kwasm-worker-spin-v2-install-n82d9
Namespace:        default
Priority:         0
Service Account:  default
Node:             kwasm-worker/192.168.228.5
Start Time:       Mon, 20 May 2024 22:49:35 +0200
Labels:           batch.kubernetes.io/controller-uid=7878f58f-1b99-4e81-99f1-7bd5b7bf54ac
                  batch.kubernetes.io/job-name=kwasm-worker-spin-v2-install
                  controller-uid=7878f58f-1b99-4e81-99f1-7bd5b7bf54ac
                  job-name=kwasm-worker-spin-v2-install
Annotations:      <none>
Status:           Failed
IP:               10.244.2.2
IPs:
  IP:           10.244.2.2
Controlled By:  Job/kwasm-worker-spin-v2-install
Init Containers:
  downloader:
    Container ID:   containerd://7f63983e513efa392e3cc684bf53d2553aeb898b4bfe08fb22229fbae83406cb
    Image:          ghcr.io/spinkube/shim-downloader:latest-feat-add_shim_downloader
    Image ID:       ghcr.io/spinkube/shim-downloader@sha256:719f54c518fc0fc65abbe8ac27978ea188d13faee23530544faf9d622aa2be92
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 20 May 2024 22:49:40 +0200
      Finished:     Mon, 20 May 2024 22:49:42 +0200
    Ready:          True
    Restart Count:  0
    Environment:
      SHIM_NAME:      spin-v2
      SHIM_LOCATION:  https://github.com/spinkube/containerd-shim-spin/releases/download/v0.14.1/containerd-shim-spin-v2-linux-aarch64.tar.gz
    Mounts:
      /assets from shim-download (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wnr2x (ro)
Containers:
  provisioner:
    Container ID:  containerd://92dd4c994b2fc95d269b5de630c00f55fff233d04d1d649a6b69ce512936278b
    Image:         ghcr.io/spinkube/node-installer:latest-feat-add_shim_downloader
    Image ID:      ghcr.io/spinkube/node-installer@sha256:fcbfa4d8197d3de3b9953219af6a8784f23abf7d798150b2c2a606daaeebe6df
    Port:          <none>
    Host Port:     <none>
    Args:
      install
      -H
      /mnt/node-root
      -r
      spin-v2
    State:          Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Mon, 20 May 2024 22:49:46 +0200
      Finished:     Mon, 20 May 2024 22:49:47 +0200
    Ready:          False
    Restart Count:  0
    Environment:
      HOST_ROOT:            /mnt/node-root
      SHIM_FETCH_STRATEGY:  /mnt/node-root
    Mounts:
      /assets from shim-download (rw)
      /mnt/node-root from root-mount (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wnr2x (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   False 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  shim-download:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  root-mount:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  kube-api-access-wnr2x:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason   Age   From     Message
  ----    ------   ----  ----     -------
  Normal  Pulling  25m   kubelet  Pulling image "ghcr.io/spinkube/shim-downloader:latest-feat-add_shim_downloader"
  Normal  Pulled   25m   kubelet  Successfully pulled image "ghcr.io/spinkube/shim-downloader:latest-feat-add_shim_downloader" in 4.108s (4.108s including waiting)
  Normal  Created  25m   kubelet  Created container downloader
  Normal  Started  25m   kubelet  Started container downloader
  Normal  Pulling  25m   kubelet  Pulling image "ghcr.io/spinkube/node-installer:latest-feat-add_shim_downloader"
  Normal  Pulled   25m   kubelet  Successfully pulled image "ghcr.io/spinkube/node-installer:latest-feat-add_shim_downloader" in 3.105s (3.105s including waiting)
  Normal  Created  25m   kubelet  Created container provisioner
  Normal  Started  25m   kubelet  Started container provisioner

Entire resource of Job (e.g. for recreation of the bug)

apiVersion: batch/v1
kind: Job
metadata:
  annotations:
    kwasm.sh/nodeName: kwasm-worker
    kwasm.sh/operation: install
    kwasm.sh/shimName: spin-v2
  labels:
    kwasm-worker-spin-v2-install: "true"
    kwasm.sh/job: "true"
    kwasm.sh/operation: install
    kwasm.sh/shimName: spin-v2
  name: kwasm-worker-spin-v2-install
  namespace: default
spec:
  backoffLimit: 6
  completionMode: NonIndexed
  completions: 1
  manualSelector: false
  parallelism: 1
  podReplacementPolicy: TerminatingOrFailed
  selector:
    matchLabels:
      batch.kubernetes.io/controller-uid: 7878f58f-1b99-4e81-99f1-7bd5b7bf54ac
  suspend: false
  template:
    metadata:
      creationTimestamp: null
      labels:
        batch.kubernetes.io/controller-uid: 7878f58f-1b99-4e81-99f1-7bd5b7bf54ac
        batch.kubernetes.io/job-name: kwasm-worker-spin-v2-install
        controller-uid: 7878f58f-1b99-4e81-99f1-7bd5b7bf54ac
        job-name: kwasm-worker-spin-v2-install
    spec:
      containers:
      - args:
        - install
        - -H
        - /mnt/node-root
        - -r
        - spin-v2
        env:
        - name: HOST_ROOT
          value: /mnt/node-root
        - name: SHIM_FETCH_STRATEGY
          value: /mnt/node-root
        image: ghcr.io/spinkube/node-installer:latest-feat-add_shim_downloader
        imagePullPolicy: IfNotPresent
        name: provisioner
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /mnt/node-root
          name: root-mount
        - mountPath: /assets
          name: shim-download
      dnsPolicy: ClusterFirst
      hostPID: true
      initContainers:
      - env:
        - name: SHIM_NAME
          value: spin-v2
        - name: SHIM_LOCATION
          value: https://github.com/spinkube/containerd-shim-spin/releases/download/v0.14.1/containerd-shim-spin-v2-linux-aarch64.tar.gz
        image: ghcr.io/spinkube/shim-downloader:latest-feat-add_shim_downloader
        imagePullPolicy: IfNotPresent
        name: downloader
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /assets
          name: shim-download
      nodeName: kwasm-worker
      restartPolicy: Never
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - emptyDir: {}
        name: shim-download
      - hostPath:
          path: /
          type: ""
        name: root-mount
status:
  completionTime: "2024-05-20T20:50:03Z"
  conditions:
  - lastProbeTime: "2024-05-20T20:50:03Z"
    lastTransitionTime: "2024-05-20T20:50:03Z"
    status: "True"
    type: Complete
  failed: 1
  ready: 0
  startTime: "2024-05-20T20:49:35Z"
  succeeded: 1
  terminating: 0
  uncountedTerminatedPods: {}

While the goal of installing/uninstalling the shim is achieved, this is not a desired behavior and desires for a solution.

The text was updated successfully, but these errors were encountered:

voigt · 2024-05-20T21:23:31Z

The install-pods of kwasm do not terminate with status Unknown, but Completed. The main difference is, that kwasms install script uses the system schedulers restart functionality.

https://github.com/KWasm/kwasm-node-installer/blob/0ee6ec416f56d35449fbe2f6af072a8643e61686/script/installer.sh#L65

In case of systemd this means, that containerd receives a SIGTERM and only after 90 seconds a SIGKILL (source).

node-installer directly sends a SIGHUP to the containerd process, which seems to me to be the issue.

https://github.com/spinkube/runtime-class-manager/blob/9dee1c02630217342d4eb25d7f0ebb00c52507b3/internal/containerd/restart_unix.go#L45

vdice · 2024-11-06T22:42:13Z

Seeing similar behavior in the uninstall jobs (when deleting a shim). The first pod deletes the shim and restarts containerd but ends with status Unknown. The second and subsequent pods then enter a failure loop, failing with e.g.:

$ k -n rcm logs kind-worker1-spin-v2-uninstall-6m57l
2024/11/06 22:39:35 INFO uninstall called shim=spin-v2
2024/11/06 22:39:35 ERROR failed to uninstall error="failed to delete shim '/opt/kwasm/bin/spin-v2': shim spin-v2 not installed"

vdice · 2025-04-25T21:31:14Z

Update here after testing the current rcm node-installer behavior on different distros. ~~Worth noting that this behavior only occurs on a subset of distros tested.~~ update(5/9): My testing was somehow faulty when I first wrote the list below; I've since updated to note that nearly all tested distros do exhibit the mentioned behavior Here are my results:

## distro: exit code, pod status
k3d: 255, Unknown
k3s: 255, Unknown
k0s: 255, Unknown
rke2: 255, Unknown
kind: 255, Unknown
microk8s: 0, Completed
minikube: 255, Unknown
aks: 255, Unknown

I first played around with different signals and slightly varied logic for the current "get containerd pid, send syscall to terminate" approach. I didn't land on a combo that solved the issue for the k3d/k3s/rke2 distros.

I then took inspiration from the current version of the node installer script in the containerd-shim-spin project and went the exec.Command approach. For example, using the following for k3s: cmd := exec.CommandContext(ctx, "nsenter", fmt.Sprintf("-m/%s/proc/1/ns/mnt", os.Getenv("HOST_ROOT")), "--", "systemctl", "restart", "k3s")

This did solve the issue (the container exit code is 0 and thus the pod's status is Completed) , but I also needed to update the node-installer Dockerfile to be based on something like busybox instead of scratch, so the nsenter binary would be available. (Could probably refine that approach, just need that binary.)

What do we think of going the exec approach and running the same nsenter-based commands that have been working for containerd-shim-spin's node-installer?

Regardless of revised approach, I do like the idea of moving away from the current one-size-fits-all "get containerd pid, send syscall to terminate" and moving to distro-specific restarters, especially since the code is already set up to support this (looking at https://github.com/spinframework/runtime-class-manager/blob/main/internal/preset/preset.go).

kate-goldenring · 2025-04-30T15:08:46Z

+1 Moving to "moving to distro-specific restarters" sounds like the most flexible approach to me.

voigt · 2025-04-30T15:51:34Z

What do we think of going the exec approach and running the same nsenter-based commands that have been working for containerd-shim-spin's node-installer?

It seems to be an acceptable workaround. I only wonder what systemctl restart does differently. AFAIK it also just sends a SIGTERM followed by a SIGKILL.

moving to distro-specific restarters

Lets do this!

vdice · 2025-05-13T18:00:22Z

Closing this now that #389 is in.

Follow-up here: #393

voigt added the kind/bug Something isn't working label May 20, 2024

voigt mentioned this issue May 20, 2024

feat: implement install and uninstall flow #68

Merged

3 tasks

vdice added this to the M1 - RCM MVP for Spinkube milestone Mar 19, 2025

vdice mentioned this issue May 5, 2025

feat(*): add K8s distro-specific restarters; update default restarter #389

Merged

14 tasks

vdice mentioned this issue May 13, 2025

node-installer(k3d|k0s): job does not terminate properly #393

Open

vdice closed this as completed May 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

node-installer job does not terminate properly #140

node-installer job does not terminate properly #140

voigt commented May 20, 2024

voigt commented May 20, 2024

Uh oh!

vdice commented Nov 6, 2024

Uh oh!

vdice commented Apr 25, 2025 •

edited

Loading

Uh oh!

kate-goldenring commented Apr 30, 2025

Uh oh!

voigt commented Apr 30, 2025

Uh oh!

vdice commented May 13, 2025

Uh oh!

node-installer job does not terminate properly #140

node-installer job does not terminate properly #140

Comments

voigt commented May 20, 2024

Overview:

Logs of Pod with status Unknown

Logs of Pod with status Completed

Description of Pod with Status Unknown

voigt commented May 20, 2024

Uh oh!

vdice commented Nov 6, 2024

Uh oh!

vdice commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kate-goldenring commented Apr 30, 2025

Uh oh!

voigt commented Apr 30, 2025

Uh oh!

vdice commented May 13, 2025

Uh oh!

Logs of Pod with status `Unknown`

Logs of Pod with status `Completed`

Description of Pod with Status `Unknown`

vdice commented Apr 25, 2025 •

edited

Loading