"exec format error" with cilium set as default CNI in all.yml if you are installing to an ARM cluster #130

kcalmond · 2021-09-11T20:27:18Z

Details

What steps did you take and what happened:

Built a cluster per instructions on Raspberry Pi 4 nodes, using boot images based on latest ubuntu 20.04

What did you expect to happen:

CNI should start up using images built for ARM architecture

Anything else you would like to add:

cilium.yml runs a kubectl apply using a yml that references an incompatible image for ARM.

Additional Information:

Note: Anything to give further context to the bug report.

kcalmond · 2021-09-11T20:32:35Z

Looks like Cilium 1.10 supports ARM
Also - the quick installation process changed from 1.9 to 1.10. In the 1.10 install they aren't providing a quick-install.yml file. In 1.10 the quick install process uses a new installer.

xunholy · 2021-09-11T20:58:13Z

Yep thanks for raising this I did run into this issue myself and had it in my list of TODOS.

kcalmond · 2021-09-12T00:01:06Z

@xunholy - re ^^ I'm trying to figure out how to successfully build a cluster with calico as CNI instead of cilium (it appears cilium is set as default - which as of right now makes the defaults broken for pi clusters)

After fishing around I changed the value from cilium to calico in each of these locations:

group_vars/all.yml: cni_plugin: 'cilium'
group_vars/controlplane.yml cni_plugin: cilium
roles/cni/defaults/main.yml cni_plugin: cilium

I ran nuke.yml, then changed cilium to calico in above locations, then ran all.yml again. Everything looked good until the end where I get a fatal FAIL message:

<lots of output...>

TASK [cni : setup calico container network interface (cni)] ****************************************************************************************************************************************************
Saturday 11 September 2021  16:48:57 -0700 (0:00:00.218)       0:07:06.331 ****
skipping: [strawberry]
skipping: [blueberry]
skipping: [blackberry]
skipping: [gooseberry]
included: /Users/almondch/GH/raspbernetes/k8s-cluster-installation-1/ansible/roles/cni/tasks/calico.yml for hackberry

TASK [cni : applying calico] ***********************************************************************************************************************************************************************************
Saturday 11 September 2021  16:48:57 -0700 (0:00:00.269)       0:07:06.601 ****
fatal: [hackberry]: FAILED! => changed=true
  cmd:
  - kubectl
  - apply
  - -f
  - https://docs.projectcalico.org/manifests/calico.yaml
  delta: '0:00:11.712726'
  end: '2021-09-11 16:49:12.059033'
  msg: non-zero return code
  rc: 1
  start: '2021-09-11 16:49:00.346307'
  stderr: 'Error from server (InternalError): an error on the server ("") has prevented the request from succeeding'
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>

PLAY RECAP *****************************************************************************************************************************************************************************************************
127.0.0.1                  : ok=1    changed=0    unreachable=0    failed=0    skipped=1    rescued=0    ignored=0
blackberry                 : ok=64   changed=16   unreachable=0    failed=0    skipped=26   rescued=0    ignored=0
blueberry                  : ok=85   changed=17   unreachable=0    failed=0    skipped=26   rescued=0    ignored=0
gooseberry                 : ok=64   changed=16   unreachable=0    failed=0    skipped=26   rescued=0    ignored=0
hackberry                  : ok=121  changed=21   unreachable=0    failed=1    skipped=28   rescued=0    ignored=0
strawberry                 : ok=87   changed=17   unreachable=0    failed=0    skipped=28   rescued=0    ignored=0

kcalmond · 2021-09-12T01:28:41Z

re ^^ I suspected some cruft in the OS level that the ansible playbooks weren't able to clean up was causing problems. So I reimaged all nodes again and started fresh. After a complete reset w/virgin image install on all nodes I get to below. Not sure what do do next???

❯ kubectl get nodes -o wide --kubeconfig ~/GH/raspbernetes/k8s-cluster-installation-1/ansible/playbooks/output/k8s-config.yaml
NAME         STATUS   ROLES                  AGE     VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
blackberry   Ready    <none>                 10m     v1.21.3   192.168.0.54   <none>        Ubuntu 20.04.3 LTS   5.4.0-1042-raspi   containerd://1.5.2
blueberry    Ready    control-plane,master   8m52s   v1.21.3   192.168.0.53   <none>        Ubuntu 20.04.3 LTS   5.4.0-1042-raspi   containerd://1.5.2
gooseberry   Ready    <none>                 10m     v1.21.3   192.168.0.55   <none>        Ubuntu 20.04.3 LTS   5.4.0-1042-raspi   containerd://1.5.2
hackberry    Ready    control-plane,master   10m     v1.21.3   192.168.0.51   <none>        Ubuntu 20.04.3 LTS   5.4.0-1042-raspi   containerd://1.5.2
strawberry   Ready    control-plane,master   9m19s   v1.21.3   192.168.0.52   <none>        Ubuntu 20.04.3 LTS   5.4.0-1042-raspi   containerd://1.5.2

❯ kubectl get pods --namespace kube-system -o wide --kubeconfig ~/GH/raspbernetes/k8s-cluster-installation-1/ansible/playbooks/output/k8s-config.yaml
NAME                                       READY   STATUS              RESTARTS   AGE     IP             NODE         NOMINATED NODE   READINESS GATES
calico-kube-controllers-58497c65d5-vhgvh   0/1     ContainerCreating   0          8m23s   <none>         blackberry   <none>           <none>
calico-node-b94x9                          0/1     CrashLoopBackOff    5          8m23s   192.168.0.51   hackberry    <none>           <none>
calico-node-dwmjt                          0/1     CrashLoopBackOff    5          8m23s   192.168.0.55   gooseberry   <none>           <none>
calico-node-fqxd2                          0/1     CrashLoopBackOff    5          8m23s   192.168.0.54   blackberry   <none>           <none>
calico-node-rftbc                          0/1     Running             6          8m23s   192.168.0.53   blueberry    <none>           <none>
calico-node-vpjfq                          0/1     Running             5          8m23s   192.168.0.52   strawberry   <none>           <none>
coredns-558bd4d5db-4lk9d                   0/1     ContainerCreating   0          10m     <none>         blackberry   <none>           <none>
coredns-558bd4d5db-6fqt5                   0/1     ContainerCreating   0          10m     <none>         blackberry   <none>           <none>
etcd-blueberry                             1/1     Running             0          9m3s    192.168.0.53   blueberry    <none>           <none>
etcd-hackberry                             1/1     Running             0          10m     192.168.0.51   hackberry    <none>           <none>
etcd-strawberry                            1/1     Running             0          7m48s   192.168.0.52   strawberry   <none>           <none>
kube-apiserver-blueberry                   1/1     Running             1          8m59s   192.168.0.53   blueberry    <none>           <none>
kube-apiserver-hackberry                   1/1     Running             0          10m     192.168.0.51   hackberry    <none>           <none>
kube-apiserver-strawberry                  1/1     Running             0          7m59s   192.168.0.52   strawberry   <none>           <none>
kube-controller-manager-blueberry          1/1     Running             0          8m15s   192.168.0.53   blueberry    <none>           <none>
kube-controller-manager-hackberry          1/1     Running             2          10m     192.168.0.51   hackberry    <none>           <none>
kube-controller-manager-strawberry         1/1     Running             0          7m30s   192.168.0.52   strawberry   <none>           <none>
kube-scheduler-blueberry                   1/1     Running             0          8m4s    192.168.0.53   blueberry    <none>           <none>
kube-scheduler-hackberry                   1/1     Running             2          10m     192.168.0.51   hackberry    <none>           <none>
kube-scheduler-strawberry                  1/1     Running             0          7m35s   192.168.0.52   strawberry   <none>           <none>

❯ kubectl logs calico-node-fqxd2 --namespace kube-system --kubeconfig ~/GH/raspbernetes/k8s-cluster-installation-1/ansible/playbooks/output/k8s-config.yaml
2021-09-12 01:33:56.002 [INFO][8] startup/startup.go 396: Early log level set to info
2021-09-12 01:33:56.002 [INFO][8] startup/utils.go 126: Using NODENAME environment for node name blackberry
2021-09-12 01:33:56.003 [INFO][8] startup/utils.go 138: Determined node name: blackberry
2021-09-12 01:33:56.003 [INFO][8] startup/startup.go 98: Starting node blackberry with version v3.20.0
2021-09-12 01:33:56.008 [INFO][8] startup/startup.go 401: Checking datastore connection
2021-09-12 01:34:26.009 [INFO][8] startup/startup.go 416: Hit error connecting to datastore - retry error=Get "https://10.96.0.1:443/api/v1/nodes/foo": dial tcp 10.96.0.1:443: i/o timeout
2021-09-12 01:34:57.012 [INFO][8] startup/startup.go 416: Hit error connecting to datastore - retry error=Get "https://10.96.0.1:443/api/v1/nodes/foo": dial tcp 10.96.0.1:443: i/o timeout

kcalmond · 2021-09-12T04:06:00Z

re ^^ now I'm noticing kube-proxy is not running. Guessing there is something going on the playbooks that disabled kube-proxy expecting cilium/eBPF stuff will be running instead...?

xunholy · 2021-09-12T09:19:11Z

This final issue will be caused due to this configuration

k8s-cluster-installation/ansible/group_vars/controlplane.yml

Line 16 in 93df519

cluster_kube_proxy_enabled: false

You've done well to troubleshoot these issues, if you want to push a fix for the cilium feel free otherwise I'll try get a fix out for it this week sometime.

kcalmond · 2021-09-12T16:20:42Z

^^ it was right in front of me inside controlplane.yml :-P
FTR, here are the changes I made to get a calico based build to work across a virgin set of nodes:

group_vars/all.yml
--> cni_plugin: 'cilium'
group_vars/controlplane.yml
--> cni_plugin: cilium
--> cluster_kube_proxy_enabled: true
roles/cni/defaults/main.yml
--> cni_plugin: cilium

kcalmond added the bug Something isn't working label Sep 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"exec format error" with cilium set as default CNI in all.yml if you are installing to an ARM cluster #130

"exec format error" with cilium set as default CNI in all.yml if you are installing to an ARM cluster #130

kcalmond commented Sep 11, 2021 •

edited

Loading

kcalmond commented Sep 11, 2021

xunholy commented Sep 11, 2021

kcalmond commented Sep 12, 2021 •

edited

Loading

kcalmond commented Sep 12, 2021 •

edited

Loading

kcalmond commented Sep 12, 2021

xunholy commented Sep 12, 2021

kcalmond commented Sep 12, 2021 •

edited

Loading

"exec format error" with cilium set as default CNI in all.yml if you are installing to an ARM cluster #130

"exec format error" with cilium set as default CNI in all.yml if you are installing to an ARM cluster #130

Comments

kcalmond commented Sep 11, 2021 • edited Loading

Details

kcalmond commented Sep 11, 2021

xunholy commented Sep 11, 2021

kcalmond commented Sep 12, 2021 • edited Loading

kcalmond commented Sep 12, 2021 • edited Loading

kcalmond commented Sep 12, 2021

xunholy commented Sep 12, 2021

kcalmond commented Sep 12, 2021 • edited Loading

kcalmond commented Sep 11, 2021 •

edited

Loading

kcalmond commented Sep 12, 2021 •

edited

Loading

kcalmond commented Sep 12, 2021 •

edited

Loading

kcalmond commented Sep 12, 2021 •

edited

Loading