Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"exec format error" with cilium set as default CNI in all.yml if you are installing to an ARM cluster #130

Open
kcalmond opened this issue Sep 11, 2021 · 7 comments
Labels
bug Something isn't working

Comments

@kcalmond
Copy link

kcalmond commented Sep 11, 2021

Details

What steps did you take and what happened:

Built a cluster per instructions on Raspberry Pi 4 nodes, using boot images based on latest ubuntu 20.04

What did you expect to happen:

CNI should start up using images built for ARM architecture

Anything else you would like to add:

cilium.yml runs a kubectl apply using a yml that references an incompatible image for ARM.

Additional Information:

Note: Anything to give further context to the bug report.

@kcalmond kcalmond added the bug Something isn't working label Sep 11, 2021
@kcalmond
Copy link
Author

Looks like Cilium 1.10 supports ARM
Also - the quick installation process changed from 1.9 to 1.10. In the 1.10 install they aren't providing a quick-install.yml file. In 1.10 the quick install process uses a new installer.

@xunholy
Copy link
Member

xunholy commented Sep 11, 2021

Yep thanks for raising this I did run into this issue myself and had it in my list of TODOS.

@kcalmond
Copy link
Author

kcalmond commented Sep 12, 2021

@xunholy - re ^^ I'm trying to figure out how to successfully build a cluster with calico as CNI instead of cilium (it appears cilium is set as default - which as of right now makes the defaults broken for pi clusters)

After fishing around I changed the value from cilium to calico in each of these locations:

  • group_vars/all.yml: cni_plugin: 'cilium'
  • group_vars/controlplane.yml cni_plugin: cilium
  • roles/cni/defaults/main.yml cni_plugin: cilium

I ran nuke.yml, then changed cilium to calico in above locations, then ran all.yml again. Everything looked good until the end where I get a fatal FAIL message:

<lots of output...>

TASK [cni : setup calico container network interface (cni)] ****************************************************************************************************************************************************
Saturday 11 September 2021  16:48:57 -0700 (0:00:00.218)       0:07:06.331 ****
skipping: [strawberry]
skipping: [blueberry]
skipping: [blackberry]
skipping: [gooseberry]
included: /Users/almondch/GH/raspbernetes/k8s-cluster-installation-1/ansible/roles/cni/tasks/calico.yml for hackberry

TASK [cni : applying calico] ***********************************************************************************************************************************************************************************
Saturday 11 September 2021  16:48:57 -0700 (0:00:00.269)       0:07:06.601 ****
fatal: [hackberry]: FAILED! => changed=true
  cmd:
  - kubectl
  - apply
  - -f
  - https://docs.projectcalico.org/manifests/calico.yaml
  delta: '0:00:11.712726'
  end: '2021-09-11 16:49:12.059033'
  msg: non-zero return code
  rc: 1
  start: '2021-09-11 16:49:00.346307'
  stderr: 'Error from server (InternalError): an error on the server ("") has prevented the request from succeeding'
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>

PLAY RECAP *****************************************************************************************************************************************************************************************************
127.0.0.1                  : ok=1    changed=0    unreachable=0    failed=0    skipped=1    rescued=0    ignored=0
blackberry                 : ok=64   changed=16   unreachable=0    failed=0    skipped=26   rescued=0    ignored=0
blueberry                  : ok=85   changed=17   unreachable=0    failed=0    skipped=26   rescued=0    ignored=0
gooseberry                 : ok=64   changed=16   unreachable=0    failed=0    skipped=26   rescued=0    ignored=0
hackberry                  : ok=121  changed=21   unreachable=0    failed=1    skipped=28   rescued=0    ignored=0
strawberry                 : ok=87   changed=17   unreachable=0    failed=0    skipped=28   rescued=0    ignored=0

@kcalmond
Copy link
Author

kcalmond commented Sep 12, 2021

re ^^ I suspected some cruft in the OS level that the ansible playbooks weren't able to clean up was causing problems. So I reimaged all nodes again and started fresh. After a complete reset w/virgin image install on all nodes I get to below. Not sure what do do next???

❯ kubectl get nodes -o wide --kubeconfig ~/GH/raspbernetes/k8s-cluster-installation-1/ansible/playbooks/output/k8s-config.yaml
NAME         STATUS   ROLES                  AGE     VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
blackberry   Ready    <none>                 10m     v1.21.3   192.168.0.54   <none>        Ubuntu 20.04.3 LTS   5.4.0-1042-raspi   containerd://1.5.2
blueberry    Ready    control-plane,master   8m52s   v1.21.3   192.168.0.53   <none>        Ubuntu 20.04.3 LTS   5.4.0-1042-raspi   containerd://1.5.2
gooseberry   Ready    <none>                 10m     v1.21.3   192.168.0.55   <none>        Ubuntu 20.04.3 LTS   5.4.0-1042-raspi   containerd://1.5.2
hackberry    Ready    control-plane,master   10m     v1.21.3   192.168.0.51   <none>        Ubuntu 20.04.3 LTS   5.4.0-1042-raspi   containerd://1.5.2
strawberry   Ready    control-plane,master   9m19s   v1.21.3   192.168.0.52   <none>        Ubuntu 20.04.3 LTS   5.4.0-1042-raspi   containerd://1.5.2

❯ kubectl get pods --namespace kube-system -o wide --kubeconfig ~/GH/raspbernetes/k8s-cluster-installation-1/ansible/playbooks/output/k8s-config.yaml
NAME                                       READY   STATUS              RESTARTS   AGE     IP             NODE         NOMINATED NODE   READINESS GATES
calico-kube-controllers-58497c65d5-vhgvh   0/1     ContainerCreating   0          8m23s   <none>         blackberry   <none>           <none>
calico-node-b94x9                          0/1     CrashLoopBackOff    5          8m23s   192.168.0.51   hackberry    <none>           <none>
calico-node-dwmjt                          0/1     CrashLoopBackOff    5          8m23s   192.168.0.55   gooseberry   <none>           <none>
calico-node-fqxd2                          0/1     CrashLoopBackOff    5          8m23s   192.168.0.54   blackberry   <none>           <none>
calico-node-rftbc                          0/1     Running             6          8m23s   192.168.0.53   blueberry    <none>           <none>
calico-node-vpjfq                          0/1     Running             5          8m23s   192.168.0.52   strawberry   <none>           <none>
coredns-558bd4d5db-4lk9d                   0/1     ContainerCreating   0          10m     <none>         blackberry   <none>           <none>
coredns-558bd4d5db-6fqt5                   0/1     ContainerCreating   0          10m     <none>         blackberry   <none>           <none>
etcd-blueberry                             1/1     Running             0          9m3s    192.168.0.53   blueberry    <none>           <none>
etcd-hackberry                             1/1     Running             0          10m     192.168.0.51   hackberry    <none>           <none>
etcd-strawberry                            1/1     Running             0          7m48s   192.168.0.52   strawberry   <none>           <none>
kube-apiserver-blueberry                   1/1     Running             1          8m59s   192.168.0.53   blueberry    <none>           <none>
kube-apiserver-hackberry                   1/1     Running             0          10m     192.168.0.51   hackberry    <none>           <none>
kube-apiserver-strawberry                  1/1     Running             0          7m59s   192.168.0.52   strawberry   <none>           <none>
kube-controller-manager-blueberry          1/1     Running             0          8m15s   192.168.0.53   blueberry    <none>           <none>
kube-controller-manager-hackberry          1/1     Running             2          10m     192.168.0.51   hackberry    <none>           <none>
kube-controller-manager-strawberry         1/1     Running             0          7m30s   192.168.0.52   strawberry   <none>           <none>
kube-scheduler-blueberry                   1/1     Running             0          8m4s    192.168.0.53   blueberry    <none>           <none>
kube-scheduler-hackberry                   1/1     Running             2          10m     192.168.0.51   hackberry    <none>           <none>
kube-scheduler-strawberry                  1/1     Running             0          7m35s   192.168.0.52   strawberry   <none>           <none>

❯ kubectl logs calico-node-fqxd2 --namespace kube-system --kubeconfig ~/GH/raspbernetes/k8s-cluster-installation-1/ansible/playbooks/output/k8s-config.yaml
2021-09-12 01:33:56.002 [INFO][8] startup/startup.go 396: Early log level set to info
2021-09-12 01:33:56.002 [INFO][8] startup/utils.go 126: Using NODENAME environment for node name blackberry
2021-09-12 01:33:56.003 [INFO][8] startup/utils.go 138: Determined node name: blackberry
2021-09-12 01:33:56.003 [INFO][8] startup/startup.go 98: Starting node blackberry with version v3.20.0
2021-09-12 01:33:56.008 [INFO][8] startup/startup.go 401: Checking datastore connection
2021-09-12 01:34:26.009 [INFO][8] startup/startup.go 416: Hit error connecting to datastore - retry error=Get "https://10.96.0.1:443/api/v1/nodes/foo": dial tcp 10.96.0.1:443: i/o timeout
2021-09-12 01:34:57.012 [INFO][8] startup/startup.go 416: Hit error connecting to datastore - retry error=Get "https://10.96.0.1:443/api/v1/nodes/foo": dial tcp 10.96.0.1:443: i/o timeout

@kcalmond
Copy link
Author

re ^^ now I'm noticing kube-proxy is not running. Guessing there is something going on the playbooks that disabled kube-proxy expecting cilium/eBPF stuff will be running instead...?

@xunholy
Copy link
Member

xunholy commented Sep 12, 2021

This final issue will be caused due to this configuration

cluster_kube_proxy_enabled: false

You've done well to troubleshoot these issues, if you want to push a fix for the cilium feel free otherwise I'll try get a fix out for it this week sometime.

@kcalmond
Copy link
Author

kcalmond commented Sep 12, 2021

^^ it was right in front of me inside controlplane.yml :-P
FTR, here are the changes I made to get a calico based build to work across a virgin set of nodes:

  • group_vars/all.yml
    --> cni_plugin: 'cilium'
  • group_vars/controlplane.yml
    --> cni_plugin: cilium
    --> cluster_kube_proxy_enabled: true
  • roles/cni/defaults/main.yml
    --> cni_plugin: cilium

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants