-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CoreDNS daemonset #6247
Comments
We use coredns as daemonset and a dnsmasq cache in front of the coredns container. We measured that it is more reliable to run with this and a local conntrack bypass on the node. It's a very important service for us so adding a little resource overhead is fine compared with an incident. The fix in coredns was reverted (I reported it and I reported also the issue in Go issue tracker). So yes to some it up: |
Thanks for the info @szuecs! Regarding the concern raised by some people that running coredns as a daemonset could cause a high load on the API server due to the different watches - is this something that you've experienced? (if you are able to share your scale of nodes + service endpoints that would be helpful to understand the scale we're talking about) Another possible issue that I raised as a concern - if the dnsmasq container starts before coredns you could get dns resolution errors, is this something you've experienced? |
This setup is used for peak loads of up to 1000-1500 nodes in the biggest clusters. More normal load is up to 500 nodes and with about 6-7000 pods.
We have a special node startup setup where we start nodes with a taint such that pods won't be scheduled and only when all daemonset pods are running and ready is a node marked ready by removing this taint. This way we ensure that things are up and running before workloads depend on it. Ofc. it doesn't help if dnsmasq or coredns crashes in the middle, but that is not an issue we have experienced. We have a lot of details in internal design documents which we unfortunately can't share in the current state because of many internal references, and I don't have time at the moment to clean it up to be shareable. Some things to highlight:
|
Thanks for the detailed information @mikkeloscar. Regarding the load on the API server - how many resources and instances are you dedicating to the API server? |
@dudicoco we use dedicated node pools for control plane nodes and a separate etcd cluster. Normally run 2 control plane nodes (during update 3), and a 5 nodes etcd (m5.large sometimes we have to increase it). control plane nodes depend on cluster but one of the big clusters has c6g.2xlarge and one of the heavy postgres (created by postgres-operator) clusters (~250 in one cluster) has c6g.16xlarge control plane nodes. |
I guess we can close the issue |
Hi,
I see that you are deploying coredns as a daemonset: https://github.com/zalando-incubator/kubernetes-on-aws/blob/25e3ac810c8cc80e0a4b91fe6d0d732ed9aa2dd7/cluster/manifests/coredns-local/daemonset-coredns.yaml
There are ongoing discussions on the benefits of using a coredns deployment + nodelocal dns architecture vs using a coredns daemonset:
kubernetes/dns#594
coredns/helm#86 (comment)
I would like to hear about your experience with using coredns as a daemonset and if it is something you would recommend over a coredns deployment + nodelocal dns architecture.
Thanks
The text was updated successfully, but these errors were encountered: