Many Kubernetes ClustersPosted: | More posts about kubernetes
As a reply to Zalando's "Running 80+ clusters in production" post, someone asked whether this (80+ clusters) would not defeat "the purpose"? My answer would not fit in a Tweet, so here it is as a blog post.
RnL writes on Twitter:
"I don’t know any reasoning behind running 80 clusters I am sure smarter people than me thought about it. However, wasn’t the purpose of schedulers to scale nodes and use resources efficiently?"
This was not the first question I got, here a LinkedIn comment by Daniel D'Abate regarding our Zalando Kubernetes story from 4 months ago:
"I have a question and your insight will be very helpful for our next steps. Why are you using 100 clusters instead of 1 huge cluster? Is it related with better cost isolation or are any other reasons involved?"
Zalando runs 100+ Kubernetes clusters on AWS. Each cluster runs in its own AWS account. We always create a pair of prod/non-prod clusters per "product community" , i.e. only half of our clusters (50+) are marked as "production" and have full 24x7 on-call support.
We decided to go with "many" (that's relative) clusters for various reasons:
- Kubernetes has no strong story for multi-tenancy, having "smaller" clusters mitigates part of this problem
- some infrastructure is shared per cluster, e.g. Prometheus and the Ingress proxy (Skipper) --- this requires appropriate (vertical) scaling of these components, smaller clusters make this easier to handle
- the blast radius is limited --- anything going wrong in one cluster (outage, security incident, ..) does not necessarily affect the whole organization
- cost attribution is easier (every cluster belongs to a cost center) 
- the cluster (and its AWS account) serves as a natural trust boundary for access control (you can either deploy via CI/CD to a cluster or not)
It just overall better fits our world view. Smaller clusters would also be possible, but produce too much overhead  and do not leverage the advantages (in-cluster communication, better utilization).
You can find some Kubernetes Failure Stories mentioning cluster size, e.g.:
- On Infrastructure at Scale: A Cascading Failure of Distributed Systems - Target
- Breaking Kubernetes: How We Broke and Fixed our K8s Cluster - Civis Analytics
My colleague Sandor reminded me of the fact that AWS rate limits are another reason against "too large" clusters: all AWS integrations call AWS APIs and count towards these rate limits: Controller-Manager, kube2iam, External DNS, kube-ingress-aws-controller, kube-static-egress-controller, ..
- you want big clusters for utilization, in-cluster communication, and reduced overhead
- you want small clusters for isolation, reduced blast radius, and less challenges around scaling
We have 1200+ developers at Zalando, should they all share the same huge cluster? Probably not! Should every team (~4-8 people) get its own cluster? Probably not! So the truth is somewhere in the middle and the story might be different for your organization.
I clarified that we always create a pair of clusters and added Sandor's point about AWS rate limits.
We just hit our configured CIDR limits in one cluster, i.e. the cluster reached 250+ nodes and we had configured 10.2.0.0/16 for --cluster-cidr and --node-cidr-mask-size=24. The Kubernetes controller failed with the error message "Error while processing Node Add/Delete: failed to allocate cidr: CIDR allocation failed; there are no remaining CIDRs left to allocate in the accepted range". To mitigate hitting this CIDR/node limit, we made the node CIDR mask size configurable. Setting this to /25 would allow around ~510 nodes at the cost of limiting the number of pods per node to ~62. See also the GKE docs on flexible Pod CIDR and the Kubernetes docs for large clusters (not so helpful). So in short: tuning the CIDR ranges is another topic to take care of when designing your cluster size(s).
|||A "product community" is not completely well defined, but usually describes a certain domain, tech department, and cost center, e.g. "Logistics" (in reality it's a bit more complicated as there is more than one "Logistics" aspect in Zalando).|
|||We also use Kubernetes Resource Report to show costs per application and team: https://github.com/hjacobs/kube-resource-report/|
|||See also my recent talk about cost efficiency where I compare the control plane overhead of GKE, EKS, and Zalando: https://www.youtube.com/watch?v=4QyecOoPsGU|