Many Kubernetes Clusters

Posted: 2019-04-28 12:26 | More posts about kubernetes

As a reply to Zalando's "Running 80+ clusters in production" post, someone asked whether this (80+ clusters) would not defeat "the purpose"? My answer would not fit in a Tweet, so here it is as a blog post.

RnL writes on Twitter:

"I don’t know any reasoning behind running 80 clusters I am sure smarter people than me thought about it. However, wasn’t the purpose of schedulers to scale nodes and use resources efficiently?"

This was not the first question I got, here a LinkedIn comment by Daniel D'Abate regarding our Zalando Kubernetes story from 4 months ago:

"I have a question and your insight will be very helpful for our next steps. Why are you using 100 clusters instead of 1 huge cluster? Is it related with better cost isolation or are any other reasons involved?"

Zalando runs 100+ Kubernetes clusters on AWS. Each cluster runs in its own AWS account. We always create a pair of prod/non-prod clusters per "product community" [1], i.e. only half of our clusters (50+) are marked as "production" and have full 24x7 on-call support.

We decided to go with "many" (that's relative) clusters for various reasons:

Kubernetes has no strong story for multi-tenancy, having "smaller" clusters mitigates part of this problem
some infrastructure is shared per cluster, e.g. Prometheus and the Ingress proxy (Skipper) --- this requires appropriate (vertical) scaling of these components, smaller clusters make this easier to handle
the blast radius is limited --- anything going wrong in one cluster (outage, security incident, ..) does not necessarily affect the whole organization
cost attribution is easier (every cluster belongs to a cost center) [2]
the cluster (and its AWS account) serves as a natural trust boundary for access control (you can either deploy via CI/CD to a cluster or not)

It just overall better fits our world view. Smaller clusters would also be possible, but produce too much overhead [3] and do not leverage the advantages (in-cluster communication, better utilization).

You can find some Kubernetes Failure Stories mentioning cluster size, e.g.:

My colleague Sandor reminded me of the fact that AWS rate limits are another reason against "too large" clusters: all AWS integrations call AWS APIs and count towards these rate limits: Controller-Manager, kube2iam, External DNS, kube-ingress-aws-controller, kube-static-egress-controller, ..

To summarize:

you want big clusters for utilization, in-cluster communication, and reduced overhead
you want small clusters for isolation, reduced blast radius, and less challenges around scaling

We have 1200+ developers at Zalando, should they all share the same huge cluster? Probably not! Should every team (~4-8 people) get its own cluster? Probably not! So the truth is somewhere in the middle and the story might be different for your organization.

UPDATE 2019-04-29

I clarified that we always create a pair of clusters and added Sandor's point about AWS rate limits.

UPDATE 2019-05-03

We just hit our configured CIDR limits in one cluster, i.e. the cluster reached 250+ nodes and we had configured 10.2.0.0/16 for --cluster-cidr and --node-cidr-mask-size=24. The Kubernetes controller failed with the error message "Error while processing Node Add/Delete: failed to allocate cidr: CIDR allocation failed; there are no remaining CIDRs left to allocate in the accepted range". To mitigate hitting this CIDR/node limit, we made the node CIDR mask size configurable. Setting this to /25 would allow around ~510 nodes at the cost of limiting the number of pods per node to ~62. See also the GKE docs on flexible Pod CIDR and the Kubernetes docs for large clusters (not so helpful). So in short: tuning the CIDR ranges is another topic to take care of when designing your cluster size(s).