Kubernetes Fails: 3 ways to kill your clusters

Posted:   |  More posts about kubernetes

Kubernetes has its virtues and is worth investing in, but it is undoubtedly complex and comes with many operational challenges. We faced many of them on our journey towards "cloud native" at Zalando.

We constantly learned from other organizations that shared their failures and insights, so I started to compile a list of public failure/horror stories related to Kubernetes. The goal was to make it easier for people tasked with operations to find outage reports to learn from.

Many of these failures had a few things in common. Here are the factors that contributed to failure.

Missing Operational Maturity

Infrastructure operations is a challenge for most organizations and the transformation towards end-to-end responsibility ("DevOps", "you build it, you run it") is often in full swing. Smaller organizations usually use a tool to bootstrap a cluster (e.g. kops), but do not dedicate time to set up full continuous delivery for the infrastructure. This leads to painful manual Kubernetes upgrades, untested infrastructure changes, and brittle clusters. The same situation applies to managed infrastructure, as cloud offerings never come with all batteries included. Infrastructure changes should get at least the same attention and rigor as your customer-facing app deployments. [1]

Upstream Kubernetes/Docker issues

Some of the failures can be attributed to upstream issues, e.g. Docker daemon hanging, issues with kubelet not reconnecting to the control plane, Kernel CPU throttling bugs, unsafe CronJob defaults, and kubelet memory leaks.

If you hit an upstream issue—congratulations! You can follow or file an upstream issue and hope or contribute a fix helping many others. I would expect this class of failure causes to get smaller over time as CNCF projects mature and the user base grows, making it less probable that you will be the first to hit an upstream issue.

Cloud/organization Integrations

Kubernetes comes in more than one flavor: There are many possible combinations of Kubernetes components and configurations. Kubernetes needs to interact with your cloud platform, such as Google Cloud or AWS, and your existing IT landscape. And all of these integrations can lead to failure scenarios.

We saw Kubernetes' AWS cloud provider code easily hit AWS API rate limits and have problems with EBS persistent volume attachments. Using the AWS ELB with dynamic IPs caused problems with kubelet losing connections. The AWS IAM integration (kube2iam) is notoriously prone to race conditions.

Human Error

Let us be clear: There is no such thing as "human error" as a root cause. If your Root Case Analysis (RCA) concludes with "human error", start over and ask some hard questions.

Share your learnings

Nowadays everybody is talking about "failure culture," but what organization is truly ready to share its failures and learnings publicly? Kubernetes gives us a common ground where we can all benefit from sharing our experiences and learnings from each other.

Many contributing factors are not new, such as the maturity in infrastructure changes, Docker, distributed systems, and so on. But Kubernetes gives us a common language to talk through and address them. By reducing the unknown unknowns of operating or using Kubernetes through shared experiences, it will get easier for everyone over time.

Thanks to Joe Beda, the list of Kubernetes failure stories is now available on its own domain:

Do you have experiences to share? Open a Pull Request!

For more on Kubernetes failures, come to my talk, “Kubernetes Failure Stories and How to Crash Your Clusters” at KubeCon + CloudNativeCon Europe 2019 in Barcelona on May 20-23.

[1] To learn how we roll out changes to our Kubernetes infrastructure at Zalando, watch the KubeCon 2018 talk "Continuously Deliver your Kubernetes Infrastructure"

This blog post was originally published on TechBeacon.