Kubernetes Fails: 3 ways to kill your clusters
Posted: | More posts about kubernetesKubernetes has its virtues and is worth investing in, but it is undoubtedly complex and comes with many operational challenges. We faced many of them on our journey towards "cloud native" at Zalando.
We constantly learned from other organizations that shared their failures and insights, so I started to compile a list of public failure/horror stories related to Kubernetes. The goal was to make it easier for people tasked with operations to find outage reports to learn from.
Many of these failures had a few things in common. Here are the factors that contributed to failure.
Missing Operational Maturity
Infrastructure operations is a challenge for most organizations and the transformation towards end-to-end responsibility ("DevOps", "you build it, you run it") is often in full swing. Smaller organizations usually use a tool to bootstrap a cluster (e.g. kops), but do not dedicate time to set up full continuous delivery for the infrastructure. This leads to painful manual Kubernetes upgrades, untested infrastructure changes, and brittle clusters. The same situation applies to managed infrastructure, as cloud offerings never come with all batteries included. Infrastructure changes should get at least the same attention and rigor as your customer-facing app deployments. [1]
Upstream Kubernetes/Docker issues
Some of the failures can be attributed to upstream issues, e.g. Docker daemon hanging, issues with kubelet not reconnecting to the control plane, Kernel CPU throttling bugs, unsafe CronJob defaults, and kubelet memory leaks.
If you hit an upstream issue—congratulations! You can follow or file an upstream issue and hope or contribute a fix helping many others. I would expect this class of failure causes to get smaller over time as CNCF projects mature and the user base grows, making it less probable that you will be the first to hit an upstream issue.
Cloud/organization Integrations
Kubernetes comes in more than one flavor: There are many possible combinations of Kubernetes components and configurations. Kubernetes needs to interact with your cloud platform, such as Google Cloud or AWS, and your existing IT landscape. And all of these integrations can lead to failure scenarios.
We saw Kubernetes' AWS cloud provider code easily hit AWS API rate limits and have problems with EBS persistent volume attachments. Using the AWS ELB with dynamic IPs caused problems with kubelet losing connections. The AWS IAM integration (kube2iam) is notoriously prone to race conditions.
Human Error
Let us be clear: There is no such thing as "human error" as a root cause. If your Root Case Analysis (RCA) concludes with "human error", start over and ask some hard questions.