Kubernetes Failure StoriesPosted: | More posts about kubernetes
I started to compile a list of public failure/horror stories related to Kubernetes. It should make it easier for people tasked with operations to find outage reports to learn from.
Since we started with Kubernetes at Zalando in 2016, we collected many internal postmortems. Docker bugs (daemon unresponsive, process stuck in pipe wait, ..) were a major pain point in the beginning, but Docker itself has become more mature and did not bite us recently. The biggest chunk of problems can be attributed to the nature of distributed systems and "cascading failures", e.g. a Kubernetes API server outage should not affect running workloads, but it did, or see our recent CoreDNS incident.
We shared some of our incidents and Kubernetes failures in talks:
My main motivation for giving such talks about failures is that I want to hear more of them myself! Nordstrom's talk "101 Ways to Crash Your Cluster" on KubeCon 2017 was my inspiration (as you can even see from the similarity in talk titles ;-)). I hope to see more people share their postmortems and give failure talks. Monzo's transparency and public postmortem is a great service to the community and should be something we all strive towards.
Compiling a List of Kubernetes Failure Stories
On my quest to find more public Kubernetes failure stories, I discovered that it's either really hard to find them (or my web search skills are lacking) or that there are only very few published. Search terms I tried on DuckDuckGo and Google:
I also tried various combinations and "k8s", "kube-dns", and "kube-proxy" instead of "kubernetes". This did not yield many results and most of the pages I found are somehow more "success" stories and highlight how to prevent outages from happening. That's boring!
The compiled list of Kubernetes Failure Stories I found so far is available on GitHub. I hope to see many contributions to the list from the community, but I guess the hard part is encouraging people to publish their outage reports. Please contribute to the list by opening an issue, creating a PR or reaching out to me on Twitter!
I'll be on a meetup in Hamburg in February to talk more about Kubernetes failures, please join if you can: "Let’s talk about Failures with Kubernetes!" meetup Hamburg.
At Zalando, we will try to publish a write-up of our recent Kubernetes DNS incident and hopefully find a way to more systematically share postmortems with the community. Sharing our failure stories is something we can all benefit from to harden our setups and help prioritize upstream issues. "Production-readiness" is, from my perspective, still something mostly discussed behind closed doors (i.e. inside organizations) --- e.g. CPU CFS quota behavior and latency impact is not well known and not mentioned in the docs. Let's change that!
BTW: I'm also still looking for the first Istio failure talk..
Some recommended talks/reads for Kubernetes in production:
Hardening Kubernetes Setups: War Stories from the Trenches of Production - Giant Swarm - KubeCon North America 2018: not very deep, but mentions some good points to look out for
90 days of AWS EKS in Production - Graham Moore - blog post 2018: many tunable system parameters (which you probably should not copy 1-1 without understanding them), mentions important
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latency - Zalando - Highload++ 2018: why you should consider disabling CPU throttling (CFS quota) in your cluster(s)
Kubernetes the very hard way at Datadog: good insights of common (DNS issues, OOM) and less common (e.g. Datadog uses containerd and IPVS) challenges
Inside Kubernetes Resource Management (QoS) – Mechanics and Lessons from the Field - Michael Gasch - KubeCon Europe 2018: fundamental information on how Kubernetes resources work