Kubernetes Failure Stories

Posted:   |  More posts about kubernetes

I started to compile a list of public failure/horror stories related to Kubernetes. It should make it easier for people tasked with operations to find outage reports to learn from.

Since we started with Kubernetes at Zalando in 2016, we collected many internal postmortems. Docker bugs (daemon unresponsive, process stuck in pipe wait, ..) were a major pain point in the beginning, but Docker itself has become more mature and did not bite us recently. The biggest chunk of problems can be attributed to the nature of distributed systems and "cascading failures", e.g. a Kubernetes API server outage should not affect running workloads, but it did, or see our recent CoreDNS incident.

We shared some of our incidents and Kubernetes failures in talks:

My main motivation for giving such talks about failures is that I want to hear more of them myself! Nordstrom's talk "101 Ways to Crash Your Cluster" on KubeCon 2017 was my inspiration (as you can even see from the similarity in talk titles ;-)). I hope to see more people share their postmortems and give failure talks. Monzo's transparency and public postmortem is a great service to the community and should be something we all strive towards.

Compiling a List of Kubernetes Failure Stories

On my quest to find more public Kubernetes failure stories, I discovered that it's either really hard to find them (or my web search skills are lacking) or that there are only very few published. Search terms I tried on DuckDuckGo and Google:

I also tried various combinations and "k8s", "kube-dns", and "kube-proxy" instead of "kubernetes". This did not yield many results and most of the pages I found are somehow more "success" stories and highlight how to prevent outages from happening. That's boring!

The compiled list of Kubernetes Failure Stories I found so far is available on GitHub. I hope to see many contributions to the list from the community, but I guess the hard part is encouraging people to publish their outage reports. Please contribute to the list by opening an issue, creating a PR or reaching out to me on Twitter!


What's Next

I'll be on a meetup in Hamburg in February to talk more about Kubernetes failures, please join if you can: "Let’s talk about Failures with Kubernetes!" meetup Hamburg.

At Zalando, we will try to publish a write-up of our recent Kubernetes DNS incident and hopefully find a way to more systematically share postmortems with the community. Sharing our failure stories is something we can all benefit from to harden our setups and help prioritize upstream issues. "Production-readiness" is, from my perspective, still something mostly discussed behind closed doors (i.e. inside organizations) --- e.g. CPU CFS quota behavior and latency impact is not well known and not mentioned in the docs. Let's change that!

BTW: I'm also still looking for the first Istio failure talk..


Some recommended talks/reads for Kubernetes in production:

UPDATE 2019-01-28

I did a brief write-up on what happend after posting this blog article on Hacker News.