Liveness Probes are DangerousPosted: | More posts about kubernetes
Kubernetes livenessProbe can be dangerous. I recommend to avoid them unless you have a clear use case and understand the consequences. This post looks at both Liveness and Readiness Probes and describes some "DOs" and "DON'Ts"
My colleague Sandor recently tweeted about common mistakes he sees, including wrong Readiness/Liveness Probe usage:
A wrong livenessProbe setting can worsen high-load situations (cascading failure + potential long container/app start) and lead to other negative consequences like bringing down dependencies (see also my recent post about K3s+ACME rate limits). A Liveness Probe in combination with an external DB health check dependency is the worst situation: a single DB hiccup will restart all your containers!
A blanket statement of "don't use Liveness Probes" is not helpful, so let's look at what Readiness and Liveness Probe are for. NOTE: most of the following text was initially put together for Zalando's internal developer documentation.
Readiness and Liveness Probes
Kubernetes provides two essential features called Liveness Probes and Readiness Probes. Essentially, Liveness/Readiness Probes will periodically perform an action (e.g. make an HTTP request, open a TCP connection, or run a command in your container) to confirm that your application is working as intended.
Kubernetes uses Readiness Probes to know when a container is ready to start accepting traffic. A Pod is considered ready when all of its containers are ready. One use of this signal is to control which Pods are used as backends for Kubernetes Services (and esp. Ingress).
Kubernetes uses Liveness Probes to know when to restart a container. For example, Liveness Probes could catch a deadlock, where an application is running, but unable to make progress. Restarting a container in such a state can help to make the application more available despite bugs, but restarting can also lead to cascading failures (see below).
If you attempt to deploy a change to your application that fails the Liveness/Readiness Probe, the rolling deploy will hang as it waits for all of your Pods to become Ready.
Example Readiness Probe checking the /health path via HTTP with default settings (interval: 10 seconds, timeout: 1 second, success threshold: 1, failure threshold: 3):
- for microservices providing an HTTP endpoint (REST service etc), always define a Readiness Probe which checks that your application (Pod) is ready to receive traffic
- make sure that your Readiness Probe covers the readiness of the
actual webserver port
- when using an "admin" or "management" port (e.g. 9090) for your readinessProbe, make sure that the endpoint only returns OK if the main HTTP port (e.g. 8080) is ready to accept traffic 
- having a different port for the Readiness Probe also can lead to thread pool congestion problems on the main port which are not reflected in the health check (i.e. main server thread pool is full, but health check still answers OK)
- make sure that your Readiness Probe includes database initialization/migration
- use httpGet for Readiness Probes to a well-known health check endpoint (e.g. /health)
understand the default behavior (interval: 10s, timeout: 1s,
successThreshold: 1, failureThreshold: 3) of probes
- the default values mean that a Pod will become not-ready after ~30s (3 failing health checks)
- do use a different "admin" or "management" port if your tech stack
(e.g. Java/Spring) allows this to separate management health and
metrics from normal traffic
- but check point 2
- you can use the Readiness Probe for prewarming/cache loading if
needed and return 503 status code until the app container is "warm"
- check also the new startupProbe introduced in 1.16
do not depend on external dependencies (like data stores) for
your Readiness/Liveness checks as this might lead to cascading
- e.g. a stateful REST service with 10 pods which depends on a single Postgres database: when your probe depends on a working DB connection, all 10 pods will be "down" if the database/network has a hiccup --- this usually makes the impact worse than it should
- note that the default behavior of Spring Data is checking the DB connection 
- "external" in this context can also mean other Pods of the same
application, i.e. your probe should ideally not depend on the
state of other Pods in the same cluster to prevent cascading
- Your mileage may vary for apps with distributed state (e.g. in-memory caching across Pods)
do not use a Liveness Probe for your Pods unless you understand
the consequences and why you need a Liveness Probe
- Liveness Probe can help recover "stuck" containers, but as you are fully owning your application, things like "stuck" processes and deadlocks should not be expected --- a better alternative is to crash on purpose to recover to a known-good state
- a failing Liveness Probe will lead to container restarts, thus potentially making the impact of load-related errors worse: container restart will lead to downtime (at least your app's startup time, e.g. 30s+), thus causing more errors and giving other containers more traffic load, leading to more failing containers, and so on
- Liveness Probes in combination with an external dependency are the worst situation leading to cascading failures: a single DB hiccup will restart all your containers!
- if you use Liveness Probe, don’t set the same specification for
Liveness and Readiness Probe
- you can use a Liveness Probe with the same health check, but a higher failureThreshold (e.g. mark as not-ready after 3 attempts and fail Liveness Probe after 10 attempts)
- do not use "exec" probes as there are known problems with them resulting in zombie processes
- use Readiness Probes for your web app to decide when the Pod should receive traffic
- use Liveness Probes only when you have a use case for them
- incorrect use of Readiness/Liveness Probes can lead to reduced availability and cascading failures
UPDATE 2019-09-29 #1
On init containers for DB migrations: added a footnote.
EJ reminded me about PDBs: one problem of Liveness Probes is the missing coordination across a fleet of Pods: Kubernetes has Pod Disruption Budgets (PDB) to limit the number of concurrent disruptions that an application can experience, but PDBs are not respected by probes. Ideally we could tell K8s "restart one pod if its probe fails, but don't restart all of them to not make it worse".
Bryan put it nicely: "Use a liveness probe when you know for sure the best possible thing to do is to kill your app." and don't sprinkle them like pixie dust:
UPDATE 2019-09-29 #2
On "read the docs before using": I created a feature request for the Kubernetes documentation about Liveness Probes: https://github.com/kubernetes/website/issues/16607
|||I know at least one Zalando incident where this was not the case, i.e. the readinessProbe checked the "management" port, but the actual webserver never came up due to being stuck in cache loading.|
|||You can also trigger/run DB migrations from init containers or from outside the Pod. I'm still a fan of self-contained apps, i.e. where the app container knows how to bring the DB into the right state without external coordination.|
|||This was the default behavior of Spring Data Redis (at least the last time I checked) which led to "catastrophic" failure: when Redis was briefly unavailable, all app Pods were "down".|