K3s, Traefik, Let's Encrypt, and Local Path

Posted: 2019-09-17 19:45 | More posts about kubernetes

The story on how I messed up my K3s demo site with Traefik as Ingress controller and Let's Encrypt rate limits — or: how to configure K3s with local-path volumes.

Approx. one month ago, I set up a K3s demo site on a cheap VPS to show Kubernetes Web View (see announcement blog post). I used K3s' default ingress controller Traefik and configured Let's Encrypt (ACME) for TLS termination. This was all working very fine and took me less than a day to set up (the tricky part was finding out the right Traefik ACME config).

Five days ago I decided to upgrade K3s from 0.7 to the current version 0.8.1..

Upgrading K3s

Upgrading K3s is straightforward: just run the install command again! At least that's what I do on my demo server (I probably would do something else for production workloads):

curl -sfL https://get.k3s.io | sh -

The K3s upgrade went fine and Ingress worked (with Traefik default TLS cert), but I needed to apply overwrites to configure ACME (Let's Encrypt) in Traefik for proper TLS certificates..

Applying Overwrites

My strategy to overwrite the Traefik config is/was pretty dumb: I use kubectl to apply my custom manifests, especially Traefik's TOML configuration:

kubectl apply -f manifests/

This broke the Ingress, i.e. https://demo.j-serv.de/ (and https://kube-web-view.demo.j-serv.de/) were now down!

Looking into the Traefik logs revealed connection refused errors from the ACME TLS challenge and the Traefik pod was now crash-looping :-(

I tried to switch the ACME challenge from "TLS" to "HTTP", but nothing helped!

This was just a demo site, so I gave up after 1-2 hours and decided to wait until Monday. I hoped that the ACME challenge will eventually succeed over the course of the weekend.

Spoiler: it did not succeed :-(

The Fix

Javier reminded me on Monday on Twitter that the site is still down:

../galleries/twitter-kube-web-view-demo-down.png

So I reset my whole demo server and started from scratch with Ubuntu 18.04:

start new server with Ubuntu 18.04
clone https://codeberg.org/hjacobs/k3s-demo
run install.sh

This took less than 10 minutes and the result: Ingress works again with Traefik's default cert!

Now I checked Traefik's K3s default configuration and discovered the culprit:

logLevel = "info"
defaultEntryPoints = ["http","https"]
[entryPoints]
  [entryPoints.http]
  address = ":80"
  compress = true
  [entryPoints.https]
  address = ":443"
  compress = true
    [entryPoints.https.tls]
[ping] # <-- this entry was not there with K3s 0.7!
entryPoint = "http"
[kubernetes]
  [kubernetes.ingressEndpoint]
  publishedService = "kube-system/traefik"
[traefikLog]
  format = "json"

So K3s 0.8.1 ships with Traefik 1.76.1 instead of Traefik 1.64 and configures a livenessProbe to the "ping" endpoint. This was causing the constant pod restarts as my overwrite did not have the "ping" endpoint and the livenessProbe always failed.

Great, so everything should work now after adding the "ping" section to my overwrite manifest!

Spoiler: it was not enough to make it work :-(

Let's Encrypt Rate Limits

Let's Encrypt has a number of rate limits and the frequent pod restarts led to being rate limited (max. 10 account creations in 3 hours).

So waiting some hours should fix it automatically?

Not quite! Traefik does not retry ACME challenges so we are back to restarting the pod (manually). Sadly this also does not work immediately as there is a race condition between server start and the ACME challenge. Here the Traefik logs show "Connection Refused" as a symptom of the race condition (line breaks added for readability):

{"level":"info","msg":"Using TOML configuration file /config/traefik.toml","time":"2019-09-17T10:30:47Z"}
{"level":"info","msg":"Traefik version v1.7.12 built on 2019-05-29_07:35:02PM","time":"2019-09-17T10:30:47Z"}
...
{"level":"info","msg":"Server configuration reloaded on :80","time":"2019-09-17T10:30:48Z"}
{"level":"info","msg":"Server configuration reloaded on :443","time":"2019-09-17T10:30:48Z"}
{"level":"error","msg":"Unable to obtain ACME certificate for domains \"kube-resource-report.demo.j-serv.de\" detected thanks to rule \"Host:kube-resource-report.demo.j-serv.de\" :
  unable to generate a certificate for the domains [kube-resource-report.demo.j-serv.de]: acme: Error -\u003e One or more domains had a problem:\n[kube-resource-report.demo.j-serv.de]
  acme: error: 400 :: urn:ietf:params:acme:error:connection ::
    Connection refused, url: \n","time":"2019-09-17T10:31:05Z"}
{"level":"error","msg":"Unable to obtain ACME certificate for domains \"kube-ops-view.demo.j-serv.de\" detected thanks to rule \"Host:kube-ops-view.demo.j-serv.de\" :
  unable to generate a certificate for the domains [kube-ops-view.demo.j-serv.de]: acme: Error -\u003e One or more domains had a problem:\n[kube-ops-view.demo.j-serv.de]
  acme: error: 400 :: urn:ietf:params:acme:error:connection ::
    Connection refused, url: \n","time":"2019-09-17T10:31:08Z"}
{"level":"error","msg":"Unable to obtain ACME certificate for domains \"demo.j-serv.de\" detected thanks to rule \"Host:demo.j-serv.de\" :
  unable to generate a certificate for the domains [demo.j-serv.de]: acme: Error -\u003e One or more domains had a problem:\n[demo.j-serv.de]
  acme: error: 400 :: urn:ietf:params:acme:error:connection ::
    Connection refused, url: \n","time":"2019-09-17T10:31:09Z"}

Restarting the pod multiple times might eventually succeed, but let's first configure a persistent volume for the Let's Encrypt certs to not run into rate limits again. After some web search, I found a K3s issue describing how to configure "local-path-storage":

sudo mkdir /opt/local-path-provisioner
kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/master/deploy/local-path-storage.yaml
kubectl patch storageclass local-path -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

Now we can configure a PVC for Traefik and mount it via another overwrite deployment manifest. After changing storage = "/local-path/acme.json" in the "acme" TOML section, we have persistent storage for our Let's Encrypt TLS certificates!

Some pod restarts (kubectl delete pod traefik..) later, I could get some certs to work despite the startup race condition.

I eventually could work around the race condition completely by changing the Ingress object which leads to Traefik reloading its configuration without restart (no race condition here, yay!).

All Ingress hostnames are now available again with proper Let's Encrypt TLS certificates:

Summary

This journey of a K3s upgrade took longer than expected, but I learned some things:

overwriting upstream manifests requires careful double-checking on version upgrade (I did not see the new "ping" livenessProbe!)
Let's Encrypt rate limits can lead to outages where only waiting (for hours!) helps (I don't want to have that situation in prod!)
Traefik's ACME module is prone to race conditions and does not retry ACME challenges
persisting ACME certs is important to avoid denial-of-service by rate limits
K3s documentation is severely lacking, but now I know how to configure "local-path-storage"

The impact of this outage was very low (demo site, who cares?) and I would probably have prioritized an immediate fix earlier.

You can find all configuration files for the K3s demo on https://codeberg.org/hjacobs/k3s-demo