cgroup drivers and how kubelet and containerd works together

Over the last month, I've been writing my own toy implementation of kubeadm, k8sbootstrap.

So far I've implemented the equivalent of the kubeadm init command, which sets up the Kubernetes control plane in the node you're running the command on. The reference page for kubeadm init has a really good list of all the different "phases" the init command executes. This includes things like preflight checks to make sure that the container runtime is setup properly, generating certificates and kubeconfig files for the different components using these certificates, generating static pod manifests for control plane components and actully starting the kubelet.

Now when I implemented the init command in my implementation, I followed the same structure, but simplified a couple of steps, such as using the same CA for generating the certs for both the kubernetes components as well as etcd (This works, but its best to keep the CAs separate for the two)

After creating the certs, I generated kubeconfig files from these, and stored everything under /etc/kubernetes/pki and /etc/kubernetes respectively. Now that I had everything needed to bring up the control plane, I wrote the code necessary to write the static pod manifests. I initialized the different control plane components as corev1.Pod objects in my code and encoded them into files in /etc/kubernetes/manifests with a json serializer. After this we use systemd to start the kubelet in the node. The kubelet is configured to know where the static pod manifests are and it creates the containers.

At this point, I was able to run crictl ps and see the containers for etcd, kube-apiserver, kube-scheduler and kube-controller-manager running in my node. Things were going good until this point. At this point I started noticing that all my pods are going into crashloopbackoff. Well I wasn't tracking the status of the mirror pods from the Kubernetes API just yet but the containers were exiting and then restarting and so on.

Looking at the logs of the apiserver, I could see that it restarts because its not able to reach the etcd server. So etcd getting restarted was the reason the other components were also getting restarted. The etcd logs were confusing. The logs didn't report anything wrong. Etcd comes up, the server is started and accepting connections, I can even curl the health endpoint and get a 200 OK response, and the pod would randomly receive a SIGTERM signal and exit gracefully - only to come right back up again.

I commented out everything except etcd in my manifest.go file which was responsible for writing the static pod manifests to debug etcd. This confirmed that this was indeed an issue with etcd and has got nothing to do with the rest of the kubernetes components' containers. And that those were exiting because of whatever was going on with etcd as well. I tried going through the logs for kubelet and containerd suspecting that etcd was being sent the stop signal from either of the two. I was discussing this with Claude and some of the possible reasons suggested by Claude were failing liveness/readiness probes, inadequate RuntimeRequestTimeout in the kubelet config and the possibility of the /var/lib/etcd directory being used by different instances of etcd while it should've been cleaned up every time. I tried playing around with all of these and the etcd pod was still getting randomly killed, a minute or so after it comes up every time.

At this point I ditched the AI assistant and decided to try going through issues in the kubernetes repository to research more. I googled "etcd sigterm" and saw a similar question asked in the ServerFault forum. The post itself didn't have the right answer to my problem, but the title of the question helped me rewrite my query properly. I copied the title, "All kube-system pods keep crashing, etcd receives sigterm" and googled this instead and found similar questions [1][2].

A screenshot of a google search of "etcd sigterm". The first result is a question on ServerFault titled "All kube-system pods keep crashing, etcd receives sigterm"

I learnt from these posts that I missed to set SystemdCgroup to true when installing containerd in the virtual machine that I was using to run k8sbootstrap init. I recreated the containerd config file, stopped the kubelet and ran k8sbootstrap init again to bring everything back up and the etcd pod was stable now!

The reason behind this bug was that both the kubelet and containerd were configured to look at cgroups in different ways. Because of this, even though containerd has created the containers (with cgroupfs) the kubelet would look for the containers with systemd cgroups. This mismatch makes the kubelet not able to track the containers properly and tries to restart/kill them thinking they're not responsive. And this leads to the containers being killed and recreated in a loop.

This was a very annoying bug to figure out, but I got to learn a lot of things about how kubelet and containerd works along the way.