
Debugging Kubernetes Deployments

Kubernetes is a useful tool for deploying and managing containerized applications, yet even seasoned Kubernetes enthusiasts agree that it’s hard to debug Kubernetes deployments and failing pods. This is because of the distributed nature of Kubernetes, which makes it hard to reproduce the exact issue and determine the root cause.
This article will cover some of the general guidelines you can use to debug your Kubernetes deployments and some of the common errors and issues you can expect to encounter.
Additionally, Scott Surovich and Marc Boorshtein help you learn about Kubernetes cluster deployment and management and discover how to deploy an entire platform to the cloud using continuous integration and continuous delivery (CI/CD) in their book Kubernetes and Docker - An Enterprise Guide.
Tools to Use to Debug the Kubernetes Cluster
It’s important to ensure that our process remains the same whether we’re debugging our application in Kubernetes or a physical machine. The tools we use will be the same, but with Kubernetes we're going to probe the state and outputs of the system. We can always start our debugging process using kubectl, or we can use some of the common Kubernetes debugging tools.
The kubectl describe nodes display detailed information about resources. For example, the kubectl describe nodes <node name> displays detailed information about the node with the name <node name>. Similarly, to get detailed information about pod use, you should use kubectl describe pod <pod name>. If you pass the flag -o yaml at the end of the kubectl describe command, it displays the output in yaml format (kubectl describe pod <pod name> -o yaml).
Kubectl logs is another useful debugging command used to print the logs for a container in a pod. For example, kubectl logs <pod name> prints the pod's logs with a name pod name.
Note: To stream the logs, use the -f flag along with the kubectl logs command. The -f flag works similarly to the tail -f command in Linux. An example is kubectl logs -f <pod name>.
For kubectl, verbosity is controlled with -v or --v flags. After -v, we will pass an integer to represent the log level. The integer ranges from 0 to 9, where 0 is least verbose and 9 is most verbose. An example is kubectl -v=9 get nodes display HTTP request contents without truncation of contents.
Kubectl exec is a useful command to debug the running container. You can run commands like kubectl exec <pod name> -- cat /var/log/messages to look at the logs from a given pod, or kubectl exec -it <pod name> --sh to log in to the given pod.
Kubectl events give you high-level information about what is happening inside the cluster. To list all the events, you can use a command like kubectl get events, or if you are looking for a specific type of event such as Warning, you can use kubectl get events --field-selector type=Warning.
Debugging Pod
There are two common reasons why pods fail in Kubernetes:
- Startup failure: A container inside the pod doesn’t start.
- Runtime failure: The application code fails after container startup.
Debugging Common Pod Errors with Step-by-Step and Real-World Examples
CrashLoopBackoff
A CrashLoopBackoff error means that when your pod starts, it crashes; it then tries to start again, but crashes again. Here are some of the common causes of CrashLoopBackoff:
- An error in the application inside the container.
- Misconfiguration inside the container (such as a missing ENTRYPOINT or CMD).
- A liveness probe that failed too many times.
Commands to Run to Identify the Issue
The first step in debugging this issue is to check pod status by running the kubectl get pods command.
Here we see our pod status under the STATUS column and verify that it’s in a CrashLoopBackOff state.
As a next troubleshooting step, we are going to run the kubectl logs command to print the logs for a container in a pod.
Here we see that the reason for this broken pod is specification of an unknown option -z in the sleep command. Verify your pod definition file.
To fix this issue, you’ll need to specify a valid option under the sleep command in your pod definition file and then run the following command
Verify the status of your pod.
ErrImagePull/ImagePullBackOff
An ErrImagePull/ImagePullBackOff error occurs when you are not able to pull the desired docker image. These are some of the common causes of this error:
- The image you have provided has an invalid name.
- The image tag doesn’t exist.
- The specified image is in the private registry.
Commands to Run to Identify the Issue
As with the CrashLoopBackoff error, our first troubleshooting step starts with getting the pod status using the kubectl get pods command.
Next we will run the kubectl describe command to get detailed information about the pod.
As you can see in the output of the kubectl describe command, it was unable to pull the image name redis123.
To find the correct image name, you can either go to Docker Hub or run the docker search command specifying the image name on the command line.
To fix this issue, we can follow either one of the following approaches:
- Start with the kubectl edit command, which allows you to directly edit any API resource you have retrieved via the command-line tool. Go under the spec section and change the image name from redis123 to redis.
- If you already have the yaml file, you can edit it directly, and if you don’t, you can use the --dry-run option to generate it.
Similarly, under image spec, change the image name to redis and recreate the pod.
CreateContainerConfigError
This error usually occurs when you’re missing a ConfigMap or Secret. Both are ways to inject data into the container when it starts up.
Commands to Run to Identify the Issue
We will start our debugging process with the standard Kubernetes debugging command kubectl get pods.
As the output of kubectl get pods shows, pod is in a CreateContainerConfigError state. Our next command is kubectl describe, which gets the detailed information about the pod.
To retrieve information about the ConfigMap, use this command:
As the output of the above command is null, the next step is to verify the pod definition file and create the ConfigMa
ContainerCreating Pod
The main cause of the ContainerCreating error is that your container is looking for a secret that is not present. Secrets in Kubernetes let you store sensitive information such as tokens, passwords, and ssh keys.
Commands to Run to Identify the Issue
We will start our debugging process with the standard Kubernetes debugging command kubectl get pods.
As the output of kubectl get pods shows, the pod is in a ContainerCreating state. Our next command is the kubectl describe command, which gets detailed information about the pod.
To retrieve information about the secret, use this command:
As the above command's output is null, the next step is to verify the pod definition file and create the secret.
To fix the error, create a secret file whose content looks like this.
Secrets are similar to ConfigMap, except they are stored in a base64 encoded or hashed format.
For example, you can use the base64 command to encode any text data.
Use the following commands to decode the text and print the original text.
Now run kubectl get secret to retrieve information about a secret, and this time you will see the newly created secret.
Verify the status of the pod, and you will see that the pod is in running state now.
Debugging Worker Nodes
A worker node in the Kubernetes cluster is responsible for running your containerized application. To debug worker node failure, we need to follow the same systematic approach and tools as we used while debugging pod failures.To reinforce the concept, we will look at three different scenarios where you see your worker node is in NotReady state.
NotReady State
There may be multiple reasons why your worker node goes into NotReady state. Here are some of them:
- A virtual machine where the worker node is running shut down.
- There is a network issue between worker and master node.
- There is a crash within the Kubernetes software.
But as you know, the kubelet is the binary present on the worker node, and it is responsible for running containers on the node. We’ll start our debugging with a kubelet.
Scenario 1: Worker Node is in NotReady State (kubelet is in inactive [dead] state)
Commands to Run to Identify the Issue
Before we start debugging at the kubelet binary, let’s first check the status of worker nodes using kubectl get nodes.
Now, as we have confirmed that our worker node (node01) is in NotReady state, the next command we will run is ps to check the status of the kubelet.
As we see, the kubelet process is not running. We can run the systemctl command to verify it further.
Using the systemctl command, we confirmed that the kubelet is not running (Active: inactive (dead)). Before we debug it further, we can try to start the kubelet service and see if that helps. To start the kubelet, run the systemctl start kubelet command.
Verify the status of kubelet again.
Based on the systemctl status kubelet output, we can say that starting kubelet helped, and now it’s in running state. Let’s verify it from the Kubernetes master node.
As you can see, the worker node is now in Ready state.
Scenario 2: Worker Node is in NotReady state (kubelet is in activating [auto-restart] state)
In scenario 2, your worker node is again in NotReady state, and you’ve followed all the steps to start the kubelet (systemctl start kubelet), but that doesn’t help. This time, the kubelet service status is activating (auto-restart), which means it’s not able to start.
Here we’ll start our debugging with a command called journalctl. Using journalctl, you can view the logs collected by systemd. In this case, we will pass the -u option to see the logs of a particular service, kubelet.
You will see the following error message:
Note: To reach the bottom of the command, press SHIFT+G.
Our next step is to identify the configuration file. Since this scenario is created by kubeadm, you will see the custom configuration file.
This file also has the reference to the actual configuration used by kubelet. Open this file and check the line starting with clientCAFile:
You will see the incorrect CA path for clients (/etc/kubernetes/pki/my-test-file.crt. Let’s check the valid path under /etc/kubernetes/pki.
Replace it with the correct file path and restart the daemon.
TO
Again, verify it from the Kubernetes master node.
As you can see, the worker node is now in Ready state.
Scenario 3 Worker Node is in NotReady state (kubelet is in active [running] state)
In scenario 3, your worker node is again in NotReady state, and although you’ve followed all the steps for starting the kubelet (systemctl start kubelet), that doesn’t help. This time, the kubelet service status is loaded, which means that systemctl reads the unit from disk into memory, but it’s not able to start. We will start our debugging with the standard process by running the systemctl status command.
If we look at the last line of the systemctl status command, we’ll see that the kubelet cannot communicate with the API server.
To get information about the API server, go to the Kubernetes master node and run kubectl cluster-info.
As you can see, there is a mismatch between the port in which the API server is running (6443) and the one to which kubectl is trying to connect (6553). Here we’ll follow the same step as in scenario 2, that is, identify the configuration file.
This time, though, you should check /etc/kubernetes/kubelet.conf, where the details about ApiServer are stored.
If you open this file, you’ll see that the port number defined for ApiServer is wrong; it should be 6443, as we saw in the output of kubectl cluster-info.
Change port 6553 to 6443 and restart the kubelet daemon.
Debugging Control Plane
The Kubernetes control plane is the brains behind Kubernetes and is responsible for managing the Kubernetes cluster. To debug control plane failure, we need to follow the same systematic approach and tools we used while debugging worker node failures. In this case, we will try to replicate the scenario where we are trying to deploy an application and it is failing.
Pod in a Pending State
What are some of the reasons that a pod goes into a Pending state?
- The cluster doesn’t have enough resources.
- The current namespace has a resource quota.
- The pod is bound to a persistent volume claim.
- You’re trying to deploy an application, and it’s failing.
Commands to Run to Identify the Issue
We will start our debugging with kubectl get all commands. So far, we have used kubectl get pods and nodes, but to list all the resources, we need to pass all command line parameters. The command looks like this:
As you can see, the pod is in a Pending state. The scheduler is the component responsible for scheduling the pod on the container. Let’s check the scheduler status in the kube-system namespace.
As you can see in the output, the kube-scheduler-controlplane pod is in CrashLoopBackOff state, and we already know that when the pod is in this state, it will try to start but will crash.
The next command we’ll run is the kubectl describe command to get detailed information about the pod.
As you can see, the name of the scheduler is incorrect. To fix this, go to the Kubernetes manifests directory and open kube-scheduler.yaml.
Under the command section, the name of the scheduler is wrong. Change it from kube-schedulerror to kube-schedule.
Once you fix it, you’ll see that it will recreate the pod, and the application pod will schedule a worker node.
Summary
In this article, we have covered some of the ways to debug the Kubernetes cluster. Because Kubernetes is dynamic, it's hard to cover all use cases, but some of the techniques we’ve discussed will help you get started on your Kubernetes debugging journey.
If you want to go beyond debugging Kubernetes and debug the application itself, Sidekick provides a rich feature set with production debugging with tracepoints.
You can check Sidekick's documentation here, and if you still haven't stepped into Sidekick yet, start your journey here.