Troubleshooting Kubernetes Clusters
Introduction
This tutorial provides a practical guide to troubleshooting common issues in Kubernetes control plane and worker nodes. We’ll cover essential kubectl commands, log analysis techniques, and system utilities to diagnose and resolve problems. Basic familiarity with Kubernetes architecture and kubectl is assumed.
Task 1: Checking the Cluster Status
A good starting point for troubleshooting is to check the overall health of your Kubernetes cluster.
sequenceDiagram
participant Admin as Admin/CLI
participant K8s_API as Kubernetes API Server
participant ControlPlane as controlplane Node
participant Worker1 as worker-node-1 Node
Note over Admin, K8s_API: Step 1: Check Node Status
Admin->>K8s_API: kubectl get nodes
activate K8s_API
K8s_API-->>Admin: controlplane (Ready), worker-node-1 (Ready), ...
deactivate K8s_API
Note right of Admin: (Optional: Repeat for any problematic node)
Note over Admin, ControlPlane: Step 2: Describe Specific Node
Admin->>K8s_API: kubectl describe node controlplane
activate K8s_API
K8s_API->>ControlPlane: Get detailed node data
activate ControlPlane
ControlPlane-->>K8s_API: Node details (Conditions, etc.)
deactivate ControlPlane
K8s_API-->>Admin: Full Node description
deactivate K8s_API
Note right of Admin: Check Conditions: MemoryPressure, DiskPressure, PIDPressure, Ready
-
Use
kubectlto retrieve the status of all nodes:NODE_TYPE // bashkubectl get nodesNODE_TYPE // outputNAME STATUS ROLES AGE VERSION controlplane Ready control-plane,master 30m v1.27.4 worker-node-1 Ready <none> 28m v1.27.4 worker-node-2 Ready <none> 28m v1.27.4A node’sSTATUSshould beReady. If a node is inNotReady,SchedulingDisabled, or other states, it indicates a problem. -
Inspect individual node details for more information:
NODE_TYPE // bashkubectl describe node <node-name>Replace
<node-name>with the name of the node you want to inspect.NODE_TYPE // outputName: controlplane Roles: control-plane,master ... Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ----------------- ------- ----------------- -------------------- -------- ------- MemoryPressure False ... ... KubeletHasSufficientMemory Kubelet has sufficient memory available DiskPressure False ... ... KubeletHasNoDiskPressure Kubelet has no disk pressure PIDPressure False ... ... KubeletHasSufficientPID Kubelet has sufficient PID available Ready True ... ... KubeletReady kubelet is posting ready status ...Pay close attention to theConditionssection. Conditions likeMemoryPressure,DiskPressure, orPIDPressurecan indicate resource exhaustion.KubeletNotReadyusually means the kubelet is having problems.
Task 2: Examining Control Plane Components
The control plane consists of several essential components: kube-apiserver, kube-scheduler, kube-controller-manager, and etcd.
sequenceDiagram
participant Admin as Admin/CLI
participant K8s_API as kube-apiserver
participant KubeSystem as kube-system Pods
participant ETCD as etcd Store
Note over Admin, KubeSystem: Step 1: Check Control Plane Pod Health
Admin->>K8s_API: kubectl get pods -n kube-system
activate K8s_API
K8s_API-->>Admin: List of pods (etcd, apiserver, scheduler, etc.)
deactivate K8s_API
Note right of Admin: Verify STATUS is "Running"
Note over Admin, KubeSystem: Step 2: Inspect Component Logs
Admin->>K8s_API: kubectl logs -n kube-system
activate K8s_API
K8s_API->>KubeSystem: Fetch container logs
activate KubeSystem
KubeSystem-->>K8s_API: Log stream
deactivate KubeSystem
K8s_API-->>Admin: Display logs (check for Errors/Warnings)
deactivate K8s_API
Note over Admin, ETCD: Step 3: Verify Data Store Health
Admin->>ETCD: etcdctl endpoint health (via certificates)
activate ETCD
ETCD-->>Admin: health: healthy (Revision, Member ID)
deactivate ETCD
Note right of Admin: If etcd is unhealthy, cluster state is at risk.
-
Check the status of control plane pods:
NODE_TYPE // bashkubectl get pods -n kube-systemNODE_TYPE // outputNAME READY STATUS RESTARTS AGE coredns-66bff4672f-7wbqk 1/1 Running 0 30m coredns-66bff4672f-z2w7z 1/1 Running 0 30m etcd-controlplane 1/1 Running 0 30m kube-apiserver-controlplane 1/1 Running 0 30m kube-controller-manager-controlplane 1/1 Running 0 30m kube-proxy-jqqv9 1/1 Running 0 29m kube-scheduler-controlplane 1/1 Running 0 30mAll control plane pods in thekube-systemnamespace should be in theRunningstate. APendingorCrashLoopBackOffstatus indicates a problem. -
View logs for specific control plane components:
NODE_TYPE // bashkubectl logs -n kube-system <pod-name>Replace
<pod-name>with the name of the pod you want to inspect (e.g.,kube-apiserver-controlplane).NODE_TYPE // output... (API server logs) ...Examine the logs for errors, warnings, or unusual activity. Focus on timestamps around when the problem started. -
Check etcd health. This may vary depending on how your cluster is set up. The following is an example using
etcdctl:NODE_TYPE // bashETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key endpoint healthNODE_TYPE // output127.0.0.1:2379, health: healthy, revision: 5, cluster_id: 272b8c24e6dd171a, member_id: 8e9e05c52f4573f8, took: 1.836668msetcd is the Kubernetes backing store. If etcd is unhealthy, your entire cluster will be affected. Consult the etcd documentation for troubleshooting steps. Common issues include disk space exhaustion and network connectivity problems.
Task 3: Investigating Worker Node Issues
Worker nodes run your application workloads. Issues on worker nodes can prevent pods from running correctly.
sequenceDiagram
participant Admin as Admin/Local Machine
participant Node as Worker Node (OS)
participant Kubelet as Kubelet Service
participant Runtime as Container Runtime (Docker/Containerd)
Note over Admin, Node: Step 1: Access Node
Admin->>Node: ssh @
activate Node
Note over Node, Kubelet: Step 2 & 3: Kubelet Health
Node->>Kubelet: systemctl status kubelet
Kubelet-->>Node: Status (Active/Inactive)
Node->>Kubelet: journalctl -u kubelet
Kubelet-->>Node: Logs (Errors, Image pulls, Networking)
Note over Node, Runtime: Step 4: Runtime Health
Node->>Runtime: systemctl status docker/containerd
Runtime-->>Node: Status (Active/Inactive)
Node->>Runtime: journalctl -u docker/containerd
Runtime-->>Node: Runtime Logs
Note over Admin, Node: Step 5: Resource Audit
Admin->>Node: top / htop
Node-->>Admin: CPU/Memory/Load metrics
deactivate Node
Note right of Admin: Analyze for resource contention or service failures.
-
SSH into the worker node exhibiting problems.
NODE_TYPE // bashssh <user>@<worker-node-ip>Replace
<user>and<worker-node-ip>with the appropriate values. -
Check kubelet status using
systemctl:NODE_TYPE // bashsudo systemctl status kubeletNODE_TYPE // output● kubelet.service - kubelet: The Kubernetes Node Agent Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled) Active: active (running) since Tue 2023-10-24 14:30:00 UTC; 30min ago ...If the kubelet is not running, start it withsudo systemctl start kubelet. Check the logs withsudo journalctl -u kubeletto identify the reason for the failure. -
Inspect kubelet logs using
journalctl:NODE_TYPE // bashsudo journalctl -u kubeletNODE_TYPE // output... (kubelet logs) ...Look for errors related to image pulls, networking, or resource limitations. -
Check Docker/Containerd status using
systemctl.NODE_TYPE // bashsudo systemctl status docker # or containerdNODE_TYPE // output● docker.service - Docker Application Container Engine Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled) Active: active (running) since Tue 2023-10-24 14:28:00 UTC; 32min ago ...If the container runtime is not running, start it withsudo systemctl start docker(orcontainerd). Check the logs withsudo journalctl -u docker(orcontainerd) to identify the reason for the failure. Common issues are similar to the kubelet issues above (image pull problems, disk space, etc.) -
Examine resource usage on the node using
toporhtop:NODE_TYPE // bashtopNODE_TYPE // outputtop - 14:59:30 up 30 min, 1 user, load average: 0.00, 0.01, 0.00 Tasks: 110 total, 1 running, 109 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.3 us, 0.3 sy, 0.0 ni, 99.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 1009504 total, 745560 free, 119756 used, 144188 buff/cache KiB Swap: 0 total, 0 free, 0 used. 803532 avail MemHigh CPU, memory, or I/O usage can indicate resource contention. Identify the processes consuming the most resources and investigate further.
Task 4: Pod-Specific Troubleshooting
If a specific pod is failing, you can troubleshoot it directly.
sequenceDiagram
participant Admin as Admin/CLI
participant K8s as Kubernetes Cluster
participant BB_A as busybox-a Pod
participant BB_B as busybox-b Pod
Note over Admin, K8s: Step 1 and 2: Deploy and Apply
Admin->>K8s: kubectl apply (busybox pods)
activate K8s
K8s-->>Admin: pods created
deactivate K8s
Note over Admin, K8s: Step 3: Get Pod IPs
Admin->>K8s: kubectl get pods -o wide
activate K8s
K8s-->>Admin: BB_A (10.244.0.5), BB_B (10.244.0.6)
deactivate K8s
Note over Admin, BB_B: Step 4: Verify Connectivity
Admin->>BB_A: kubectl exec (ping 10.244.0.6)
activate BB_A
BB_A->>BB_B: ICMP request
activate BB_B
BB_B-->>BB_A: ICMP response
deactivate BB_B
BB_A-->>Admin: Ping Statistics
deactivate BB_A
-
Get the pod’s status:
NODE_TYPE // bashkubectl get pod <pod-name> -n <namespace>Replace
<pod-name>and<namespace>with the correct values.NODE_TYPE // outputNAME READY STATUS RESTARTS AGE my-app-pod-7d6c7b964-mkt9q 0/1 ImagePullBackOff 0 5mPay attention to theSTATUSandREADYcolumns.ImagePullBackOffindicates a problem pulling the container image.CrashLoopBackOffmeans the container is crashing repeatedly.0/1in theREADYcolumn indicates the pod is not ready. -
Describe the pod for more detailed information:
NODE_TYPE // bashkubectl describe pod <pod-name> -n <namespace>NODE_TYPE // outputName: my-app-pod-7d6c7b964-mkt9q Namespace: default ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 6m default-scheduler Successfully assigned default/my-app-pod-7d6c7b964-mkt9q to worker-node-1 Warning Failed 1m (x5 over 5m) kubelet Failed to pull image "my-app:latest": rpc error: code = NotFound desc = failed to pull and unpack image "docker.io/library/my-app:latest": failed to resolve reference "docker.io/library/my-app:latest": docker.io/library/my-app:latest: not found Warning Failed 1m (x5 over 5m) kubelet Error: ErrImagePull: rpc error: code = NotFound desc = failed to pull and unpack image "my-app:latest": failed to resolve reference "docker.io/library/my-app:latest": docker.io/library/my-app:latest: not found Warning Failed 1m (x5 over 5m) kubelet Error: ImagePullBackOff: Back-off pulling image "my-app:latest"TheEventssection often contains valuable clues about why a pod is failing. -
View the pod’s logs:
NODE_TYPE // bashkubectl logs <pod-name> -n <namespace>NODE_TYPE // output... (application logs) ...If the pod is crashing frequently, you might need to use the
--previousflag to view the logs from the previous container instance:NODE_TYPE // bashkubectl logs --previous <pod-name> -n <namespace> -
Execute a command inside the running container (if possible):
NODE_TYPE // bashkubectl exec -it <pod-name> -n <namespace> -- /bin/bashThis allows you to inspect the container’s filesystem, network configuration, and running processes.
If the container is not running, you can’t usekubectl exec.
Task 5: Network Troubleshooting
Networking issues can prevent pods from communicating with each other or with external services.
sequenceDiagram
participant Admin as Admin/CLI
participant PodA as Source Pod
participant CoreDNS as CoreDNS (kube-dns)
participant K8s_API as K8s API Server
participant Target as Target Pod / Service
Note over Admin, CoreDNS: Step 1: DNS Resolution
Admin->>PodA: kubectl exec (nslookup SERVICE_NAME)
activate PodA
PodA->>CoreDNS: DNS Query
activate CoreDNS
CoreDNS-->>PodA: Service IP (ClusterIP)
deactivate CoreDNS
PodA-->>Admin: Display Resolved IP
deactivate PodA
Note over Admin, Target: Step 2: Connectivity Test
Admin->>PodA: kubectl exec (curl SERVICE_URL)
activate PodA
PodA->>Target: HTTP Request / Ping
alt Connection Success
Target-->>PodA: 200 OK / Response
PodA-->>Admin: Success
else Connection Failure
Target--xPodA: Timeout / Connection Refused
PodA-->>Admin: Error (Check Policies/CNI)
end
deactivate PodA
Note over Admin, K8s_API: Step 3: Verify Service Endpoints
Admin->>K8s_API: kubectl get endpoints SERVICE_NAME
activate K8s_API
K8s_API-->>Admin: List of Pod IPs + Ports
deactivate K8s_API
Note right of Admin: If list is empty, check Pod labels and selectors.
-
Verify DNS resolution:
From inside a pod, try to resolve a service name or external hostname using
nslookupordig.NODE_TYPE // bashkubectl exec -it <pod-name> -n <namespace> -- nslookup <service-name>NODE_TYPE // outputServer: 10.96.0.10 Address: 10.96.0.10#53 Name: <service-name>.default.svc.cluster.local Address: 10.97.141.170If DNS resolution fails, check your CoreDNS configuration and ensure that the kube-dns service is running correctly in thekube-systemnamespace. -
Test network connectivity using
pingorcurl:From inside a pod, try to ping another pod or service.
NODE_TYPE // bashkubectl exec -it <pod-name> -n <namespace> -- ping <pod-ip>NODE_TYPE // bashkubectl exec -it <pod-name> -n <namespace> -- curl <service-url>If ping or curl fails, check your network policies, firewall rules, and routing configuration. -
Verify service endpoints:
NODE_TYPE // bashkubectl get endpoints <service-name> -n <namespace>NODE_TYPE // outputNAME ENDPOINTS AGE my-service 10.244.2.10:8080,10.244.3.15:8080 30mIf theENDPOINTSlist is empty, it means there are no pods backing the service. Check your pod selectors and ensure that the pods are running and healthy.
Conclusion
This tutorial covered essential techniques for troubleshooting Kubernetes control plane and worker nodes. By checking cluster status, examining logs, inspecting pod details, and verifying network connectivity, you can effectively diagnose and resolve common issues in your Kubernetes environment. Remember to consult the official Kubernetes documentation for more in-depth information on specific error messages and troubleshooting scenarios.