Troubleshooting Kubernetes Cluster and Node Failures
Introduction
This tutorial aims to equip you with the skills to troubleshoot common Kubernetes cluster and node failure scenarios. It assumes a working knowledge of Kubernetes concepts like pods, deployments, services, and namespaces, as well as familiarity with kubectl. You’ll need a Kubernetes cluster (minikube, kind, or a cloud-based cluster) and kubectl configured to access it. Basic Linux command-line skills are also required.
Task 1: Simulating a Node Failure
To understand the impact of node failures, we’ll simulate one. This is a crucial step to observe how Kubernetes responds and what indicators to look for.
sequenceDiagram
participant Admin as Admin/CLI
participant K8s_API as K8s API Server
participant Worker as worker-node1 (Host)
participant Kubelet as Kubelet Service
Note over Admin, K8s_API: Step 1: Identify Target
Admin->>K8s_API: kubectl get nodes
K8s_API-->>Admin: worker-node1 (Ready)
Note over Admin, K8s_API: Step 2: Prevent New Pods
Admin->>K8s_API: kubectl cordon worker-node1
activate K8s_API
K8s_API-->>Admin: node/worker-node1 cordoned
deactivate K8s_API
Note right of K8s_API: Status: Ready, SchedulingDisabled
Note over Admin, Worker: Step 3: Simulate Failure (SSH)
Admin->>Worker: ssh USER at NODE_IP
activate Worker
Admin->>Worker: sudo systemctl stop kubelet
Worker->>Kubelet: Stop Service
deactivate Worker
Note over K8s_API, Worker: Resulting Cluster State
loop Node Health Check
K8s_API-xKubelet: Heartbeat fails
end
K8s_API-->>Admin: worker-node1 status: NotReady
-
Identify a worker node in your cluster:
NODE_TYPE // bashkubectl get nodesNODE_TYPE // outputNAME STATUS ROLES AGE VERSION controlplane Ready control-plane,master 20d v1.29.0 worker-node1 Ready <none> 20d v1.29.0 worker-node2 Ready <none> 20d v1.29.0 -
Isolate the node by marking it as unschedulable. Replace
<node_name>with the actual name of your worker node. In this case, we will targetworker-node1.NODE_TYPE // bashkubectl cordon worker-node1Cordoning a node prevents new pods from being scheduled on it. Existing pods continue to run. -
Simulate a node failure by stopping the kubelet service on the node. Do this on the actual node machine (e.g., via SSH if it’s a VM). If using minikube, you will need to ssh into the minikube VM, and then stop the kubelet service.
NODE_TYPE // bashsudo systemctl stop kubeletThis command is executed on the actual worker node, NOT through kubectl on your control plane. Make sure you have access to the node’s shell.
Task 2: Observing the Impact of Node Failure
Now that the node is failing, let’s observe how Kubernetes reacts.
-
List the nodes again and observe the STATUS of the node you cordoned and stopped the kubelet on.
NODE_TYPE // bashkubectl get nodesNODE_TYPE // outputNAME STATUS ROLES AGE VERSION controlplane Ready control-plane,master 20d v1.29.0 worker-node1 NotReady,SchedulingDisabled <none> 20d v1.29.0 worker-node2 Ready <none> 20d v1.29.0You’ll likely see the node in a
NotReadystate andSchedulingDisabledstatus. This indicates Kubernetes has detected an issue with the node. -
Examine the pods running on the failed node. Identify pods that should be automatically rescheduled.
NODE_TYPE // bashkubectl get pods --all-namespaces -o wide | grep worker-node1The
-o wideoption shows the node the pod is running on.NODE_TYPE // outputNAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES default nginx-deployment-54db667d6d-abcde 1/1 Running 0 5d 10.244.2.5 worker-node1 <none> <none> -
After a few minutes (depending on your cluster’s configuration), Kubernetes should start rescheduling pods that were running on the failed node to other available nodes. Check the status of the pods again.
NODE_TYPE // bashkubectl get pods --all-namespaces -o wide | grep nginx-deploymentNODE_TYPE // outputNAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES default nginx-deployment-54db667d6d-fghij 1/1 Running 0 1m 10.244.3.6 worker-node2 <none> <none>Notice that the pod
nginx-deployment-54db667d6d-abcderunning onworker-node1is no longer present, and a new podnginx-deployment-54db667d6d-fghijnow runs onworker-node2. The last characters of the pod name will be different as it’s a brand new Pod instance.
Task 3: Diagnosing Node Failure with kubectl describe node
The kubectl describe node command provides a wealth of information about a node, including its status, capacity, and any recent events.
sequenceDiagram
participant Admin as Admin/CLI
participant K8s_API as K8s API Server
participant Cache as Node Resource Store
Note over Admin, K8s_API: Step 1: Request Detailed Diagnostics
Admin->>K8s_API: kubectl describe node worker-node1
activate K8s_API
Note over K8s_API, Cache: Step 2: Aggregate Node Data
K8s_API->>Cache: Fetch Node Metadata
K8s_API->>Cache: Fetch Node Conditions
K8s_API->>Cache: Fetch Capacity/Allocatable
K8s_API->>Cache: Fetch Cluster Events
Note over Admin, K8s_API: Step 3: Return Detailed Report
K8s_API-->>Admin: Detailed Node Output
deactivate K8s_API
Note over Admin: Analysis Phase
Note right of Admin: 1. Conditions: Check if Ready = False
Note right of Admin: 2. Resources: Check Pressure flags
Note right of Admin: 3. Events: Look for KubeletNotReady warnings
-
Use
kubectl describe nodeto get detailed information about the failed node.NODE_TYPE // bashkubectl describe node worker-node1 -
Examine the output. Pay close attention to the following sections:
- Conditions: This section shows the health of the node, including whether the kubelet is running (
KubeletReady), whether the disk space is low (DiskPressure), and whether the node is network-reachable (NetworkUnavailable). In our caseKubeletReadywill beFalse. - Capacity: Shows the resources available on the node (CPU, memory, pods).
- Allocatable: Shows the resources available for pods to use, after subtracting resources reserved for the system.
- Events: This section is crucial. It lists recent events that have occurred on the node, such as the kubelet failing to report status, pods being evicted due to resource pressure, or network connectivity issues.
The Events section is your best friend for quickly identifying the root cause of a node failure. - Conditions: This section shows the health of the node, including whether the kubelet is running (
-
Example snippet from the
Eventssection that might indicate the problem:NODE_TYPE // outputEvents: Type Reason Age From Message ---- ------ ---- ---- ------- Warning KubeletNotReady 3m (x10 over 10m) kubelet, node1 Node became not ready. Normal NodeNotReady 3m kubelet, node1 Node worker-node1 status is now: NodeNotReady
Task 4: Resolving the Node Failure
Now that you’ve diagnosed the node failure, let’s resolve it.
-
On the failed node (
worker-node1), restart the kubelet service.NODE_TYPE // bashsudo systemctl start kubelet -
Uncordon the node to allow new pods to be scheduled on it.
NODE_TYPE // bashkubectl uncordon worker-node1 -
Verify the node returns to
Readystatus.NODE_TYPE // bashkubectl get nodesNODE_TYPE // outputNAME STATUS ROLES AGE VERSION controlplane Ready control-plane,master 20d v1.29.0 worker-node1 Ready <none> 20d v1.29.0 worker-node2 Ready <none> 20d v1.29.0 -
Kubernetes will not automatically move pods back to
worker-node1. New pods can now be scheduled there.
Task 5: Troubleshooting Common Node Failure Scenarios
Let’s explore some common node failure scenarios and how to troubleshoot them.
Scenario 1: Disk Pressure
Nodes can fail due to disk pressure, meaning they are running out of disk space.
sequenceDiagram
autonumber
participant Admin as Admin/CLI
participant K8s as K8s API Server
participant Node as Worker Node (Linux)
participant Pods as Application Pods
Note over Admin, Pods: Step 1: Cluster-Level Discovery
Admin->>K8s: kubectl describe node
K8s-->>Admin: Conditions: DiskPressure = True
Admin->>K8s: kubectl get events
K8s-->>Admin: "Evicted due to DiskPressure"
Note over Admin, Node: Step 2: The Remediation Path
rect rgb(46, 70, 255, 0.1)
Admin->>Node: ssh & df -h (Check root/var/lib)
Note right of Admin: Action: Identify log bloat / orphaned images
Admin->>Node: docker system prune -a (Example remedy)
Admin->>Node: df -h (Confirm usage < eviction threshold)
end
Note over Admin, Pods: Step 3: Verification
Admin->>K8s: kubectl get node (Condition clears)
K8s-->>Admin: Status: Ready
Note right of Pods: Normal scheduling resumes.
-
Symptoms:
- Pods are evicted from the node.
kubectl describe nodeshowsDiskPressurecondition isTrue.- Events indicate pods are being evicted due to
DiskPressure.
-
Troubleshooting:
- Check disk usage on the node using
df -h. - Identify and remove unnecessary files or containers.
- Increase the disk size of the node (if possible).
- Configure log rotation to prevent logs from filling up the disk.
- Check disk usage on the node using
Scenario 2: Network Connectivity Issues
Nodes can become unreachable due to network connectivity problems.
sequenceDiagram
autonumber
participant Admin as Admin/CLI
participant K8s as K8s API Server
participant Node as Worker Node (Linux)
participant CNI as CNI Plugin (e.g., Calico)
Note over Admin, CNI: Step 1: The Outage Detection
Admin->>K8s: kubectl describe node node-name
K8s-->>Admin: Condition: NetworkUnavailable = True
Note right of Node: Node lost route to API Server
Node-xAdmin: ping API_SERVER_IP (Fails)
Note over Admin, CNI: Step 2: The Root Cause Investigation
rect rgb(46, 70, 255, 0.1)
Note right of Admin: Investigating CNI Health
Admin->>K8s: kubectl get pods -n kube-system -l k8s-app=cni
K8s-->>Admin: list of CNI agent pods
%% Use quotes around the text with angle brackets to avoid HTML parsing errors
Admin->>K8s: kubectl logs -n kube-system "cni-pod-name"
Note right of Admin: Verify Node OS Layer
Admin->>Node: ssh & check "ip route"
Node-->>Admin: Routing table output
end
-
Symptoms:
- Node status is
NotReady. kubectl describe nodeshowsNetworkUnavailablecondition isTrue.- Events indicate network connectivity issues.
- Node status is
-
Troubleshooting:
- Check network configuration on the node (e.g., routing tables, firewall rules).
- Verify DNS resolution is working correctly.
- Check network cables and switches.
- Ensure the node can communicate with the Kubernetes API server.
Scenario 3: Kubelet Failure
The kubelet is the agent that runs on each node and manages the containers. If the kubelet fails, the node will become NotReady.
sequenceDiagram
autonumber
participant Admin as Admin/CLI
participant API as K8s API Server (Lease Mechanism)
participant Node as Worker Node (Linux)
participant Kubelet as Kubelet Service (Heartbeat)
Note over Admin, Kubelet: Step 1: Heartbeat Timeout
Kubelet--xAPI: (STOPPED SENDING HEARTBEATS)
Note right of API: Grace period (40s) expires.
Admin->>API: kubectl get nodes
API-->>Admin: Status: NotReady / Unknown
Note over Admin, Kubelet: Step 2: Investigation (Local)
Admin->>Node: ssh (Connection works, OS is fine)
rect rgb(46, 70, 255, 0.1)
Admin->>Node: systemctl status kubelet
Node-->>Admin: Active: failed / inactive (Exit Code 1)
Admin->>Node: journalctl -u kubelet | tail -n 50
Note right of Admin: Action: Fix config, certificate, or OOM kill issue.
Admin->>Node: systemctl restart kubelet
end
Note over Admin, Kubelet: Step 3: Re-Integration
Kubelet->>API: (POST Heartbeat)
Admin->>API: kubectl get node (Heartbeat accepted)
API-->>Admin: Status: Ready
-
Symptoms:
- Node status is
NotReady. kubectl describe nodeshowsKubeletReadycondition isFalse.- Events indicate the kubelet is failing to report status or is crashing.
- Node status is
-
Troubleshooting:
- Check the kubelet logs for errors (usually located in
/var/log/kubelet.log). - Restart the kubelet service.
- Check resource usage on the node (CPU, memory) to see if the kubelet is being starved.
- Ensure the kubelet configuration is correct.
- Check the kubelet logs for errors (usually located in
Conclusion
In this tutorial, you learned how to simulate and troubleshoot node failures in Kubernetes. You learned how to identify the root cause of a node failure using kubectl get nodes and kubectl describe node, and how to resolve common issues like disk pressure, network connectivity problems, and kubelet failures. These skills are essential for maintaining a healthy and resilient Kubernetes cluster.