Technical Theory

Troubleshooting Kubernetes Cluster and Node Failures

Introduction

This tutorial aims to equip you with the skills to troubleshoot common Kubernetes cluster and node failure scenarios. It assumes a working knowledge of Kubernetes concepts like pods, deployments, services, and namespaces, as well as familiarity with kubectl. You’ll need a Kubernetes cluster (minikube, kind, or a cloud-based cluster) and kubectl configured to access it. Basic Linux command-line skills are also required.

Task 1: Simulating a Node Failure

To understand the impact of node failures, we’ll simulate one. This is a crucial step to observe how Kubernetes responds and what indicators to look for.

sequenceDiagram
    participant Admin as Admin/CLI
    participant K8s_API as K8s API Server
    participant Worker as worker-node1 (Host)
    participant Kubelet as Kubelet Service

    Note over Admin, K8s_API: Step 1: Identify Target
    Admin->>K8s_API: kubectl get nodes
    K8s_API-->>Admin: worker-node1 (Ready)

    Note over Admin, K8s_API: Step 2: Prevent New Pods
    Admin->>K8s_API: kubectl cordon worker-node1
    activate K8s_API
    K8s_API-->>Admin: node/worker-node1 cordoned
    deactivate K8s_API
    Note right of K8s_API: Status: Ready, SchedulingDisabled

    Note over Admin, Worker: Step 3: Simulate Failure (SSH)
    Admin->>Worker: ssh USER at NODE_IP
    activate Worker
    Admin->>Worker: sudo systemctl stop kubelet
    Worker->>Kubelet: Stop Service
    deactivate Worker
    
    Note over K8s_API, Worker: Resulting Cluster State
    loop Node Health Check
        K8s_API-xKubelet: Heartbeat fails
    end
    K8s_API-->>Admin: worker-node1 status: NotReady
  1. Identify a worker node in your cluster:

    NODE_TYPE // bash
    kubectl get nodes
    NODE_TYPE // output
    NAME             STATUS   ROLES                  AGE     VERSION
    controlplane   Ready    control-plane,master   20d     v1.29.0
    worker-node1   Ready    <none>                 20d     v1.29.0
    worker-node2   Ready    <none>                 20d     v1.29.0
  2. Isolate the node by marking it as unschedulable. Replace <node_name> with the actual name of your worker node. In this case, we will target worker-node1.

    NODE_TYPE // bash
    kubectl cordon worker-node1
    Cordoning a node prevents new pods from being scheduled on it. Existing pods continue to run.
  3. Simulate a node failure by stopping the kubelet service on the node. Do this on the actual node machine (e.g., via SSH if it’s a VM). If using minikube, you will need to ssh into the minikube VM, and then stop the kubelet service.

    NODE_TYPE // bash
    sudo systemctl stop kubelet
    This command is executed on the actual worker node, NOT through kubectl on your control plane. Make sure you have access to the node’s shell.

Task 2: Observing the Impact of Node Failure

Now that the node is failing, let’s observe how Kubernetes reacts.

  1. List the nodes again and observe the STATUS of the node you cordoned and stopped the kubelet on.

    NODE_TYPE // bash
    kubectl get nodes
    NODE_TYPE // output
    NAME             STATUS                     ROLES                  AGE     VERSION
    controlplane   Ready                      control-plane,master   20d     v1.29.0
    worker-node1   NotReady,SchedulingDisabled   <none>                 20d     v1.29.0
    worker-node2   Ready                      <none>                 20d     v1.29.0

    You’ll likely see the node in a NotReady state and SchedulingDisabled status. This indicates Kubernetes has detected an issue with the node.

  2. Examine the pods running on the failed node. Identify pods that should be automatically rescheduled.

    NODE_TYPE // bash
    kubectl get pods --all-namespaces -o wide | grep worker-node1

    The -o wide option shows the node the pod is running on.

    NODE_TYPE // output
    NAMESPACE     NAME                                   READY   STATUS    RESTARTS   AGE   IP           NODE            NOMINATED NODE   READINESS GATES
    default       nginx-deployment-54db667d6d-abcde      1/1     Running   0          5d    10.244.2.5   worker-node1    <none>           <none>
  3. After a few minutes (depending on your cluster’s configuration), Kubernetes should start rescheduling pods that were running on the failed node to other available nodes. Check the status of the pods again.

    NODE_TYPE // bash
    kubectl get pods --all-namespaces -o wide | grep nginx-deployment
    NODE_TYPE // output
    NAMESPACE     NAME                                   READY   STATUS    RESTARTS   AGE   IP           NODE            NOMINATED NODE   READINESS GATES
    default       nginx-deployment-54db667d6d-fghij      1/1     Running   0          1m    10.244.3.6   worker-node2    <none>           <none>

    Notice that the pod nginx-deployment-54db667d6d-abcde running on worker-node1 is no longer present, and a new pod nginx-deployment-54db667d6d-fghij now runs on worker-node2. The last characters of the pod name will be different as it’s a brand new Pod instance.

Task 3: Diagnosing Node Failure with kubectl describe node

The kubectl describe node command provides a wealth of information about a node, including its status, capacity, and any recent events.

sequenceDiagram
    participant Admin as Admin/CLI
    participant K8s_API as K8s API Server
    participant Cache as Node Resource Store

    Note over Admin, K8s_API: Step 1: Request Detailed Diagnostics
    Admin->>K8s_API: kubectl describe node worker-node1
    activate K8s_API
    
    Note over K8s_API, Cache: Step 2: Aggregate Node Data
    K8s_API->>Cache: Fetch Node Metadata
    K8s_API->>Cache: Fetch Node Conditions
    K8s_API->>Cache: Fetch Capacity/Allocatable
    K8s_API->>Cache: Fetch Cluster Events
    
    Note over Admin, K8s_API: Step 3: Return Detailed Report
    K8s_API-->>Admin: Detailed Node Output
    deactivate K8s_API

    Note over Admin: Analysis Phase
    Note right of Admin: 1. Conditions: Check if Ready = False
    Note right of Admin: 2. Resources: Check Pressure flags
    Note right of Admin: 3. Events: Look for KubeletNotReady warnings
  1. Use kubectl describe node to get detailed information about the failed node.

    NODE_TYPE // bash
    kubectl describe node worker-node1
  2. Examine the output. Pay close attention to the following sections:

    • Conditions: This section shows the health of the node, including whether the kubelet is running (KubeletReady), whether the disk space is low (DiskPressure), and whether the node is network-reachable (NetworkUnavailable). In our case KubeletReady will be False.
    • Capacity: Shows the resources available on the node (CPU, memory, pods).
    • Allocatable: Shows the resources available for pods to use, after subtracting resources reserved for the system.
    • Events: This section is crucial. It lists recent events that have occurred on the node, such as the kubelet failing to report status, pods being evicted due to resource pressure, or network connectivity issues.
    The Events section is your best friend for quickly identifying the root cause of a node failure.
  3. Example snippet from the Events section that might indicate the problem:

    NODE_TYPE // output
    Events:
      Type     Reason                   Age                    From               Message
      ----     ------                   ----                   ----               -------
      Warning  KubeletNotReady          3m (x10 over 10m)      kubelet, node1   Node became not ready.
      Normal   NodeNotReady             3m                     kubelet, node1   Node worker-node1 status is now: NodeNotReady

Task 4: Resolving the Node Failure

Now that you’ve diagnosed the node failure, let’s resolve it.

  1. On the failed node (worker-node1), restart the kubelet service.

    NODE_TYPE // bash
    sudo systemctl start kubelet
  2. Uncordon the node to allow new pods to be scheduled on it.

    NODE_TYPE // bash
    kubectl uncordon worker-node1
  3. Verify the node returns to Ready status.

    NODE_TYPE // bash
    kubectl get nodes
    NODE_TYPE // output
    NAME             STATUS   ROLES                  AGE     VERSION
    controlplane   Ready    control-plane,master   20d     v1.29.0
    worker-node1   Ready    <none>                 20d     v1.29.0
    worker-node2   Ready    <none>                 20d     v1.29.0
  4. Kubernetes will not automatically move pods back to worker-node1. New pods can now be scheduled there.

Task 5: Troubleshooting Common Node Failure Scenarios

Let’s explore some common node failure scenarios and how to troubleshoot them.

Scenario 1: Disk Pressure

Nodes can fail due to disk pressure, meaning they are running out of disk space.

sequenceDiagram
    autonumber
    participant Admin as Admin/CLI
    participant K8s as K8s API Server
    participant Node as Worker Node (Linux)
    participant Pods as Application Pods

    Note over Admin, Pods: Step 1: Cluster-Level Discovery
    Admin->>K8s: kubectl describe node 
    K8s-->>Admin: Conditions: DiskPressure = True
    Admin->>K8s: kubectl get events
    K8s-->>Admin: "Evicted due to DiskPressure"

    Note over Admin, Node: Step 2: The Remediation Path
    rect rgb(46, 70, 255, 0.1)
        Admin->>Node: ssh & df -h (Check root/var/lib)
        Note right of Admin: Action: Identify log bloat / orphaned images
        Admin->>Node: docker system prune -a (Example remedy)
        Admin->>Node: df -h (Confirm usage < eviction threshold)
    end
    
    Note over Admin, Pods: Step 3: Verification
    Admin->>K8s: kubectl get node (Condition clears)
    K8s-->>Admin: Status: Ready
    Note right of Pods: Normal scheduling resumes.
  1. Symptoms:

    • Pods are evicted from the node.
    • kubectl describe node shows DiskPressure condition is True.
    • Events indicate pods are being evicted due to DiskPressure.
  2. Troubleshooting:

    • Check disk usage on the node using df -h.
    • Identify and remove unnecessary files or containers.
    • Increase the disk size of the node (if possible).
    • Configure log rotation to prevent logs from filling up the disk.

Scenario 2: Network Connectivity Issues

Nodes can become unreachable due to network connectivity problems.

sequenceDiagram
    autonumber
    participant Admin as Admin/CLI
    participant K8s as K8s API Server
    participant Node as Worker Node (Linux)
    participant CNI as CNI Plugin (e.g., Calico)

    Note over Admin, CNI: Step 1: The Outage Detection
    Admin->>K8s: kubectl describe node node-name
    K8s-->>Admin: Condition: NetworkUnavailable = True
    
    Note right of Node: Node lost route to API Server
    Node-xAdmin: ping API_SERVER_IP (Fails)

    Note over Admin, CNI: Step 2: The Root Cause Investigation
    rect rgb(46, 70, 255, 0.1)
        Note right of Admin: Investigating CNI Health
        Admin->>K8s: kubectl get pods -n kube-system -l k8s-app=cni
        K8s-->>Admin: list of CNI agent pods
        
        %% Use quotes around the text with angle brackets to avoid HTML parsing errors
        Admin->>K8s: kubectl logs -n kube-system "cni-pod-name"
        
        Note right of Admin: Verify Node OS Layer
        Admin->>Node: ssh & check "ip route"
        Node-->>Admin: Routing table output
    end
  1. Symptoms:

    • Node status is NotReady.
    • kubectl describe node shows NetworkUnavailable condition is True.
    • Events indicate network connectivity issues.
  2. Troubleshooting:

    • Check network configuration on the node (e.g., routing tables, firewall rules).
    • Verify DNS resolution is working correctly.
    • Check network cables and switches.
    • Ensure the node can communicate with the Kubernetes API server.

Scenario 3: Kubelet Failure

The kubelet is the agent that runs on each node and manages the containers. If the kubelet fails, the node will become NotReady.

sequenceDiagram
    autonumber
    participant Admin as Admin/CLI
    participant API as K8s API Server (Lease Mechanism)
    participant Node as Worker Node (Linux)
    participant Kubelet as Kubelet Service (Heartbeat)

    Note over Admin, Kubelet: Step 1: Heartbeat Timeout
    Kubelet--xAPI: (STOPPED SENDING HEARTBEATS)
    Note right of API: Grace period (40s) expires.
    Admin->>API: kubectl get nodes
    API-->>Admin: Status: NotReady / Unknown

    Note over Admin, Kubelet: Step 2: Investigation (Local)
    Admin->>Node: ssh (Connection works, OS is fine)
    rect rgb(46, 70, 255, 0.1)
        Admin->>Node: systemctl status kubelet
        Node-->>Admin: Active: failed / inactive (Exit Code 1)
        Admin->>Node: journalctl -u kubelet | tail -n 50
        Note right of Admin: Action: Fix config, certificate, or OOM kill issue.
        Admin->>Node: systemctl restart kubelet
    end

    Note over Admin, Kubelet: Step 3: Re-Integration
    Kubelet->>API: (POST Heartbeat)
    Admin->>API: kubectl get node (Heartbeat accepted)
    API-->>Admin: Status: Ready
  1. Symptoms:

    • Node status is NotReady.
    • kubectl describe node shows KubeletReady condition is False.
    • Events indicate the kubelet is failing to report status or is crashing.
  2. Troubleshooting:

    • Check the kubelet logs for errors (usually located in /var/log/kubelet.log).
    • Restart the kubelet service.
    • Check resource usage on the node (CPU, memory) to see if the kubelet is being starved.
    • Ensure the kubelet configuration is correct.

Conclusion

In this tutorial, you learned how to simulate and troubleshoot node failures in Kubernetes. You learned how to identify the root cause of a node failure using kubectl get nodes and kubectl describe node, and how to resolve common issues like disk pressure, network connectivity problems, and kubelet failures. These skills are essential for maintaining a healthy and resilient Kubernetes cluster.

Next Topic