Technical Theory

Troubleshooting Kubernetes Clusters

Introduction

This tutorial provides a practical guide to troubleshooting common issues in Kubernetes control plane and worker nodes. We’ll cover essential kubectl commands, log analysis techniques, and system utilities to diagnose and resolve problems. Basic familiarity with Kubernetes architecture and kubectl is assumed.

Task 1: Checking the Cluster Status

A good starting point for troubleshooting is to check the overall health of your Kubernetes cluster.

sequenceDiagram
    participant Admin as Admin/CLI
    participant K8s_API as Kubernetes API Server
    participant ControlPlane as controlplane Node
    participant Worker1 as worker-node-1 Node

    Note over Admin, K8s_API: Step 1: Check Node Status
    Admin->>K8s_API: kubectl get nodes
    activate K8s_API
    K8s_API-->>Admin: controlplane (Ready), worker-node-1 (Ready), ...
    deactivate K8s_API

    Note right of Admin: (Optional: Repeat for any problematic node)

    Note over Admin, ControlPlane: Step 2: Describe Specific Node
    Admin->>K8s_API: kubectl describe node controlplane
    activate K8s_API
    K8s_API->>ControlPlane: Get detailed node data
    activate ControlPlane
    ControlPlane-->>K8s_API: Node details (Conditions, etc.)
    deactivate ControlPlane
    K8s_API-->>Admin: Full Node description
    deactivate K8s_API

    Note right of Admin: Check Conditions: MemoryPressure, DiskPressure, PIDPressure, Ready
  1. Use kubectl to retrieve the status of all nodes:

    NODE_TYPE // bash
    kubectl get nodes
    NODE_TYPE // output
    NAME             STATUS   ROLES                  AGE   VERSION
    controlplane     Ready    control-plane,master   30m   v1.27.4
    worker-node-1    Ready    <none>                 28m   v1.27.4
    worker-node-2    Ready    <none>                 28m   v1.27.4
    A node’s STATUS should be Ready. If a node is in NotReady, SchedulingDisabled, or other states, it indicates a problem.
  2. Inspect individual node details for more information:

    NODE_TYPE // bash
    kubectl describe node <node-name>

    Replace <node-name> with the name of the node you want to inspect.

    NODE_TYPE // output
    Name:               controlplane
    Roles:              control-plane,master
    ...
    Conditions:
      Type             Status  LastHeartbeatTime   LastTransitionTime  Reason                       Message
      ----------------- ------- ----------------- -------------------- --------                       -------
      MemoryPressure     False   ...               ...                  KubeletHasSufficientMemory   Kubelet has sufficient memory available
      DiskPressure       False   ...               ...                  KubeletHasNoDiskPressure     Kubelet has no disk pressure
      PIDPressure        False   ...               ...                  KubeletHasSufficientPID      Kubelet has sufficient PID available
      Ready              True    ...               ...                  KubeletReady                 kubelet is posting ready status
    ...
    Pay close attention to the Conditions section. Conditions like MemoryPressure, DiskPressure, or PIDPressure can indicate resource exhaustion. KubeletNotReady usually means the kubelet is having problems.

Task 2: Examining Control Plane Components

The control plane consists of several essential components: kube-apiserver, kube-scheduler, kube-controller-manager, and etcd.

sequenceDiagram
    participant Admin as Admin/CLI
    participant K8s_API as kube-apiserver
    participant KubeSystem as kube-system Pods
    participant ETCD as etcd Store

    Note over Admin, KubeSystem: Step 1: Check Control Plane Pod Health
    Admin->>K8s_API: kubectl get pods -n kube-system
    activate K8s_API
    K8s_API-->>Admin: List of pods (etcd, apiserver, scheduler, etc.)
    deactivate K8s_API
    
    Note right of Admin: Verify STATUS is "Running"

    Note over Admin, KubeSystem: Step 2: Inspect Component Logs
    Admin->>K8s_API: kubectl logs -n kube-system 
    activate K8s_API
    K8s_API->>KubeSystem: Fetch container logs
    activate KubeSystem
    KubeSystem-->>K8s_API: Log stream
    deactivate KubeSystem
    K8s_API-->>Admin: Display logs (check for Errors/Warnings)
    deactivate K8s_API

    Note over Admin, ETCD: Step 3: Verify Data Store Health
    Admin->>ETCD: etcdctl endpoint health (via certificates)
    activate ETCD
    ETCD-->>Admin: health: healthy (Revision, Member ID)
    deactivate ETCD

    Note right of Admin: If etcd is unhealthy, cluster state is at risk.
  1. Check the status of control plane pods:

    NODE_TYPE // bash
    kubectl get pods -n kube-system
    NODE_TYPE // output
    NAME                                    READY   STATUS    RESTARTS   AGE
    coredns-66bff4672f-7wbqk                1/1     Running   0          30m
    coredns-66bff4672f-z2w7z                1/1     Running   0          30m
    etcd-controlplane                       1/1     Running   0          30m
    kube-apiserver-controlplane             1/1     Running   0          30m
    kube-controller-manager-controlplane    1/1     Running   0          30m
    kube-proxy-jqqv9                        1/1     Running   0          29m
    kube-scheduler-controlplane             1/1     Running   0          30m
    All control plane pods in the kube-system namespace should be in the Running state. A Pending or CrashLoopBackOff status indicates a problem.
  2. View logs for specific control plane components:

    NODE_TYPE // bash
    kubectl logs -n kube-system <pod-name>

    Replace <pod-name> with the name of the pod you want to inspect (e.g., kube-apiserver-controlplane).

    NODE_TYPE // output
    ... (API server logs) ...
    Examine the logs for errors, warnings, or unusual activity. Focus on timestamps around when the problem started.
  3. Check etcd health. This may vary depending on how your cluster is set up. The following is an example using etcdctl:

    NODE_TYPE // bash
    ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key endpoint health
    NODE_TYPE // output
    127.0.0.1:2379, health: healthy, revision: 5, cluster_id: 272b8c24e6dd171a, member_id: 8e9e05c52f4573f8, took: 1.836668ms
    etcd is the Kubernetes backing store. If etcd is unhealthy, your entire cluster will be affected. Consult the etcd documentation for troubleshooting steps. Common issues include disk space exhaustion and network connectivity problems.

Task 3: Investigating Worker Node Issues

Worker nodes run your application workloads. Issues on worker nodes can prevent pods from running correctly.

sequenceDiagram
    participant Admin as Admin/Local Machine
    participant Node as Worker Node (OS)
    participant Kubelet as Kubelet Service
    participant Runtime as Container Runtime (Docker/Containerd)

    Note over Admin, Node: Step 1: Access Node
    Admin->>Node: ssh @
    activate Node

    Note over Node, Kubelet: Step 2 & 3: Kubelet Health
    Node->>Kubelet: systemctl status kubelet
    Kubelet-->>Node: Status (Active/Inactive)
    Node->>Kubelet: journalctl -u kubelet
    Kubelet-->>Node: Logs (Errors, Image pulls, Networking)

    Note over Node, Runtime: Step 4: Runtime Health
    Node->>Runtime: systemctl status docker/containerd
    Runtime-->>Node: Status (Active/Inactive)
    Node->>Runtime: journalctl -u docker/containerd
    Runtime-->>Node: Runtime Logs

    Note over Admin, Node: Step 5: Resource Audit
    Admin->>Node: top / htop
    Node-->>Admin: CPU/Memory/Load metrics
    deactivate Node

    Note right of Admin: Analyze for resource contention or service failures.
  1. SSH into the worker node exhibiting problems.

    NODE_TYPE // bash
    ssh <user>@<worker-node-ip>

    Replace <user> and <worker-node-ip> with the appropriate values.

  2. Check kubelet status using systemctl:

    NODE_TYPE // bash
    sudo systemctl status kubelet
    NODE_TYPE // output
     kubelet.service - kubelet: The Kubernetes Node Agent
      Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
      Active: active (running) since Tue 2023-10-24 14:30:00 UTC; 30min ago
        ...
    If the kubelet is not running, start it with sudo systemctl start kubelet. Check the logs with sudo journalctl -u kubelet to identify the reason for the failure.
  3. Inspect kubelet logs using journalctl:

    NODE_TYPE // bash
    sudo journalctl -u kubelet
    NODE_TYPE // output
    ... (kubelet logs) ...
    Look for errors related to image pulls, networking, or resource limitations.
  4. Check Docker/Containerd status using systemctl.

    NODE_TYPE // bash
    sudo systemctl status docker # or containerd
    NODE_TYPE // output
     docker.service - Docker Application Container Engine
      Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
      Active: active (running) since Tue 2023-10-24 14:28:00 UTC; 32min ago
        ...
    If the container runtime is not running, start it with sudo systemctl start docker (or containerd). Check the logs with sudo journalctl -u docker (or containerd) to identify the reason for the failure. Common issues are similar to the kubelet issues above (image pull problems, disk space, etc.)
  5. Examine resource usage on the node using top or htop:

    NODE_TYPE // bash
    top
    NODE_TYPE // output
    top - 14:59:30 up 30 min,  1 user,  load average: 0.00, 0.01, 0.00
    Tasks: 110 total,   1 running, 109 sleeping,   0 stopped,   0 zombie
    %Cpu(s):  0.3 us,  0.3 sy,  0.0 ni, 99.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
    KiB Mem :  1009504 total,   745560 free,   119756 used,   144188 buff/cache
    KiB Swap:        0 total,        0 free,        0 used.   803532 avail Mem
    High CPU, memory, or I/O usage can indicate resource contention. Identify the processes consuming the most resources and investigate further.

Task 4: Pod-Specific Troubleshooting

If a specific pod is failing, you can troubleshoot it directly.

sequenceDiagram
    participant Admin as Admin/CLI
    participant K8s as Kubernetes Cluster
    participant BB_A as busybox-a Pod
    participant BB_B as busybox-b Pod

    Note over Admin, K8s: Step 1 and 2: Deploy and Apply
    Admin->>K8s: kubectl apply (busybox pods)
    activate K8s
    K8s-->>Admin: pods created
    deactivate K8s

    Note over Admin, K8s: Step 3: Get Pod IPs
    Admin->>K8s: kubectl get pods -o wide
    activate K8s
    K8s-->>Admin: BB_A (10.244.0.5), BB_B (10.244.0.6)
    deactivate K8s

    Note over Admin, BB_B: Step 4: Verify Connectivity
    Admin->>BB_A: kubectl exec (ping 10.244.0.6)
    activate BB_A
    BB_A->>BB_B: ICMP request
    activate BB_B
    BB_B-->>BB_A: ICMP response
    deactivate BB_B
    BB_A-->>Admin: Ping Statistics
    deactivate BB_A
  1. Get the pod’s status:

    NODE_TYPE // bash
    kubectl get pod <pod-name> -n <namespace>

    Replace <pod-name> and <namespace> with the correct values.

    NODE_TYPE // output
    NAME                         READY   STATUS             RESTARTS   AGE
    my-app-pod-7d6c7b964-mkt9q   0/1     ImagePullBackOff   0          5m
    Pay attention to the STATUS and READY columns. ImagePullBackOff indicates a problem pulling the container image. CrashLoopBackOff means the container is crashing repeatedly. 0/1 in the READY column indicates the pod is not ready.
  2. Describe the pod for more detailed information:

    NODE_TYPE // bash
    kubectl describe pod <pod-name> -n <namespace>
    NODE_TYPE // output
    Name:         my-app-pod-7d6c7b964-mkt9q
    Namespace:    default
    ...
    Events:
      Type     Reason     Age                From               Message
      ----     ------     ----               ----               -------
      Normal   Scheduled  6m                 default-scheduler  Successfully assigned default/my-app-pod-7d6c7b964-mkt9q to worker-node-1
      Warning  Failed     1m (x5 over 5m)  kubelet            Failed to pull image "my-app:latest": rpc error: code = NotFound desc = failed to pull and unpack image "docker.io/library/my-app:latest": failed to resolve reference "docker.io/library/my-app:latest": docker.io/library/my-app:latest: not found
      Warning  Failed     1m (x5 over 5m)  kubelet            Error: ErrImagePull: rpc error: code = NotFound desc = failed to pull and unpack image "my-app:latest": failed to resolve reference "docker.io/library/my-app:latest": docker.io/library/my-app:latest: not found
      Warning  Failed     1m (x5 over 5m)  kubelet            Error: ImagePullBackOff: Back-off pulling image "my-app:latest"
    The Events section often contains valuable clues about why a pod is failing.
  3. View the pod’s logs:

    NODE_TYPE // bash
    kubectl logs <pod-name> -n <namespace>
    NODE_TYPE // output
    ... (application logs) ...

    If the pod is crashing frequently, you might need to use the --previous flag to view the logs from the previous container instance:

    NODE_TYPE // bash
    kubectl logs --previous <pod-name> -n <namespace>
  4. Execute a command inside the running container (if possible):

    NODE_TYPE // bash
    kubectl exec -it <pod-name> -n <namespace> -- /bin/bash

    This allows you to inspect the container’s filesystem, network configuration, and running processes.

    If the container is not running, you can’t use kubectl exec.

Task 5: Network Troubleshooting

Networking issues can prevent pods from communicating with each other or with external services.

sequenceDiagram
    participant Admin as Admin/CLI
    participant PodA as Source Pod
    participant CoreDNS as CoreDNS (kube-dns)
    participant K8s_API as K8s API Server
    participant Target as Target Pod / Service

    Note over Admin, CoreDNS: Step 1: DNS Resolution
    Admin->>PodA: kubectl exec (nslookup SERVICE_NAME)
    activate PodA
    PodA->>CoreDNS: DNS Query
    activate CoreDNS
    CoreDNS-->>PodA: Service IP (ClusterIP)
    deactivate CoreDNS
    PodA-->>Admin: Display Resolved IP
    deactivate PodA

    Note over Admin, Target: Step 2: Connectivity Test
    Admin->>PodA: kubectl exec (curl SERVICE_URL)
    activate PodA
    PodA->>Target: HTTP Request / Ping
    alt Connection Success
        Target-->>PodA: 200 OK / Response
        PodA-->>Admin: Success
    else Connection Failure
        Target--xPodA: Timeout / Connection Refused
        PodA-->>Admin: Error (Check Policies/CNI)
    end
    deactivate PodA

    Note over Admin, K8s_API: Step 3: Verify Service Endpoints
    Admin->>K8s_API: kubectl get endpoints SERVICE_NAME
    activate K8s_API
    K8s_API-->>Admin: List of Pod IPs + Ports
    deactivate K8s_API

    Note right of Admin: If list is empty, check Pod labels and selectors.
  1. Verify DNS resolution:

    From inside a pod, try to resolve a service name or external hostname using nslookup or dig.

    NODE_TYPE // bash
    kubectl exec -it <pod-name> -n <namespace> -- nslookup <service-name>
    NODE_TYPE // output
    Server:         10.96.0.10
    Address:        10.96.0.10#53
    
    Name:   <service-name>.default.svc.cluster.local
    Address: 10.97.141.170
    If DNS resolution fails, check your CoreDNS configuration and ensure that the kube-dns service is running correctly in the kube-system namespace.
  2. Test network connectivity using ping or curl:

    From inside a pod, try to ping another pod or service.

    NODE_TYPE // bash
    kubectl exec -it <pod-name> -n <namespace> -- ping <pod-ip>
    NODE_TYPE // bash
    kubectl exec -it <pod-name> -n <namespace> -- curl <service-url>
    If ping or curl fails, check your network policies, firewall rules, and routing configuration.
  3. Verify service endpoints:

    NODE_TYPE // bash
    kubectl get endpoints <service-name> -n <namespace>
    NODE_TYPE // output
    NAME         ENDPOINTS                           AGE
    my-service   10.244.2.10:8080,10.244.3.15:8080   30m
    If the ENDPOINTS list is empty, it means there are no pods backing the service. Check your pod selectors and ensure that the pods are running and healthy.

Conclusion

This tutorial covered essential techniques for troubleshooting Kubernetes control plane and worker nodes. By checking cluster status, examining logs, inspecting pod details, and verifying network connectivity, you can effectively diagnose and resolve common issues in your Kubernetes environment. Remember to consult the official Kubernetes documentation for more in-depth information on specific error messages and troubleshooting scenarios.

Next Topic