Learn Troubleshooting (CKA) with Interactive Flashcards
Master key concepts in Troubleshooting through our interactive flashcard system. Click on each card to reveal detailed explanations and enhance your understanding.
Troubleshoot clusters and nodes
Troubleshooting Kubernetes clusters and nodes is a pivotal competency for the CKA exam, focusing on diagnosing why nodes are 'NotReady' or why the control plane is unresponsive. The process begins with high-level observation using 'kubectl get nodes' to identify the problem scope. If a node is unhealthy, 'kubectl describe node <node-name>' is the first step to inspect the 'Conditions' section (checking for MemoryPressure, DiskPressure, or PIDPressure) and the 'Events' stream for recent errors.
At the node level, the Kubelet is the most critical component. Since it runs as a system daemon, troubleshooting requires SSH access to the node. You must verify the service status using 'systemctl status kubelet' and analyze logs via 'journalctl -u kubelet'. Common Kubelet failure points include misconfigured CNI plugins, expired certificates in '/var/lib/kubelet/pki', or swap memory being enabled (which must be disabled). If the Kubelet is running but containers fail, check the Container Runtime (e.g., containerd) status and logs.
For Control Plane issues where the API server is unreachable, 'kubectl' commands will fail. You must access the master node directly and inspect static pod manifests in '/etc/kubernetes/manifests/'. Since control plane components run as containers, use 'crictl ps' and 'crictl logs' to diagnose the API Server, Scheduler, or Etcd. Additionally, verify certificate validity using 'kubeadm certs check-expiration'.
Effective troubleshooting follows a layered approach: validate the Node status, inspect the Kubelet and Runtime services on the host, verify network/CNI configurations, and ensure certificate validity across the cluster.
Troubleshoot cluster components
Troubleshooting cluster components is a critical domain in the Certified Kubernetes Administrator (CKA) curriculum, focusing on diagnosing failures within the Control Plane and Worker Nodes. The workflow involves isolating whether the issue lies with the cluster services (systemd) or the Kubernetes components (Pods/Containers).
**1. Control Plane Failure:**
Most control plane components (API Server, Scheduler, Controller Manager) run as Static Pods. If the API server is unreachable, `kubectl` will not work. You must SSH into the master node and:
- **Check the Kubelet:** The kubelet manages static pods. Use `systemctl status kubelet` and `journalctl -u kubelet` to ensure it is active.
- **Verify Manifests:** Check `/etc/kubernetes/manifests` for YAML syntax errors or misconfigurations (e.g., incorrect command arguments or volume mounts).
- **Container Logs:** Since the API is down, use `crictl ps` and `crictl logs <container-id>` or inspect `/var/log/pods` directly to identify why a component crashed.
- **PKI/Certificates:** Ensure certificates in `/etc/kubernetes/pki` are valid and not expired.
**2. Worker Node Failure:**
Nodes often report a `NotReady` status.
- **Kubelet Status:** This is the primary agent. If it is stopped, the node cannot report to the control plane. Check its configuration file (usually `/var/lib/kubelet/config.yaml`) and restart the service.
- **Container Runtime:** Ensure the runtime (containerd/CRI-O) is running via `systemctl status containerd`.
- **CNI/Networking:** If pods are pending or nodes are NotReady, check if the CNI plugin is correctly installed and initialized.
**3. ETCD:**
A failed ETCD cluster halts all cluster operations. Troubleshoot by checking `journalctl` logs for the etcd service or static pod, verifying data directory permissions, and ensuring leader election is successful.
Success requires mastery of `systemctl`, `journalctl`, `crictl`, and locating standard config paths.
Monitor cluster and application resource usage
In the context of the Certified Kubernetes Administrator (CKA) exam and troubleshooting, monitoring resource usage is fundamental to maintaining cluster health and application performance. Kubernetes does not provide a native storage solution for historical metrics by default; instead, it relies on the Metrics Server, a cluster-wide aggregator of resource usage data. The Metrics Server collects metrics such as CPU and memory usage efficiently from the Summary API exposed by the Kubelet on each node.
For an administrator, the primary interface for accessing this data is the `kubectl top` command. You use `kubectl top nodes` to analyze the capacity and utilization of the underlying infrastructure, identifying if nodes are overcommitted or under pressure. Similarly, `kubectl top pods` allows you to inspect application-level consumption. This command is versatile, supporting flags to sort results (e.g., `--sort-by=cpu` or `--sort-by=memory`), filter by selector labels, or check specific namespaces, which is vital for quickly identifying 'noisy neighbor' containers that are causing resource contention.
Crucially, this monitoring pipeline is not just for observation; it drives automation. The Horizontal Pod Autoscaler (HPA) queries the Metrics Server API to determine when to scale the number of pod replicas up or down based on defined utilization thresholds. Without a functional Metrics Server, HPA cannot operate.
From a troubleshooting perspective, monitoring usage is key to diagnosing issues like `OOMKilled` (Out of Memory) errors, where a container exceeds its memory limit, or CPU throttling, which degrades performance. While the CKA focuses on these immediate, in-memory metrics, production environments typically integrate robust tools like Prometheus and Grafana for long-term data retention, historical trend analysis, and complex alerting.
Manage and evaluate container output streams
In the context of the Certified Kubernetes Administrator (CKA) exam and general troubleshooting, managing and evaluating container output streams is a fundamental skill for diagnosing application issues. Kubernetes captures standard output (stdout) and standard error (stderr) from containerized applications, treating them as logs.
The primary tool for accessing these streams is the `kubectl logs` command. For a simple single-container pod, `kubectl logs <pod-name>` retrieves the current logs. However, troubleshooting often requires more specific flags. If a pod contains multiple containers (e.g., sidecars), you must specify the container name using `-c <container-name>`.
When a pod enters a `CrashLoopBackOff` state, current logs might be empty because the container is dead. Here, the `--previous` (or `-p`) flag is vital to view the logs of the previous instance that failed, revealing the root cause of the crash. To monitor real-time activity, similar to the Linux `tail -f` command, use the `--follow` (or `-f`) flag.
For bulk evaluation, you can view logs from all containers in a pod with `--all-containers=true` or use label selectors with `-l <key>=<value>` to aggregate logs from multiple pods (like a Deployment). Understanding log lifecycle is also crucial. Logs are stored on the node (usually `/var/log/pods`). If a pod is evicted or the node dies, logs are lost unless a cluster-level logging architecture is implemented. A CKA must know how to retrieve these raw streams to debug failing workloads effectively.
Troubleshoot services and networking
Troubleshooting Kubernetes services and networking for the CKA exam requires a systematic approach, tracing the packet flow from the source Pod to the destination. Issues generally fall into four categories: Pod Networking, Service Discovery (DNS), Service Configuration, and Network Policies.
1. **Pod Networking (CNI)**: First, ensure Pods are `Running` and have IP addresses (`kubectl get pods -o wide`). If Pods cannot communicate via IP, check the CNI plugin (e.g., Calico, Flannel, Weave). Verify the CNI pods are running in the `kube-system` namespace and inspect their logs for errors.
2. **Service Discovery (DNS)**: If IP communication works but Service names fail, check CoreDNS. Verify CoreDNS pods are active. launch a temporary busybox pod to test resolution: `kubectl run test --image=busybox:1.28 --rm -it -- nslookup <service-name>`. If this fails, check `/etc/resolv.conf` inside the Pod.
3. **Service Configuration & Endpoints**: If DNS resolves but the connection times out, check the Service definition. A common error is a mismatch between the Service `selector` and the Pod `labels`. Run `kubectl get endpoints <service-name>` to verify that the Service targets actual Pod IPs. If the endpoints list is empty, the selector is incorrect. Also, ensure `kube-proxy` is running on the nodes, as it manages the iptables/IPVS rules that route Service traffic.
4. **Network Policies**: If configuration is correct but traffic is dropped, check for `NetworkPolicy` objects. By default, all traffic is allowed, but a restrictive policy might deny Ingress or Egress traffic.
Essential tools for this process include `nslookup`, `curl`, `nc` (netcat), `ip a`, and `kubectl describe`.
Debugging pods and containers
Debugging pods and containers is a fundamental competency for the Certified Kubernetes Administrator (CKA) exam and real-world cluster maintenance. The troubleshooting workflow typically follows a hierarchy: Status Check, Event Inspection, Log Analysis, and Direct Interaction.
The process begins with `kubectl get pods` to identify the state (e.g., `Pending`, `CrashLoopBackOff`, `ImagePullBackOff`). If a pod is not `Running`, the most critical command is `kubectl describe pod <pod-name>`. The 'Events' section at the bottom of this output reveals scheduling failures, image pull errors, or liveness probe failures.
Next, examine application behavior using `kubectl logs <pod-name>`. If the pod has multiple containers, specificy one using `-c <container-name>`. Crucially, if a container is in a restart loop, adding the `--previous` flag retrieves logs from the crashed instance, which is often necessary to find the root cause of startup failures.
For runtime issues, use `kubectl exec -it <pod-name> -- /bin/sh` to enter the container. This allows you to verify environment variables (`env`), check file permissions, and test network connectivity (`curl` or `nslookup`) from within the pod's network namespace.
However, strict security policies or 'distroless' images may lack shells, making `exec` impossible. In these scenarios, CKA candidates must know how to use **Ephemeral Containers**. By running `kubectl debug -it <pod-name> --image=busybox --target=<container-name>`, you attach a temporary container equipped with tools to the running pod. This enables you to inspect processes and filesystems without restarting the pod or modifying the original deployment specification.
Analyzing control plane logs
Analyzing control plane logs is a critical competency for the Certified Kubernetes Administrator (CKA) exam, serving as the primary method for diagnosing cluster failures. The control plane consists of four core components: kube-apiserver, kube-scheduler, kube-controller-manager, and etcd. The approach to log analysis depends on the deployment method.
In a standard `kubeadm` cluster, these components run as **Static Pods**. If the API server is responsive, you can check logs via `kubectl logs -n kube-system <pod-name>`. However, troubleshooting usually implies the API server is down. In this scenario, you must SSH into the control plane node and bypass kubectl. Use the container runtime CLI (e.g., `crictl ps` or `docker ps`) to identify the failing container's ID, then view logs using `crictl logs <id>`. You are looking for "CrashLoopBackOff" causes, typically resulting from typos in YAML manifests located in `/etc/kubernetes/manifests`, incorrect command-line arguments, or certificate path mismatches.
If the cluster was set up using binaries (systemd), logs are managed by `journalctl`. You would inspect them using commands like `journalctl -u kube-apiserver -f`.
Specific patterns to look for include:
1. **API Server:** Connection refused errors to etcd, indicating database unavailability.
2. **Etcd:** "Database space exceeded" or high disk latency warnings.
3. **Scheduler:** "Failed to schedule" errors indicating resource starvation or taint/toleration conflicts.
Successfully analyzing these logs allows you to pinpoint the exact configuration error—whether it is a syntax error in a static pod manifest or a networking issue—and apply the necessary fix to restore the control plane to a Running state.
Node troubleshooting and kubelet issues
In the context of the CKA exam, troubleshooting a worker node frequently centers on the **Kubelet**, the primary agent running on every node. When a node is marked `NotReady`, your initial step should be `kubectl describe node <node-name>` to identify conditions such as `DiskPressure`, `MemoryPressure`, or network unavailability.
If the issue persists, SSH into the affected node. First, verify the service status using `systemctl status kubelet`. If it is inactive or crashing, examine the logs using `journalctl -u kubelet` to find specific error messages.
Common failure scenarios include:
1. **Certificate Mismatches:** The Kubelet requires valid certificates to authenticate with the API server. Ensure the paths in the `kubeconfig` file (usually `/etc/kubernetes/kubelet.conf`) point to valid, non-expired client certificates and the correct CA.
2. **Configuration Errors:** Check `/var/lib/kubelet/config.yaml`. Misconfigured paths to the CNI (Container Network Interface) binaries or the container runtime endpoint will prevent the Kubelet from starting pods.
3. **Container Runtime Issues:** The Kubelet depends on a runtime (like `containerd` or `CRI-O`). If the runtime service is stopped (`systemctl status containerd`), the Kubelet cannot operate. Furthermore, ensure that the `SystemdCgroup` driver settings match between the Kubelet and the runtime.
4. **Swap Memory:** By default, Kubernetes fails if swap is enabled. Ensure `swapoff -a` is run and that swap is disabled in `/etc/fstab`.
A systematic approach—checking the service status, analyzing logs for cert/config errors, and verifying runtime dependencies—is the standard for resolving node issues.
Resource quota and limit troubleshooting
In the CKA context, troubleshooting resource constraints revolves around distinguishing between Pod-level definitions and Namespace-level policies.
**1. Pod-Level Troubleshooting (Requests vs. Limits)**
* **OOMKilled (Exit Code 137):** This occurs when a container attempts to use more memory than its configured **limit**. To diagnose, run `kubectl describe pod <pod-name>` and check the 'LastState' or 'State' for 'OOMKilled'.
* **Fix:** Increase the memory limit in the manifest or debug the application for memory leaks.
* **CPU Throttling:** If a container hits its CPU limit, Kubernetes throttles it (performance degrades) rather than killing it. Use `kubectl top pod` to monitor usage.
* **Pending State (Insufficient Capacity):** If the cluster nodes do not have enough unallocated CPU/Memory to match a Pod's **requests**, the Pod remains 'Pending'. `kubectl describe pod` will show 'FailedScheduling' with 'Insufficient cpu/memory'.
* **Fix:** Add nodes, scale down other workloads, or reduce the Pod's resource requests.
**2. Namespace-Level Troubleshooting (ResourceQuota)**
ResourceQuotas enforce hard limits on the aggregate resource usage within a Namespace.
* **Symptoms:** You receive a 'Forbidden' error upon creation (e.g., `exceeded quota: compute-quota`), or a ReplicaSet fails to create Pods.
* **Diagnosis:** Inspect the current usage against the hard limits using `kubectl describe resourcequota -n <namespace>`. This displays columns for 'Used' vs. 'Hard' limits for CPU, memory, and object counts (e.g., number of Pods).
* **Fix:** Increase the ResourceQuota limits, delete unused resources in that Namespace to free up capacity, or lower the requests/limits on the specific Pods you are trying to deploy.
**3. LimitRange:** If a Pod is rejected for violating minimum/maximum constraints, check `kubectl describe limitrange` in the namespace.
kubectl debug and ephemeral containers
In the context of the Certified Kubernetes Administrator (CKA) exam and general troubleshooting, `kubectl debug` is the primary mechanism for inspecting running Pods that utilize minimal, secure images. Because production containers often use 'distroless' images lacking shells or utilities like `curl`, `netstat`, or `ps`, standard `kubectl exec` commands are often impossible to use.
`kubectl debug` solves this by injecting an **ephemeral container** into the running Pod. Ephemeral containers are temporary containers intended for inspection rather than application logic. They are added dynamically to the Pod's API object at runtime. Crucially, because they run within the context of the existing Pod, they share the Pod's network namespace and IPC (Inter-Process Communication) namespace.
For a CKA candidate, knowing how to leverage this is vital for two specific scenarios:
1. **Network Debugging:** Since the ephemeral container shares the network IP and routing table of the Pod, you can launch a container with a tool-rich image (like `busybox` or `nicolaka/netshoot`) to test connectivity or check if local ports are listening.
2. **Filesystem/Process Inspection:** By using the `--target` flag, the ephemeral container shares the process namespace of a specific container. This allows you to see the target's running processes and access its filesystem via `/proc/<pid>/root`, which is essential for diagnosing why an application might be crashing on startup (CrashLoopBackOff).
It is important to note that ephemeral containers cannot be removed from a Pod once added; they stop when the Pod stops or when they complete their task.
etcd troubleshooting and backup
In the Certified Kubernetes Administrator (CKA) exam, etcd is the critical component storing the entire cluster state. Mastery of its maintenance is essential.
**Troubleshooting**
If the API server cannot communicate with etcd, the cluster becomes unresponsive. To troubleshoot:
1. **Check Process/Pod:** If etcd runs as a static pod, ensure the container is running via `crictl ps`. If it is a systemd service, check `systemctl status etcd`.
2. **Verify Health:** Use the command line tool. Ensure `ETCDCTL_API=3` is set. Run `etcdctl endpoint health` and `etcdctl endpoint status --write-out=table`. You must provide TLS flags (`--cacert`, `--cert`, `--key`), typically found in `/etc/kubernetes/pki/etcd`.
3. **Analyze Logs:** Check logs for 'fsync' latency warnings (indicating disk I/O issues) or space quota errors.
**Backup**
Kubernetes does not back up automatically; you must create snapshots.
Command: `ETCDCTL_API=3 etcdctl snapshot save <path-to-backup> [flags]`.
Use the necessary TLS certificates for authentication. Verify the snapshot integrity afterwards using `etcdctl snapshot status <path-to-backup>`.
**Restore**
Restoring is a destructive action for the current state.
1. **Stop Access:** It is often best to stop the static pod or API server to prevent writes.
2. **Restore Command:** Run `etcdctl snapshot restore <path-to-backup> --data-dir <new-directory>`. Do not overwrite the existing data directory; restore to a new path to isolate the restored state.
3. **Update Manifest:** Edit the etcd static pod manifest (usually `/etc/kubernetes/manifests/etcd.yaml`). Change the `hostPath` volume configuration to point to the `<new-directory>` created in the previous step.
4. **Restart:** Kubelet will restart the pod upon manifest modification. Verify the cluster state recovers once the pod is running.
Certificate and authentication issues
In the context of the Certified Kubernetes Administrator (CKA) exam and troubleshooting, certificate and authentication issues are primarily rooted in the Public Key Infrastructure (PKI) that secures cluster communication. Kubernetes requires mutual TLS (mTLS) for practically all internal traffic (e.g., Kubelet to API Server, ETCD to API Server), meaning both parties must present valid certificates signed by a trusted Certificate Authority (CA).
Common issues fall into three main categories:
1. **Expiration:** Certificates have a finite lifespan (often one year for clients). If a control plane component fails to start or `kubectl` commands fail unexpectedly, the first step is to verify the expiration date using `openssl x509 -in <path-to-cert> -text -noout`.
2. **Identity Mismatches:** Authentication relies on the Common Name (CN) for user identity and Organization (O) for group membership within the certificate's Subject. If a user is authenticated but denied access (RBAC issues), the certificate often lacks the correct group (e.g., `system:masters`) or user name matching the associated RoleBinding.
3. **Configuration and SANs:** A frequent point of failure is the Subject Alternative Name (SAN). If a client tries to connect to the API server via an IP or DNS name not listed in the server certificate's SAN list, the connection is rejected. Furthermore, `kubeconfig` files must reference the correct CA data; if the client CA does not match the server's signing authority, the TLS handshake will fail.
Troubleshooting involves inspecting service logs (`journalctl -u kubelet` or container logs) for 'x509: certificate signed by unknown authority' or 'certificate has expired' errors and validating certificate paths in manifest files.