Module 5.3: Control Plane Failures
Complexity:
[COMPLEX]- Critical infrastructure troubleshootingTime to Complete: 50-60 minutes
Prerequisites: Module 5.1 (Methodology), Module 1.1 (Control Plane Deep-Dive)
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will transition from basic debugging to advanced cluster rescue operations. Specifically, you will be able to:
- Diagnose complex control plane failures by cross-referencing static pod manifests, container runtime events, and kubelet logs.
- Implement restorative actions for critical components including etcd quorum loss, API server certificate expiration, and scheduler crashes.
- Evaluate the systemic impact of specific component failures to rapidly isolate the root cause of cluster-wide freezes.
- Design safe troubleshooting workflows that preserve forensic evidence while returning the cluster to a healthy state.
Why This Module Matters
Section titled “Why This Module Matters”When the control plane fails, your entire platform teeters on the edge of catastrophe. The API server down means zero visibility and zero control. The scheduler down means auto-scaling is dead in the water. The controller manager down means self-healing mechanisms cease to exist. These are the most critical, high-pressure incidents an infrastructure engineer will face, and the mean time to recovery directly impacts business survival.
Consider the highly publicized outage of a major global retailer in May 2018. During a critical promotional event, their primary Kubernetes clusters went entirely dark. No deployments could roll out, no crashed pods were replaced, and the engineering team lost all kubectl access. The root cause? A control plane certificate had expired exactly 365 days after initial cluster bootstrapping. The API server static pods failed to start, causing a complete management plane blackout.
This single failure caused an estimated $12 million in lost transaction revenue over an eight-hour recovery window. If the engineers had known how to manually verify static pod logs using crictl and immediately renew certificates via kubeadm, the downtime could have been restricted to fifteen minutes. Mastery of control plane troubleshooting separates the novice operators from the true Kubernetes experts.
The Air Traffic Control Analogy
The control plane is exactly like air traffic control for your cluster. The API server is the radio tower; if it goes down, absolutely no communication occurs. The scheduler is the flight planner; without it, new flights (pods) cannot take off and remain stranded at the gate. The controller manager is the monitoring system; it ensures all planes follow their assigned routes and handles emergencies. Finally, etcd is the flight record database; if it corrupts, the entire airport forgets what planes exist. When any of these systems fail, you must act decisively.
Did You Know?
Section titled “Did You Know?”- Static Pod Exclusivity: In standard kubeadm deployments, control plane components do not run as standard deployments; they run as static pods managed directly by the local kubelet bypassing the scheduler entirely.
- Certificate Default Lifespans: By default, Kubernetes internal certificates generated by kubeadm are configured to expire in exactly 365 days, which is the most common cause of “sudden” control plane deaths on a cluster’s first anniversary.
- Port Allocations: Since 2015, etcd has exclusively utilized port 2379 for client API requests and port 2380 for internal peer-to-peer cluster communication.
- Data Capacity Limits: A single etcd database cluster defaults to a maximum storage space quota of 2 gigabytes; if this limit is breached without compaction, the entire cluster enters a read-only state.
Part 1: Control Plane Architecture Review
Section titled “Part 1: Control Plane Architecture Review”To effectively troubleshoot, you must first understand the architectural flow and dependencies.
1.1 Component Dependencies
Section titled “1.1 Component Dependencies”The control plane is not a monolith; it is a collection of highly interdependent microservices. The API server acts as the sole gateway. No other component speaks directly to the database (etcd).
Here is the dynamic architectural flow:
flowchart TD ETCD[(etcd\nStorage)] API[API Server\nGateway] SCHED[Scheduler] CM[Controller\nManager] CCM[Cloud\nController] KUBECTL[kubectl] KUBELET[kubelet] CONTROLLERS[External\nControllers]
API -->|Reads/Writes| ETCD KUBECTL -->|REST calls| API KUBELET -->|Status updates| API CONTROLLERS -->|Reconciliation| API
SCHED -.->|Watches/Binds| API CM -.->|Watches/Updates| API CCM -.->|Watches/Updates| API
classDef critical fill:#f9f,stroke:#333,stroke-width:2px; class ETCD,API critical;For reference when viewing legacy terminal documentation, this relationship is often mapped conceptually as follows:
┌──────────────────────────────────────────────────────────────┐│ CONTROL PLANE DEPENDENCIES ││ ││ ┌─────────────┐ ││ │ etcd │ ││ │ (storage) │ ││ └──────┬──────┘ ││ │ ││ ▼ ││ ┌─────────────┐ ││ │ API Server │◄──── kubectl ││ │ (gateway) │◄──── kubelet ││ └──────┬──────┘◄──── controllers ││ │ ││ ┌──────────────┼──────────────┐ ││ │ │ │ ││ ▼ ▼ ▼ ││ ┌───────────┐ ┌───────────┐ ┌───────────┐ ││ │ Scheduler │ │ Controller│ │ Cloud │ ││ │ │ │ Manager │ │ Controller│ ││ └───────────┘ └───────────┘ └───────────┘ ││ ││ If etcd fails → Everything fails ││ If API server → Nothing can communicate ││ If scheduler → New pods won't be scheduled ││ If controller-mgr → Resources won't reconcile ││ │└──────────────────────────────────────────────────────────────┘1.2 Static Pods Overview
Section titled “1.2 Static Pods Overview”The core components are deployed as static pods. The local kubelet daemon constantly monitors a specific directory on the host machine.
# Static pod manifest location/etc/kubernetes/manifests/├── etcd.yaml├── kube-apiserver.yaml├── kube-controller-manager.yaml└── kube-scheduler.yaml
# kubelet watches this directory# Changes to these files = automatic restart of component1.3 Baseline Health Verification
Section titled “1.3 Baseline Health Verification”Before diving into logs, always check the high-level status of the control plane. While the componentstatuses endpoint is deprecated, it is still occasionally used for rapid triage.
# Quick health check (deprecated but useful)k get componentstatuses
# Check control plane podsk -n kube-system get pods | grep -E 'etcd|api|controller|scheduler'
# Verify all components are runningk -n kube-system get pods -o wide | grep -E 'kube-'Part 2: API Server Troubleshooting
Section titled “Part 2: API Server Troubleshooting”The API server is the heart of the cluster. If it fails, kubectl becomes useless, and you must rely on node-level tools like crictl and journalctl.
2.1 Failure Symptoms
Section titled “2.1 Failure Symptoms”When the API server experiences degradation, the symptoms are immediate and severe.
┌──────────────────────────────────────────────────────────────┐│ API SERVER FAILURE SYMPTOMS ││ ││ Symptom Indicates ││ ───────────────────────────────────────────────────────── ││ kubectl hangs/times out API server unreachable ││ "connection refused" API server not listening ││ "unable to connect to server" Network/firewall issue ││ "Unauthorized" Auth/cert issue ││ "etcd cluster is unavailable" API can't reach etcd ││ Very slow responses Overloaded or etcd slow ││ │└──────────────────────────────────────────────────────────────┘Stop and think: If your API server is down, how will you find out why it is down if you cannot run
kubectl logs? You must SSH into the control plane node and use the container runtime interface.
2.2 Diagnosing Issues at the Node Level
Section titled “2.2 Diagnosing Issues at the Node Level”Step 1: Check if the API server pod is running
# From a control plane nodecrictl ps | grep kube-apiserver
# Or check static pod statusls -la /etc/kubernetes/manifests/kube-apiserver.yamlIf it is not in the running list, check if it recently crashed:
crictl ps -a | grep kube-apiserver # See if it exists but stoppedjournalctl -u kubelet | grep apiserver # Check kubelet logsStep 2: Inspecting the logs natively
# If running as podk -n kube-system logs kube-apiserver-<node>
# If pod is down, use crictlcrictl logs $(crictl ps -a | grep apiserver | awk '{print $1}')
# Or check kubelet logs for why it's not startingjournalctl -u kubelet | grep apiserverSometimes, crictl will show you multiple stopped containers. Just grab the latest one:
crictl ps | grep kube-apiserverStep 3: Validating Cryptographic Health Certificate issues are the number one cause of API failures.
# Verify certificatesopenssl x509 -in /etc/kubernetes/pki/apiserver.crt -text -noout | grep -A 2 "Validity"
# Check if certs are expiredkubeadm certs check-expiration2.3 Remediation Workflows
Section titled “2.3 Remediation Workflows”Here is a mapping of common API server issues and their respective fixes:
| Issue | Symptom | Fix |
|---|---|---|
| Certificate expired | ”x509: certificate has expired” | kubeadm certs renew all |
| etcd unreachable | ”etcd cluster is unavailable” | Check etcd health, fix etcd |
| Wrong etcd endpoints | Startup failure | Check --etcd-servers in manifest |
| Port conflict | ”bind: address already in use” | Find and kill conflicting process |
| Out of memory | OOMKilled, slow responses | Increase node resources |
| Incorrect flags | Won’t start | Check manifest YAML syntax |
If certificates are the culprit, execution is straightforward:
# Check certificate statuskubeadm certs check-expiration
# Renew all certificateskubeadm certs renew all
# Restart control plane pods# kubelet automatically restarts static pods when manifests changeIf the manifest configuration is damaged, you must edit it directly. The local kubelet will instantly notice the file hash change and restart the container.
# Edit static pod manifestsudo vim /etc/kubernetes/manifests/kube-apiserver.yaml
# Common fixes:# - Fix typos in flags# - Correct certificate paths# - Fix etcd endpoints
# kubelet automatically detects changes and restarts the podPart 3: Scheduler Troubleshooting
Section titled “Part 3: Scheduler Troubleshooting”The scheduler’s only job is finding a suitable home for incoming pods. When it breaks, your existing infrastructure hums along perfectly, but scaling up becomes impossible.
3.1 Failure Symptoms
Section titled “3.1 Failure Symptoms”┌──────────────────────────────────────────────────────────────┐│ SCHEDULER FAILURE SYMPTOMS ││ ││ Symptom Check ││ ───────────────────────────────────────────────────────── ││ All new pods stuck Pending Scheduler not running ││ "no nodes available to schedule" All nodes unschedulable ││ Pods not being distributed Scheduler misconfigured ││ Very slow scheduling Scheduler overloaded ││ ││ Remember: Existing pods keep running when scheduler fails! ││ Only NEW pods are affected. ││ │└──────────────────────────────────────────────────────────────┘3.2 Diagnosing Placement Issues
Section titled “3.2 Diagnosing Placement Issues”You can trace scheduling logic by looking at cluster events.
# Check scheduler pod statusk -n kube-system get pod -l component=kube-scheduler
# Check scheduler logsk -n kube-system logs kube-scheduler-<node>
# Check for scheduling eventsk get events -A --field-selector reason=FailedScheduling
# Describe pending pod for scheduling failure reasonk describe pod <pending-pod> | grep -A 10 Events| Issue | Symptom | Fix |
|---|---|---|
| Scheduler not running | All new pods Pending | Check static pod manifest |
| Can’t connect to API | ”connection refused” | Check kubeconfig, certs |
| Leader election failed | Scheduler not active | Check --leader-elect flag |
| No nodes available | Scheduling failures | Check node taints, resources |
3.3 Fixing the Scheduler
Section titled “3.3 Fixing the Scheduler”Usually, scheduler failures stem from a corrupted kubeconfig path or invalid YAML indentation in the manifest.
# Check manifest existscat /etc/kubernetes/manifests/kube-scheduler.yaml
# Check for YAML errorscat /etc/kubernetes/manifests/kube-scheduler.yaml | python3 -c "import yaml,sys; yaml.safe_load(sys.stdin)"
# Common fixes in manifest:# --kubeconfig=/etc/kubernetes/scheduler.conf# --leader-elect=true
# Verify kubeconfig existsls -la /etc/kubernetes/scheduler.confWar Story Incident: If you face a severe outage and cannot wait for the scheduler pod to recover, you can perform manual scheduling by directly mutating the nodeName field. This bypasses the scheduler entirely.
# If scheduler is down, you can manually schedule podsk patch pod <pod> -p '{"spec":{"nodeName":"worker-1"}}'Part 4: Controller Manager Troubleshooting
Section titled “Part 4: Controller Manager Troubleshooting”The controller manager contains dozens of individual control loops (ReplicaSet, Node, Endpoints). If it dies, the cluster loses its ability to self-heal.
4.1 Failure Symptoms
Section titled “4.1 Failure Symptoms”┌──────────────────────────────────────────────────────────────┐│ CONTROLLER MANAGER FAILURE SYMPTOMS ││ ││ Symptom Affected Controller ││ ───────────────────────────────────────────────────────── ││ Pods not created from Deployment ReplicaSet controller ││ Deleted pods not replaced ReplicaSet controller ││ PVCs stay Pending PV controller ││ Services have no endpoints Endpoints controller ││ Nodes stay NotReady forever Node controller ││ Jobs don't complete Job controller ││ No automatic cleanup GC controller ││ ││ The cluster "freezes" in current state - no reconciliation ││ │└──────────────────────────────────────────────────────────────┘4.2 Diagnostic Workflow
Section titled “4.2 Diagnostic Workflow”First, verify if the pod is crash-looping or throwing fatal authentication errors.
# Check controller manager podk -n kube-system get pod -l component=kube-controller-manager
# Check logsk -n kube-system logs kube-controller-manager-<node>
# Check for specific controller issuesk -n kube-system logs kube-controller-manager-<node> | grep -i error
# Verify controllers are working# Create a deployment and verify ReplicaSet is createdk create deployment test --image=nginxk get rs | grep test| Issue | Symptom | Fix |
|---|---|---|
| Not running | No reconciliation | Check static pod manifest |
| Service account missing | Can’t create pods | Check service-account-private-key-file |
| Can’t connect to API | All controllers fail | Check kubeconfig path |
| Cluster-signing-cert missing | CSR not approved | Check cert paths in manifest |
4.3 Correcting Configurations
Section titled “4.3 Correcting Configurations”The controller manager requires access to multiple certificates to sign tokens and communicate with the API. A simple typo in the volume mounts can cause permanent failure.
# Check manifestcat /etc/kubernetes/manifests/kube-controller-manager.yaml
# Key flags to verify:# --kubeconfig=/etc/kubernetes/controller-manager.conf# --service-account-private-key-file=/etc/kubernetes/pki/sa.key# --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt# --root-ca-file=/etc/kubernetes/pki/ca.crt
# Verify files existls -la /etc/kubernetes/pki/Part 5: etcd Troubleshooting
Section titled “Part 5: etcd Troubleshooting”If etcd is corrupted, you have no cluster. All states, secrets, and configurations live here.
5.1 Systemic Impact
Section titled “5.1 Systemic Impact”flowchart TD ETCD[etcd DOWN] W[No writes] R[No reads] A[API errors]
C1[Can't create\nresources] C2[Can't list\nresources] C3["etcd cluster\nis unavailable"]
ETCD --> W ETCD --> R ETCD --> A
W --> C1 R --> C2 A --> C3
style ETCD fill:#ffcccc,stroke:#ff0000Legacy Terminal View:
┌──────────────────────────────────────────────────────────────┐│ ETCD FAILURE IMPACT ││ ││ ┌─────────────────────────────────────────────────────┐ ││ │ etcd DOWN │ ││ └────────────────────────┬────────────────────────────┘ ││ │ ││ ┌─────────────┼─────────────┐ ││ ▼ ▼ ▼ ││ No writes No reads API errors ││ │ │ │ ││ ▼ ▼ ▼ ││ Can't create Can't list "etcd cluster ││ resources resources is unavailable" ││ ││ Note: Existing pods keep running (kubelet is independent) ││ But no new changes can be made to the cluster ││ │└──────────────────────────────────────────────────────────────┘5.2 Diagnosing Quorum Health
Section titled “5.2 Diagnosing Quorum Health”Because etcd is highly secure, you must pass full cryptographic credentials to use its native command-line tool, etcdctl.
# Check etcd pod statusk -n kube-system get pod -l component=etcd
# Check etcd logsk -n kube-system logs etcd-<node>
# Check etcd health with etcdctlETCDCTL_API=3 etcdctl \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ endpoint health
# Check etcd member listETCDCTL_API=3 etcdctl \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ member listHere is a duplicate of the health command that you should drill into memory, as it is tested heavily:
ETCDCTL_API=3 etcdctl endpoint health \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.keyIf you have environment variables set up previously in your shell profile, you can simplify this drastically:
etcdctl endpoint health| Issue | Symptom | Fix |
|---|---|---|
| Data directory corrupt | Won’t start | Restore from backup |
| Certificate expired | TLS errors | Renew certificates |
| Disk full | Write failures | Free disk space |
| Member not reachable | Cluster unhealthy | Check network, restart member |
| Clock skew | Raft failures | Sync NTP |
5.3 Backup and Restore Procedures
Section titled “5.3 Backup and Restore Procedures”Taking snapshots safely prevents total data loss.
ETCDCTL_API=3 etcdctl snapshot save /tmp/etcd-backup.db \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key
# Verify backupETCDCTL_API=3 etcdctl snapshot status /tmp/etcd-backup.dbTo restore from a snapshot, you must prevent the API server from writing new data during the process.
# Stop API server firstmv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
# Restore snapshotETCDCTL_API=3 etcdctl snapshot restore /tmp/etcd-backup.db \ --data-dir=/var/lib/etcd-restored
# Update etcd manifest to use new data dir# Move API server manifest backPart 6: Static Pod Troubleshooting Deep Dive
Section titled “Part 6: Static Pod Troubleshooting Deep Dive”To master control plane restoration, you must intimately understand the static pod lifecycle.
6.1 How Static Pods Work
Section titled “6.1 How Static Pods Work”sequenceDiagram participant DIR as /etc/kubernetes/manifests participant KUB as kubelet participant CRI as Container Runtime
KUB->>DIR: Watches directory DIR-->>KUB: File changed/created KUB->>CRI: Create Pod from manifest DIR-->>KUB: File deleted KUB->>CRI: Terminate PodLegacy Terminal View:
┌──────────────────────────────────────────────────────────────┐│ STATIC POD LIFECYCLE ││ ││ /etc/kubernetes/manifests/ kubelet ││ ┌───────────────────────┐ ┌──────────────────┐ ││ │ kube-apiserver.yaml │◄─ watch ──│ │ ││ │ kube-scheduler.yaml │ │ Creates pods │ ││ │ controller-manager... │──────────▶│ from manifests │ ││ │ etcd.yaml │ │ │ ││ └───────────────────────┘ └──────────────────┘ ││ │ ││ File changed/created ─────────────────────▶│ ││ File deleted ─────────────────────────────▶│ ││ ▼ ││ Pod created/deleted ││ ││ Naming: <name>-<node-name> (e.g., kube-apiserver-master) ││ │└──────────────────────────────────────────────────────────────┘Pause and predict: If you edit a live pod via
kubectl edit pod kube-apiserver-master -n kube-system, what will happen when the node reboots? The changes will be destroyed because the source of truth is the manifest file on disk, not the API database.
6.2 Validating the Kubelet Engine
Section titled “6.2 Validating the Kubelet Engine”If you place a manifest in the directory and nothing happens, the kubelet might be configured to look elsewhere, or the YAML is broken.
# Check kubelet is configured to watch manifests dircat /var/lib/kubelet/config.yaml | grep staticPodPath
# Check manifest syntaxcat /etc/kubernetes/manifests/kube-apiserver.yaml | head -20
# Common issues:# - YAML syntax errors (tabs instead of spaces)# - Wrong file extension (must be .yaml or .yml)# - Wrong file permissions (must be readable)# - Missing required fields6.3 Lower-Level Debugging
Section titled “6.3 Lower-Level Debugging”When kubectl fails, journalctl is your best friend.
# If static pod won't start, check kubelet logsjournalctl -u kubelet -f
# Look for errors about specific manifestjournalctl -u kubelet | grep -i "kube-apiserver\|error\|failed"
# Check if container exists but unhealthycrictl ps -a | grep kube-
# Get container logs directlycrictl logs <container-id>Common Mistakes
Section titled “Common Mistakes”When stress levels are high, engineers frequently make these critical errors:
| Mistake | Problem | Solution |
|---|---|---|
| Editing pods instead of manifests | Changes lost on restart | Edit /etc/kubernetes/manifests/ files |
| Using kubectl when API is down | Commands fail | Use crictl for container management |
| Not checking kubelet logs | Miss root cause | Always check journalctl -u kubelet |
| Forgetting cert dependencies | Components can’t communicate | Verify all cert paths exist |
| Not checking etcd first | Miss storage-level issues | etcd problems affect everything |
| Restarting before diagnosing | Lose evidence | Gather logs first, then restart |
| Assuming API server holds state | Wasting time backing up API pods | Always target etcd for backups; API is stateless |
Evaluate your deep understanding of control plane mechanisms with these scenario-based challenges.
Q1: API Silence Scenario
Section titled “Q1: API Silence Scenario”You are paged at 2:00 AM. kubectl get nodes returns a “connection refused” error. You SSH into the master node. What is your very first diagnostic action to isolate the failure layer?
[QUIZ-1] Answer
You must verify if the API server container is actively running using the local container runtime interface. Run `crictl ps | grep kube-apiserver`. If it is missing from the active list, you immediately know the pod has crashed and should proceed to check `journalctl -u kubelet` to find out why the kubelet cannot start the static manifest.Q2: The Phantom Replicas
Section titled “Q2: The Phantom Replicas”A developer complains that they deleted several crashing pods in their namespace, but no new pods are spinning up to replace them. The Deployment resource shows 5 desired replicas, but only 2 currently exist. The API server is fully responsive. Diagnose the failing component.
[QUIZ-2] Answer
The Controller Manager is failing or dead. The API server is responsive (hence you can query the Deployment), but the reconciliation loop responsible for noticing the discrepancy between desired state (5) and actual state (2) is broken. The ReplicaSet controller lives inside the `kube-controller-manager` pod, which needs immediate inspection.Q3: Permanent Pending State
Section titled “Q3: Permanent Pending State”You successfully deploy a new DaemonSet. You can see the pods created via kubectl get pods, but they are all stuck in a Pending state indefinitely. The cluster has plenty of CPU and memory available. Which control plane component requires investigation?
[QUIZ-3] Answer
The Scheduler is failing or crashed. When a pod is created via the API, it enters the `Pending` state by default. It is the sole responsibility of the `kube-scheduler` to evaluate node resources, assign a `nodeName` to the pod specification, and update the API. If the scheduler is dead, pods remain pending forever, even if resources are abundant.Q4: Storage Layer Validation
Section titled “Q4: Storage Layer Validation”During a major cluster upgrade, the API server begins throwing intermittent “etcd cluster is unavailable” errors. You need to verify the cryptographically secure health of the database layer. Implement the exact command string required.
[QUIZ-4] Answer
You must invoke the etcdctl tool while passing the correct PKI paths for authentication. The command is: ```bash ETCDCTL_API=3 etcdctl endpoint health \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key ``` This bypasses the API server entirely and queries the storage layer directly.Q5: Anniversary Outage
Section titled “Q5: Anniversary Outage”Exactly one year after bootstrapping a new production cluster, the entire control plane drops offline simultaneously. No configuration changes were made. Diagnose the root cause and identify the remediation command.
[QUIZ-5] Answer
The internal TLS certificates generated by kubeadm have hit their default 365-day expiration limit. All components instantly lose the ability to mutually authenticate, causing systemic collapse. You must run `kubeadm certs renew all` on the control plane nodes, then wait for the kubelet to restart the static pods with the fresh certificates.Q6: The Restart Fallacy
Section titled “Q6: The Restart Fallacy”A junior engineer notices the scheduler pod is failing to elect a leader. They immediately run kubectl delete pod -n kube-system kube-scheduler-master hoping it will restart and fix itself. Evaluate why this action is ineffective and what will actually happen.
[QUIZ-6] Answer
Deleting a static pod via the API server is an illusion. The API server will mark it for deletion, but the local kubelet is the ultimate source of truth because it is watching the physical YAML file in `/etc/kubernetes/manifests/`. The kubelet will instantly notice the pod is gone and recreate it using the exact same broken configuration from the disk, solving nothing while erasing the old container logs.Hands-On Exercise: Control Plane Troubleshooting
Section titled “Hands-On Exercise: Control Plane Troubleshooting”Scenario
Section titled “Scenario”You have been granted access to a sandbox cluster. Your objective is to practice diagnosing and intentionally manipulating control plane components to observe failure modes firsthand.
Prerequisites
Section titled “Prerequisites”This exercise requires a kubeadm-based cluster with SSH access to control plane nodes.
Log in to the primary management node to begin your forensics.
# Verify you have control plane accessssh <control-plane-node>sudo ls /etc/kubernetes/manifests/Task 1: Explore Control Plane Components
Section titled “Task 1: Explore Control Plane Components”Examine the physical files that dictate the control plane’s existence.
# List all static pod manifestsls -la /etc/kubernetes/manifests/
# Check current control plane pod statusk -n kube-system get pods | grep -E 'etcd|api|scheduler|controller'
# View API server configurationcat /etc/kubernetes/manifests/kube-apiserver.yaml | grep -A 5 "command:"Task 2: Validate Cryptographic Health
Section titled “Task 2: Validate Cryptographic Health”Check the expiration dates of the core certificates to ensure the cluster isn’t a ticking time bomb.
# Use kubeadm to check all certificatessudo kubeadm certs check-expiration
# Manually check a specific certificatesudo openssl x509 -in /etc/kubernetes/pki/apiserver.crt -text -noout | grep -A 2 ValidityTask 3: Check etcd Health
Section titled “Task 3: Check etcd Health”Configure an alias to make interacting with the secure database easier, then interrogate its status.
# Use etcdctl to check health# First, set up an alias for conveniencealias etcdctl='ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key'
# Check healthetcdctl endpoint health
# Check member listetcdctl member list
# Check cluster statusetcdctl endpoint status --write-out=tableTask 4: Simulate Scheduler Failure (Careful!)
Section titled “Task 4: Simulate Scheduler Failure (Careful!)”We will intentionally break the scheduler to observe the exact symptoms when attempting to deploy workloads.
# First, note a pending pod's behaviork run test-scheduler --image=nginxk get pods test-scheduler
# Temporarily rename scheduler manifest (this stops it)sudo mv /etc/kubernetes/manifests/kube-scheduler.yaml /tmp/
# Wait 30 seconds, try to create another podsleep 30k run test-scheduler-2 --image=nginx
# Check status - should be Pendingk get pods test-scheduler-2k describe pod test-scheduler-2 | grep -A 5 Events
# Restore schedulersudo mv /tmp/kube-scheduler.yaml /etc/kubernetes/manifests/
# Wait for scheduler to restartsleep 30k get pods -wCleanup
Section titled “Cleanup”Remove the test artifacts to return the environment to a clean state.
k delete pod test-scheduler test-scheduler-2Success Criteria
Section titled “Success Criteria”- Listed all static pod manifests physically residing on disk.
- Verified etcd health using strict PKI authentication flags.
- Successfully simulated and observed a scheduler outage.
- Verified certificate expiration horizons using kubeadm.
Practice Drills: Rapid Incident Response
Section titled “Practice Drills: Rapid Incident Response”When the pager goes off, muscle memory saves time. Use these rapid-fire drills to memorize the exact commands needed to diagnose different layers of the control plane stack.
Drill 1: Control Plane Pod Status (30 sec)
Section titled “Drill 1: Control Plane Pod Status (30 sec)”Objective: Quickly isolate which major component is crashing from a high level.
# Task: Show all control plane pods statusk -n kube-system get pods | grep -E 'etcd|api|scheduler|controller'Drill 2: Check Component Logs (1 min)
Section titled “Drill 2: Check Component Logs (1 min)”Objective: Extract the immediate failure reason from a looping container.
# Task: View last 50 lines of API server logsk -n kube-system logs kube-apiserver-<node> --tail=50Drill 3: Static Pod Manifest Check (30 sec)
Section titled “Drill 3: Static Pod Manifest Check (30 sec)”Objective: Inspect the configuration source-of-truth for typos or bad flags.
# Task: View scheduler configurationcat /etc/kubernetes/manifests/kube-scheduler.yamlDrill 4: Deep etcd Health Verification (1 min)
Section titled “Drill 4: Deep etcd Health Verification (1 min)”Objective: Bypass the API completely and ensure the storage quorum is intact.
# Task: Check etcd endpoint healthETCDCTL_API=3 etcdctl endpoint health \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.keyDrill 5: Preventative Certificate Maintenance (30 sec)
Section titled “Drill 5: Preventative Certificate Maintenance (30 sec)”Objective: Audit the cluster for impending cryptographic doom.
# Task: Check all certificate expiration dateskubeadm certs check-expirationDrill 6: The Kubelet Engine Logs (1 min)
Section titled “Drill 6: The Kubelet Engine Logs (1 min)”Objective: Discover why a static pod manifest is being rejected by the local node daemon.
# Task: Check kubelet logs for control plane errorsjournalctl -u kubelet --since "10 minutes ago" | grep -i "error\|failed"Drill 7: Container Runtime Forensics (30 sec)
Section titled “Drill 7: Container Runtime Forensics (30 sec)”Objective: Check the actual running processes when kubectl is completely unresponsive.
# Task: List all control plane containerscrictl ps | grep kubeDrill 8: API Server Network Test (30 sec)
Section titled “Drill 8: API Server Network Test (30 sec)”Objective: Verify if the API server is rejecting traffic at the network/socket level.
# Task: Test API server endpointcurl -k https://localhost:6443/healthzNext Module
Section titled “Next Module”Now that you can resurrect a dead control plane, it is time to look at the other half of the cluster architecture. Continue to Module 5.4: Worker Node Failures to learn how to diagnose and resolve massive node evictions, container runtime crashes, and kubelet communication blackouts.