Skip to content

Module 5.3: Control Plane Failures

Hands-On Lab Available
K8s Cluster advanced 45 min
Launch Lab ↗

Opens in Killercoda in a new tab

Complexity: [COMPLEX] - Critical infrastructure troubleshooting

Time to Complete: 50-60 minutes

Prerequisites: Module 5.1 (Methodology), Module 1.1 (Control Plane Deep-Dive)


After completing this module, you will transition from basic debugging to advanced cluster rescue operations. Specifically, you will be able to:

  • Diagnose complex control plane failures by cross-referencing static pod manifests, container runtime events, and kubelet logs.
  • Implement restorative actions for critical components including etcd quorum loss, API server certificate expiration, and scheduler crashes.
  • Evaluate the systemic impact of specific component failures to rapidly isolate the root cause of cluster-wide freezes.
  • Design safe troubleshooting workflows that preserve forensic evidence while returning the cluster to a healthy state.

When the control plane fails, your entire platform teeters on the edge of catastrophe. The API server down means zero visibility and zero control. The scheduler down means auto-scaling is dead in the water. The controller manager down means self-healing mechanisms cease to exist. These are the most critical, high-pressure incidents an infrastructure engineer will face, and the mean time to recovery directly impacts business survival.

Consider the highly publicized outage of a major global retailer in May 2018. During a critical promotional event, their primary Kubernetes clusters went entirely dark. No deployments could roll out, no crashed pods were replaced, and the engineering team lost all kubectl access. The root cause? A control plane certificate had expired exactly 365 days after initial cluster bootstrapping. The API server static pods failed to start, causing a complete management plane blackout.

This single failure caused an estimated $12 million in lost transaction revenue over an eight-hour recovery window. If the engineers had known how to manually verify static pod logs using crictl and immediately renew certificates via kubeadm, the downtime could have been restricted to fifteen minutes. Mastery of control plane troubleshooting separates the novice operators from the true Kubernetes experts.

The Air Traffic Control Analogy

The control plane is exactly like air traffic control for your cluster. The API server is the radio tower; if it goes down, absolutely no communication occurs. The scheduler is the flight planner; without it, new flights (pods) cannot take off and remain stranded at the gate. The controller manager is the monitoring system; it ensures all planes follow their assigned routes and handles emergencies. Finally, etcd is the flight record database; if it corrupts, the entire airport forgets what planes exist. When any of these systems fail, you must act decisively.


  • Static Pod Exclusivity: In standard kubeadm deployments, control plane components do not run as standard deployments; they run as static pods managed directly by the local kubelet bypassing the scheduler entirely.
  • Certificate Default Lifespans: By default, Kubernetes internal certificates generated by kubeadm are configured to expire in exactly 365 days, which is the most common cause of “sudden” control plane deaths on a cluster’s first anniversary.
  • Port Allocations: Since 2015, etcd has exclusively utilized port 2379 for client API requests and port 2380 for internal peer-to-peer cluster communication.
  • Data Capacity Limits: A single etcd database cluster defaults to a maximum storage space quota of 2 gigabytes; if this limit is breached without compaction, the entire cluster enters a read-only state.

To effectively troubleshoot, you must first understand the architectural flow and dependencies.

The control plane is not a monolith; it is a collection of highly interdependent microservices. The API server acts as the sole gateway. No other component speaks directly to the database (etcd).

Here is the dynamic architectural flow:

flowchart TD
ETCD[(etcd\nStorage)]
API[API Server\nGateway]
SCHED[Scheduler]
CM[Controller\nManager]
CCM[Cloud\nController]
KUBECTL[kubectl]
KUBELET[kubelet]
CONTROLLERS[External\nControllers]
API -->|Reads/Writes| ETCD
KUBECTL -->|REST calls| API
KUBELET -->|Status updates| API
CONTROLLERS -->|Reconciliation| API
SCHED -.->|Watches/Binds| API
CM -.->|Watches/Updates| API
CCM -.->|Watches/Updates| API
classDef critical fill:#f9f,stroke:#333,stroke-width:2px;
class ETCD,API critical;

For reference when viewing legacy terminal documentation, this relationship is often mapped conceptually as follows:

┌──────────────────────────────────────────────────────────────┐
│ CONTROL PLANE DEPENDENCIES │
│ │
│ ┌─────────────┐ │
│ │ etcd │ │
│ │ (storage) │ │
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ API Server │◄──── kubectl │
│ │ (gateway) │◄──── kubelet │
│ └──────┬──────┘◄──── controllers │
│ │ │
│ ┌──────────────┼──────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ Scheduler │ │ Controller│ │ Cloud │ │
│ │ │ │ Manager │ │ Controller│ │
│ └───────────┘ └───────────┘ └───────────┘ │
│ │
│ If etcd fails → Everything fails │
│ If API server → Nothing can communicate │
│ If scheduler → New pods won't be scheduled │
│ If controller-mgr → Resources won't reconcile │
│ │
└──────────────────────────────────────────────────────────────┘

The core components are deployed as static pods. The local kubelet daemon constantly monitors a specific directory on the host machine.

Terminal window
# Static pod manifest location
/etc/kubernetes/manifests/
├── etcd.yaml
├── kube-apiserver.yaml
├── kube-controller-manager.yaml
└── kube-scheduler.yaml
# kubelet watches this directory
# Changes to these files = automatic restart of component

Before diving into logs, always check the high-level status of the control plane. While the componentstatuses endpoint is deprecated, it is still occasionally used for rapid triage.

Terminal window
# Quick health check (deprecated but useful)
k get componentstatuses
# Check control plane pods
k -n kube-system get pods | grep -E 'etcd|api|controller|scheduler'
# Verify all components are running
k -n kube-system get pods -o wide | grep -E 'kube-'

The API server is the heart of the cluster. If it fails, kubectl becomes useless, and you must rely on node-level tools like crictl and journalctl.

When the API server experiences degradation, the symptoms are immediate and severe.

┌──────────────────────────────────────────────────────────────┐
│ API SERVER FAILURE SYMPTOMS │
│ │
│ Symptom Indicates │
│ ───────────────────────────────────────────────────────── │
│ kubectl hangs/times out API server unreachable │
│ "connection refused" API server not listening │
│ "unable to connect to server" Network/firewall issue │
│ "Unauthorized" Auth/cert issue │
│ "etcd cluster is unavailable" API can't reach etcd │
│ Very slow responses Overloaded or etcd slow │
│ │
└──────────────────────────────────────────────────────────────┘

Stop and think: If your API server is down, how will you find out why it is down if you cannot run kubectl logs? You must SSH into the control plane node and use the container runtime interface.

Step 1: Check if the API server pod is running

Terminal window
# From a control plane node
crictl ps | grep kube-apiserver
# Or check static pod status
ls -la /etc/kubernetes/manifests/kube-apiserver.yaml

If it is not in the running list, check if it recently crashed:

Terminal window
crictl ps -a | grep kube-apiserver # See if it exists but stopped
journalctl -u kubelet | grep apiserver # Check kubelet logs

Step 2: Inspecting the logs natively

Terminal window
# If running as pod
k -n kube-system logs kube-apiserver-<node>
# If pod is down, use crictl
crictl logs $(crictl ps -a | grep apiserver | awk '{print $1}')
# Or check kubelet logs for why it's not starting
journalctl -u kubelet | grep apiserver

Sometimes, crictl will show you multiple stopped containers. Just grab the latest one:

Terminal window
crictl ps | grep kube-apiserver

Step 3: Validating Cryptographic Health Certificate issues are the number one cause of API failures.

Terminal window
# Verify certificates
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -text -noout | grep -A 2 "Validity"
# Check if certs are expired
kubeadm certs check-expiration

Here is a mapping of common API server issues and their respective fixes:

IssueSymptomFix
Certificate expired”x509: certificate has expired”kubeadm certs renew all
etcd unreachable”etcd cluster is unavailable”Check etcd health, fix etcd
Wrong etcd endpointsStartup failureCheck --etcd-servers in manifest
Port conflict”bind: address already in use”Find and kill conflicting process
Out of memoryOOMKilled, slow responsesIncrease node resources
Incorrect flagsWon’t startCheck manifest YAML syntax

If certificates are the culprit, execution is straightforward:

Terminal window
# Check certificate status
kubeadm certs check-expiration
# Renew all certificates
kubeadm certs renew all
# Restart control plane pods
# kubelet automatically restarts static pods when manifests change

If the manifest configuration is damaged, you must edit it directly. The local kubelet will instantly notice the file hash change and restart the container.

Terminal window
# Edit static pod manifest
sudo vim /etc/kubernetes/manifests/kube-apiserver.yaml
# Common fixes:
# - Fix typos in flags
# - Correct certificate paths
# - Fix etcd endpoints
# kubelet automatically detects changes and restarts the pod

The scheduler’s only job is finding a suitable home for incoming pods. When it breaks, your existing infrastructure hums along perfectly, but scaling up becomes impossible.

┌──────────────────────────────────────────────────────────────┐
│ SCHEDULER FAILURE SYMPTOMS │
│ │
│ Symptom Check │
│ ───────────────────────────────────────────────────────── │
│ All new pods stuck Pending Scheduler not running │
│ "no nodes available to schedule" All nodes unschedulable │
│ Pods not being distributed Scheduler misconfigured │
│ Very slow scheduling Scheduler overloaded │
│ │
│ Remember: Existing pods keep running when scheduler fails! │
│ Only NEW pods are affected. │
│ │
└──────────────────────────────────────────────────────────────┘

You can trace scheduling logic by looking at cluster events.

Terminal window
# Check scheduler pod status
k -n kube-system get pod -l component=kube-scheduler
# Check scheduler logs
k -n kube-system logs kube-scheduler-<node>
# Check for scheduling events
k get events -A --field-selector reason=FailedScheduling
# Describe pending pod for scheduling failure reason
k describe pod <pending-pod> | grep -A 10 Events
IssueSymptomFix
Scheduler not runningAll new pods PendingCheck static pod manifest
Can’t connect to API”connection refused”Check kubeconfig, certs
Leader election failedScheduler not activeCheck --leader-elect flag
No nodes availableScheduling failuresCheck node taints, resources

Usually, scheduler failures stem from a corrupted kubeconfig path or invalid YAML indentation in the manifest.

Terminal window
# Check manifest exists
cat /etc/kubernetes/manifests/kube-scheduler.yaml
# Check for YAML errors
cat /etc/kubernetes/manifests/kube-scheduler.yaml | python3 -c "import yaml,sys; yaml.safe_load(sys.stdin)"
# Common fixes in manifest:
# --kubeconfig=/etc/kubernetes/scheduler.conf
# --leader-elect=true
# Verify kubeconfig exists
ls -la /etc/kubernetes/scheduler.conf

War Story Incident: If you face a severe outage and cannot wait for the scheduler pod to recover, you can perform manual scheduling by directly mutating the nodeName field. This bypasses the scheduler entirely.

Terminal window
# If scheduler is down, you can manually schedule pods
k patch pod <pod> -p '{"spec":{"nodeName":"worker-1"}}'

Part 4: Controller Manager Troubleshooting

Section titled “Part 4: Controller Manager Troubleshooting”

The controller manager contains dozens of individual control loops (ReplicaSet, Node, Endpoints). If it dies, the cluster loses its ability to self-heal.

┌──────────────────────────────────────────────────────────────┐
│ CONTROLLER MANAGER FAILURE SYMPTOMS │
│ │
│ Symptom Affected Controller │
│ ───────────────────────────────────────────────────────── │
│ Pods not created from Deployment ReplicaSet controller │
│ Deleted pods not replaced ReplicaSet controller │
│ PVCs stay Pending PV controller │
│ Services have no endpoints Endpoints controller │
│ Nodes stay NotReady forever Node controller │
│ Jobs don't complete Job controller │
│ No automatic cleanup GC controller │
│ │
│ The cluster "freezes" in current state - no reconciliation │
│ │
└──────────────────────────────────────────────────────────────┘

First, verify if the pod is crash-looping or throwing fatal authentication errors.

Terminal window
# Check controller manager pod
k -n kube-system get pod -l component=kube-controller-manager
# Check logs
k -n kube-system logs kube-controller-manager-<node>
# Check for specific controller issues
k -n kube-system logs kube-controller-manager-<node> | grep -i error
# Verify controllers are working
# Create a deployment and verify ReplicaSet is created
k create deployment test --image=nginx
k get rs | grep test
IssueSymptomFix
Not runningNo reconciliationCheck static pod manifest
Service account missingCan’t create podsCheck service-account-private-key-file
Can’t connect to APIAll controllers failCheck kubeconfig path
Cluster-signing-cert missingCSR not approvedCheck cert paths in manifest

The controller manager requires access to multiple certificates to sign tokens and communicate with the API. A simple typo in the volume mounts can cause permanent failure.

Terminal window
# Check manifest
cat /etc/kubernetes/manifests/kube-controller-manager.yaml
# Key flags to verify:
# --kubeconfig=/etc/kubernetes/controller-manager.conf
# --service-account-private-key-file=/etc/kubernetes/pki/sa.key
# --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt
# --root-ca-file=/etc/kubernetes/pki/ca.crt
# Verify files exist
ls -la /etc/kubernetes/pki/

If etcd is corrupted, you have no cluster. All states, secrets, and configurations live here.

flowchart TD
ETCD[etcd DOWN]
W[No writes]
R[No reads]
A[API errors]
C1[Can't create\nresources]
C2[Can't list\nresources]
C3["etcd cluster\nis unavailable"]
ETCD --> W
ETCD --> R
ETCD --> A
W --> C1
R --> C2
A --> C3
style ETCD fill:#ffcccc,stroke:#ff0000

Legacy Terminal View:

┌──────────────────────────────────────────────────────────────┐
│ ETCD FAILURE IMPACT │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ etcd DOWN │ │
│ └────────────────────────┬────────────────────────────┘ │
│ │ │
│ ┌─────────────┼─────────────┐ │
│ ▼ ▼ ▼ │
│ No writes No reads API errors │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ Can't create Can't list "etcd cluster │
│ resources resources is unavailable" │
│ │
│ Note: Existing pods keep running (kubelet is independent) │
│ But no new changes can be made to the cluster │
│ │
└──────────────────────────────────────────────────────────────┘

Because etcd is highly secure, you must pass full cryptographic credentials to use its native command-line tool, etcdctl.

Terminal window
# Check etcd pod status
k -n kube-system get pod -l component=etcd
# Check etcd logs
k -n kube-system logs etcd-<node>
# Check etcd health with etcdctl
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
endpoint health
# Check etcd member list
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
member list

Here is a duplicate of the health command that you should drill into memory, as it is tested heavily:

Terminal window
ETCDCTL_API=3 etcdctl endpoint health \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key

If you have environment variables set up previously in your shell profile, you can simplify this drastically:

Terminal window
etcdctl endpoint health
IssueSymptomFix
Data directory corruptWon’t startRestore from backup
Certificate expiredTLS errorsRenew certificates
Disk fullWrite failuresFree disk space
Member not reachableCluster unhealthyCheck network, restart member
Clock skewRaft failuresSync NTP

Taking snapshots safely prevents total data loss.

Terminal window
ETCDCTL_API=3 etcdctl snapshot save /tmp/etcd-backup.db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Verify backup
ETCDCTL_API=3 etcdctl snapshot status /tmp/etcd-backup.db

To restore from a snapshot, you must prevent the API server from writing new data during the process.

Terminal window
# Stop API server first
mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
# Restore snapshot
ETCDCTL_API=3 etcdctl snapshot restore /tmp/etcd-backup.db \
--data-dir=/var/lib/etcd-restored
# Update etcd manifest to use new data dir
# Move API server manifest back

Part 6: Static Pod Troubleshooting Deep Dive

Section titled “Part 6: Static Pod Troubleshooting Deep Dive”

To master control plane restoration, you must intimately understand the static pod lifecycle.

sequenceDiagram
participant DIR as /etc/kubernetes/manifests
participant KUB as kubelet
participant CRI as Container Runtime
KUB->>DIR: Watches directory
DIR-->>KUB: File changed/created
KUB->>CRI: Create Pod from manifest
DIR-->>KUB: File deleted
KUB->>CRI: Terminate Pod

Legacy Terminal View:

┌──────────────────────────────────────────────────────────────┐
│ STATIC POD LIFECYCLE │
│ │
│ /etc/kubernetes/manifests/ kubelet │
│ ┌───────────────────────┐ ┌──────────────────┐ │
│ │ kube-apiserver.yaml │◄─ watch ──│ │ │
│ │ kube-scheduler.yaml │ │ Creates pods │ │
│ │ controller-manager... │──────────▶│ from manifests │ │
│ │ etcd.yaml │ │ │ │
│ └───────────────────────┘ └──────────────────┘ │
│ │ │
│ File changed/created ─────────────────────▶│ │
│ File deleted ─────────────────────────────▶│ │
│ ▼ │
│ Pod created/deleted │
│ │
│ Naming: <name>-<node-name> (e.g., kube-apiserver-master) │
│ │
└──────────────────────────────────────────────────────────────┘

Pause and predict: If you edit a live pod via kubectl edit pod kube-apiserver-master -n kube-system, what will happen when the node reboots? The changes will be destroyed because the source of truth is the manifest file on disk, not the API database.

If you place a manifest in the directory and nothing happens, the kubelet might be configured to look elsewhere, or the YAML is broken.

Terminal window
# Check kubelet is configured to watch manifests dir
cat /var/lib/kubelet/config.yaml | grep staticPodPath
# Check manifest syntax
cat /etc/kubernetes/manifests/kube-apiserver.yaml | head -20
# Common issues:
# - YAML syntax errors (tabs instead of spaces)
# - Wrong file extension (must be .yaml or .yml)
# - Wrong file permissions (must be readable)
# - Missing required fields

When kubectl fails, journalctl is your best friend.

Terminal window
# If static pod won't start, check kubelet logs
journalctl -u kubelet -f
# Look for errors about specific manifest
journalctl -u kubelet | grep -i "kube-apiserver\|error\|failed"
# Check if container exists but unhealthy
crictl ps -a | grep kube-
# Get container logs directly
crictl logs <container-id>

When stress levels are high, engineers frequently make these critical errors:

MistakeProblemSolution
Editing pods instead of manifestsChanges lost on restartEdit /etc/kubernetes/manifests/ files
Using kubectl when API is downCommands failUse crictl for container management
Not checking kubelet logsMiss root causeAlways check journalctl -u kubelet
Forgetting cert dependenciesComponents can’t communicateVerify all cert paths exist
Not checking etcd firstMiss storage-level issuesetcd problems affect everything
Restarting before diagnosingLose evidenceGather logs first, then restart
Assuming API server holds stateWasting time backing up API podsAlways target etcd for backups; API is stateless

Evaluate your deep understanding of control plane mechanisms with these scenario-based challenges.

You are paged at 2:00 AM. kubectl get nodes returns a “connection refused” error. You SSH into the master node. What is your very first diagnostic action to isolate the failure layer?

[QUIZ-1] Answer You must verify if the API server container is actively running using the local container runtime interface. Run `crictl ps | grep kube-apiserver`. If it is missing from the active list, you immediately know the pod has crashed and should proceed to check `journalctl -u kubelet` to find out why the kubelet cannot start the static manifest.

A developer complains that they deleted several crashing pods in their namespace, but no new pods are spinning up to replace them. The Deployment resource shows 5 desired replicas, but only 2 currently exist. The API server is fully responsive. Diagnose the failing component.

[QUIZ-2] Answer The Controller Manager is failing or dead. The API server is responsive (hence you can query the Deployment), but the reconciliation loop responsible for noticing the discrepancy between desired state (5) and actual state (2) is broken. The ReplicaSet controller lives inside the `kube-controller-manager` pod, which needs immediate inspection.

You successfully deploy a new DaemonSet. You can see the pods created via kubectl get pods, but they are all stuck in a Pending state indefinitely. The cluster has plenty of CPU and memory available. Which control plane component requires investigation?

[QUIZ-3] Answer The Scheduler is failing or crashed. When a pod is created via the API, it enters the `Pending` state by default. It is the sole responsibility of the `kube-scheduler` to evaluate node resources, assign a `nodeName` to the pod specification, and update the API. If the scheduler is dead, pods remain pending forever, even if resources are abundant.

During a major cluster upgrade, the API server begins throwing intermittent “etcd cluster is unavailable” errors. You need to verify the cryptographically secure health of the database layer. Implement the exact command string required.

[QUIZ-4] Answer You must invoke the etcdctl tool while passing the correct PKI paths for authentication. The command is: ```bash ETCDCTL_API=3 etcdctl endpoint health \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key ``` This bypasses the API server entirely and queries the storage layer directly.

Exactly one year after bootstrapping a new production cluster, the entire control plane drops offline simultaneously. No configuration changes were made. Diagnose the root cause and identify the remediation command.

[QUIZ-5] Answer The internal TLS certificates generated by kubeadm have hit their default 365-day expiration limit. All components instantly lose the ability to mutually authenticate, causing systemic collapse. You must run `kubeadm certs renew all` on the control plane nodes, then wait for the kubelet to restart the static pods with the fresh certificates.

A junior engineer notices the scheduler pod is failing to elect a leader. They immediately run kubectl delete pod -n kube-system kube-scheduler-master hoping it will restart and fix itself. Evaluate why this action is ineffective and what will actually happen.

[QUIZ-6] Answer Deleting a static pod via the API server is an illusion. The API server will mark it for deletion, but the local kubelet is the ultimate source of truth because it is watching the physical YAML file in `/etc/kubernetes/manifests/`. The kubelet will instantly notice the pod is gone and recreate it using the exact same broken configuration from the disk, solving nothing while erasing the old container logs.

Hands-On Exercise: Control Plane Troubleshooting

Section titled “Hands-On Exercise: Control Plane Troubleshooting”

You have been granted access to a sandbox cluster. Your objective is to practice diagnosing and intentionally manipulating control plane components to observe failure modes firsthand.

This exercise requires a kubeadm-based cluster with SSH access to control plane nodes.

Log in to the primary management node to begin your forensics.

Terminal window
# Verify you have control plane access
ssh <control-plane-node>
sudo ls /etc/kubernetes/manifests/

Examine the physical files that dictate the control plane’s existence.

Terminal window
# List all static pod manifests
ls -la /etc/kubernetes/manifests/
# Check current control plane pod status
k -n kube-system get pods | grep -E 'etcd|api|scheduler|controller'
# View API server configuration
cat /etc/kubernetes/manifests/kube-apiserver.yaml | grep -A 5 "command:"

Check the expiration dates of the core certificates to ensure the cluster isn’t a ticking time bomb.

Terminal window
# Use kubeadm to check all certificates
sudo kubeadm certs check-expiration
# Manually check a specific certificate
sudo openssl x509 -in /etc/kubernetes/pki/apiserver.crt -text -noout | grep -A 2 Validity

Configure an alias to make interacting with the secure database easier, then interrogate its status.

Terminal window
# Use etcdctl to check health
# First, set up an alias for convenience
alias etcdctl='ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key'
# Check health
etcdctl endpoint health
# Check member list
etcdctl member list
# Check cluster status
etcdctl endpoint status --write-out=table

Task 4: Simulate Scheduler Failure (Careful!)

Section titled “Task 4: Simulate Scheduler Failure (Careful!)”

We will intentionally break the scheduler to observe the exact symptoms when attempting to deploy workloads.

Terminal window
# First, note a pending pod's behavior
k run test-scheduler --image=nginx
k get pods test-scheduler
# Temporarily rename scheduler manifest (this stops it)
sudo mv /etc/kubernetes/manifests/kube-scheduler.yaml /tmp/
# Wait 30 seconds, try to create another pod
sleep 30
k run test-scheduler-2 --image=nginx
# Check status - should be Pending
k get pods test-scheduler-2
k describe pod test-scheduler-2 | grep -A 5 Events
# Restore scheduler
sudo mv /tmp/kube-scheduler.yaml /etc/kubernetes/manifests/
# Wait for scheduler to restart
sleep 30
k get pods -w

Remove the test artifacts to return the environment to a clean state.

Terminal window
k delete pod test-scheduler test-scheduler-2
  • Listed all static pod manifests physically residing on disk.
  • Verified etcd health using strict PKI authentication flags.
  • Successfully simulated and observed a scheduler outage.
  • Verified certificate expiration horizons using kubeadm.

When the pager goes off, muscle memory saves time. Use these rapid-fire drills to memorize the exact commands needed to diagnose different layers of the control plane stack.

Drill 1: Control Plane Pod Status (30 sec)

Section titled “Drill 1: Control Plane Pod Status (30 sec)”

Objective: Quickly isolate which major component is crashing from a high level.

Terminal window
# Task: Show all control plane pods status
k -n kube-system get pods | grep -E 'etcd|api|scheduler|controller'

Objective: Extract the immediate failure reason from a looping container.

Terminal window
# Task: View last 50 lines of API server logs
k -n kube-system logs kube-apiserver-<node> --tail=50

Drill 3: Static Pod Manifest Check (30 sec)

Section titled “Drill 3: Static Pod Manifest Check (30 sec)”

Objective: Inspect the configuration source-of-truth for typos or bad flags.

Terminal window
# Task: View scheduler configuration
cat /etc/kubernetes/manifests/kube-scheduler.yaml

Drill 4: Deep etcd Health Verification (1 min)

Section titled “Drill 4: Deep etcd Health Verification (1 min)”

Objective: Bypass the API completely and ensure the storage quorum is intact.

Terminal window
# Task: Check etcd endpoint health
ETCDCTL_API=3 etcdctl endpoint health \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key

Drill 5: Preventative Certificate Maintenance (30 sec)

Section titled “Drill 5: Preventative Certificate Maintenance (30 sec)”

Objective: Audit the cluster for impending cryptographic doom.

Terminal window
# Task: Check all certificate expiration dates
kubeadm certs check-expiration

Objective: Discover why a static pod manifest is being rejected by the local node daemon.

Terminal window
# Task: Check kubelet logs for control plane errors
journalctl -u kubelet --since "10 minutes ago" | grep -i "error\|failed"

Drill 7: Container Runtime Forensics (30 sec)

Section titled “Drill 7: Container Runtime Forensics (30 sec)”

Objective: Check the actual running processes when kubectl is completely unresponsive.

Terminal window
# Task: List all control plane containers
crictl ps | grep kube

Objective: Verify if the API server is rejecting traffic at the network/socket level.

Terminal window
# Task: Test API server endpoint
curl -k https://localhost:6443/healthz

Now that you can resurrect a dead control plane, it is time to look at the other half of the cluster architecture. Continue to Module 5.4: Worker Node Failures to learn how to diagnose and resolve massive node evictions, container runtime crashes, and kubelet communication blackouts.