Module 1.1: Control Plane Deep-Dive
Module 1.1: Control Plane Deep-Dive
Section titled “Module 1.1: Control Plane Deep-Dive”Complexity:
[MEDIUM]- Conceptual understanding requiredTime to Complete: 35-45 minutes
Prerequisites: Module 0.1 (working cluster)
Learning Outcomes
Section titled “Learning Outcomes”After this module, you will be able to:
- Diagnose control plane failures by connecting symptoms to the API server, etcd, scheduler, controller manager, kubelet, kube-proxy, or container runtime.
- Trace a Kubernetes object request from
kubectlthrough authentication, authorization, admission, storage, scheduling, reconciliation, and node execution. - Compare control plane and worker node responsibilities so you can choose the right logs, health checks, and recovery path during an outage.
- Recover a failed static-pod control plane component by inspecting manifests, reading local node logs, and restoring the correct file.
Why This Module Matters
Section titled “Why This Module Matters”Hypothetical scenario: you are on call for a practice cluster shortly before a CKA-style maintenance window, and developers report that new workloads no longer start. Existing applications still respond, kubectl get nodes sometimes works, a Deployment says it wants more replicas, and every new Pod sits in Pending. Those symptoms sound similar when you are under pressure, but they point to different parts of the system. If the API server is unhealthy, nearly every command becomes unreliable. If etcd is slow or unavailable, state changes stall. If the scheduler is missing, Pods can be created but never assigned. If the controller manager is stopped, higher-level objects can be accepted but never reconciled.
The control plane is the part of Kubernetes that decides what the cluster should be and records those decisions. Worker nodes make containers run, but they do not decide which Deployment should own a ReplicaSet, which node should receive a Pod, or whether a request is allowed. For the CKA exam and for real operational work, this distinction matters because it changes where you investigate first. You do not fix a missing scheduler by deleting a Pod from kube-system, and you do not debug an etcd certificate issue by staring at application container logs.
This module builds a mental map you can use when the cluster is partially broken. You will preserve the original component diagrams, command probes, troubleshooting drills, and static-pod recovery examples, but the prose now connects those assets into a single diagnostic story. By the end, you should be able to look at a symptom, predict which control loop is stuck, and choose the next command with intent rather than cycling through random checks.
Think of Kubernetes operations like a busy airport control room. The API server is the radio desk where every official request arrives, etcd is the durable operations ledger, the scheduler assigns aircraft to available gates, controllers compare planned schedules with reality, and kubelets coordinate the ground crew on each runway. The analogy is imperfect because Kubernetes is software, not aviation, but it helps you remember one crucial rule: the control plane coordinates and records decisions, while worker nodes perform the container work.
Control Plane Map and Request Path
Section titled “Control Plane Map and Request Path”The first mistake learners make is picturing the control plane as one large process. Kubernetes is deliberately split into specialized components so each responsibility can fail, scale, and recover independently. The API server handles the front door and validation path, etcd holds persisted cluster state, the scheduler chooses nodes for unscheduled Pods, and the controller manager runs reconciliation loops that keep desired and observed state moving toward each other. Worker-side components then make those decisions real by starting containers, reporting status, and programming network rules.
The following architecture diagram is the starting map. It shows a kubeadm-style cluster where the major control plane components usually run as static Pods on a control plane node, while kubelet, kube-proxy, and the container runtime run on every node. In highly available production clusters, you may have several API servers, schedulers, and controller managers, with leader election deciding which scheduler and controller manager instance is active. The basic communication pattern remains the same: components talk through the API server, and only the API server persists Kubernetes API objects into etcd.
┌─────────────────────────────────────────────────────────────────────┐│ CONTROL PLANE ││ ┌─────────────┐ ┌─────────────┐ ┌──────────────────────────────┐ ││ │ API Server │ │ etcd │ │ Controller Manager │ ││ │ (kube-api) │◄─┤ (storage) │ │ ┌────────────────────────┐ │ ││ │ │ │ │ │ │ Deployment Controller │ │ ││ └──────┬──────┘ └─────────────┘ │ │ ReplicaSet Controller │ │ ││ │ │ │ Node Controller │ │ ││ │ ┌─────────────────────┐ │ │ Job Controller │ │ ││ │ │ Scheduler │ │ │ ... (40+ controllers) │ │ ││ │ │ (kube-scheduler) │ │ └────────────────────────┘ │ ││ │ └─────────────────────┘ └──────────────────────────────┘ │└─────────┼───────────────────────────────────────────────────────────┘ │ │ kubelet talks to API server ▼┌─────────────────────────────────────────────────────────────────────┐│ WORKER NODES ││ ┌─────────────────────────────────────────────────────────────────┐││ │ Node 1 Node 2 Node 3 │││ │ ┌─────────┐ ┌──────────┐ ┌─────────┐ ┌──────────┐ │││ │ │ kubelet │ │kube-proxy│ │ kubelet │ │kube-proxy│ ... │││ │ └─────────┘ └──────────┘ └─────────┘ └──────────┘ │││ │ ┌──────────────────────┐ ┌──────────────────────┐ │││ │ │ Container Runtime │ │ Container Runtime │ │││ │ │ (containerd) │ │ (containerd) │ │││ │ └──────────────────────┘ └──────────────────────┘ │││ └─────────────────────────────────────────────────────────────────┘│└─────────────────────────────────────────────────────────────────────┘The split between control plane and worker node components is not trivia. It tells you whether a failure prevents new decisions, existing containers, service networking, or all API access. A kubelet problem is local to the node that cannot launch or report Pods, while an API server problem can make the entire cluster look unavailable to clients. A stopped controller manager can leave existing Pods alone but prevent Deployments, Jobs, Nodes, and Endpoints from being reconciled. A scheduler problem usually shows up as Pods that exist in the API but never receive a nodeName.
| Component | Runs On | Purpose |
|---|---|---|
| kube-apiserver | Control plane | API gateway, all communication |
| etcd | Control plane | Cluster state storage |
| kube-scheduler | Control plane | Pod placement decisions |
| kube-controller-manager | Control plane | Reconciliation loops |
| kubelet | Every node | Container lifecycle |
| kube-proxy | Every node | Network rules |
| Container runtime | Every node | Actually runs containers |
When you run a command such as kubectl create deployment nginx --image=nginx --replicas=3, you are not talking to the scheduler or to a node directly. kubectl sends an HTTP request to the API server using credentials from your kubeconfig. The API server authenticates the user, authorizes the action, applies admission logic, validates the object, and stores the accepted desired state. Only after the object exists in the API can controllers and the scheduler react to it.
┌─────────────────────────────────────────────────────────────────┐│ All Roads Lead to API Server ││ ││ kubectl ────────┐ ││ Scheduler ──────┼────► kube-apiserver ◄───► etcd ││ Controllers ────┤ ││ kubelet ────────┤ ││ Dashboard ──────┘ ││ │└─────────────────────────────────────────────────────────────────┘The API server being the only Kubernetes component that directly stores API objects in etcd is a protective design choice. It centralizes authentication, authorization, validation, admission, version conversion, and audit behavior instead of allowing every component to write raw state. That design also creates a clear troubleshooting rule: if kubectl cannot reach the API server, normal cluster introspection is unavailable, so you must move to the control plane node and use local tools such as journalctl, crictl, or static manifest inspection.
1. kubectl → API Server: "Create this pod please"2. API Server: Authentication check ✓3. API Server: Authorization check ✓4. API Server: Admission controllers run5. API Server: Validation check ✓6. API Server → etcd: "Store this pod spec"7. API Server → kubectl: "Pod created (pending)"At this point the Pod may exist only as desired state. Nothing in the sequence above guarantees that a container is already running. The scheduler still has to bind the Pod to a node, the kubelet on that node still has to observe the assignment, and the container runtime still has to pull the image and create the container. This gap between “accepted by the API” and “running on a node” is where many control plane troubleshooting scenarios live.
Pause and predict: If the API server is unavailable but a worker node already has healthy containers running, which activities can continue and which activities stop?
Pause and predict: if the API server is unavailable but a worker node already has healthy containers running, which activities can continue and which activities stop? Existing containers can keep running because kubelet and the runtime already have local work, but new API decisions, status updates, scheduling, and controller reconciliation are blocked or delayed. That prediction gives you a quick way to separate “the application process died” from “the cluster control loops cannot make progress.”
Kubernetes 1.35+ still follows this separation of concerns. Component flags, health endpoints, and exact manifest details evolve across releases, but the model remains stable enough for diagnostic reasoning. On a CKA-style kubeadm cluster, the control plane components are commonly defined as files under /etc/kubernetes/manifests/. The kubelet watches that directory and keeps the corresponding static Pods alive, which is why file recovery is often more important than kubectl delete pod during control plane repair.
API Server and etcd: Front Door and Source of Truth
Section titled “API Server and etcd: Front Door and Source of Truth”The API server is both a gateway and a policy enforcement point. It receives requests from humans, controllers, schedulers, kubelets, dashboards, operators, and automation. It does not run application containers, but it decides whether an object is accepted into the cluster record. If you are investigating a symptom that affects all clients, such as widespread kubectl timeouts, admission failures, or TLS errors, the API server is usually the first control plane component to check.
# Is the API server responding?kubectl cluster-info
# Check API server component status (legacy)kubectl get componentstatuses # Deprecated, may not work on all clusters
# Modern health endpoints (preferred)kubectl get --raw='/readyz?verbose'kubectl get --raw='/livez?verbose'
# Direct health endpointkubectl get --raw='/healthz'
# Detailed healthkubectl get --raw='/healthz?verbose'The livez and readyz endpoints are more useful than the old component status command because they expose API server health checks through a supported endpoint model. Liveness tells you whether the process should be restarted, while readiness tells you whether it is prepared to serve normal traffic. A readiness failure can be caused by dependencies such as etcd connectivity, post-start hooks, or other internal checks. The important habit is to read the verbose output instead of treating health as a single green or red light.
API server logs may be available through kubectl logs when the API path is healthy enough to read the mirror Pod. During a serious failure, that assumption may be false, so you need the local path too. In kubeadm clusters, the static Pod manifest records the image, command flags, certificates, etcd endpoints, admission configuration, and bind settings. A small edit in that manifest can break the front door for the entire cluster, so inspect before changing and keep a copy when practicing.
# If running as a static pod (kubeadm setup)kubectl logs -n kube-system kube-apiserver-<control-plane-node>
# If running as systemd servicejournalctl -u kube-apiserver
# Static pod manifest locationcat /etc/kubernetes/manifests/kube-apiserver.yamletcd is the durable memory behind that front door. Kubernetes stores API objects under keys that reflect resource type, namespace, and object name, though you normally interact through the Kubernetes API rather than through raw etcd reads. Losing etcd data is not like losing a cache; it means losing the cluster record. That is why backups, disk monitoring, quorum awareness, and certificate health are operational topics, not optional extras for large teams only.
Key format: /registry/<resource-type>/<namespace>/<name>
Examples:/registry/pods/default/nginx/registry/services/kube-system/kube-dns/registry/secrets/default/my-secret/registry/deployments/production/web-appIn a single-control-plane training cluster, etcd may be a single static Pod. In production, etcd is commonly deployed as a small odd-sized cluster because its Raft consensus model requires a majority of members to agree. A three-member etcd cluster can tolerate one member failure; a five-member cluster can tolerate two. Adding members does not simply make every operation faster because consensus still requires communication, disk writes, and quorum. The design goal is durable agreement, not raw key-value speed at any cost.
┌─────────────────────────────────────────────────────────────────┐│ etcd Cluster (Raft Consensus) ││ ││ ┌──────────┐ ┌──────────┐ ┌──────────┐ ││ │ etcd-1 │◄────►│ etcd-2 │◄────►│ etcd-3 │ ││ │ (Leader) │ │(Follower)│ │(Follower)│ ││ └──────────┘ └──────────┘ └──────────┘ ││ ││ Writes go to leader, replicated to followers ││ Reads can go to any node ││ Survives loss of 1 node (quorum = 2/3) ││ │└─────────────────────────────────────────────────────────────────┘Checking etcd health often requires local certificates because the etcd API is protected. The following command uses the kubeadm-style certificate paths that are common in exam labs. If it fails, distinguish between transport problems, certificate problems, member health problems, and command configuration mistakes. A wrong certificate path can look like an etcd outage if you read only the last line of the error.
# etcd member list (if you have etcdctl configured)ETCDCTL_API=3 etcdctl member list \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key
# Check etcd podkubectl get pods -n kube-system | grep etcdkubectl logs -n kube-system etcd-<control-plane-node>Hypothetical scenario: kubectl get pods -A works slowly, kubectl apply often times out, and API server logs show repeated storage latency warnings. You should suspect the API server is waiting on etcd rather than blaming the scheduler, because scheduling is only one later consumer of accepted state. In that case, check etcd member health, disk pressure, certificate expiration, and whether compaction or defragmentation has been neglected in a long-running cluster. The operational lesson is that every control loop depends on the shared state store being responsive.
Before running: What output do you expect from
kubectl get --raw='/readyz?verbose'on a healthy kubeadm control plane?
Before running this, what output do you expect from kubectl get --raw='/readyz?verbose' on a healthy kubeadm control plane? Predict the shape of the output before you execute it: you should expect named checks with ok results, not just a single success string. If a check fails, copy the exact failing check name into your notes, because it usually points more precisely than the final HTTP status.
Scheduler, Controllers, and Node Agents
Section titled “Scheduler, Controllers, and Node Agents”After the API server accepts a Pod object, the scheduler watches for Pods that do not yet have .spec.nodeName. Its job is not to run the Pod; its job is to choose a feasible node and bind the Pod by updating the API. The scheduler first filters out nodes that cannot run the Pod, then scores the remaining candidates. That two-stage model matters because a Pod stuck in Pending may have no feasible node at all, or it may have feasible nodes but be waiting behind other scheduling work.
┌────────────────────────────────────────────────────────────────┐│ Scheduling Process ││ ││ 1. New pod created (no nodeName) ─────────────────────┐ ││ │ ││ 2. Scheduler watches API server ▼ ││ "Any pods need scheduling?" ◄────────────────── Pod Queue ││ ││ 3. Filtering: Which nodes CAN run this pod? ││ - Enough CPU/memory? ││ - Taints/tolerations match? ││ - Node selectors match? ││ - Affinity rules satisfied? ││ ││ 4. Scoring: Which node is BEST? ││ - Resource balance ││ - Spreading across zones ││ - Custom priorities ││ ││ 5. Binding: Assign pod to winning node ││ Scheduler → API Server: "pod X goes to node Y" ││ │└────────────────────────────────────────────────────────────────┘Filtering is about hard constraints. If a Pod requests more CPU than any node can offer, requires a label no node has, lacks a toleration for a taint, or depends on a storage claim that is not bound, the scheduler cannot create a good answer by being clever. Scoring is different: it chooses among nodes that passed the hard tests, often preferring better resource balance, topology spread, or other configured priorities. In practice, the Events section of kubectl describe pod is your best first clue because it records why no node was accepted.
# Pod stuck in Pendingkubectl describe pod <pod-name>
# Look for Events section:# "0/3 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/control-plane: },# 2 node(s) didn't match Pod's node affinity/selector"Pause and predict: a Pod requests 4 CPU cores and 8Gi of memory, but no single node currently has that much allocatable capacity available. The Pod should remain Pending, and kubectl describe pod should show scheduling events that mention insufficient resources or no available nodes meeting the request. That prediction is useful because it keeps you from restarting the scheduler when the scheduler is actually doing the correct thing.
# Scheduler podkubectl get pods -n kube-system | grep schedulerkubectl logs -n kube-system kube-scheduler-<control-plane-node>
# Scheduler leader election (in HA setups)kubectl get endpoints kube-scheduler -n kube-system -o yamlControllers solve a different problem. A controller is a reconciliation loop: it watches desired state and observed state, compares them, and takes action when they differ. The kube-controller-manager is a single binary that runs many built-in controllers, including Deployment, ReplicaSet, Node, Job, Endpoint, ServiceAccount, and Namespace controllers. The design is powerful because each controller can focus on one kind of convergence rather than placing all cluster behavior inside the API server.
┌────────────────────────────────────────────────────────────────┐│ Controller Loop Pattern ││ ││ ┌─────────────────┐ ││ │ Desired State │ ││ │ (in etcd) │ ││ └────────┬────────┘ ││ │ ││ Compare │ ││ ▼ ││ Is current state = desired state? ││ │ ││ ┌──────────────┴──────────────┐ ││ │ YES NO │ ││ ▼ ▼ ││ Do nothing Take action ││ (wait & watch) (create/delete/update) ││ │└────────────────────────────────────────────────────────────────┘| Controller | Watches | Does |
|---|---|---|
| Deployment | Deployments | Creates/updates ReplicaSets |
| ReplicaSet | ReplicaSets | Ensures correct pod count |
| Node | Nodes | Monitors node health, evicts pods from dead nodes |
| Job | Jobs | Creates pods, tracks completion |
| Endpoint | Services, Pods | Updates Service endpoints |
| ServiceAccount | Namespaces | Creates default ServiceAccount |
| Namespace | Namespaces | Cleans up resources when namespace deleted |
The ReplicaSet example is the cleanest way to understand reconciliation. You create a desired replica count, and the ReplicaSet controller observes that zero matching Pods exist. It then creates Pods through the API server. If a Pod later disappears, the controller does not mourn the specific object; it sees that the count is below desired state and creates another matching Pod. This is why labels and selectors are central to Kubernetes behavior.
# You create this:apiVersion: apps/v1kind: ReplicaSetmetadata: name: webspec: replicas: 3 selector: matchLabels: app: web template: metadata: labels: app: web spec: containers: - name: nginx image: nginxController loop:1. Watch: "ReplicaSet 'web' wants 3 pods"2. Check: "How many pods with label 'app=web' exist?"3. Compare: "0 exist, 3 desired"4. Act: "Create 3 pods"5. Repeat forever...
Later:- Pod dies → Controller sees 2 pods → Creates 1 more- You scale to 5 → Controller sees 3 pods → Creates 2 more- You scale to 2 → Controller sees 5 pods → Deletes 3What would happen if the controller manager crashes while the API server, scheduler, and etcd are still running? You can still create a bare Pod because the API server can accept it and the scheduler can assign it, but Deployments, ReplicaSets, Jobs, Endpoints, and node lifecycle reactions stop converging. Existing containers can keep serving because kubelet owns their local lifecycle, yet higher-level Kubernetes intent becomes frozen until controllers return and catch up.
# Controller manager podkubectl get pods -n kube-system | grep controller-managerkubectl logs -n kube-system kube-controller-manager-<control-plane-node>
# Check for specific controller issues in logskubectl logs -n kube-system kube-controller-manager-<node> | grep -i "error\|failed"Node agents complete the path from desired state to running containers. The kubelet registers the node, watches for Pods assigned to it, asks the container runtime to create containers, reports status, and runs health probes. kube-proxy watches Services and EndpointSlices, then programs node-level network rules so Service virtual IPs can reach backing Pods. The container runtime, usually containerd in kubeadm-style clusters, performs the actual image and container operations through the Container Runtime Interface.
# Check kubelet statussystemctl status kubelet
# kubelet logsjournalctl -u kubelet -f
# kubelet configurationcat /var/lib/kubelet/config.yaml# Check kube-proxykubectl get pods -n kube-system | grep kube-proxykubectl logs -n kube-system kube-proxy-<id>
# See iptables rules kube-proxy creatediptables -t nat -L KUBE-SERVICES# Check containerdsystemctl status containerdcrictl ps # List running containerscrictl images # List imagesWhich approach would you choose here and why: if Pods are Running but Services cannot reach them, would you start with the scheduler or with node networking state? Start with Service, EndpointSlice, kube-proxy, and node firewall or dataplane checks because scheduling has already completed. This kind of component-to-symptom mapping is the difference between fast diagnosis and noisy command collection.
End-to-End Deployment Walkthrough and Recovery
Section titled “End-to-End Deployment Walkthrough and Recovery”The following command looks simple, but it causes several independent control loops to cooperate. Read the timeline slowly and notice how each step depends on the previous state becoming visible through the API. The Deployment controller cannot create a ReplicaSet until the Deployment exists. The ReplicaSet controller cannot create Pods until the ReplicaSet exists. The scheduler cannot bind Pods until Pods exist without nodeName. Kubelet cannot start containers until it sees Pods assigned to its node.
kubectl create deployment nginx --image=nginx --replicas=3┌─────────────────────────────────────────────────────────────────┐│ Timeline of Events │├─────────────────────────────────────────────────────────────────┤│ ││ 0ms kubectl → API Server: "Create Deployment nginx" ││ 5ms API Server → etcd: Store Deployment ││ ││ 10ms Deployment Controller sees new Deployment ││ 15ms Deployment Controller → API: "Create ReplicaSet" ││ 20ms API Server → etcd: Store ReplicaSet ││ ││ 25ms ReplicaSet Controller sees new ReplicaSet ││ 30ms ReplicaSet Controller → API: "Create Pod 1, 2, 3" ││ 35ms API Server → etcd: Store 3 Pods (Pending) ││ ││ 40ms Scheduler sees 3 unscheduled Pods ││ 50ms Scheduler → API: "Pod 1→node1, Pod 2→node2, Pod 3→node1" ││ 55ms API Server → etcd: Update Pods with nodeName ││ ││ 60ms kubelet on node1 sees 2 Pods assigned to it ││ 65ms kubelet on node2 sees 1 Pod assigned to it ││ 70ms kubelets → containerd: "Start nginx containers" ││ ││ 500ms Containers running ││ 505ms kubelets → API: "Pods are Running" ││ 510ms API Server → etcd: Update Pod status ││ ││ Done! kubectl get pods shows 3/3 Running ││ │└─────────────────────────────────────────────────────────────────┘This timeline is also a troubleshooting checklist. If no Deployment appears, focus on the client request, API server, admission, authorization, and etcd write path. If the Deployment exists but no ReplicaSet appears, suspect the Deployment controller inside the controller manager. If ReplicaSets exist but no Pods appear, suspect the ReplicaSet controller or selector issues. If Pods exist but remain unscheduled, inspect scheduler status and Pod events. If Pods are assigned but containers do not start, move to kubelet, image pull, runtime, and node conditions.
Static Pods add one important recovery rule. In kubeadm clusters, the API server, scheduler, controller manager, and etcd are often specified as files in /etc/kubernetes/manifests/, and kubelet creates mirror Pods so you can see them through the API. Deleting a mirror Pod through kubectl does not repair a missing manifest because kubelet is the real manager. If a manifest file disappears, the durable fix is to restore the file with correct content and permissions, then let kubelet recreate the static Pod.
Hypothetical scenario: a learner accidentally moves /etc/kubernetes/manifests/kube-scheduler.yaml during a lab, then tries to restart scheduling by deleting the scheduler Pod from kube-system. The visible Pod may vanish, but the root cause remains because the kubelet no longer has a manifest to run. The right repair is to restore kube-scheduler.yaml, watch kubelet recreate the static Pod, then verify that Pending Pods receive node assignments. This is a common exam-shaped pattern because it tests component ownership rather than command memorization.
The same reasoning applies to local logs. When the API server is working, kubectl logs -n kube-system ... is convenient. When the API server is down, it is unavailable by definition, so local node tools become mandatory. You may need to inspect /etc/kubernetes/manifests/kube-apiserver.yaml, read journalctl -u kubelet, or use crictl ps and crictl logs to inspect static Pod containers. The control plane is not magical; it is a set of processes that kubelet and the runtime still manage locally.
Reading Control Plane Symptoms Like a Chain
Section titled “Reading Control Plane Symptoms Like a Chain”A reliable diagnosis usually begins by locating the first broken link in the object lifecycle. Kubernetes exposes many views of the same event, and beginners often collect all of them without deciding which transition they are testing. A Deployment that exists proves the client request, authentication, authorization, admission, and first etcd write succeeded. A missing ReplicaSet after that points to Deployment reconciliation. A ReplicaSet with no Pods points to ReplicaSet reconciliation. Pods with no assigned node point to scheduling or impossible constraints. Assigned Pods that fail to start point to kubelet, image pull, runtime, storage mount, or node pressure.
This chain-based approach is faster than memorizing component names because it follows evidence. Imagine a learner says, “My Deployment failed.” That statement is too broad to debug. Ask which object exists and which object should have appeared next. If the Deployment object is absent, investigate the API request. If the Deployment exists and has status conditions, read those conditions before moving lower. If the ReplicaSet exists but has zero desired Pods, check selectors and Deployment rollout state. Each answer narrows the component set without guessing.
The same method works in reverse when existing workloads keep running during control plane trouble. If containers are still serving traffic, kubelet and the runtime have already done enough local work to keep them alive. If the API server is down, those containers do not instantly die because Kubernetes is not a remote process supervisor that must constantly stream permission to every container. However, status reporting, new scheduling, scaling, rollout, and many self-healing actions will be impaired. That distinction explains why partial outages can look calm for a few minutes before hidden control loops matter.
There is also a difference between “not happening yet” and “cannot happen.” Controllers and schedulers watch the API and work asynchronously, so a small delay after creating an object can be normal. A persistent delay with clear Events is different. If Events say no nodes match an affinity rule, waiting longer does not create a matching node. If controller logs show failed leader election, waiting might resolve after lease changes, but you still need to understand why leadership is unstable. Your job is to separate normal eventual consistency from a blocked precondition.
Worked example: suppose you run kubectl create deployment api --image=nginx --replicas=3, and the command returns success. One minute later, kubectl get deployment api shows the Deployment, but kubectl get rs shows no ReplicaSet. The scheduler has not had a chance to matter because there are no Pods to schedule. The kubelet has not had a chance to matter because there are no assigned Pods. The component responsible for creating the next object is the Deployment controller inside the controller manager, so that is where you look next.
If the ReplicaSet exists and three Pods exist, but all three Pods are Pending, the diagnostic owner changes. The Deployment and ReplicaSet controllers did their jobs well enough to create Pods. Now inspect Pod Events and scheduler logs. The answer might be a failed scheduler, but it might also be a correct scheduler refusing an impossible request. A Pod that asks for a label no node has should stay Pending until the label changes, the Pod spec changes, or a matching node joins. Treat the scheduler as a decision engine, not as a magic placement force.
If the Pods have nodeName set and remain in ContainerCreating, the scheduler is done. Look at kubelet Events, image pull status, volume attachment, CNI setup, and runtime logs on the assigned node. A missing image pull secret, a broken container runtime socket, or a CNI configuration error can all appear after scheduling succeeds. The control plane still records status, but the next useful evidence is often node-local. This is where CKA candidates lose time if they keep reading scheduler logs after the Pod is already bound.
API server symptoms require a different entry point because the API server is your normal window into the cluster. If kubectl get --raw='/readyz?verbose' works, use it because it gives structured readiness checks. If it fails with connection refused or TLS errors, move to the control plane node and inspect kubelet plus the API server manifest. If it works but reports etcd-related readiness failures, you have evidence that the API process is alive but its storage dependency is not healthy enough for normal service. That distinction changes whether you inspect API flags, certs, network reachability, or etcd itself.
etcd symptoms are often broad because every persisted state transition depends on storage. Slow etcd does not always mean etcd is fully down; it may mean disk latency, database growth, compaction issues, member communication trouble, or certificate errors between the API server and the etcd endpoint. Readiness output, API server logs, and etcdctl endpoint health or member commands can triangulate the cause. In production, you would also check metrics, disk capacity, and recent maintenance. In an exam lab, you focus on certificate paths, endpoints, static Pod health, and obvious disk or process failures.
Leader election adds another layer in highly available clusters. The scheduler and controller manager may run multiple Pods, but only the leader actively performs work. A standby instance is not broken just because it is not making scheduling or reconciliation decisions. If no leader can be elected, the cluster may accept objects without progressing. This is why checking leader election endpoints or leases can be useful when component Pods are Running but behavior is still frozen. Running means the process exists; leadership means the process owns the active work.
Static Pod recovery also deserves careful sequencing. First identify whether the manifest exists and whether kubelet is reading it. Then inspect whether the container was created and why it exited. If the manifest is missing, restore it before spending time on API-level deletes. If the manifest exists but contains a bad flag, compare it against a known-good backup or another control plane node. If the container runs but fails readiness, read the component logs and dependency checks. Each step tests a different layer of the local static Pod stack.
When you practice, write down the proof each command gives you. kubectl get deployment proves the API can read a Deployment object, not that the controller is healthy. kubectl get pods -n kube-system | grep scheduler proves a scheduler mirror Pod is visible, not that every Pod is schedulable. journalctl -u kubelet proves what the local kubelet is doing, not what etcd has stored. These distinctions seem fussy at first, but they are what make an incident timeline accurate.
The final habit is to avoid changing multiple layers at once. If you edit a static manifest, restart kubelet, drain a node, and delete workloads in one burst, you cannot tell which action fixed or worsened the symptom. Make one scoped change, watch the expected next transition, then continue. Kubernetes is built from reconciliation loops, so good troubleshooting uses the same patience: observe desired state, observe current state, change one input, and verify the next loop response.
One useful way to practice this discipline is to narrate a failure in terms of verbs rather than nouns. Instead of saying “the scheduler is broken,” say “the Pod has not been bound to a node.” Instead of saying “the controller manager is broken,” say “the Deployment has not produced a ReplicaSet” or “the ReplicaSet has not produced Pods.” Verb-based descriptions keep you close to evidence, and they make peer handoffs clearer. Another engineer can test the same transition without inheriting your unproven conclusion.
The API object status fields are also part of the evidence chain, not decorative output. Conditions, timestamps, observed generations, ready counts, and Events tell you whether a controller has seen the current desired state. For example, a Deployment may have a newer metadata.generation than status.observedGeneration, which means the controller has not yet processed the latest spec. A Pod may have a scheduled condition, a ready condition, and container state details that each belong to a different owner. Read these fields as breadcrumbs through the control loops.
Be careful with label selectors during control plane diagnosis. A ReplicaSet controller counts Pods by selector, not by the story you have in your head. If a Pod template label does not match the selector, Kubernetes may reject the object, or a controller may ignore Pods you expected it to own. If a Service selector does not match ready Pods, the Service can exist while having no useful endpoints. These mistakes are not control plane process outages, but they look like broken automation until you verify the selector relationship.
Node conditions provide another boundary marker. If a node is NotReady, the scheduler may stop placing normal Pods there, and the node controller may eventually react to missed heartbeats. If a node is Ready but under memory, disk, or PID pressure, kubelet may reject or evict work even though the control plane accepted the desired state. In a small lab, those details may feel secondary, yet they explain why “the Pod exists” is not enough. The node still has to prove it can execute the assignment safely.
Timeouts deserve similar care. A client-side timeout from kubectl can mean local network trouble, API server overload, admission webhook latency, etcd latency, or a request that needs to stream more data than expected. The message text, the affected verbs, and the affected resources matter. If only Pod creation is slow, compare admission, quota, image policy, and scheduler paths. If every list and watch operation is slow across the cluster, look harder at the API server and etcd. The breadth of the timeout is evidence.
For CKA practice, convert every broad incident into two artifacts: a suspected owner and a falsifiable next check. “Suspected owner: controller manager. Next check: does a new Deployment create a ReplicaSet, and do controller manager logs show reconciliation or leader election errors?” That format forces you to choose a test that could prove you wrong. If the ReplicaSet appears immediately, the controller manager is not the current blocker. You then move to the next transition instead of defending the first guess, and your notes remain useful to someone reviewing the path later under exam pressure or team review.
Patterns & Anti-Patterns
Section titled “Patterns & Anti-Patterns”Good control plane troubleshooting starts with symptom classification. Ask whether the failure blocks API access, state persistence, scheduling, reconciliation, node execution, or service networking. That question narrows the search before you touch anything. It also prevents harmful fixes, such as restarting every component when a single Pod has an impossible node selector. Restarting may hide timing clues and can turn a partial outage into a broader one if several components are already close to unhealthy.
| Pattern | When to Use It | Why It Works | Scaling Consideration |
|---|---|---|---|
| Start at the failed transition | A resource is stuck between accepted, scheduled, running, or ready | Each transition maps to a small set of components | Build runbooks around transitions, not around random commands |
| Prefer health endpoints and events before restarts | The API is reachable but behavior is degraded | Readiness checks and events preserve evidence | Store important outputs in incident notes before making changes |
| Treat static manifests as source for static Pods | A control plane Pod keeps disappearing or cannot restart | kubelet reconciles files in /etc/kubernetes/manifests/ | Back up manifests and compare against a healthy control plane node |
| Separate desired state from execution state | A Deployment, ReplicaSet, Pod, or container disagrees | Kubernetes progresses through controllers, scheduler, kubelet, and runtime | Use object owner references and status fields to follow the chain |
Anti-patterns usually come from treating Kubernetes as one box. A team sees Pods in Pending and restarts kubelet on every node, even though the Events section says the Pods request too much memory. Another team deletes a static Pod mirror and expects it to stay deleted, even though kubelet immediately recreates it from the manifest. These actions are understandable because the surface area is large, but they waste time because they ignore ownership boundaries.
| Anti-Pattern | What Goes Wrong | Better Alternative |
|---|---|---|
| Restarting all control plane components first | Evidence disappears and the outage can widen | Read health, events, and logs before choosing one component |
Using kubectl only during API server failure | The main diagnostic tool depends on the failed component | SSH to the control plane and use manifest, kubelet, and runtime logs |
| Blaming the scheduler for every Pending Pod | Resource, taint, affinity, and storage constraints are missed | Inspect kubectl describe pod Events before touching scheduler |
| Editing static manifests without a backup | A typo can keep a core component crash-looping | Copy the manifest, make one change, and watch kubelet logs |
The safest habit is to name the owner of the next state transition. API server owns request acceptance, etcd owns persisted state, controllers own reconciliation, scheduler owns Pod binding, kubelet owns node-local Pod execution, kube-proxy owns many Service dataplane rules, and the runtime owns containers. If you cannot name the owner, pause before changing the system. That small discipline prevents most beginner control plane mistakes.
Decision Framework
Section titled “Decision Framework”Use this framework when a cluster symptom looks broad but the exact component is unclear. Start with what still works, because negative evidence is often more useful than a long command list. If no kubectl command succeeds, move below the API and inspect the control plane node locally. If API reads work but writes are slow or failing, inspect API server readiness and etcd health. If objects are accepted but not acted on, follow the desired-state chain through controllers, scheduler, and kubelet.
Symptom observed | vCan kubectl reach the API server? | +-- No --> Check kube-apiserver static Pod, kubelet logs, certificates, | bind address, and local runtime status on the control plane. | +-- Yes --> Are API reads or writes slow across many resources? | +-- Yes --> Check API server readyz/livez and etcd health. | +-- No --> Does the object exist but no child object appears? | +-- Yes --> Check kube-controller-manager logs. | +-- No --> Are Pods created but Pending? | +-- Yes --> Check scheduler and Pod Events. | +-- No --> Are Pods assigned but not Running? | +-- Yes --> Check kubelet and runtime. | +-- No --> Check Service, EndpointSlice, kube-proxy, and application readiness.| Symptom | Most Likely Owner | First Useful Check | Why This Check Comes First |
|---|---|---|---|
kubectl cannot connect | API server or node-local static Pod stack | journalctl -u kubelet and API server manifest | kubectl depends on the component you are testing |
| All writes are slow or time out | API server to etcd path | kubectl get --raw='/readyz?verbose' and etcd member health | Every write must be persisted before other loops react |
| Deployment exists but no ReplicaSet appears | Controller manager | Controller manager logs and leader election status | Deployment reconciliation creates the next object |
Pods exist but have no nodeName | Scheduler or scheduling constraints | kubectl describe pod <pod-name> Events | Events reveal resource, taint, affinity, and volume blockers |
| Pod assigned but containers absent | kubelet or runtime | journalctl -u kubelet and crictl ps | Node-local execution starts after scheduling |
| Pod Running but Service traffic fails | kube-proxy or Service endpoints | Service, EndpointSlice, kube-proxy logs, node rules | Scheduling is complete, so networking owns the next hop |
This decision framework is not a replacement for documentation or deeper debugging. It is a guardrail for the first ten minutes of an incident or exam task. In those minutes, the goal is to avoid confusing the component that records desired state with the component that executes it. Once you identify the stuck transition, detailed component documentation and logs become far easier to interpret.
Did You Know?
Section titled “Did You Know?”-
Static Pods are managed by kubelet, not by the API server. In kubeadm clusters, core control plane components commonly live as manifests under
/etc/kubernetes/manifests/, and the Pods shown inkube-systemare mirror objects. This is why restoring a deleted manifest repairs the component while deleting the mirror Pod does not. -
The API server is designed to be stateless. Durable Kubernetes state lives in etcd, so restarting an API server should not erase objects. That statelessness is what lets highly available clusters run multiple API server instances behind a load balancer.
-
etcd uses Raft consensus and needs quorum. A three-member etcd cluster tolerates one failed member, while a five-member cluster tolerates two. Even-sized clusters are usually avoided because they add cost without improving failure tolerance in the way many people expect.
-
Schedulers and controller managers use leader election in highly available clusters. Multiple instances can run, but only the active leader performs the main work at a given time. Standby instances are still valuable because they can take over when the leader fails.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Why It Happens | How to Fix It |
|---|---|---|
| Thinking kubelet runs in a Pod | Most other Kubernetes components are visible as Pods, so learners assume kubelet is too | Check kubelet with systemctl status kubelet and journalctl -u kubelet on the node |
| Ignoring etcd health when the API is slow | The symptom appears in kubectl, so the storage path is easy to overlook | Check API server readiness and etcd member health, disk space, certificates, and latency |
| Not checking component logs | The resource state shows the effect, not always the cause | Read the relevant kube-system logs when the API works, and local node logs when it does not |
| Confusing control plane with worker responsibilities | Kubernetes presents one API even though many components own different transitions | Map the stuck transition to API server, etcd, controller manager, scheduler, kubelet, kube-proxy, or runtime |
| Forgetting static Pods are file-backed | Mirror Pods look normal in kubectl get pods output | Restore or edit the manifest in /etc/kubernetes/manifests/ and watch kubelet recreate the Pod |
| Restarting scheduler for impossible Pod constraints | Pending Pods look like scheduler failures even when the scheduler is correct | Inspect Pod Events for insufficient resources, taints, affinity, selectors, and unbound PVCs |
| Using deprecated component status as the only health signal | Old habits and older tutorials still mention kubectl get componentstatuses | Prefer readyz, livez, component logs, and direct etcd checks on Kubernetes 1.35+ clusters |
Question 1: Your monitoring alert says etcd latency is high, and developers report that `kubectl` commands are slow or timing out. Why does etcd latency affect `kubectl`, and which other cluster behaviors would degrade?
The API server is the only Kubernetes component that persists API objects into etcd, and every normal kubectl request goes through the API server. When etcd is slow, the API server can block on reads or writes, so clients experience timeouts even if the API server process is still alive. Scheduler binding decisions, controller status updates, kubelet status reports, and admission paths that require persisted state can all degrade. The key diagnostic clue is that many unrelated resource operations become slow together, which points to the shared API-to-etcd path rather than one workload controller.
Question 2: A developer reports that a new Deployment shows `0/3` replicas ready. You run `kubectl get pods` and see three Pods stuck in `Pending`, while the scheduler Pod is Running. What are the most likely causes, and how would you investigate?
A Running scheduler does not guarantee that any node is eligible for the Pods. Start with kubectl describe pod <pod-name> and read the Events section because it will usually mention insufficient CPU or memory, untolerated taints, node selector mismatches, affinity failures, or unbound PersistentVolumeClaims. Then compare those messages against node labels, taints, allocatable resources, and storage state. Restarting the scheduler would be premature unless Events or scheduler logs show the scheduler itself is unhealthy.
Question 3: During an incident, the kube-controller-manager has been down for 10 minutes. Existing Pods are still serving traffic, but a Deployment scaled from 3 to 5 replicas never creates the extra Pods. Explain the difference.
Existing containers keep running because kubelet and the container runtime manage local execution after Pods have already been assigned. Scaling a Deployment requires controller reconciliation: the Deployment controller updates ReplicaSets, and the ReplicaSet controller creates or deletes Pods to match desired replica count. With the controller manager down, the new desired state can sit in etcd without the controller that acts on it. Once the controller manager returns, it should observe the mismatch and create the missing Pods.
Question 4: A colleague accidentally deleted `/etc/kubernetes/manifests/kube-scheduler.yaml`. They try `kubectl delete pod kube-scheduler -n kube-system` to restart it, but scheduling stays broken. What went wrong, and what is the correct fix?
The scheduler in a kubeadm control plane is commonly a static Pod managed by kubelet from the manifest file. The Pod visible through kubectl is a mirror Pod, so deleting it does not recreate the missing source manifest. The correct fix is to restore /etc/kubernetes/manifests/kube-scheduler.yaml from backup or from a healthy control plane node, then watch kubelet recreate the static Pod. After the scheduler is Running, verify that Pending Pods receive node assignments.
Question 5: `kubectl create pod` succeeds, the Pod gets a `nodeName`, but the container never starts and kubelet reports image pull failures. Which component owns the next diagnostic step, and why?
The API server and scheduler have already done their parts because the Pod exists and has been assigned to a node. The next owner is the kubelet on the selected node, working with the container runtime to pull the image and create the container. You should inspect kubectl describe pod for image events, then move to journalctl -u kubelet and runtime commands such as crictl ps or crictl images on that node. The scheduler is not the first suspect because binding already happened.
Question 6: A Service has a ClusterIP, the backing Pods are Running and Ready, but traffic to the Service fails from one node only. Which part of the architecture should you investigate first?
Because the Pods are Running and Ready, the control plane has already accepted, scheduled, and reconciled the workload. A one-node Service failure points toward node-local dataplane behavior, especially kube-proxy rules, firewall rules, CNI state, or EndpointSlice propagation on that node. Start by checking the Service and EndpointSlice objects, then inspect kube-proxy logs and node rules such as the KUBE-SERVICES chain if that cluster uses iptables mode. Restarting the controller manager would not be the first move because the failure is localized to traffic handling.
Question 7: The API server is down, and `kubectl logs -n kube-system kube-apiserver-` cannot connect. What should you do next, and why?
When the API server is down, kubectl logs depends on the failed path, so it is the wrong tool for the next step. SSH to the control plane node and inspect local state: journalctl -u kubelet, the static manifest at /etc/kubernetes/manifests/kube-apiserver.yaml, and runtime logs through crictl if needed. This works because kubelet and the runtime are local processes that can still expose evidence even when the Kubernetes API is unavailable. The goal is to find whether the static Pod is missing, crash-looping, misconfigured, or unable to reach etcd.
Hands-On Exercise
Section titled “Hands-On Exercise”This exercise uses the protected command drills from the original module and places them in a safer diagnostic sequence. Run destructive steps only in a disposable practice cluster, such as the linked lab environment, because moving static manifests intentionally breaks control plane components. If you are on a shared cluster, perform only the read-only inspection steps and discuss the recovery tasks instead of executing them.
Task: Explore your cluster’s control plane components and practice mapping symptoms to owners.
Guided Exploration
Section titled “Guided Exploration”- List all control plane pods:
kubectl get pods -n kube-system- Check component health:
kubectl get componentstatuseskubectl get --raw='/healthz?verbose'- View API server configuration:
# On control plane nodesudo cat /etc/kubernetes/manifests/kube-apiserver.yaml | grep -A5 "command:"- Check scheduler logs for recent activity:
kubectl logs -n kube-system -l component=kube-scheduler --tail=20- Watch controller manager in action:
# Terminal 1: Watch controller logskubectl logs -n kube-system -l component=kube-controller-manager -f
# Terminal 2: Create and delete a deploymentkubectl create deployment test --image=nginx --replicas=2
# Verify the deployment was created successfully before deletingkubectl rollout status deployment test
kubectl delete deployment test- Explore etcd if available:
# On control plane node with etcdctlsudo ETCDCTL_API=3 etcdctl get /registry/namespaces/default \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.keyPractice Drills
Section titled “Practice Drills”Drill 1: Component Identification Race
Section titled “Drill 1: Component Identification Race”Without looking at notes, identify which component handles each scenario. The target is two minutes, but accuracy matters more than speed because each row maps to an exam troubleshooting path.
| Scenario | Component |
|---|---|
| Stores all cluster state | ___ |
| Decides which node runs a new pod | ___ |
| Authenticates kubectl requests | ___ |
| Creates pods when ReplicaSet changes | ___ |
| Reports node status to control plane | ___ |
| Maintains iptables rules for Services | ___ |
Solution
- etcd
- kube-scheduler
- kube-apiserver
- kube-controller-manager (ReplicaSet controller)
- kubelet
- kube-proxy
Drill 2: Troubleshooting Missing Scheduler
Section titled “Drill 2: Troubleshooting Missing Scheduler”Exercise scenario: Pods are stuck in Pending forever because the scheduler manifest has been moved out of the watched manifest directory. Do this only on a disposable cluster where breaking scheduling is acceptable.
# Setup: Break the schedulersudo mv /etc/kubernetes/manifests/kube-scheduler.yaml /tmp/
# Create a test podkubectl run drill-pod --image=nginx
# Observe the problemkubectl get pods # Pending foreverkubectl describe pod drill-pod | grep -A5 Events
# YOUR TASK: Diagnose and fix# 1. What's missing?# 2. How do you restore it?Solution
# Check control plane podskubectl get pods -n kube-system | grep scheduler # Nothing!
# Restore schedulersudo mv /tmp/kube-scheduler.yaml /etc/kubernetes/manifests/
# Wait for scheduler and verifykubectl get pods -n kube-system | grep scheduler # Running!kubectl get pod drill-pod # Now Running
# Cleanupkubectl delete pod drill-podDrill 3: Troubleshooting Controller Manager Down
Section titled “Drill 3: Troubleshooting Controller Manager Down”Exercise scenario: Deployments are accepted by the API server, but ReplicaSets and Pods are never created because the controller manager manifest has been moved away.
# Setupsudo mv /etc/kubernetes/manifests/kube-controller-manager.yaml /tmp/
# Create deploymentkubectl create deployment drill-deploy --image=nginx --replicas=3
# Observekubectl get deploy # Shows 0/3 readykubectl get rs # No ReplicaSet created!kubectl get pods # No pods!
# YOUR TASK: Diagnose and fixSolution
# Check controller managerkubectl get pods -n kube-system | grep controller # Nothing!
# Restore controller managersudo mv /tmp/kube-controller-manager.yaml /etc/kubernetes/manifests/
# Watch pods appearkubectl get pods -w # 3 pods created
# Cleanupkubectl delete deployment drill-deployDrill 4: API Server Health Deep Dive
Section titled “Drill 4: API Server Health Deep Dive”Check API server health using multiple methods and compare what each method proves. The direct curl method is useful from the control plane node, while the kubectl get --raw methods verify the authenticated API path.
# Method 1: Basic connectivitykubectl cluster-info
# Method 2: Health endpointskubectl get --raw='/healthz'kubectl get --raw='/readyz'kubectl get --raw='/livez'
# Method 3: Detailed healthkubectl get --raw='/readyz?verbose' | grep -E "^\[|ok|failed"
# Method 4: Direct curl (from control plane)curl -k https://localhost:6443/healthz
# Method 5: Check API server logs for errorskubectl logs -n kube-system -l component=kube-apiserver --tail=20 | grep -i errorDrill 5: Watch the Reconciliation Loop
Section titled “Drill 5: Watch the Reconciliation Loop”Use two terminals to observe that changing desired state does not directly run containers. The controller manager reacts first, then lower-level objects and node components continue the chain.
# Terminal 1: Watch controller manager logskubectl logs -n kube-system -l component=kube-controller-manager -f | grep -i "replicaset\|deployment"
# Terminal 2: Create, scale, delete deploymentkubectl create deployment watch-me --image=nginx --replicas=2kubectl rollout status deployment watch-me
kubectl scale deployment watch-me --replicas=5kubectl rollout status deployment watch-me
kubectl delete deployment watch-me
# Observe logs in Terminal 1 - see the controller react to each changeDrill 6: etcd Exploration
Section titled “Drill 6: etcd Exploration”Explore what etcd stores if you have etcdctl and the local certificates on the control plane node. Avoid listing all keys in production clusters because the output can be large and may expose sensitive object names.
# Set up etcdctl aliasexport ETCDCTL_API=3alias etcdctl='etcdctl --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key'
# List all keys (be careful in production!)etcdctl get / --prefix --keys-only | head -50
# Find all podsetcdctl get /registry/pods --prefix --keys-only
# Get a specific pod's dataetcdctl get /registry/pods/default/<pod-name>Drill 7: Challenge Full Control Plane Restart
Section titled “Drill 7: Challenge Full Control Plane Restart”Exercise scenario: restart control plane static Pods by restarting kubelet, then verify that the cluster recovers. This is an advanced practice task for disposable clusters only.
# WARNING: Only do this on practice clusters!
# 1. Note current statekubectl get nodeskubectl get pods -A | wc -l
# 2. Restart all control plane componentssudo systemctl restart kubelet# Static pods will restart automatically
# 3. Wait and verify recoverysleep 30kubectl get nodes # All Ready?kubectl get pods -n kube-system # All Running?
# 4. Test workload schedulingkubectl run recovery-test --image=nginxkubectl wait --for=condition=Ready pod/recovery-test --timeout=60skubectl get podskubectl delete pod recovery-testSuccess Criteria
Section titled “Success Criteria”- Can identify all control plane components and their Pods.
- Can check API server liveness and readiness using supported health endpoints.
- Can find and read control plane component logs through
kubectlwhen the API is healthy. - Can switch to local kubelet, manifest, and runtime inspection when the API server is unavailable.
- Can explain what each component does during Deployment creation.
- Can restore a moved scheduler or controller manager static Pod manifest in a practice cluster.
Cleanup
Section titled “Cleanup”# Remove test deployment if createdkubectl delete deployment test --ignore-not-foundSources
Section titled “Sources”- https://kubernetes.io/docs/concepts/overview/components/
- https://kubernetes.io/docs/concepts/architecture/control-plane-node-communication/
- https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/
- https://kubernetes.io/docs/reference/command-line-tools-reference/kube-scheduler/
- https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager/
- https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/
- https://kubernetes.io/docs/reference/command-line-tools-reference/kube-proxy/
- https://kubernetes.io/docs/reference/using-api/health-checks/
- https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/
- https://kubernetes.io/docs/tasks/debug/debug-cluster/
- https://kubernetes.io/docs/tasks/configure-pod-container/static-pod/
- https://etcd.io/docs/v3.6/op-guide/runtime-configuration/
Next Module
Section titled “Next Module”Module 1.2: Extension Interfaces (CNI, CSI, CRI) - How Kubernetes plugs in networking, storage, and container runtimes.