Module 3.1: Application Probes
Complexity:
[MEDIUM]- Critical exam topic with production consequences and multiple configuration trade-offsTime to Complete: 50-60 minutes
Prerequisites: Module 1.1 (Pods), basic Service routing, and comfort reading Pod events with
kubectl describe
Learning Outcomes
Section titled “Learning Outcomes”After completing this module, you will be able to:
- Design liveness, readiness, and startup probe configurations that match an application’s startup behavior, dependency model, and failure modes.
- Compare HTTP, TCP, exec, and gRPC probe mechanisms, then select the least risky check for a given workload.
- Debug restart loops, missing Service endpoints, and slow rollouts by reading probe events, Pod status, and endpoint membership.
- Evaluate probe timing parameters so recovery is fast without causing false positives during normal startup, load, or garbage collection pauses.
- Implement runnable Kubernetes manifests that combine startup, liveness, and readiness probes for exam-style and production-style scenarios.
Why This Module Matters
Section titled “Why This Module Matters”A team ships a new checkout service on Friday afternoon. The container starts, the Pod enters Running, and the rollout looks healthy, so the deployment continues across the cluster. A few minutes later, customer traffic starts failing because the process is alive but the application has not finished loading its pricing cache. Kubernetes did exactly what it was told: it ran containers and routed traffic to ready Pods. The problem was that nobody taught Kubernetes what “ready” meant for this application.
Probes are the contract between your application and the kubelet. They tell Kubernetes whether a container should be restarted, whether a Pod should receive traffic, and whether a slow-starting application deserves more time before normal health checks begin. Without probes, Kubernetes can only observe process-level facts such as “the container exists” and “the main process has not exited.” Those facts are useful, but they are not the same as “the service can answer real requests.”
The CKAD exam tests probes because they sit at the boundary between YAML fluency and operational judgment. You need to write valid manifests quickly, but you also need to know which probe failure restarts a container and which probe failure removes a Pod from Service endpoints. In real clusters, that distinction decides whether an outage heals itself, hides a broken Pod from users, or becomes worse because Kubernetes restarts healthy but slow containers.
The Hospital Monitoring Analogy
A liveness probe is like asking, “Does this patient need emergency intervention because the vital process has stopped?” A readiness probe is like asking, “Can this patient safely receive visitors right now?” A startup probe is like telling the monitoring system, “Do not apply normal alarm rules while the patient is still waking from surgery.” Each question has a different consequence, and using one question for every situation causes noisy alarms or missed failures.
The Mental Model: Three Questions, Three Actions
Section titled “The Mental Model: Three Questions, Three Actions”A probe is not just a health check endpoint. It is a health check endpoint plus a Kubernetes action. The fastest way to avoid probe mistakes is to ask what Kubernetes will do after a failure, because the same /healthz URL can be harmless in one probe and destructive in another.
┌──────────────────────────────────────────────────────────────────────────────┐│ Kubernetes Probe Decision Map │├──────────────────────────────────────────────────────────────────────────────┤│ ││ startupProbe ││ Question: Has the application completed its startup sequence? ││ Failure: Keep waiting until failureThreshold is exceeded. ││ Success: Enable livenessProbe and readinessProbe evaluation. ││ ││ livenessProbe ││ Question: Is the application alive enough that continuing is useful? ││ Failure: Kill the container and let the restart policy recreate it. ││ Success: Leave the running container alone. ││ ││ readinessProbe ││ Question: Should this Pod receive Service traffic right now? ││ Failure: Remove the Pod IP from matching Service endpoints. ││ Success: Add or keep the Pod IP in matching Service endpoints. ││ │└──────────────────────────────────────────────────────────────────────────────┘The most important operational distinction is that liveness changes container lifetime, while readiness changes traffic routing. A failing liveness probe increases restart counts and can create CrashLoopBackOff behavior if the application never satisfies the check. A failing readiness probe does not restart the container; it changes whether Services send traffic to that Pod. Startup probes gate the other two so that slow applications are not punished before they have had a fair chance to initialize.
Active Learning Prompt: Predict the Action
A Pod is
Runningwith zero restarts, but the Service has no endpoints for it. Which probe family should you investigate first, and why would checking the restart count alone mislead you?
If your answer was “readiness,” your mental model is on track. Readiness is the probe that controls endpoint membership, so the container can look stable while traffic is still blocked. Restart count is a liveness clue, not a readiness guarantee.
Probe Types in Context
Section titled “Probe Types in Context”Liveness Probe: Recover When Continuing Is Worse Than Restarting
Section titled “Liveness Probe: Recover When Continuing Is Worse Than Restarting”A liveness probe should detect situations where the process is still present but the application is no longer making useful progress. Examples include a deadlocked worker pool, a server stuck in an unrecoverable internal state, or a process that accepts connections but can no longer serve core requests. The consequence is severe: Kubernetes kills the container. That makes liveness valuable for self-healing, but dangerous when the check is too broad or too sensitive.
A good liveness check is usually narrow. It should verify that the application runtime can make basic forward progress, not that every external dependency is perfect. If a downstream database is temporarily unavailable, restarting every application Pod may amplify the incident. In that case, readiness should usually fail so traffic drains, while liveness should continue to pass unless the process itself is stuck.
apiVersion: v1kind: Podmetadata: name: liveness-webspec: containers: - name: app image: nginx:1.27 ports: - containerPort: 80 livenessProbe: httpGet: path: / port: 80 initialDelaySeconds: 10 periodSeconds: 10 timeoutSeconds: 2 failureThreshold: 3This Pod uses a simple HTTP liveness probe because nginx can answer / quickly after startup. The kubelet waits ten seconds before the first check, probes every ten seconds, and restarts the container after three consecutive failures. For a real application, the path might be /healthz, but the same principle applies: the endpoint should mean “restart me if this keeps failing.”
Readiness Probe: Control Traffic Without Killing the Container
Section titled “Readiness Probe: Control Traffic Without Killing the Container”A readiness probe answers a different question: can this Pod safely receive requests right now? It may fail during startup, cache warmup, overloaded periods, maintenance windows, or temporary loss of a required dependency. When readiness fails, the Pod stays alive, but Services stop routing traffic to it until the probe succeeds again.
Readiness is where application-specific truth belongs. If the app requires a database connection before it can serve user requests, readiness may check that connection. If the app needs a local cache loaded before request latency is acceptable, readiness may stay false until the cache is warm. The goal is not to prove the process is alive; the goal is to avoid sending user traffic to a Pod that cannot handle it.
apiVersion: v1kind: Podmetadata: name: readiness-web labels: app: readiness-webspec: containers: - name: app image: nginx:1.27 ports: - containerPort: 80 readinessProbe: httpGet: path: / port: 80 initialDelaySeconds: 3 periodSeconds: 5 timeoutSeconds: 2 failureThreshold: 2If this readiness probe fails, k get pod may still show the Pod as Running, but the READY column will not show the container as ready. If a Service selects the Pod, endpoint membership changes as readiness changes. That is why readiness probes are central to zero-downtime rollouts: Kubernetes can wait for new Pods to become ready before sending them traffic.
Startup Probe: Protect Slow Boot Without Weakening Liveness Forever
Section titled “Startup Probe: Protect Slow Boot Without Weakening Liveness Forever”A startup probe is for applications whose startup time is long, variable, or difficult to predict. Before startup probes existed, teams often used very large initialDelaySeconds values on liveness probes. That workaround prevented early restarts, but it also delayed detection of real failures after startup. Startup probes solve this by giving the app a separate startup budget, then handing control to normal liveness and readiness checks once startup succeeds.
The key behavior is gating. When a startup probe is configured, liveness and readiness probes do not run until the startup probe succeeds. If the startup probe never succeeds and its failure threshold is exceeded, Kubernetes restarts the container. After startup succeeds once, the startup probe is done for that container instance, and normal probe behavior begins.
apiVersion: v1kind: Podmetadata: name: startup-webspec: containers: - name: app image: nginx:1.27 ports: - containerPort: 80 startupProbe: httpGet: path: / port: 80 periodSeconds: 5 failureThreshold: 24 livenessProbe: httpGet: path: / port: 80 periodSeconds: 10 failureThreshold: 3 readinessProbe: httpGet: path: / port: 80 periodSeconds: 5 failureThreshold: 2This startup probe gives the container up to about two minutes to pass its first health check. Once it passes, the liveness probe can remain reasonably aggressive, which means post-startup failures are detected quickly. This is usually better than setting initialDelaySeconds: 120 on the liveness probe, because the startup path and the steady-state recovery path are separate.
Active Learning Prompt: Choose the Consequence
Your application takes ninety seconds to compile templates on boot, then usually serves traffic reliably. During a traffic spike, the database sometimes rejects connections for ten seconds. Which probe should tolerate startup, which probe should stop traffic during database rejection, and which probe should avoid checking the database directly?
A strong answer uses startup for the template compilation window, readiness for database-dependent traffic routing, and liveness for process health rather than downstream availability. That division keeps Kubernetes from restarting containers just because a dependency had a temporary problem.
Probe Mechanisms: How Kubernetes Checks Health
Section titled “Probe Mechanisms: How Kubernetes Checks Health”Kubernetes 1.35 supports several probe mechanisms. The mechanism should match what the application can prove cheaply and reliably. A probe that is expensive, flaky, or dependent on the wrong layer will eventually become an outage trigger.
| Mechanism | Best Fit | Success Condition | Common Risk |
|---|---|---|---|
| HTTP GET | Web services and APIs with health endpoints | HTTP status code from 200 through 399 | Endpoint performs too much work or redirects to a generic login page |
| TCP socket | Protocols where opening a port proves basic availability | TCP connection can be established | Port is open even though the application protocol is unhealthy |
| Exec | Containers with a local health command or file-based signal | Command exits with status 0 | Command is slow, missing, or consumes too many resources |
| gRPC | gRPC services implementing the standard health checking protocol | gRPC health response is serving | Service does not implement the expected health service |
HTTP GET Probes
Section titled “HTTP GET Probes”HTTP probes are common because most web applications can expose cheap health endpoints. Kubernetes treats status codes from 200 through 399 as success. Any other status code, timeout, DNS problem, or connection failure counts as a failed probe. Headers can be supplied when the endpoint requires a specific host or custom marker, but health checks should not require user authentication.
livenessProbe: httpGet: path: /healthz port: 8080 httpHeaders: - name: X-Probe value: kubelet initialDelaySeconds: 10 periodSeconds: 10 timeoutSeconds: 2A production HTTP health endpoint should be intentionally boring. It should not allocate large objects, call half the platform, or perform an expensive database migration check. For liveness, it should answer whether the process can still serve basic work. For readiness, it may include dependencies that are required for serving user traffic, but those dependency checks should have tight timeouts and clear failure behavior.
TCP Socket Probes
Section titled “TCP Socket Probes”A TCP socket probe only checks whether the kubelet can open a TCP connection to the container port. This is useful for services that do not speak HTTP, such as Redis, MySQL, or custom binary protocols. It is also weaker than an application-level check because a process may accept TCP connections while failing every real command after connection.
livenessProbe: tcpSocket: port: 6379 initialDelaySeconds: 10 periodSeconds: 10 timeoutSeconds: 2Use TCP probes when connection acceptance is the best cheap signal available. For example, an exam question may ask for a Redis liveness check, and a TCP probe on port 6379 is fast to write and usually acceptable. In a production Redis deployment, you might prefer an exec probe that runs redis-cli ping if the image includes the tool and the command is reliable.
Exec Probes
Section titled “Exec Probes”An exec probe runs a command inside the container. It succeeds when the command exits with status 0 and fails when the command exits with a nonzero status or times out. Exec probes are powerful because they can inspect local files, run application-specific commands, or query a local process without exposing another network endpoint.
livenessProbe: exec: command: - sh - -c - test -f /tmp/healthy initialDelaySeconds: 5 periodSeconds: 5 timeoutSeconds: 2The power of exec probes comes with overhead and image coupling. The command must exist in the container image, run quickly, and behave consistently under load. If a minimal image does not contain sh, cat, curl, or a database client, the probe will fail even if the application is healthy. That failure mode is common in exam practice because learners copy an exec command into an image that does not include the binary.
gRPC Probes
Section titled “gRPC Probes”A gRPC probe checks a service using the gRPC health checking protocol. It is a good fit when the application already implements the standard health service and exposes it on a known port. It avoids bundling a separate probing binary into the image, which used to be a common workaround for gRPC workloads.
livenessProbe: grpc: port: 50051 initialDelaySeconds: 10 periodSeconds: 10 timeoutSeconds: 2Use gRPC probes when the app actually implements gRPC health checking. Do not choose gRPC just because the application uses gRPC for business traffic. The probe and the service need to agree on the health protocol, otherwise the kubelet cannot interpret the response correctly.
Timing Parameters: Fast Recovery Without False Positives
Section titled “Timing Parameters: Fast Recovery Without False Positives”Probe timing is where many correct-looking manifests become operationally dangerous. The YAML may validate, but the thresholds may restart healthy applications during garbage collection pauses, slow cold starts, or temporary CPU pressure. Probe timing should be chosen from observed application behavior, not copied from a generic snippet.
| Parameter | Meaning | Default | Practical Guidance |
|---|---|---|---|
initialDelaySeconds | Delay before the first probe after container start | 0 | Useful without startup probes, but avoid using it as a long-term startup workaround |
periodSeconds | Time between probe attempts | 10 | Shorter periods detect failures faster but increase probe load |
timeoutSeconds | Time allowed for one probe attempt | 1 | Increase when the health endpoint is valid but predictably slower than one second |
successThreshold | Consecutive successes required after failure | 1 | For readiness, values above one can prevent rapid endpoint flapping |
failureThreshold | Consecutive failures required before action | 3 | Higher values reduce false positives but delay restart or endpoint removal |
The approximate failure action time is initialDelaySeconds + (failureThreshold * periodSeconds), ignoring timeout details and scheduling jitter. For example, a liveness probe with initialDelaySeconds: 10, periodSeconds: 10, and failureThreshold: 3 will usually restart the container after roughly forty seconds if every probe fails. That is long enough to avoid killing during a brief hiccup, but short enough to recover from a real stuck process.
The calculation changes in meaning depending on probe type. For liveness, the action is restart. For readiness, the action is endpoint removal. For startup, the action is restart after the startup budget is exhausted. The arithmetic is similar, but the consequence is different, so the same numbers may be reasonable for one probe and too aggressive for another.
┌──────────────────────────────────────────────────────────────────────────────┐│ Probe Timing Timeline │├──────────────────────────────────────────────────────────────────────────────┤│ ││ container starts ││ │ ││ ├── initialDelaySeconds ││ │ ││ ├── probe attempt 1 fails ││ │ wait periodSeconds ││ ├── probe attempt 2 fails ││ │ wait periodSeconds ││ ├── probe attempt 3 fails ││ │ ││ └── failureThreshold reached, so Kubernetes performs probe action ││ ││ For liveness: restart container. For readiness: remove endpoint. ││ For startup: restart container because startup never completed. ││ │└──────────────────────────────────────────────────────────────────────────────┘Worked Example: Debug the Math Before Editing YAML
A Java API starts in thirty-five seconds on a warm node and up to ninety seconds on a cold node. The team configured a liveness probe with no startup probe,
initialDelaySeconds: 15,periodSeconds: 10, andfailureThreshold: 3. The first failure window ends at about forty-five seconds, which means cold starts can be killed before they finish. A better design is a startup probe withperiodSeconds: 5andfailureThreshold: 24, giving roughly two minutes for startup, then a liveness probe with normal steady-state timing.
The worked example shows the decision sequence you should use on the CKAD exam and in production reviews. First, estimate realistic startup time. Second, decide whether startup timing and steady-state failure detection need separate budgets. Third, configure startup and liveness independently so one concern does not distort the other.
Designing Probe Endpoints
Section titled “Designing Probe Endpoints”A health endpoint is part of the application interface. It deserves the same design care as any other API because Kubernetes will make scheduling and restart decisions from it. A weak endpoint hides broken workloads, while an overambitious endpoint creates false outages.
For liveness, design the endpoint to answer whether restarting this container is likely to improve the situation. It can check internal event loops, local worker status, or a lightweight self-test. It should usually avoid external dependencies because restarting the process rarely fixes a database outage, a DNS incident, or a third-party API failure.
For readiness, design the endpoint to answer whether the Pod should receive user traffic right now. This may include required dependencies, migrations, warmed caches, or feature flag readiness. If the Pod cannot serve useful requests without a database, failing readiness during database unavailability is often correct. The Pod stays alive, and Kubernetes stops sending it traffic until it can serve again.
For startup, design the endpoint to become successful only when the app has finished the phase that makes normal probes meaningful. It might reuse the liveness endpoint if liveness is not valid until startup completes, or it might use a dedicated endpoint that reports boot sequence completion. The startup check should not become a second readiness check; its job is to unlock normal probing.
┌──────────────────────────────────────────────────────────────────────────────┐│ Endpoint Design Boundary │├──────────────────────────────────────────────────────────────────────────────┤│ ││ /livez ││ - checks local process progress ││ - avoids broad downstream dependency checks ││ - failure means restart could help ││ ││ /readyz ││ - checks whether requests can be served now ││ - may include critical dependencies with strict timeouts ││ - failure means remove from Service traffic ││ ││ /startupz ││ - checks whether boot has completed enough for normal probes ││ - protects slow initialization without weakening steady-state liveness ││ - failure means keep waiting until startup budget is exhausted ││ │└──────────────────────────────────────────────────────────────────────────────┘A common senior-level design choice is to separate /livez and /readyz even when they initially return the same result. The separation gives the application room to evolve without changing Kubernetes semantics later. If the team later adds a database readiness check, it can update /readyz without accidentally causing liveness-driven restarts during a database incident.
Worked Example: Fix a Restart Loop Caused by Liveness
Section titled “Worked Example: Fix a Restart Loop Caused by Liveness”In this scenario, a learner inherits a Pod that keeps restarting. The container image is fine, but the liveness probe points at a missing path. The Pod reaches Running, the kubelet checks /not-here, receives a failing HTTP status, and restarts the container after the threshold is reached.
First, create the broken Pod so the failure is observable. The command uses kubectl; after this first full command, this module uses the common CKAD alias k for kubectl, which you can create with alias k=kubectl in your shell.
kubectl apply -f - << 'EOF'apiVersion: v1kind: Podmetadata: name: broken-livenessspec: containers: - name: app image: nginx:1.27 ports: - containerPort: 80 livenessProbe: httpGet: path: /not-here port: 80 initialDelaySeconds: 5 periodSeconds: 5 failureThreshold: 2EOFWait long enough for the kubelet to run the probe more than once, then inspect the Pod and its events. The restart count is the first clue because liveness failures restart containers. The event stream is the stronger clue because it includes the failed probe message and often the failing status code.
sleep 20k get pod broken-livenessk describe pod broken-livenessYou should see events indicating that the liveness probe failed. The exact wording can vary by Kubernetes version and image behavior, but the important details are the probe type, the failing path, and the restart action. In an exam, this event section is often the fastest way to confirm whether the problem is a wrong path, wrong port, timeout, or command failure.
Now replace the Pod with a valid probe path. Pods are mostly immutable for container probe changes in practical exam workflows, so deleting and recreating the Pod is usually faster than trying to patch nested fields under pressure.
k delete pod broken-liveness --ignore-not-found=true
k apply -f - << 'EOF'apiVersion: v1kind: Podmetadata: name: broken-livenessspec: containers: - name: app image: nginx:1.27 ports: - containerPort: 80 livenessProbe: httpGet: path: / port: 80 initialDelaySeconds: 5 periodSeconds: 10 failureThreshold: 3EOFVerify that the restart count remains stable after the initial start. A single successful k get pod is useful, but not sufficient for this failure mode because liveness failures happen over time. Wait through more than one probe period before deciding the fix worked.
k wait --for=condition=Ready pod/broken-liveness --timeout=60ssleep 25k get pod broken-livenessk describe pod broken-liveness | grep -A 12 EventsThis worked example demonstrates the full loop: observe the symptom, connect restart behavior to liveness, read kubelet events, correct the probe path, and verify over time. The same loop applies to wrong ports, missing exec commands, and probe timeouts.
Debugging Readiness and Service Endpoints
Section titled “Debugging Readiness and Service Endpoints”Readiness problems often look like networking problems because the application process is alive but the Service has no eligible endpoints. Users report connection failures, while k get pods shows Running. That mismatch is a strong signal to inspect readiness, labels, and endpoints together.
Create a small Deployment and Service to see the relationship. This example intentionally uses a readiness probe that succeeds, so you can observe the healthy baseline before thinking about failures.
apiVersion: apps/v1kind: Deploymentmetadata: name: ready-demospec: replicas: 2 selector: matchLabels: app: ready-demo template: metadata: labels: app: ready-demo spec: containers: - name: app image: nginx:1.27 ports: - containerPort: 80 readinessProbe: httpGet: path: / port: 80 initialDelaySeconds: 2 periodSeconds: 5 failureThreshold: 2---apiVersion: v1kind: Servicemetadata: name: ready-demospec: selector: app: ready-demo ports: - port: 80 targetPort: 80Apply it and verify all layers. The Deployment rollout tells you whether Kubernetes sees the desired replicas as available. The Pod list tells you readiness at the Pod level. The endpoint list tells you whether the Service has routable backend IPs.
k apply -f ready-demo.yamlk rollout status deployment/ready-demo --timeout=90sk get pods -l app=ready-demok get endpoints ready-demoWhen readiness is broken, use the same three-layer approach. If Pods are Running but not Ready, inspect k describe pod. If Pods are Ready but endpoints are missing, inspect Service selectors and Pod labels. If endpoints exist but traffic still fails, move on to Service ports, NetworkPolicy, application behavior, or node-level networking.
k describe pod -l app=ready-demok get svc ready-demo -o yamlk get pods -l app=ready-demo --show-labelsThis distinction matters under exam pressure because not every “Service has no endpoints” problem is a probe problem. A bad readiness probe removes endpoints. A bad selector also removes endpoints. A good debugger checks both before changing manifests.
Choosing the Right Probe Mechanism
Section titled “Choosing the Right Probe Mechanism”The best probe is the cheapest one that answers the operational question accurately. Do not make a liveness probe perform a full user journey just because a full user journey is important. A full user journey belongs in synthetic monitoring outside the container restart path, while a probe should be fast, local enough to be reliable, and tied to a specific Kubernetes action.
For a web API, HTTP probes usually provide the clearest signal. Use /livez for local process health and /readyz for traffic readiness. For Redis or another TCP service, a TCP probe may be acceptable for liveness, but an exec probe with a protocol command can be stronger if the image includes the tool. For gRPC services, prefer gRPC probes when the application implements the health protocol. For minimal images, avoid exec probes that assume shell utilities exist.
| Scenario | Recommended Probe | Why This Choice Fits | Watch Out For |
|---|---|---|---|
| HTTP API with dedicated health routes | HTTP liveness and readiness | Clear application-level signal with low overhead | Do not make liveness depend on every downstream service |
| Redis container in an exam task | TCP liveness on port 6379 | Fast to write and validates listener availability | TCP success does not prove Redis commands work |
PostgreSQL image with pg_isready available | Exec readiness or liveness depending on goal | Protocol-aware check from inside the container | Missing binary or slow command causes false failures |
| gRPC service with health protocol implemented | gRPC probe | Avoids extra binaries and checks native health status | The app must expose the standard health service |
| Slow legacy Java service | Startup plus HTTP liveness and readiness | Separates boot budget from steady-state recovery | Do not replace startup probe with a huge liveness delay |
A senior probe design also considers blast radius. If one shared dependency fails and every Pod’s liveness probe checks that dependency, the cluster may restart hundreds of containers at once. That does not repair the dependency; it adds load, destroys caches, and complicates recovery. In most designs, dependency availability affects readiness first, while liveness remains focused on whether this container should continue running.
Common Patterns
Section titled “Common Patterns”Pattern 1: Web Application with Separate Lifecycle Endpoints
Section titled “Pattern 1: Web Application with Separate Lifecycle Endpoints”This pattern is the safest default for production web applications. The startup probe protects initialization, the liveness probe checks basic process health, and the readiness probe decides whether traffic should flow. The endpoints are separate because their meanings are separate.
apiVersion: v1kind: Podmetadata: name: webapp-probes labels: app: webapp-probesspec: containers: - name: app image: nginx:1.27 ports: - containerPort: 80 startupProbe: httpGet: path: / port: 80 failureThreshold: 24 periodSeconds: 5 timeoutSeconds: 2 livenessProbe: httpGet: path: / port: 80 periodSeconds: 10 timeoutSeconds: 2 failureThreshold: 3 readinessProbe: httpGet: path: / port: 80 periodSeconds: 5 timeoutSeconds: 2 failureThreshold: 2In a real application, / would usually become /startupz, /livez, and /readyz. The nginx image is used here because it gives you a runnable manifest in any standard Kubernetes practice cluster. The lesson is the lifecycle structure, not that every production service should use /.
Pattern 2: Readiness Gate for Traffic During Warmup
Section titled “Pattern 2: Readiness Gate for Traffic During Warmup”Some applications can start their process quickly but need extra time before they are useful. For example, an API might need to load routing tables, build an in-memory search index, or connect to a required database. In those cases, readiness should stay false until traffic can be handled correctly.
readinessProbe: httpGet: path: /readyz port: 8080 initialDelaySeconds: 5 periodSeconds: 5 timeoutSeconds: 2 successThreshold: 2 failureThreshold: 2The successThreshold: 2 setting is only valid above one for readiness probes, and it can reduce endpoint flapping after a failure. That does not mean every readiness probe needs it. Use it when a single successful check is not enough evidence that the Pod should rejoin traffic.
Pattern 3: Exec Probe for a Local Health File
Section titled “Pattern 3: Exec Probe for a Local Health File”A local file check is useful for teaching exec probes because it is easy to observe and break. The application creates /tmp/healthy at startup, and the liveness probe restarts the container if that file disappears. This is not a universal production pattern, but it demonstrates the exec success rule clearly.
apiVersion: v1kind: Podmetadata: name: exec-health-filespec: containers: - name: app image: busybox:1.36 command: - sh - -c - touch /tmp/healthy && sleep 3600 livenessProbe: exec: command: - cat - /tmp/healthy initialDelaySeconds: 5 periodSeconds: 5 timeoutSeconds: 2 failureThreshold: 2If you delete the file with k exec exec-health-file -- rm /tmp/healthy, the next failing probe sequence restarts the container. After restart, the startup command recreates the file, so the Pod becomes healthy again. That is a compact way to see the relationship between probe failure, container restart, and application initialization.
Pattern 4: TCP Probe for a Non-HTTP Listener
Section titled “Pattern 4: TCP Probe for a Non-HTTP Listener”A TCP probe works when the service’s useful health signal is “is the port accepting connections?” This is common in exam tasks for Redis or simple TCP daemons. It is less expressive than a protocol-specific command, but it is valid Kubernetes configuration and often the intended answer.
apiVersion: v1kind: Podmetadata: name: redis-tcp-probespec: containers: - name: redis image: redis:7 ports: - containerPort: 6379 livenessProbe: tcpSocket: port: 6379 initialDelaySeconds: 10 periodSeconds: 10 timeoutSeconds: 2 failureThreshold: 3If the exam asks for TCP, write TCP. If a production review asks whether TCP is enough, evaluate what the team needs to know. A TCP accept proves less than a Redis PING, but it also avoids depending on client tools inside the image.
Exam Workflow: Fast, Correct, Verifiable
Section titled “Exam Workflow: Fast, Correct, Verifiable”Probe tasks on the CKAD exam are usually YAML editing tasks. Speed matters, but verification matters more because a one-character path or port mistake can leave a Pod restarting. The safest workflow is generate, edit, apply, describe, and verify the exact probe fields.
Start by generating a manifest when possible. Imperative commands do not expose every probe option cleanly, so use dry-run output as a starting point and edit the YAML.
k run webapp --image=nginx:1.27 --port=80 --dry-run=client -o yaml > pod.yamlOpen pod.yaml, add the probe block under the container, and apply it. Be careful with indentation: probes are fields of the container, not fields of the Pod spec and not fields under ports.
k apply -f pod.yamlk describe pod webapp | grep -E "Liveness|Readiness|Startup"k get pod webappFor readiness tasks, always verify endpoint membership if a Service is involved. A Pod can be Running without being a Service backend, and the exam often rewards checking the exact resource affected by the configuration.
k get endpointsk get endpoints webappk describe pod webapp | grep -A 20 EventsFor liveness tasks, wait long enough to observe stability or failure. If a liveness probe has periodSeconds: 10 and failureThreshold: 3, checking one second after creation tells you almost nothing. Time-based behavior requires time-based verification.
sleep 35k get pod webappk describe pod webapp | grep -A 20 EventsDid You Know?
Section titled “Did You Know?”-
Startup probes prevent a specific anti-pattern: before startup probes, teams often hid slow startup by setting a huge liveness delay, which also delayed real failure detection after the app was already running.
-
Readiness failures do not restart containers: they change Service endpoint membership, which is why a Pod can have zero restarts while still receiving no traffic.
-
Exec probes run inside the container image: if the image does not include the command you configured, the probe fails even when the application process is healthy.
-
HTTP probe redirects can count as success: Kubernetes treats HTTP status codes from 200 through 399 as successful, so a health endpoint that redirects may hide a badly designed check.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Why It Hurts | Better Practice |
|---|---|---|
| Using the same deep dependency check for liveness and readiness | A database or third-party outage can trigger restarts that do not fix the dependency and may worsen recovery | Keep liveness focused on local process health, and put traffic-critical dependencies in readiness with tight timeouts |
| Making liveness too aggressive during startup | Slow but healthy containers are killed before they finish booting, creating restart loops | Use a startup probe for slow startup, then keep liveness tuned for steady-state recovery |
| Forgetting that readiness controls Service endpoints | Debugging focuses on restart count while traffic fails because the Pod is not considered ready | Check READY, Pod events, Service selectors, and k get endpoints together |
| Assuming TCP success means application success | A port can accept connections while the protocol handler or business logic is broken | Use HTTP, exec, or gRPC probes when you need an application-level signal |
| Configuring an exec probe with missing tools | Minimal images often lack sh, cat, curl, database clients, or custom binaries | Verify the command exists in the image, or choose a network probe that matches the app |
Leaving timeoutSeconds at one second for slow checks | Legitimate health checks fail under load, causing endpoint flapping or restarts | Make health checks cheap, then set timeout values based on observed response behavior |
| Putting probe fields at the wrong YAML level | The manifest may be rejected, or the probe may not attach to the intended container | Place startupProbe, livenessProbe, and readinessProbe under the specific container entry |
| Treating startup probes as permanent health checks | Teams misunderstand why liveness and readiness do not run during startup, then miss post-startup behavior | Remember that startup gates normal probes only until it succeeds for that container instance |
-
Your team deploys a Java API that starts in about twenty seconds on warm nodes but sometimes takes two minutes after a node image pull. The current liveness probe starts after thirty seconds and kills the container after three failed checks. During a rollout, new Pods repeatedly restart before becoming ready. What probe design would you apply, and how would you justify it?
Answer
Add a startup probe that gives the application enough time for the slow startup case, then keep liveness tuned for steady-state failures. For example, a startup probe with
periodSeconds: 5andfailureThreshold: 30gives about two and a half minutes for startup before Kubernetes restarts the container. Liveness and readiness should begin only after startup succeeds. This is better than increasing the livenessinitialDelaySecondsto a very large value because it separates boot-time tolerance from normal failure detection after the app is running. -
A Service intermittently has no endpoints, but all selected Pods show
Runningand restart counts remain at zero. Users see connection failures during these windows. Which Kubernetes objects and probe behavior should you inspect first, and what failure would you expect to find?Answer
Inspect Pod readiness status, Pod events, Service selectors, Pod labels, and
k get endpoints <service-name>. Zero restarts make liveness failure less likely, while disappearing endpoints point toward readiness failure or a selector mismatch. If the labels and selector match, expect readiness probe failures ink describe pod, possibly caused by a flaky readiness endpoint, an external dependency check, or thresholds that are too aggressive under normal load. -
A team uses
/healthzfor both liveness and readiness. The endpoint checks the database, message queue, and a third-party billing API. When the billing API slows down, Kubernetes restarts every application Pod. How would you redesign the probes to reduce blast radius?Answer
Split the health semantics into separate endpoints. The liveness endpoint should check local process progress and avoid broad downstream checks, because restarting the container will not repair a third-party billing outage. The readiness endpoint can check dependencies required to serve traffic, using strict timeouts and clear failure behavior so Pods are removed from endpoints when they cannot serve. This keeps traffic away from unready Pods without causing cluster-wide restarts during a dependency incident.
-
You are asked to add a liveness probe to a Redis Pod during an exam. The requirement says the probe must verify that port
6379accepts connections, and no HTTP endpoint exists. Which probe mechanism should you use, and what limitation should you keep in mind?Answer
Use a TCP socket liveness probe on port
6379. It matches the stated requirement and is valid for a non-HTTP service. The limitation is that TCP success only proves the port accepts connections; it does not prove Redis can successfully process commands. In a production setting, an exec probe using a Redis client might provide a stronger signal if the image includes the tool and the command is reliable. -
A Pod uses an exec liveness probe with
command: ["cat", "/tmp/healthy"]. It runs well in abusyboxtest image but fails immediately after the team switches to a distroless application image. The application logs show no startup error. How would you diagnose and fix the issue?Answer
The probe command likely fails because the distroless image does not include
cator the expected shell utilities. Check the Pod events withk describe podto confirm exec probe failures, and inspect the image design if needed. Fix the probe by using a command that actually exists in the image, adding a small purpose-built health binary, or switching to an HTTP or gRPC probe exposed by the application. Do not assume tools from a teaching image exist in a minimal production image. -
During a rollout, new Pods become ready for a few seconds, then drop out of endpoints, then return again. The readiness endpoint performs a slow cache verification that sometimes takes longer than the configured one-second timeout. What changes would you evaluate before increasing replica count?
Answer
First, make the readiness check cheaper if possible, because health endpoints should be fast and predictable. Then evaluate
timeoutSeconds,failureThreshold,periodSeconds, and possiblysuccessThresholdfor readiness so endpoint membership does not flap on a single slow response. Increasing replicas may hide symptoms temporarily, but it does not fix the readiness signal. The correct fix is to align readiness timing with realistic endpoint behavior while keeping the check narrow enough to avoid unnecessary load. -
You inherit a Pod manifest where
livenessProbeis indented underspecbesidecontainersinstead of under the container item. The Pod fails to apply with a schema-related error. How would you correct the manifest, and why does the nesting matter operationally?Answer
Move
livenessProbeunder the specific container entry, at the same indentation level as fields such asimage,ports, andreadinessProbe. Probes are container-level configuration because each container in a Pod may have different health behavior, ports, commands, and lifecycle needs. The nesting matters because Kubernetes must know exactly which container the kubelet should probe and restart when liveness fails.
Hands-On Exercise
Section titled “Hands-On Exercise”Task: Configure, verify, break, and repair probes for a web application so you can connect manifest fields to kubelet behavior.
Goal: By the end of this exercise, you should be able to prove that startup gates normal probes, liveness restarts a broken container, and readiness controls Service endpoint membership.
Step 1: Create a Pod with Startup, Liveness, and Readiness Probes
Section titled “Step 1: Create a Pod with Startup, Liveness, and Readiness Probes”Apply a runnable Pod manifest that uses nginx and HTTP probes. The paths use / because this image serves that path by default, which lets you focus on Kubernetes probe behavior rather than application code.
k apply -f - << 'EOF'apiVersion: v1kind: Podmetadata: name: probe-demo labels: app: probe-demospec: containers: - name: app image: nginx:1.27 ports: - containerPort: 80 startupProbe: httpGet: path: / port: 80 failureThreshold: 12 periodSeconds: 5 timeoutSeconds: 2 livenessProbe: httpGet: path: / port: 80 periodSeconds: 10 failureThreshold: 3 timeoutSeconds: 2 readinessProbe: httpGet: path: / port: 80 periodSeconds: 5 failureThreshold: 2 timeoutSeconds: 2EOFStep 2: Verify the Pod and Probe Configuration
Section titled “Step 2: Verify the Pod and Probe Configuration”Use wait, get, and describe so you verify both high-level status and the actual probe fields attached to the container.
k wait --for=condition=Ready pod/probe-demo --timeout=90sk get pod probe-demok describe pod probe-demo | grep -E "Startup|Liveness|Readiness"Success criteria:
-
k waitreports thatpod/probe-demomet theReadycondition within the timeout. -
k get pod probe-demoshows the Pod inRunningstatus with the container ready. -
k describe pod probe-demoshows startup, liveness, and readiness probe configuration for theappcontainer.
Step 3: Expose the Pod and Confirm Readiness Controls Endpoints
Section titled “Step 3: Expose the Pod and Confirm Readiness Controls Endpoints”Create a Service and confirm that the ready Pod appears as a backend endpoint. This connects the readiness probe to traffic routing rather than just Pod status.
k expose pod probe-demo --port=80 --target-port=80k get svc probe-demok get endpoints probe-demoSuccess criteria:
-
k get svc probe-demoshows a Service namedprobe-demo. -
k get endpoints probe-demoshows at least one endpoint address and port for the Pod. - You can explain why a failing readiness probe would remove this endpoint without increasing the restart count.
Step 4: Break the Liveness Probe Signal and Observe Restart Behavior
Section titled “Step 4: Break the Liveness Probe Signal and Observe Restart Behavior”Delete the default nginx index file so the / path no longer returns the same successful response. Then wait long enough for the liveness probe to fail according to the configured period and threshold.
k exec probe-demo -- rm /usr/share/nginx/html/index.htmlsleep 40k get pod probe-demok describe pod probe-demo | grep -A 20 EventsSuccess criteria:
- The Pod event stream shows failed liveness probe messages or a container restart related to probe failure.
-
k get pod probe-demoshows that restart count changed after the liveness failures. - You can state why Kubernetes restarted the container instead of merely removing the Pod from Service endpoints.
Step 5: Repair the Application Signal and Verify Stability
Section titled “Step 5: Repair the Application Signal and Verify Stability”Because the container restart recreates the default file for this image, wait for the Pod to become ready again and confirm that endpoint membership returns.
k wait --for=condition=Ready pod/probe-demo --timeout=90ssleep 15k get pod probe-demok get endpoints probe-demoSuccess criteria:
- The Pod returns to
Readyafter the restart. - The Service endpoint exists again after readiness succeeds.
- The restart count remains stable during the final observation window.
Step 6: Clean Up
Section titled “Step 6: Clean Up”Remove the Service and Pod so the practice namespace is ready for the next exercise.
k delete svc probe-demo --ignore-not-found=truek delete pod probe-demo --ignore-not-found=trueSuccess criteria:
-
k get pod probe-demono longer returns the practice Pod. -
k get svc probe-demono longer returns the practice Service. - No unrelated resources were deleted during cleanup.
Practice Drills
Section titled “Practice Drills”Drill 1: HTTP Liveness Probe
Section titled “Drill 1: HTTP Liveness Probe”Create a Pod named drill-http-live running nginx:1.27 with an HTTP liveness probe against / on port 80. The probe should wait five seconds before the first check and run every ten seconds.
Solution
k apply -f - << 'EOF'apiVersion: v1kind: Podmetadata: name: drill-http-livespec: containers: - name: nginx image: nginx:1.27 livenessProbe: httpGet: path: / port: 80 initialDelaySeconds: 5 periodSeconds: 10EOF
k describe pod drill-http-live | grep Livenessk delete pod drill-http-live --ignore-not-found=trueDrill 2: Exec Liveness Probe
Section titled “Drill 2: Exec Liveness Probe”Create a Pod named drill-exec-live using busybox:1.36. The container should create /tmp/healthy and then sleep. Configure an exec liveness probe that checks the file.
Solution
k apply -f - << 'EOF'apiVersion: v1kind: Podmetadata: name: drill-exec-livespec: containers: - name: app image: busybox:1.36 command: - sh - -c - touch /tmp/healthy && sleep 3600 livenessProbe: exec: command: - cat - /tmp/healthy initialDelaySeconds: 5 periodSeconds: 5EOF
k describe pod drill-exec-live | grep Livenessk delete pod drill-exec-live --ignore-not-found=trueDrill 3: TCP Liveness Probe
Section titled “Drill 3: TCP Liveness Probe”Create a Pod named drill-tcp-live running redis:7 with a TCP liveness probe on port 6379. The probe should wait ten seconds before the first check and run every five seconds.
Solution
k apply -f - << 'EOF'apiVersion: v1kind: Podmetadata: name: drill-tcp-livespec: containers: - name: redis image: redis:7 ports: - containerPort: 6379 livenessProbe: tcpSocket: port: 6379 initialDelaySeconds: 10 periodSeconds: 5EOF
k describe pod drill-tcp-live | grep Livenessk delete pod drill-tcp-live --ignore-not-found=trueDrill 4: Readiness Probe with Service Verification
Section titled “Drill 4: Readiness Probe with Service Verification”Create a Deployment named drill-ready with two nginx:1.27 replicas and an HTTP readiness probe on /. Expose it with a Service and verify that both replicas appear as endpoints.
Solution
k apply -f - << 'EOF'apiVersion: apps/v1kind: Deploymentmetadata: name: drill-readyspec: replicas: 2 selector: matchLabels: app: drill-ready template: metadata: labels: app: drill-ready spec: containers: - name: nginx image: nginx:1.27 ports: - containerPort: 80 readinessProbe: httpGet: path: / port: 80 initialDelaySeconds: 2 periodSeconds: 3EOF
k expose deployment drill-ready --port=80 --target-port=80k rollout status deployment/drill-ready --timeout=90sk get endpoints drill-readyk delete service drill-ready --ignore-not-found=truek delete deployment drill-ready --ignore-not-found=trueDrill 5: Startup Probe for Slow Initialization
Section titled “Drill 5: Startup Probe for Slow Initialization”Create a Pod named drill-startup that sleeps for twenty seconds before starting nginx. Configure a startup probe that gives enough time for the delayed start, then add liveness and readiness probes for the steady state.
Solution
k apply -f - << 'EOF'apiVersion: v1kind: Podmetadata: name: drill-startupspec: containers: - name: app image: nginx:1.27 command: - sh - -c - sleep 20 && nginx -g 'daemon off;' ports: - containerPort: 80 startupProbe: httpGet: path: / port: 80 periodSeconds: 5 failureThreshold: 8 livenessProbe: httpGet: path: / port: 80 periodSeconds: 10 failureThreshold: 3 readinessProbe: httpGet: path: / port: 80 periodSeconds: 5 failureThreshold: 2EOF
k wait --for=condition=Ready pod/drill-startup --timeout=90sk describe pod drill-startup | grep -E "Startup|Liveness|Readiness"k delete pod drill-startup --ignore-not-found=trueDrill 6: Diagnose a Broken Probe Path
Section titled “Drill 6: Diagnose a Broken Probe Path”Create a Pod with an intentionally broken liveness path, observe the restart behavior, then recreate it with the correct path. This drill mirrors a common exam troubleshooting task.
Solution
k apply -f - << 'EOF'apiVersion: v1kind: Podmetadata: name: drill-broken-pathspec: containers: - name: app image: nginx:1.27 livenessProbe: httpGet: path: /missing port: 80 initialDelaySeconds: 5 periodSeconds: 3 failureThreshold: 2EOF
sleep 20k get pod drill-broken-pathk describe pod drill-broken-path | grep -A 20 Events
k delete pod drill-broken-path --ignore-not-found=true
k apply -f - << 'EOF'apiVersion: v1kind: Podmetadata: name: drill-broken-pathspec: containers: - name: app image: nginx:1.27 livenessProbe: httpGet: path: / port: 80 initialDelaySeconds: 5 periodSeconds: 10EOF
k wait --for=condition=Ready pod/drill-broken-path --timeout=60sk get pod drill-broken-pathk delete pod drill-broken-path --ignore-not-found=trueDrill 7: Distinguish Selector Problems from Readiness Problems
Section titled “Drill 7: Distinguish Selector Problems from Readiness Problems”Create a Deployment with a valid readiness probe, then create a Service with the wrong selector. Use endpoints and labels to prove the problem is not the probe.
Solution
k apply -f - << 'EOF'apiVersion: apps/v1kind: Deploymentmetadata: name: drill-selectorspec: replicas: 1 selector: matchLabels: app: drill-selector template: metadata: labels: app: drill-selector spec: containers: - name: nginx image: nginx:1.27 ports: - containerPort: 80 readinessProbe: httpGet: path: / port: 80 periodSeconds: 5---apiVersion: v1kind: Servicemetadata: name: drill-selectorspec: selector: app: wrong-label ports: - port: 80 targetPort: 80EOF
k rollout status deployment/drill-selector --timeout=90sk get pods -l app=drill-selector --show-labelsk get endpoints drill-selectork get service drill-selector -o yaml
k delete service drill-selector --ignore-not-found=truek delete deployment drill-selector --ignore-not-found=trueSenior Review Checklist
Section titled “Senior Review Checklist”Before approving a probe configuration in a real cluster, review what the probe causes Kubernetes to do. A liveness probe is not a dashboard metric; it is a restart trigger. A readiness probe is not just a health URL; it is a traffic-routing gate. A startup probe is not a permanent monitor; it is a temporary shield for initialization.
Ask whether each probe checks something the container can answer reliably under normal load. If the probe depends on a broad dependency graph, consider whether that dependency belongs in readiness instead of liveness. If a probe endpoint is slow, fix the endpoint before simply raising thresholds. If an exec command depends on tools in the image, verify those tools exist in the exact image tag being deployed.
Finally, verify behavior with the Kubernetes resources that actually change. Use restart counts and events for liveness. Use Pod readiness, endpoint membership, and rollout status for readiness. Use startup timing and gated probe behavior for startup. Probe design is finished only when the manifest is valid and the observed cluster behavior matches the intended consequence.
Next Module
Section titled “Next Module”Module 3.2: Container Logging - Access, manage, and troubleshoot container logs.
Sources
Section titled “Sources”- Kubernetes v1.35: Liveness, Readiness, and Startup Probes — This is the canonical v1.35 reference for probe semantics, mechanisms, defaults, and failure behavior.
- Configure Liveness, Readiness and Startup Probes — This task page provides runnable examples and concrete event output for probe troubleshooting.
- Pod Conditions — It clarifies the difference between Running and Ready and explains why readiness affects Service endpoint membership.
- Certified Kubernetes Application Developer (CKAD) — It shows the official CKAD v1.35 domain coverage, including probes and health checks.