Module 1.2: Jobs and CronJobs
Module 1.2: Jobs and CronJobs
Section titled “Module 1.2: Jobs and CronJobs”Complexity:
[MEDIUM]- Essential CKAD skill with specific production tradeoffsTime to Complete: 45-50 minutes
Prerequisites: Module 1.1 (Container Images), basic Pod lifecycle knowledge, and a working Kubernetes 1.35+ cluster
The examples in this module use the standard CKAD shortcut alias k=kubectl. Create it in your shell before practicing so the commands match the exam-style workflow and the troubleshooting examples below.
alias k=kubectlLearning Outcomes
Section titled “Learning Outcomes”After completing this module, you will be able to:
- Design Jobs and CronJobs with correct completion counts, parallelism, retry limits, cleanup controls, and schedules.
- Configure CronJob concurrency, deadline, suspension, and history behavior for recurring batch workloads.
- Diagnose failed batch workloads by correlating Job status, Pod events, container logs, restart policy, and controller retry behavior.
- Compare Jobs, CronJobs, Deployments, and ad hoc Pods so you can choose the right controller for one-time, recurring, and long-running work.
Why This Module Matters
Section titled “Why This Module Matters”In 2017, GitLab published a detailed incident report after a production database maintenance operation went badly wrong and left the company restoring data under intense customer pressure. The failure was not a Kubernetes Job failure, but the lesson is directly relevant: operational work that feels like “just a script” can become a business incident when retries, scheduling, ownership, and cleanup are vague. A backup, migration, report, or cleanup task is not safer because it is short-lived; it is safer only when the platform knows when it should start, when it should stop, how many times it may retry, and what evidence it should leave behind for diagnosis.
Kubernetes separates this kind of work from Deployments because the desired end state is different. A Deployment tries to keep a number of Pods running indefinitely, while a Job tries to reach a number of successful completions and then stop creating Pods. A CronJob adds time to that model: it creates Jobs on a schedule, applies overlap rules when a previous run is still active, and retains a bounded amount of history so teams can inspect recent successes and failures without letting old Pods fill the namespace.
For CKAD work, Jobs and CronJobs are high-value because they mix API knowledge with debugging judgment. You may be asked to create a Job imperatively, generate YAML for a CronJob, change parallelism, fix an invalid restartPolicy, or explain why a scheduled backup did not start. This module teaches the mechanics, but it also teaches the operational shape: every batch workload has a start rule, a completion rule, a retry rule, a cleanup rule, and a failure investigation path.
That operational shape is the difference between “a command that happened to run” and “a workload the platform can reason about.” If a one-off migration fails on a developer laptop, the evidence may live only in a terminal scrollback. If the same migration runs as a Job, the cluster records status, events, logs, owner references, and retry attempts in places the whole team can inspect. The controller does not make bad scripts good, but it gives good scripts a reliable execution envelope.
CronJobs add another layer of accountability because schedules often outlive the person who wrote them. A cleanup job created during a release crunch may still be running months later, and a backup job may become part of compliance evidence. For that reason, a CronJob manifest should read like an operational policy: when it runs, what happens if it is late, what happens if it overlaps, how much history remains, and how a human can test the same template without waiting for the next scheduled time.
The Factory Shift Analogy
Deployments are like permanent factory staff: they clock in and stay until the business changes the staffing plan. Jobs are like contractors hired for a specific piece of work: they arrive, complete the work, and leave records behind. CronJobs are like scheduled maintenance crews: they arrive at specific times, do the work, and follow rules about whether a late or overlapping shift should still happen.
Jobs: Completion Controllers For One-Time Work
Section titled “Jobs: Completion Controllers For One-Time Work”A Job is a controller for finite work. It owns one or more Pods, watches their exit results, and decides whether enough Pods have succeeded to mark the work complete. That distinction matters because Kubernetes does not read your script and infer whether the backup, migration, or report was meaningful; it only sees whether containers exit with status zero or non-zero. Your container command therefore becomes part of the contract between the application and the control plane.
The simplest Job runs one Pod and needs one successful completion. This is the pattern you use for small migrations, smoke reports, and exam tasks where the command does a single piece of work and exits. The imperative command is fast for CKAD tasks, and the dry-run form is useful when you need to add fields that k create job does not expose directly.
# Simple jobk create job backup --image=busybox -- echo "Backup complete"
# Job with a shell commandk create job report --image=busybox -- /bin/sh -c "date; echo 'Report generated'"
# Generate YAMLk create job backup --image=busybox --dry-run=client -o yaml -- echo "done" > job.yamlWhen you turn that command into YAML, the important part is not the kind: Job line by itself. The important part is the nested Pod template and the Job-level controls around it. The template says what work should run; backoffLimit, activeDeadlineSeconds, completions, parallelism, and ttlSecondsAfterFinished say how the controller should behave when work succeeds, fails, takes too long, or finishes and needs cleanup.
apiVersion: batch/v1kind: Jobmetadata: name: backup-jobspec: template: spec: containers: - name: backup image: busybox command: ["sh", "-c", "echo 'Backing up data' && sleep 10"] restartPolicy: Never # or OnFailure backoffLimit: 4 # Retry attempts ttlSecondsAfterFinished: 100 # Auto-cleanup| Property | Purpose | Default |
|---|---|---|
restartPolicy | What to do on failure | Must be Never or OnFailure |
backoffLimit | Max retry attempts | 6 |
activeDeadlineSeconds | Max job runtime | None (runs forever) |
ttlSecondsAfterFinished | Auto-delete after completion | None (keep forever) |
completions | Required successful completions | 1 |
parallelism | Max parallel pods | 1 |
The restartPolicy rule is one of the easiest ways to catch a fake Job manifest. Jobs may use Never or OnFailure; they may not use Always, because an always-restarting Pod can never naturally express successful completion. With Never, a failed container leaves a failed Pod and the Job controller creates another Pod if retries remain. With OnFailure, kubelet restarts the container inside the same Pod, which can be cheaper but sometimes hides the history you wanted to inspect.
The tradeoff between Never and OnFailure is partly about evidence. Never tends to leave a clearer trail because each failed Pod has its own events, status, and logs, which helps when you are learning or when the failure changes between attempts. OnFailure can be better for short transient failures because kubelet can restart the container without forcing the Job controller to create a replacement Pod. Neither policy fixes a broken command; each only changes where retries happen and what evidence remains.
Pause and predict: A Job requires
restartPolicyto be set to eitherNeverorOnFailure. Why can’t you useAlways, the default style you may associate with long-running controllers? Think about what a Job is supposed to prove before you read the explanation in the next paragraph.
The reason is that a Job is complete only when the controller can count successful Pod completions. A container that restarts forever is useful for a web server, but it is a poor signal for batch work because it keeps moving instead of finishing. If you need a worker process that continuously watches a queue, a Deployment may be the better controller; if you need a bounded number of successful attempts, a Job is the better controller.
# Never: Don't restart failed containers (create new pod)restartPolicy: Never# Pod fails -> New pod created (up to backoffLimit)
# OnFailure: Restart failed container in same podrestartPolicy: OnFailure# Container fails -> Same pod restarts containerThe next design choice is how many successful completions you want and how much concurrency the cluster should allow. A single completion is the default and fits one-off work. Multiple sequential completions fit repeatable independent work when order or resource pressure matters. Parallel completions fit a workload where many shards can run at the same time, provided the application can determine which shard to process and can tolerate several Pods executing together.
This is where many learners accidentally over-credit Kubernetes. The Job controller can count successful Pods, but it does not automatically divide your input data into safe units. If ten Pods all run process-all-files.sh, you may process the same files ten times unless the script, queue, or indexed-completion design prevents it. Parallelism is a capacity knob, not a correctness guarantee. Correctness still comes from idempotent application behavior, safe task claiming, and output paths that tolerate retries.
apiVersion: batch/v1kind: Jobmetadata: name: single-jobspec: template: spec: containers: - name: worker image: busybox command: ["echo", "Single task done"] restartPolicy: NeverSequential completions are useful when the cluster should run the same template several times but keep only one Pod active at a time. This can model a batch import that must respect an external rate limit, a database operation that should not have concurrent writers, or a training task where each completion consumes a separate unit of work from an external system. The key is that Kubernetes counts completions, not business objects, unless your script maps each Pod execution to a specific object.
In practice, sequential Jobs are also useful as a stepping stone while you harden a batch process. You may begin with completions: 5 and parallelism: 1 because you want predictable logs and low pressure on a downstream service. After you prove that each attempt is independent and repeatable, you can raise parallelism carefully. This staged approach is slower than jumping to maximum concurrency, but it reveals hidden assumptions before they become production outages.
apiVersion: batch/v1kind: Jobmetadata: name: sequential-jobspec: completions: 5 # Run 5 times parallelism: 1 # One at a time template: spec: containers: - name: worker image: busybox command: ["sh", "-c", "echo Task $JOB_COMPLETION_INDEX"] restartPolicy: NeverParallel processing changes the capacity profile. The controller still wants a fixed number of successful completions, but it is allowed to keep several Pods active at the same time. That means you must think about shared resources: a CPU-bound image processor may benefit from parallelism, while a schema migration that writes to one database table may become dangerous if several Pods execute the same mutation concurrently.
The safe way to choose parallelism is to start from the narrowest bottleneck, not from the number of nodes. A cluster may have enough CPU for many Pods, but the external API, database connection pool, object store, or license server may tolerate far fewer simultaneous clients. A Job that overwhelms a dependency can look like a Kubernetes failure even when the controller is behaving perfectly. Resource requests protect the cluster; application-level limits protect the systems your Pods call.
apiVersion: batch/v1kind: Jobmetadata: name: parallel-jobspec: completions: 10 # 10 total completions parallelism: 3 # 3 pods at a time template: spec: containers: - name: worker image: busybox command: ["sh", "-c", "echo Processing batch && sleep 5"] restartPolicy: NeverA work-queue Job is slightly different because the queue, not Kubernetes, defines when there is no more work. In that model you often set parallelism and omit completions, then let workers pull items until each worker exits successfully. This is powerful, but it shifts correctness into the queue protocol: workers must claim work safely, handle duplicate attempts, and exit when the queue is empty rather than sitting idle forever.
Queue-driven Jobs are common in production because they decouple cluster capacity from business demand. If a queue has many items, you can raise parallelism and run more workers; if the queue is empty, workers exit and the Job finishes. The danger is that a worker that polls forever will prevent completion, while a worker that exits too early may leave work behind. A reliable worker needs clear empty-queue semantics, timeout behavior, and logging that explains whether it completed work or found nothing to do.
apiVersion: batch/v1kind: Jobmetadata: name: queue-jobspec: parallelism: 3 # 3 workers # No completions: workers process until they exit 0 template: spec: containers: - name: worker image: busybox command: ["sh", "-c", "process-queue && exit 0"] restartPolicy: NeverA useful way to debug your own design is to state the completion rule in plain language before writing YAML. For example, “run one backup and stop” maps to default completions and parallelism. “Process ten independent shards with at most three Pods active” maps to completions: 10 and parallelism: 3. “Run workers until Redis says no tasks remain” maps to a queue-driven design, where Kubernetes controls worker count but your application controls task ownership.
One more design habit pays off on both the exam and real teams: decide whether failed attempts are valuable evidence or just noise. During development, you may deliberately keep finished Jobs around so you can compare Pods, logs, and events. In stable scheduled work, you usually keep only a small window of history and rely on alerts or metrics for long-term visibility. Retention is not an aesthetic preference; it controls whether operators can see enough evidence without drowning in stale objects.
CronJobs: Schedules That Create Jobs
Section titled “CronJobs: Schedules That Create Jobs”A CronJob is a Job factory with a clock. It does not run your container directly; it creates Jobs from spec.jobTemplate whenever the schedule fires, and each created Job then manages its own Pods. This layering is why CronJob debugging often has two levels: first verify whether the CronJob created a Job, then inspect the Job and Pods to learn whether the run succeeded.
The schedule uses the familiar five-field cron format: minute, hour, day of month, month, and day of week. Kubernetes 1.35 CronJobs also support a timeZone field, which is safer than relying on controller-manager local time or comments in a manifest. If your schedule represents a business-local deadline, document the time zone directly in the spec so daylight saving changes and operator geography do not become hidden assumptions.
The time-zone detail is easy to underestimate because many lab clusters run everything in UTC and examples often avoid local business rules. Real schedules are rarely that neutral. A payroll export, end-of-day report, or maintenance window may be tied to a region, a legal jurisdiction, or a customer contract. When the schedule is business-local, the manifest should carry that fact. Otherwise, a future platform migration or daylight saving transition can change behavior without any application release.
# Every minutek create cronjob minute-task --image=busybox --schedule="* * * * *" -- echo "Every minute"
# Every hour at minute 30k create cronjob hourly-task --image=busybox --schedule="30 * * * *" -- date
# Daily at midnightk create cronjob daily-cleanup --image=busybox --schedule="0 0 * * *" -- echo "Daily cleanup"
# Generate YAMLk create cronjob backup --image=busybox --schedule="0 2 * * *" --dry-run=client -o yaml -- /backup.sh > cronjob.yamlThe generated YAML deserves careful reading because the Job template is nested more deeply than a normal Job. Fields such as schedule, concurrencyPolicy, successfulJobsHistoryLimit, failedJobsHistoryLimit, startingDeadlineSeconds, and suspend belong to the CronJob. Fields such as backoffLimit, ttlSecondsAfterFinished, and the Pod template belong under jobTemplate.spec, because they configure each Job the CronJob creates.
apiVersion: batch/v1kind: CronJobmetadata: name: daily-backupspec: schedule: "0 2 * * *" # 2 AM daily concurrencyPolicy: Forbid # Don't overlap successfulJobsHistoryLimit: 3 # Keep last 3 successful failedJobsHistoryLimit: 1 # Keep last 1 failed startingDeadlineSeconds: 200 # Max delay to start jobTemplate: spec: template: spec: containers: - name: backup image: busybox command: ["sh", "-c", "echo 'Backup at $(date)'"] restartPolicy: OnFailureThe cron expression is compact, so use the diagram as a reading tool rather than memorizing examples blindly. The leftmost field changes most often, and the rightmost fields narrow the calendar. In production reviews, the biggest mistake is not a typo; it is a schedule that was technically valid but expressed the wrong business intent because the author confused day-of-month, day-of-week, or UTC.
Cron expressions also hide frequency cost. A schedule of * * * * * looks tiny, but it creates up to 1,440 opportunities per day for the controller to create Jobs. That may be fine for a lightweight heartbeat, but it is excessive for a report that pulls large data sets. When you choose a schedule, estimate how many Jobs it creates per day and how much log, event, and object churn that implies. The cost is not only compute; it is also operational noise.
┌───────────── minute (0 - 59)│ ┌───────────── hour (0 - 23)│ │ ┌───────────── day of month (1 - 31)│ │ │ ┌───────────── month (1 - 12)│ │ │ │ ┌───────────── day of week (0 - 6) (Sunday = 0)│ │ │ │ │* * * * *| Schedule | Meaning |
|---|---|
* * * * * | Every minute |
*/5 * * * * | Every 5 minutes |
0 * * * * | Every hour (at minute 0) |
0 */2 * * * | Every 2 hours |
0 0 * * * | Daily at midnight |
0 0 * * 0 | Weekly on Sunday at midnight |
0 0 1 * * | Monthly on the 1st at midnight |
30 4 * * 1-5 | 4:30 AM on weekdays |
Concurrency policy is the CronJob field that turns a valid schedule into a safe schedule. The default Allow means the controller may create a new Job even if the previous one is still running. Forbid skips a new run when the previous run has not finished. Replace terminates the old run and starts the new one, which can be correct for freshness-oriented work but dangerous for destructive maintenance.
Choosing among those policies is a business decision disguised as a YAML field. Allow says every scheduled time matters independently, even if work overlaps. Forbid says completion matters more than strict schedule count, so the system may skip a run to protect shared state. Replace says the latest run is more valuable than finishing the previous run. If you cannot explain that choice in one sentence, the CronJob probably needs a design review before it is trusted.
Stop and think: You have a CronJob that runs a database backup every hour, but sometimes the backup takes 75 minutes. What happens when the next scheduled run triggers while the previous one is still running? Which policy would you choose here and why:
Allow,Forbid, orReplace?
spec: concurrencyPolicy: Allow # Run concurrent (default) # or concurrencyPolicy: Forbid # Skip if previous still running # or concurrencyPolicy: Replace # Kill previous, start new| Policy | Behavior | Use Case |
|---|---|---|
Allow | Run concurrent jobs | Independent tasks |
Forbid | Skip if previous running | Avoid resource contention |
Replace | Stop previous, start new | Latest data matters |
startingDeadlineSeconds handles missed schedules, not ordinary runtime. If the controller is unavailable, the cluster is under pressure, or the CronJob is otherwise delayed, this field tells Kubernetes how long after the scheduled time a run is still worth starting. A short deadline is useful for work that loses value quickly, such as frequent cache refreshes; a longer deadline is better for work that must eventually happen, such as compliance exports.
This field is especially important after outages. Without a clear deadline, a recovered controller may need to decide what to do about missed times according to controller behavior and history. A deadline lets you express intent: a cache refresh that is ten minutes late may be pointless, while a daily accounting export may still matter hours later. The more business meaning a run has, the more deliberate you should be about the missed-start policy.
spec: startingDeadlineSeconds: 100 # Must start within 100s of scheduleSuspension is a quieter but important operational control. Setting spec.suspend: true stops future schedules without deleting the CronJob, which is useful during incident response, maintenance windows, or risky releases. It does not normally delete already-created Jobs, so you still inspect and clean active Jobs separately. That separation is helpful because stopping future work should not erase the evidence from the current failed run.
Manual triggering pairs well with suspension. During a risky change, you can suspend future scheduled runs, create a manual Job from the current template, watch the result, and then resume the schedule when you are confident. This workflow is safer than editing the schedule to an artificial time because it keeps the production schedule intact. It also gives reviewers a clean audit trail: the CronJob was paused, a manual validation ran, and the schedule was resumed.
Retry, Cleanup, and Debugging Behavior
Section titled “Retry, Cleanup, and Debugging Behavior”Troubleshooting batch workloads begins by identifying which controller state you are looking at. A CronJob may be healthy while the Job it created is failing, and a Job may be retrying while individual Pods have already terminated. Start broad with k get cronjobs, k get jobs, and labels, then narrow to k describe, logs, events, and Pod status. This order prevents a common mistake: reading the newest Pod log and assuming it explains the entire Job history.
The controller hierarchy is your map. CronJob status answers scheduling questions, Job status answers completion and retry questions, Pod status answers scheduling and container lifecycle questions, and container logs answer application questions. Moving down the hierarchy too early wastes time because an application log cannot tell you why the CronJob never created a Job. Moving up too late also wastes time because a healthy CronJob does not prove the created Pod had permission to read a ConfigMap or write output.
# List jobsk get jobs
# List cronjobsk get cronjobs
# Get job podsk get pods -l job-name=my-job
# Check job statusk describe job my-job
# Watch job completionk get job my-job -wLogs are easiest when the Job has exactly one active or completed Pod, because k logs job/my-job can find the Pod for you. When there have been multiple retries, use the job-name label to list Pods and inspect the specific Pod that failed in the way you care about. With restartPolicy: OnFailure, remember that the same Pod may contain restarted container attempts, so look at restart counts and previous logs when needed.
Events explain what logs cannot. An image pull error, failed scheduling decision, missing ConfigMap, denied volume mount, or exceeded deadline may happen before your application starts, so there may be no useful application log at all. k describe combines status and events in one view, which is why it should be part of every Job investigation. If the container never ran, the fix is usually in the Pod template, permissions, image reference, or cluster capacity rather than in the application script.
# Get logs from job's podk logs job/my-job
# Get logs from specific podk logs my-job-abc12
# Follow logsk logs -f job/my-jobManual triggering is the safest way to test a CronJob template without waiting for the clock. The command below creates a one-off Job from the CronJob’s current template, which lets you verify image pull behavior, command syntax, ConfigMap mounts, and RBAC before relying on the schedule. It does not prove the cron expression is correct, but it proves the Job template can run right now.
This distinction is useful in incident response because it narrows the search quickly. If a manual trigger fails the same way as the scheduled run, focus on the Job template and dependencies. If a manual trigger succeeds but the scheduled run did not appear, focus on the CronJob schedule, suspension state, deadline, time zone, and controller events. Separating those two failure classes keeps you from rewriting working container commands while the real problem is a scheduling rule.
# Create job from cronjob immediatelyk create job manual-backup --from=cronjob/daily-backupCleanup should be designed, not treated as an afterthought. Finished Jobs and Pods are useful evidence, but they also create visual noise and consume API objects. CronJob history limits control how many created Jobs remain attached to the CronJob, while ttlSecondsAfterFinished lets the TTL controller remove finished Jobs after a delay. Use both when you want recent evidence without keeping every successful run forever.
The right retention window depends on how quickly humans and monitoring systems notice failures. For a frequent cleanup job, keeping three successful runs and one failed run may be enough because the next failure will happen soon and alerts should fire promptly. For a monthly compliance export, you might keep more history or export evidence elsewhere before TTL removes the Job. Kubernetes retention settings should complement, not replace, logs, metrics, and external audit storage.
# Delete jobk delete job my-job
# Delete cronjob (also deletes jobs it created)k delete cronjob my-cronjob
# Delete completed jobs older than TTL# (Automatic if ttlSecondsAfterFinished is set)When a Job will not complete, read it like a chain of contracts. The image must pull, the Pod must schedule, the command must run, the container must exit with the intended code, and the Job controller must still have retries and time remaining. A non-zero exit code is not inherently bad; it is the application’s way of telling Kubernetes that the work did not meet the success contract and should be retried or marked failed.
The exit code contract is why shell wrappers deserve care. A command like sh -c "step1; step2; echo done" may print a happy message even when an earlier step failed, depending on how the shell is written. In production scripts, teams often use set -e or explicit error checks so the container exits non-zero when the work did not actually succeed. Kubernetes can only act on the exit status it receives, so sloppy scripting can turn real failures into false completions.
# Check statusk describe job my-job
# Common issues:# - Container command exits non-zero# - Image pull fails# - Resource limits too low# - restartPolicy not set correctly
# Check pod logsk logs $(k get pods -l job-name=my-job -o jsonpath='{.items[0].metadata.name}')What would happen if: You create a Job with
backoffLimit: 6, which is the default, andrestartPolicy: Never. The container’s script has a bug that always exits with code 1. Before running this, what output do you expect fromk get pods -l job-name=my-job, and why?
The controller will keep creating replacement Pods until the failure policy is exhausted. With restartPolicy: Never, each failed attempt is visible as a separate failed Pod, which is noisy but very helpful during diagnosis. With restartPolicy: OnFailure, you may instead see restarts inside a smaller number of Pods, so your debugging habit must match the restart policy. In both cases, the root fix is the same: inspect the exit reason and application log, then correct the command, image, dependencies, or permissions.
# Check backoffLimitk get job my-job -o jsonpath='{.spec.backoffLimit}'
# If hitting limit, check why pods failk describe pods -l job-name=my-jobCronJob failures require one extra question: did the schedule create a Job at all? If lastScheduleTime is empty or stale, inspect the schedule, suspension flag, deadline, and controller events. If Jobs exist but fail, shift your attention to the Job and Pods. This split keeps you from changing retry fields when the real issue is a suspended CronJob or a missed schedule.
Names can also guide the investigation. Jobs created by a CronJob normally include the CronJob name plus generated suffixes, and their Pods carry labels that link them back to the Job. Use owner references and labels instead of guessing from timestamps when several batch workloads run in the same namespace. This becomes important during outages, when several CronJobs may all create delayed or failed work around the same time.
# Check cronjob statusk describe cronjob my-cronjob
# Check last schedule timek get cronjob my-cronjob -o jsonpath='{.status.lastScheduleTime}'
# Check if suspendedk get cronjob my-cronjob -o jsonpath='{.spec.suspend}'
# Resume if suspendedk patch cronjob my-cronjob -p '{"spec":{"suspend":false}}'A practical war story makes the difference concrete. A platform team once scheduled a nightly report generator as a CronJob with the default Allow policy because the YAML looked smaller and the first few runs finished quickly. At month end, each run took longer, the next schedule created another Job, and several Pods competed for the same database read replica until dashboard latency spiked. The fix was not exotic: they set concurrencyPolicy: Forbid, added a runtime deadline, tuned resource requests, and created an alert when a run was skipped because the previous one was still active.
Patterns & Anti-Patterns
Section titled “Patterns & Anti-Patterns”Good batch design starts by making the work idempotent. A Job may retry after a node failure, a container crash, or a deadline miss, and a CronJob may create a later run after an earlier run partially completed. If your script can safely run twice and converge on the same result, Kubernetes retry behavior becomes an asset instead of a threat. If your script cannot tolerate duplicates, you need external locking, transaction boundaries, or a different workflow.
Idempotency does not mean the work has no side effects. It means repeated attempts produce an acceptable final state. Uploading a file to a deterministic object key, marking a database row processed inside a transaction, or writing output with a unique completion index can all be idempotent designs. Appending blindly to a report, deleting by a broad pattern, or charging a customer inside a retrying Job are dangerous unless the application uses safeguards beyond Kubernetes.
| Pattern | When to Use It | Why It Works | Scaling Consideration |
|---|---|---|---|
| Single-completion Job | One migration, one backup test, one report export | The controller needs exactly one successful Pod and then stops | Keep backoffLimit low enough that a bad command fails visibly |
| Fixed parallel completions | Many independent shards or files | completions defines total work and parallelism caps active Pods | Match parallelism to cluster capacity and downstream rate limits |
| Queue-driven workers | Work items live in Redis, a database table, or another queue | Kubernetes controls worker count while the queue controls item ownership | Workers must claim items atomically and exit cleanly when no work remains |
| CronJob with forbidden overlap | Backups, cleanup, compaction, or reports that touch shared state | A new run is skipped instead of stacking onto unfinished work | Alert on missed runs so skipped schedules are visible |
Anti-patterns usually come from treating Jobs as small Deployments or treating CronJobs as ordinary crontab lines. Kubernetes adds controller behavior, status, events, and garbage collection, but it also expects your manifest to define the lifecycle clearly. The safest review question is, “What happens if this command fails halfway through and Kubernetes tries again?”
Another useful review question is, “Who owns the result after the Pod exits?” A Job can tell you that a container exited successfully, but it cannot prove that the backup is restorable, the report is correct, or the cleanup deleted only intended data. Mature batch systems pair Kubernetes controller status with application-level verification, such as checksum validation, row counts, smoke queries, or a restore test. The controller proves execution; the application must prove business correctness.
| Anti-Pattern | What Goes Wrong | Better Alternative |
|---|---|---|
| Using a Deployment for finite migration work | The Pod restarts after success and the migration may run repeatedly | Use a Job with a clear command, retry limit, and cleanup policy |
Leaving CronJob concurrency at Allow for shared-state work | Long runs overlap and compete for the same files, locks, or databases | Use Forbid or design the task to be safely concurrent |
| Setting high retries without observability | A broken command burns time and creates many failing Pods before anyone notices | Use a deliberate backoffLimit, inspect events, and alert on failed Jobs |
| Keeping every completed run forever | Namespaces become cluttered and humans stop noticing real failures | Combine history limits with ttlSecondsAfterFinished |
The pattern that scales best is the one whose failure mode you have rehearsed. A parallel image processor can retry individual shards if the output path includes the shard identity. A database backup should avoid overlap because two backup streams may overload storage and produce confusing evidence. A report generator may use Replace if only the newest data matters, but a cleanup job should rarely be replaced mid-delete unless the cleanup operation is explicitly transactional.
For CKAD, you will not build a complete production batch platform, but the same reasoning helps you answer scenario questions quickly. Identify whether the workload is finite or continuous, decide whether time is part of the requirement, choose the controller, then tune retry, concurrency, and cleanup. Most wrong answers violate one of those steps. They use a Deployment for finite work, forget overlap policy for scheduled work, or debug a Pod log when no Job was ever created.
Decision Framework
Section titled “Decision Framework”Choose the controller by asking what “healthy” means. If healthy means “there should always be three Pods serving traffic,” use a Deployment or another long-running controller. If healthy means “this task should eventually complete once,” use a Job. If healthy means “a Job should be created on a calendar,” use a CronJob. The controller should encode the desired lifecycle, not just happen to launch a container.
After you choose the controller, choose the failure budget for the work. Expensive or destructive jobs usually deserve fewer retries, stronger observability, and manual review when they fail. Cheap and idempotent jobs can often tolerate more retries because repeated attempts are safe and useful. The mistake is to copy retry values from an example without asking what a retry costs. A failed thumbnail job and a failed data migration do not deserve the same operational policy.
Need the workload to keep running? | +-- yes --> Use a Deployment, StatefulSet, or another long-running controller. | +-- no --> Is the work scheduled repeatedly? | +-- no --> Use a Job with completions, parallelism, retry, and TTL rules. | +-- yes --> Use a CronJob with schedule, concurrency, deadline, and history rules.| Decision | Prefer This | Avoid This |
|---|---|---|
| One migration must run once | Job with parallelism: 1 and deliberate retries | Deployment that restarts the migration container forever |
| Hourly backup sometimes exceeds one hour | CronJob with concurrencyPolicy: Forbid | Default overlap that launches competing backup Pods |
| Cache refresh where newest run matters most | CronJob with Replace after checking termination safety | Killing non-idempotent work without cleanup guarantees |
| Hundreds of independent files need processing | Job with fixed completions and capped parallelism | One enormous Pod that serializes everything and hides partial failures |
| Queue workers consume until empty | Job with parallelism and queue-aware worker logic | Assuming Kubernetes knows how many external queue items remain |
For CKAD speed, build a mental template that you can adapt under pressure. Create or generate the object, inspect the nested spec, set lifecycle controls, run it, then diagnose from controller to Pod. The exam rarely rewards exotic features; it rewards clear resource choice, valid YAML, and practical debugging. In real clusters, the same habits prevent batch jobs from becoming silent background hazards.
For production speed, make the template readable enough that the next engineer can audit it under stress. Put the important lifecycle decisions in fields, not only in comments or runbooks. A CronJob with explicit concurrency, deadline, history, and time-zone behavior is easier to review than a minimal manifest whose behavior depends on defaults. Defaults are not bad, but hidden defaults are expensive when a scheduled task fails at an inconvenient hour and the responder has to infer the original intent.
Did You Know?
Section titled “Did You Know?”- Jobs track completions with a completion index. In indexed completion mode, each Pod can know its index through
JOB_COMPLETION_INDEX, which is useful when you shard data and want each Pod to process a different partition. - CronJobs support time zones in modern Kubernetes. The
spec.timeZonefield lets you say that a business schedule should run in a named time zone instead of relying on the controller’s local clock behavior. - The
activeDeadlineSecondsfield limits the whole Job runtime. If the deadline expires, Kubernetes terminates active Pods for that Job even if some individual attempts were still making progress. - CronJob history limits and Job TTL solve different cleanup problems. History limits decide how many Jobs a CronJob keeps, while
ttlSecondsAfterFinishedlets the TTL controller remove finished Jobs after a configured delay.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Why It Happens | How to Fix It |
|---|---|---|
restartPolicy: Always in a Job template | The author copies a Pod or Deployment template and forgets that Jobs must finish | Use Never when separate failed Pods help debugging, or OnFailure when in-Pod restarts are acceptable |
Leaving backoffLimit implicit for risky commands | The default feels harmless during creation but can hide repeated application failures | Set an explicit retry limit that matches the cost and safety of the operation |
| Using the wrong cron field for business time | Cron expressions are compact, and UTC or time zone assumptions are easy to miss | Validate the expression, set spec.timeZone when needed, and test with a manual Job trigger |
| Omitting cleanup controls on frequent CronJobs | Finished Jobs are useful at first, so teams postpone retention decisions | Set successfulJobsHistoryLimit, failedJobsHistoryLimit, and a Job TTL where appropriate |
| Allowing overlap for shared-state tasks | Allow is the default, and short test runs do not reveal month-end runtime | Use Forbid for backups, compaction, and cleanup unless concurrent runs are explicitly safe |
| Debugging only the newest Pod log | Retries create several Pods or restarts, and the newest attempt may not show the first error | Inspect Job status, events, all Pods with the job-name label, and previous logs when using OnFailure |
Treating startingDeadlineSeconds as a runtime limit | The field name sounds like a general timeout, but it controls missed starts | Use activeDeadlineSeconds for runtime limits and startingDeadlineSeconds for late schedules |
Your team deploys a database migration as a Job, but the manifest uses `restartPolicy: Always` because it was copied from a Deployment. What happens, and what should you change?
The API server rejects the Job because the Pod template for a Job must use Never or OnFailure. A Job needs Pods that can finish so the controller can count completions, and Always describes long-running behavior instead. Use Never if you want each failed attempt preserved as a separate Pod for easier postmortem inspection. Use OnFailure if restarting inside the same Pod is acceptable and you want fewer replacement Pods.
Your operations team needs a log cleanup script to run at 4:30 AM on weekdays only, and the cleanup sometimes takes more than a day. What schedule and concurrency policy would you choose?
The schedule is 30 4 * * 1-5, which means minute 30, hour 4, any day of month, any month, Monday through Friday. For the concurrency policy, choose Forbid unless the cleanup is proven safe to run concurrently. Allow could stack multiple cleanups against the same filesystem or database, and Replace could terminate a cleanup halfway through. Skipping an overlapping run is usually safer than multiplying destructive maintenance.
You need to process 100 images through a thumbnail generator, each image takes about 10 seconds, and the cluster can handle five extra Pods. How should you configure the Job and what must the application handle?
Set completions: 100 and parallelism: 5 so Kubernetes runs at most five Pods while it works toward 100 successful completions. The application still needs a reliable way to map each completion to a specific image, such as an indexed completion strategy or an external queue. Without that mapping, five Pods may all process the same image or skip work. The controller manages completion counts; your application must manage business item ownership.
A CronJob runs every five minutes, but completed Jobs and Pods are piling up in the namespace. Which controls should you add, and why are there two kinds?
Add successfulJobsHistoryLimit and failedJobsHistoryLimit to the CronJob so it keeps only a bounded number of recent Jobs. Add ttlSecondsAfterFinished under the Job template when you also want finished Jobs removed after a time delay. The history limits are CronJob retention controls, while TTL is handled for finished Jobs. Using both gives operators recent evidence without letting frequent runs clutter the namespace indefinitely.
A scheduled backup did not run during a short control-plane outage, and now the CronJob shows no new Job for that schedule. What fields and status would you inspect before changing the image or command?
First inspect the CronJob with k describe cronjob and check status.lastScheduleTime, spec.suspend, spec.schedule, spec.timeZone, and spec.startingDeadlineSeconds. If the missed schedule exceeded the starting deadline, Kubernetes may correctly skip it rather than starting late. If the CronJob is suspended, no future schedules are created until it is resumed. Only after confirming that a Job was actually created should you move to image, command, Pod, and log debugging.
A Job with `restartPolicy: Never` and a failing command has several failed Pods. A teammate wants to delete the Job and recreate it immediately. What should you inspect first?
Inspect k describe job, list Pods with k get pods -l job-name=<name>, and read logs from the failed Pods before deleting evidence. With restartPolicy: Never, each failed attempt can preserve a different event or log sequence, especially if scheduling, image pull, and application failures happened at different times. You should also check backoffLimit and any deadline fields to understand whether Kubernetes stopped retrying as configured. Recreating the Job too quickly can erase the trail that explains the root cause.
Your team is deciding between a Deployment, a Job, and a CronJob for a nightly report generator that exits after publishing a file. Which controller fits, and when would the answer change?
A CronJob fits because the work is finite and recurring on a schedule. A plain Job would fit for a one-time report, but it would not create future runs by itself. A Deployment would be the wrong default because the desired state is not a continuously running Pod; the report command should exit after success. The answer changes only if the report process is actually a long-running service that watches for requests or queue messages indefinitely.
Hands-On Exercise
Section titled “Hands-On Exercise”This exercise builds a small backup and cleanup workflow in layers. You will create a one-time Job, trigger a CronJob manually, run parallel completions, observe retry behavior, and finish with a more complete backup CronJob that uses history and TTL controls. Work in a disposable namespace if you have one, and clean up at the end so your later CKAD practice is not polluted by old Jobs.
As you work through the tasks, keep a small investigation journal in your terminal or notes: object created, expected controller behavior, command used to verify it, and cleanup command. That may feel formal for tiny examples, but it trains the exact loop you need when batch work fails in a shared namespace. The goal is not only to make the commands pass; it is to connect each command to the controller decision it confirms.
Task 1: Run and Inspect a One-Time Job
Section titled “Task 1: Run and Inspect a One-Time Job”Start with the smallest useful Job: one Pod, one successful completion, one log stream to inspect. This task teaches the basic loop you will use throughout the module: create the object, wait for the controller condition, inspect status, and read logs from the Job-owned Pod.
Before running the command, predict what objects should exist after completion. You should have a Job object and at least one Pod owned by that Job. The Job should report a completed condition, and the Pod should be in a succeeded phase. If your cluster keeps finished Pods for inspection, the log should still be available through the Job reference; if retention or cleanup removes evidence quickly, describe the Job first and capture what remains.
# Create a job that simulates a database backupk create job db-backup --image=busybox -- sh -c "echo 'Backing up database' && sleep 5 && echo 'Backup complete'"
# Wait for completionk wait --for=condition=complete job/db-backup --timeout=60sk get job db-backup
# Check logsk logs job/db-backupSolution notes for Task 1
The Job should reach the Complete condition, and the logs should show both backup messages. If the wait times out, describe the Job and list Pods with the job-name=db-backup label before changing anything. That keeps your investigation aligned with the controller hierarchy.
Task 2: Create a CronJob and Trigger It Manually
Section titled “Task 2: Create a CronJob and Trigger It Manually”Now create a scheduled controller, but do not wait for the clock to prove the template works. A manual Job created from the CronJob gives quick feedback on the image, command, and Pod template while keeping schedule debugging separate.
This is the workflow many teams use before enabling a new schedule. They create or update the CronJob, trigger one manual Job, inspect logs and status, and only then trust the schedule. It avoids the awkward pattern of waiting until the next hour, discovering an image or command failure, editing under time pressure, and then waiting again. Manual triggering turns a scheduled workload into a testable template.
# Create cronjob for hourly cleanupk create cronjob hourly-cleanup \ --image=busybox \ --schedule="0 * * * *" \ -- sh -c "echo 'Cleanup at $(date)'"
# Manually trigger for testingk create job manual-cleanup --from=cronjob/hourly-cleanup
# Check resultsk get jobsk logs job/manual-cleanupSolution notes for Task 2
You should see the manual-cleanup Job complete and print the cleanup timestamp. This proves the Job template can run, but it does not prove the hourly schedule has fired. Use k describe cronjob hourly-cleanup when you want to inspect schedule status, last schedule time, and events.
Task 3: Run a Parallel Job
Section titled “Task 3: Run a Parallel Job”The next manifest preserves the original parallel-processing example and gives you a concrete way to observe completions and active Pods. Apply it, watch the Pods, and compare what the Job wants with what the cluster is currently running.
Do not expect every k get pods call to show exactly two running Pods, because timing matters. Some Pods may finish between list operations, and the controller may create replacements quickly enough that the snapshot changes from second to second. The important evidence is the trend: the Job progresses toward six completions while maintaining no more than the configured parallelism under normal conditions. If it stalls, describe the Job and then inspect the Pods that did not succeed.
# Create parallel-job.yamlcat << 'EOF' > parallel-job.yamlapiVersion: batch/v1kind: Jobmetadata: name: parallel-processspec: completions: 6 parallelism: 2 template: spec: containers: - name: worker image: busybox command: ["sh", "-c", "echo Processing item $JOB_COMPLETION_INDEX && sleep 3"] restartPolicy: NeverEOF
k apply -f parallel-job.yamlk get pods -l job-name=parallel-processSolution notes for Task 3
The Job targets six successful completions while allowing two active Pods at a time. Depending on when you list Pods, you may see running Pods, completed Pods, or a mix. The important observation is that Kubernetes launches more Pods as earlier completions finish until the completion target is reached.
Task 4: Practice Focused CKAD Drills
Section titled “Task 4: Practice Focused CKAD Drills”These drills are short on purpose, but do not treat them as memorization only. After each command, name the controller behavior you expect before checking the output. That habit makes exam debugging faster because you notice when the object behaves differently from the lifecycle you intended.
The retry drill is especially useful because it creates a controlled failure. A failing Job is not an accident here; it is a lab instrument. Watch how failed Pods accumulate with restartPolicy: Never, how backoffLimit changes when the controller stops trying, and how k describe job summarizes the state. Once you have seen intentional failure, accidental failure is much less mysterious.
# Create a job that:# - Named: hello-job# - Runs busybox# - Echoes "Hello from job"
k create job hello-job --image=busybox -- echo "Hello from job"
# Verify completionk get job hello-job
# Check logsk logs job/hello-job
# Cleanupk delete job hello-job# Create a cronjob that:# - Named: every-minute# - Runs every minute# - Prints current date
k create cronjob every-minute --image=busybox --schedule="* * * * *" -- date
# Wait 1 minute and checksleep 65k get jobs
# Check logs of triggered jobk logs job/$(k get jobs -o jsonpath='{.items[0].metadata.name}')
# Cleanupk delete cronjob every-minute# Create a job that fails and retriescat << 'EOF' | k apply -f -apiVersion: batch/v1kind: Jobmetadata: name: retry-jobspec: backoffLimit: 3 template: spec: containers: - name: fail image: busybox command: ["sh", "-c", "echo 'Trying...' && exit 1"] restartPolicy: NeverEOF
# Verify retriessleep 5k get pods -l job-name=retry-job
# Check job statusk describe job retry-job | grep -A5 Conditions
# Cleanupk delete job retry-job# Create a parallel jobcat << 'EOF' | k apply -f -apiVersion: batch/v1kind: Jobmetadata: name: parallelspec: completions: 5 parallelism: 2 template: spec: containers: - name: worker image: busybox command: ["sh", "-c", "echo Worker done && sleep 2"] restartPolicy: NeverEOF
# Verify parallel executionsleep 5k get pods -l job-name=parallel
# Verify all completedk get job parallel
# Cleanupk delete job parallel# Create cronjob that forbids overlapcat << 'EOF' | k apply -f -apiVersion: batch/v1kind: CronJobmetadata: name: no-overlapspec: schedule: "* * * * *" concurrencyPolicy: Forbid jobTemplate: spec: template: spec: containers: - name: worker image: busybox command: ["sh", "-c", "echo 'Start' && sleep 90 && echo 'Done'"] restartPolicy: NeverEOF
# Check policyk get cronjob no-overlap -o jsonpath='{.spec.concurrencyPolicy}'
# Wait 2 minutes and verify only 1 job runssleep 120k get jobs | grep no-overlap
# Cleanupk delete cronjob no-overlapSolution notes for Task 4
The basic Job should complete once, the every-minute CronJob should create at least one Job after the schedule fires, the retry Job should show failed attempts until the backoff policy is reached, the parallel Job should run up to two Pods at a time, and the no-overlap CronJob should avoid starting a second active Job while the first one is still running. If your output differs, use k describe before deleting anything.
Task 5: Build a Complete Backup CronJob
Section titled “Task 5: Build a Complete Backup CronJob”The final task combines a ConfigMap-provided script, a CronJob, forbidden concurrency, history limits, and Job TTL cleanup. This is closer to production shape because the script is separated from the CronJob object and the retention behavior is visible in the spec. In a real cluster, you would also add service account permissions, resource requests, storage credentials from a safe Secret workflow, and monitoring.
Notice that the example still uses BusyBox and simulated output. That is deliberate because the learning target is Kubernetes controller behavior, not storage integration. In a real backup system, the script would need authenticated storage access, encryption decisions, restore validation, and alerting when the backup cannot be verified. The Kubernetes CronJob is the scheduler and execution envelope; it is not a complete backup product by itself.
# 1. Create configmap with backup scriptk create configmap backup-script --from-literal=script.sh='#!/bin/shecho "Starting backup at $(date)"echo "Compressing data..."sleep 3echo "Uploading to storage..."sleep 2echo "Backup complete at $(date)"'
# 2. Create CronJob using the scriptcat << 'EOF' | k apply -f -apiVersion: batch/v1kind: CronJobmetadata: name: backup-systemspec: schedule: "*/5 * * * *" concurrencyPolicy: Forbid successfulJobsHistoryLimit: 3 failedJobsHistoryLimit: 1 jobTemplate: spec: ttlSecondsAfterFinished: 300 template: spec: containers: - name: backup image: busybox command: ["sh", "/scripts/script.sh"] volumeMounts: - name: scripts mountPath: /scripts restartPolicy: OnFailure volumes: - name: scripts configMap: name: backup-scriptEOF
# 3. Test with manual triggerk create job test-backup --from=cronjob/backup-system
# 4. Check logsk logs job/test-backup
# 5. Verify history limitsk get cronjob backup-system -o jsonpath='{.spec.successfulJobsHistoryLimit}'
# Cleanupk delete cronjob backup-systemk delete job test-backupk delete configmap backup-scriptSolution notes for Task 5
The manual trigger should create test-backup, run the script from the ConfigMap, and print the backup steps. The CronJob should show Forbid concurrency and the configured history limits. The TTL value applies to Jobs created from the template after they finish, so do not expect it to delete the CronJob itself.
Success Criteria
Section titled “Success Criteria”- You can create a one-time Job, wait for completion, and read its logs through the Job reference.
- You can create a CronJob and manually trigger a Job from its template for fast validation.
- You can explain how
completionsandparallelisminteract in a fixed-size batch. - You can diagnose a failing Job by checking Job status, labeled Pods, events, logs, and retry limits.
- You can configure a CronJob with overlap protection, history limits, and finished-Job cleanup.
- You can clean up every Job, CronJob, and ConfigMap created during the exercise.
Sources
Section titled “Sources”- Kubernetes Documentation: Jobs
- Kubernetes Documentation: CronJob
- Kubernetes Documentation: TTL Mechanism for Finished Jobs
- Kubernetes Documentation: Pod Lifecycle
- Kubernetes Documentation: Managing Resources for Containers
- Kubernetes Documentation: Configure a Pod to Use a ConfigMap
- Kubernetes Documentation: Labels and Selectors
- Kubernetes Documentation: kubectl create job
- Kubernetes Documentation: kubectl create cronjob
- Kubernetes API Reference: batch/v1
Next Module
Section titled “Next Module”Module 1.3: Multi-Container Pods - Sidecar, init, and ambassador patterns help you design Pods where multiple containers cooperate instead of forcing every responsibility into one image.