Module 2.4: Jobs & CronJobs
Complexity:
[QUICK]- Straightforward batch workloadsTime to Complete: 30-40 minutes
Prerequisites: Module 2.1 (Pods)
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After this module, you will be able to:
- Create Jobs and CronJobs with appropriate parallelism, completion counts, and backoff limits
- Debug failed Jobs by checking pod logs, exit codes, and restart policies
- Configure CronJob concurrency policies and history limits for production use
- Explain when to use Jobs vs Deployments and the implications of each for batch workloads
Why This Module Matters
Section titled “Why This Module Matters”Not all workloads run forever. Some run once and exit:
- Database migrations
- Batch processing
- Report generation
- Backup operations
Jobs handle one-time tasks. CronJobs handle scheduled, recurring tasks. The CKA exam tests creating Jobs with specific completion requirements and troubleshooting failed Jobs.
The Task Manager Analogy
Think of Jobs like tasks on a to-do list. A Job is a single task: “Generate monthly report.” Once done, you check it off. A CronJob is a recurring task: “Generate monthly report on the 1st of every month.” The task manager (Kubernetes) ensures the task runs, retries if it fails, and tracks completion.
What You’ll Learn
Section titled “What You’ll Learn”By the end of this module, you’ll be able to:
- Create Jobs for one-time tasks
- Configure parallelism and completions
- Handle Job failures and retries
- Create CronJobs for scheduled tasks
- Debug failed Jobs
Part 1: Jobs
Section titled “Part 1: Jobs”1.1 What Is a Job?
Section titled “1.1 What Is a Job?”A Job creates pods that run to completion. Unlike Deployments (which keep pods running forever), Jobs expect pods to terminate successfully.
┌────────────────────────────────────────────────────────────────┐│ Job Lifecycle ││ ││ Job Created ││ │ ││ ▼ ││ Pod Created ─────────────────────────────────────────┐ ││ │ │ ││ ▼ │ ││ Pod Running │ ││ │ │ ││ ├───► Exit 0 (Success) ──► Job Complete │ ││ │ │ ││ └───► Exit ≠ 0 (Fail) ──► Retry? ──────────────►┘ ││ (based on backoffLimit) ││ │└────────────────────────────────────────────────────────────────┘1.2 Creating a Job
Section titled “1.2 Creating a Job”apiVersion: batch/v1kind: Jobmetadata: name: pi-calculationspec: template: spec: containers: - name: pi image: perl command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"] restartPolicy: Never # Required for Jobs backoffLimit: 4 # Retry up to 4 times on failure# Create job imperativelykubectl create job pi --image=perl -- perl -Mbignum=bpi -wle "print bpi(100)"
# Generate YAMLkubectl create job pi --image=perl --dry-run=client -o yaml -- perl -Mbignum=bpi -wle "print bpi(100)"1.3 Job Commands
Section titled “1.3 Job Commands”# List jobskubectl get jobs
# Watch job progresskubectl get jobs -w
# Describe jobkubectl describe job pi-calculation
# Get job logskubectl logs job/pi-calculation
# Delete job (also deletes pods)kubectl delete job pi-calculationPause and predict: A Job has
restartPolicy: NeverandbackoffLimit: 4. The container fails on every attempt. How many pods will you see inkubectl get podsafter the Job gives up? Now consider the same scenario withrestartPolicy: OnFailure— how many pods would you see?
1.4 Restart Policy
Section titled “1.4 Restart Policy”Jobs require either Never or OnFailure:
| Policy | Behavior |
|---|---|
Never | Create new pod on failure |
OnFailure | Restart container in same pod on failure |
spec: template: spec: restartPolicy: Never # New pod per failure # restartPolicy: OnFailure # Restart same podDid You Know?
With
restartPolicy: Never, failed attempts create new pods. With a backoffLimit of 4, you might see 5 pods (1 original + 4 retries). WithOnFailure, you see fewer pods because containers restart in place.
Part 2: Job Completions and Parallelism
Section titled “Part 2: Job Completions and Parallelism”2.1 Running Multiple Completions
Section titled “2.1 Running Multiple Completions”apiVersion: batch/v1kind: Jobmetadata: name: batch-jobspec: completions: 5 # Job succeeds when 5 pods complete successfully parallelism: 2 # Run 2 pods at a time template: spec: containers: - name: worker image: busybox command: ["sh", "-c", "echo Processing item; sleep 5"] restartPolicy: Never2.2 Parallelism Patterns
Section titled “2.2 Parallelism Patterns”| Pattern | completions | parallelism | Behavior |
|---|---|---|---|
| Single pod | 1 (default) | 1 (default) | One pod runs to completion |
| Fixed completions | N | M | M pods run in parallel until N succeed |
| Work queue | unset | N | N pods run until one succeeds |
┌────────────────────────────────────────────────────────────────┐│ Completions=5, Parallelism=2 ││ ││ Time ─────────────────────────────────────────────────► ││ ││ Slot 1: [Pod 1 ✓] [Pod 3 ✓] [Pod 5 ✓] ││ Slot 2: [Pod 2 ✓] [Pod 4 ✓] ││ ││ 2 pods run concurrently, until 5 completions achieved ││ │└────────────────────────────────────────────────────────────────┘2.3 Examples
Section titled “2.3 Examples”# Run 10 tasks, 3 at a timekubectl create job batch --image=busybox -- sh -c "echo done; sleep 2"kubectl patch job batch -p '{"spec":{"completions":10,"parallelism":3}}'
# Or create with YAMLcat << 'EOF' | kubectl apply -f -apiVersion: batch/v1kind: Jobmetadata: name: parallel-jobspec: completions: 10 parallelism: 3 template: spec: containers: - name: worker image: busybox command: ["sh", "-c", "echo Task complete; sleep 2"] restartPolicy: NeverEOF
# Watch progresskubectl get jobs parallel-job -wPart 3: Job Failure Handling
Section titled “Part 3: Job Failure Handling”Pause and predict: A Job with
activeDeadlineSeconds: 60andbackoffLimit: 10runs a container that takes 15 seconds per attempt and always fails. Will the Job hit the backoff limit or the deadline first? How many pods will be created?
3.1 backoffLimit
Section titled “3.1 backoffLimit”Controls how many times to retry:
apiVersion: batch/v1kind: Jobmetadata: name: failing-jobspec: backoffLimit: 3 # Retry 3 times, then fail template: spec: containers: - name: fail image: busybox command: ["sh", "-c", "exit 1"] # Always fails restartPolicy: Never3.2 activeDeadlineSeconds
Section titled “3.2 activeDeadlineSeconds”Maximum time for job to run:
apiVersion: batch/v1kind: Jobmetadata: name: timeout-jobspec: activeDeadlineSeconds: 60 # Kill job after 60 seconds template: spec: containers: - name: long-task image: busybox command: ["sleep", "120"] # Tries to run 2 minutes restartPolicy: Never3.3 Checking Job Status
Section titled “3.3 Checking Job Status”# Job statuskubectl get job myjob# NAME COMPLETIONS DURATION AGE# myjob 3/5 2m 5m
# Detailed statuskubectl describe job myjob | grep -A5 "Pods Statuses"
# Check failed podskubectl get pods -l job-name=myjob --field-selector=status.phase=FailedPart 4: CronJobs
Section titled “Part 4: CronJobs”4.1 What Is a CronJob?
Section titled “4.1 What Is a CronJob?”A CronJob creates Jobs on a schedule, like cron in Linux.
┌────────────────────────────────────────────────────────────────┐│ CronJob ││ ││ Schedule: "0 * * * *" (hourly) ││ ││ 1:00 ──► Creates Job ──► Creates Pod ──► Completes ││ 2:00 ──► Creates Job ──► Creates Pod ──► Completes ││ 3:00 ──► Creates Job ──► Creates Pod ──► Completes ││ ... ││ │└────────────────────────────────────────────────────────────────┘4.2 Cron Schedule Syntax
Section titled “4.2 Cron Schedule Syntax”┌───────────── minute (0 - 59)│ ┌───────────── hour (0 - 23)│ │ ┌───────────── day of month (1 - 31)│ │ │ ┌───────────── month (1 - 12)│ │ │ │ ┌───────────── day of week (0 - 6) (Sunday = 0)│ │ │ │ │* * * * *| Schedule | Description |
|---|---|
* * * * * | Every minute |
0 * * * * | Every hour |
0 0 * * * | Every day at midnight |
0 0 * * 0 | Every Sunday at midnight |
*/5 * * * * | Every 5 minutes |
0 9-17 * * 1-5 | Every hour 9-17, Mon-Fri |
4.3 Creating a CronJob
Section titled “4.3 Creating a CronJob”apiVersion: batch/v1kind: CronJobmetadata: name: backupspec: schedule: "0 2 * * *" # Daily at 2 AM jobTemplate: spec: template: spec: containers: - name: backup image: busybox command: ["sh", "-c", "echo Backup started; sleep 10; echo Backup done"] restartPolicy: OnFailure successfulJobsHistoryLimit: 3 # Keep 3 successful job records failedJobsHistoryLimit: 1 # Keep 1 failed job record# Create CronJob imperativelykubectl create cronjob backup --image=busybox --schedule="0 2 * * *" -- sh -c "echo Backup done"
# Generate YAMLkubectl create cronjob backup --image=busybox --schedule="*/5 * * * *" --dry-run=client -o yaml -- echo "hello"4.4 CronJob Commands
Section titled “4.4 CronJob Commands”# List CronJobskubectl get cronjobskubectl get cj # Short form
# Describekubectl describe cronjob backup
# Manually trigger a job from CronJobkubectl create job --from=cronjob/backup backup-manual
# Suspend CronJobkubectl patch cronjob backup -p '{"spec":{"suspend":true}}'
# Resume CronJobkubectl patch cronjob backup -p '{"spec":{"suspend":false}}'
# Delete CronJob (also deletes Jobs it created)kubectl delete cronjob backupStop and think: You have a CronJob that runs a database backup every hour, but sometimes the backup takes 90 minutes. With the default
concurrencyPolicy: Allow, two backup jobs would overlap. What could go wrong with concurrent backups, and which concurrency policy would you choose instead?
4.5 CronJob Concurrency Policy
Section titled “4.5 CronJob Concurrency Policy”spec: concurrencyPolicy: Allow # Default - allow concurrent jobs # concurrencyPolicy: Forbid # Skip if previous still running # concurrencyPolicy: Replace # Kill previous, start new| Policy | Behavior |
|---|---|
Allow | Multiple Jobs can run simultaneously |
Forbid | Skip new Job if previous still running |
Replace | Kill running Job, start new one |
Exam Tip
For scheduled backup tasks, use
concurrencyPolicy: Forbidto prevent overlapping runs. For quick tasks that shouldn’t overlap,Replacemight be better.
Part 5: Debugging Jobs
Section titled “Part 5: Debugging Jobs”5.1 Common Job Issues
Section titled “5.1 Common Job Issues”| Issue | Symptom | Debug Command |
|---|---|---|
| Image pull failure | Pod in ImagePullBackOff | kubectl describe pod <pod> |
| Command failure | Job never completes | kubectl logs job/<job-name> |
| Timeout | Job killed | Check activeDeadlineSeconds |
| Too many retries | Multiple failed pods | Check backoffLimit |
5.2 Debugging Workflow
Section titled “5.2 Debugging Workflow”# 1. Check job statuskubectl get job myjobkubectl describe job myjob
# 2. Find pods created by jobkubectl get pods -l job-name=myjob
# 3. Check pod logskubectl logs <pod-name>kubectl logs job/myjob # Auto-selects a pod
# 4. If still running, exec into podkubectl exec -it <pod-name> -- /bin/sh
# 5. Check eventskubectl get events --field-selector involvedObject.name=myjobDid You Know?
Section titled “Did You Know?”-
Jobs don’t auto-delete by default. Set
ttlSecondsAfterFinishedto auto-cleanup completed Jobs. -
CronJob timezone is based on the controller-manager’s timezone (usually UTC). Plan schedules accordingly.
-
Job pods remain after completion for log inspection. Delete the Job to clean up pods.
-
Indexed Jobs (Kubernetes 1.21+) assign unique indexes to pods for parallel processing patterns.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
Using restartPolicy: Always | Job never completes | Use Never or OnFailure |
| Forgetting backoffLimit | Infinite retries | Set appropriate backoffLimit |
| Wrong cron syntax | Job never triggers | Verify with crontab.guru |
| Not checking logs | Unknown failure cause | Always check kubectl logs job/name |
| CronJob overlap | Resource contention | Set concurrencyPolicy: Forbid |
-
A developer creates a Job with
restartPolicy: Alwaysand wonders why it gets rejected. They argue that retrying should mean restarting. Explain whyAlwaysis invalid for Jobs and describe the practical difference betweenNeverandOnFailurefor a Job that might fail.Answer
`restartPolicy: Always` is invalid for Jobs because it would create a pod that never terminates -- the kubelet would restart the container forever, and the Job could never reach a "completed" state. Jobs need pods to eventually exit. With `Never`, each failure creates a new pod (the old failed pod stays for log inspection), so with `backoffLimit: 4` you might see 5 pods total. With `OnFailure`, the same pod's container is restarted in place, so you see only 1 pod but with multiple restarts. Use `Never` when you need to inspect failed pod logs side-by-side; use `OnFailure` to keep your pod count clean. -
Your data pipeline needs to process 100 items. Each item takes about 30 seconds. You want to finish in under 10 minutes. Design the Job spec with appropriate
completionsandparallelismvalues, and explain what happens if one of the parallel pods fails halfway through.Answer
Set `completions: 100` and `parallelism: 6` (or higher). With 6 pods running in parallel, each taking 30 seconds, you can complete 100 items in roughly `ceil(100/6) * 30s = 510s` (about 8.5 minutes), safely under 10 minutes. If one pod fails, the Job controller creates a replacement pod to redo that specific completion (failed completions don't count toward the 100). The `backoffLimit` controls how many total failures are tolerated before the Job is marked as failed. Set it high enough to handle transient failures (e.g., `backoffLimit: 10`) but not so high that a systematic bug creates hundreds of failed pods. -
It’s 3 AM and your on-call pager fires because a CronJob-created backup hasn’t run. The CronJob schedule is
0 2 * * *(daily at 2 AM). You runkubectl get cronjobsand seeLAST SCHEDULE: <none>. How do you investigate, and how do you immediately trigger the backup while you fix the root cause?Answer
First, check if the CronJob is suspended: `kubectl get cronjob backup -o yaml | grep suspend`. If `suspend: true`, that explains it. Next, check `kubectl describe cronjob backup` for events -- the CronJob controller may have logged failures. Also verify the cron schedule syntax is correct (a common mistake is swapping minute/hour fields). To trigger the backup immediately while investigating, run `kubectl create job --from=cronjob/backup backup-emergency`. This creates a Job using the CronJob's template without waiting for the next scheduled time. After the emergency run succeeds, fix the root cause (unsuspend, fix schedule, or check RBAC permissions). -
You have a CronJob that runs every 5 minutes to aggregate metrics, but sometimes the aggregation takes 7 minutes. With
concurrencyPolicy: Allow(the default), overlapping runs are causing duplicate data. You switch toForbid, but now some scheduled runs are being skipped entirely. What is the trade-off betweenForbidandReplace, and which would you choose for this use case?Answer
With `Forbid`, the new scheduled run is silently skipped if the previous is still running. You avoid duplicates but miss data from the skipped interval. With `Replace`, the running Job is terminated and a new one starts fresh, which means the in-progress aggregation is lost but you always have the most recent run executing. For a metrics aggregation use case, `Forbid` is usually better because the long-running job will eventually complete and cover that interval's data. `Replace` would waste the 7 minutes of work already done. However, the real fix is to optimize the aggregation to finish within 5 minutes, or change the schedule to every 10 minutes to prevent overlap entirely.
Hands-On Exercise
Section titled “Hands-On Exercise”Task: Create Jobs and CronJobs, handle failures.
Steps:
- Create a simple Job:
kubectl create job hello --image=busybox -- echo "Hello from job"kubectl get jobskubectl logs job/hellokubectl delete job hello- Create Job with completions:
cat << 'EOF' | kubectl apply -f -apiVersion: batch/v1kind: Jobmetadata: name: batch-processorspec: completions: 5 parallelism: 2 template: spec: containers: - name: processor image: busybox command: ["sh", "-c", "echo Processing $(hostname); sleep 3"] restartPolicy: NeverEOF
kubectl get jobs batch-processor -w # Watch completionskubectl get pods -l job-name=batch-processorkubectl delete job batch-processor- Create a failing Job:
cat << 'EOF' | kubectl apply -f -apiVersion: batch/v1kind: Jobmetadata: name: failing-jobspec: backoffLimit: 2 template: spec: containers: - name: fail image: busybox command: ["sh", "-c", "echo 'About to fail'; exit 1"] restartPolicy: NeverEOF
kubectl get jobs failing-job -wkubectl get pods -l job-name=failing-job # Multiple failed podskubectl logs job/failing-jobkubectl delete job failing-job- Create a CronJob:
kubectl create cronjob minute-job --image=busybox --schedule="*/1 * * * *" -- date
# Wait for it to runsleep 70kubectl get cronjobskubectl get jobskubectl logs job/<job-name> # Use actual job name
kubectl delete cronjob minute-job- Manually trigger CronJob:
kubectl create cronjob backup --image=busybox --schedule="0 0 * * *" -- echo "backup"
# Trigger manuallykubectl create job --from=cronjob/backup backup-nowkubectl get jobskubectl logs job/backup-now
kubectl delete cronjob backupkubectl delete job backup-nowSuccess Criteria:
- Can create Jobs imperatively and declaratively
- Understand completions and parallelism
- Can debug failed Jobs
- Can create CronJobs
- Can manually trigger CronJobs
Practice Drills
Section titled “Practice Drills”Drill 1: Job Creation Speed Test (Target: 2 minutes)
Section titled “Drill 1: Job Creation Speed Test (Target: 2 minutes)”# Create jobkubectl create job quick --image=busybox -- echo "done"
# Wait for completionkubectl wait --for=condition=complete job/quick --timeout=60s
# Check logskubectl logs job/quick
# Cleanupkubectl delete job quickDrill 2: Parallel Job (Target: 3 minutes)
Section titled “Drill 2: Parallel Job (Target: 3 minutes)”cat << 'EOF' | kubectl apply -f -apiVersion: batch/v1kind: Jobmetadata: name: parallelspec: completions: 6 parallelism: 3 template: spec: containers: - name: worker image: busybox command: ["sh", "-c", "echo Pod: $HOSTNAME; sleep 5"] restartPolicy: NeverEOF
# Watchkubectl get pods -l job-name=parallel -w &kubectl get job parallel -w &sleep 30kill %1 %2 2>/dev/null
# Cleanupkubectl delete job parallelDrill 3: Job with Timeout (Target: 3 minutes)
Section titled “Drill 3: Job with Timeout (Target: 3 minutes)”cat << 'EOF' | kubectl apply -f -apiVersion: batch/v1kind: Jobmetadata: name: timeout-testspec: activeDeadlineSeconds: 10 template: spec: containers: - name: long-task image: busybox command: ["sleep", "60"] restartPolicy: NeverEOF
# Watch job timeoutkubectl get job timeout-test -w &sleep 15kill %1 2>/dev/null
# Check statuskubectl describe job timeout-test | grep -A3 "Conditions"
# Cleanupkubectl delete job timeout-testDrill 4: CronJob Creation (Target: 2 minutes)
Section titled “Drill 4: CronJob Creation (Target: 2 minutes)”# Create CronJobkubectl create cronjob every-minute --image=busybox --schedule="*/1 * * * *" -- date
# Verifykubectl get cronjob every-minute
# Wait for first runsleep 70
# Check jobs createdkubectl get jobs -l job-name
# Cleanupkubectl delete cronjob every-minuteDrill 5: Manual CronJob Trigger (Target: 2 minutes)
Section titled “Drill 5: Manual CronJob Trigger (Target: 2 minutes)”# Create CronJob (won't run for a while)kubectl create cronjob daily --image=busybox --schedule="0 0 * * *" -- echo "daily task"
# Trigger manuallykubectl create job --from=cronjob/daily daily-manual-run
# Checkkubectl get jobskubectl logs job/daily-manual-run
# Cleanupkubectl delete cronjob dailykubectl delete job daily-manual-runDrill 6: Troubleshooting Failed Job (Target: 5 minutes)
Section titled “Drill 6: Troubleshooting Failed Job (Target: 5 minutes)”# Create intentionally broken jobcat << 'EOF' | kubectl apply -f -apiVersion: batch/v1kind: Jobmetadata: name: brokenspec: backoffLimit: 2 template: spec: containers: - name: app image: busybox command: ["sh", "-c", "cat /nonexistent/file"] restartPolicy: NeverEOF
# Diagnosekubectl get job brokenkubectl get pods -l job-name=brokenkubectl describe job brokenkubectl logs job/broken
# Answer: What's the error? How would you fix it?
# Cleanupkubectl delete job brokenDrill 7: Challenge - Complete Job Workflow
Section titled “Drill 7: Challenge - Complete Job Workflow”Create a Job that:
- Runs 4 completions, 2 at a time
- Each pod echoes its hostname and sleeps 3 seconds
- Has a backoff limit of 2
- Automatically deletes after 60 seconds
# YOUR TASK: Create this JobSolution
cat << 'EOF' | kubectl apply -f -apiVersion: batch/v1kind: Jobmetadata: name: challenge-jobspec: completions: 4 parallelism: 2 backoffLimit: 2 ttlSecondsAfterFinished: 60 template: spec: containers: - name: worker image: busybox command: ["sh", "-c", "echo $HOSTNAME; sleep 3"] restartPolicy: NeverEOF
kubectl get job challenge-job -wNext Module
Section titled “Next Module”Module 2.5: Resource Management - Requests, limits, and QoS classes.