Skip to content

Module 2.4: Jobs & CronJobs

Hands-On Lab Available
K8s Cluster intermediate 30 min
Launch Lab ↗

Opens in Killercoda in a new tab

Complexity: [QUICK] - Straightforward batch workloads

Time to Complete: 30-40 minutes

Prerequisites: Module 2.1 (Pods)


After this module, you will be able to:

  • Create Jobs and CronJobs with appropriate parallelism, completion counts, and backoff limits
  • Debug failed Jobs by checking pod logs, exit codes, and restart policies
  • Configure CronJob concurrency policies and history limits for production use
  • Explain when to use Jobs vs Deployments and the implications of each for batch workloads

Not all workloads run forever. Some run once and exit:

  • Database migrations
  • Batch processing
  • Report generation
  • Backup operations

Jobs handle one-time tasks. CronJobs handle scheduled, recurring tasks. The CKA exam tests creating Jobs with specific completion requirements and troubleshooting failed Jobs.

The Task Manager Analogy

Think of Jobs like tasks on a to-do list. A Job is a single task: “Generate monthly report.” Once done, you check it off. A CronJob is a recurring task: “Generate monthly report on the 1st of every month.” The task manager (Kubernetes) ensures the task runs, retries if it fails, and tracks completion.


By the end of this module, you’ll be able to:

  • Create Jobs for one-time tasks
  • Configure parallelism and completions
  • Handle Job failures and retries
  • Create CronJobs for scheduled tasks
  • Debug failed Jobs

A Job creates pods that run to completion. Unlike Deployments (which keep pods running forever), Jobs expect pods to terminate successfully.

┌────────────────────────────────────────────────────────────────┐
│ Job Lifecycle │
│ │
│ Job Created │
│ │ │
│ ▼ │
│ Pod Created ─────────────────────────────────────────┐ │
│ │ │ │
│ ▼ │ │
│ Pod Running │ │
│ │ │ │
│ ├───► Exit 0 (Success) ──► Job Complete │ │
│ │ │ │
│ └───► Exit ≠ 0 (Fail) ──► Retry? ──────────────►┘ │
│ (based on backoffLimit) │
│ │
└────────────────────────────────────────────────────────────────┘
apiVersion: batch/v1
kind: Job
metadata:
name: pi-calculation
spec:
template:
spec:
containers:
- name: pi
image: perl
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never # Required for Jobs
backoffLimit: 4 # Retry up to 4 times on failure
Terminal window
# Create job imperatively
kubectl create job pi --image=perl -- perl -Mbignum=bpi -wle "print bpi(100)"
# Generate YAML
kubectl create job pi --image=perl --dry-run=client -o yaml -- perl -Mbignum=bpi -wle "print bpi(100)"
Terminal window
# List jobs
kubectl get jobs
# Watch job progress
kubectl get jobs -w
# Describe job
kubectl describe job pi-calculation
# Get job logs
kubectl logs job/pi-calculation
# Delete job (also deletes pods)
kubectl delete job pi-calculation

Pause and predict: A Job has restartPolicy: Never and backoffLimit: 4. The container fails on every attempt. How many pods will you see in kubectl get pods after the Job gives up? Now consider the same scenario with restartPolicy: OnFailure — how many pods would you see?

Jobs require either Never or OnFailure:

PolicyBehavior
NeverCreate new pod on failure
OnFailureRestart container in same pod on failure
spec:
template:
spec:
restartPolicy: Never # New pod per failure
# restartPolicy: OnFailure # Restart same pod

Did You Know?

With restartPolicy: Never, failed attempts create new pods. With a backoffLimit of 4, you might see 5 pods (1 original + 4 retries). With OnFailure, you see fewer pods because containers restart in place.


apiVersion: batch/v1
kind: Job
metadata:
name: batch-job
spec:
completions: 5 # Job succeeds when 5 pods complete successfully
parallelism: 2 # Run 2 pods at a time
template:
spec:
containers:
- name: worker
image: busybox
command: ["sh", "-c", "echo Processing item; sleep 5"]
restartPolicy: Never
PatterncompletionsparallelismBehavior
Single pod1 (default)1 (default)One pod runs to completion
Fixed completionsNMM pods run in parallel until N succeed
Work queueunsetNN pods run until one succeeds
┌────────────────────────────────────────────────────────────────┐
│ Completions=5, Parallelism=2 │
│ │
│ Time ─────────────────────────────────────────────────► │
│ │
│ Slot 1: [Pod 1 ✓] [Pod 3 ✓] [Pod 5 ✓] │
│ Slot 2: [Pod 2 ✓] [Pod 4 ✓] │
│ │
│ 2 pods run concurrently, until 5 completions achieved │
│ │
└────────────────────────────────────────────────────────────────┘
Terminal window
# Run 10 tasks, 3 at a time
kubectl create job batch --image=busybox -- sh -c "echo done; sleep 2"
kubectl patch job batch -p '{"spec":{"completions":10,"parallelism":3}}'
# Or create with YAML
cat << 'EOF' | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
name: parallel-job
spec:
completions: 10
parallelism: 3
template:
spec:
containers:
- name: worker
image: busybox
command: ["sh", "-c", "echo Task complete; sleep 2"]
restartPolicy: Never
EOF
# Watch progress
kubectl get jobs parallel-job -w

Pause and predict: A Job with activeDeadlineSeconds: 60 and backoffLimit: 10 runs a container that takes 15 seconds per attempt and always fails. Will the Job hit the backoff limit or the deadline first? How many pods will be created?

Controls how many times to retry:

apiVersion: batch/v1
kind: Job
metadata:
name: failing-job
spec:
backoffLimit: 3 # Retry 3 times, then fail
template:
spec:
containers:
- name: fail
image: busybox
command: ["sh", "-c", "exit 1"] # Always fails
restartPolicy: Never

Maximum time for job to run:

apiVersion: batch/v1
kind: Job
metadata:
name: timeout-job
spec:
activeDeadlineSeconds: 60 # Kill job after 60 seconds
template:
spec:
containers:
- name: long-task
image: busybox
command: ["sleep", "120"] # Tries to run 2 minutes
restartPolicy: Never
Terminal window
# Job status
kubectl get job myjob
# NAME COMPLETIONS DURATION AGE
# myjob 3/5 2m 5m
# Detailed status
kubectl describe job myjob | grep -A5 "Pods Statuses"
# Check failed pods
kubectl get pods -l job-name=myjob --field-selector=status.phase=Failed

A CronJob creates Jobs on a schedule, like cron in Linux.

┌────────────────────────────────────────────────────────────────┐
│ CronJob │
│ │
│ Schedule: "0 * * * *" (hourly) │
│ │
│ 1:00 ──► Creates Job ──► Creates Pod ──► Completes │
│ 2:00 ──► Creates Job ──► Creates Pod ──► Completes │
│ 3:00 ──► Creates Job ──► Creates Pod ──► Completes │
│ ... │
│ │
└────────────────────────────────────────────────────────────────┘
┌───────────── minute (0 - 59)
│ ┌───────────── hour (0 - 23)
│ │ ┌───────────── day of month (1 - 31)
│ │ │ ┌───────────── month (1 - 12)
│ │ │ │ ┌───────────── day of week (0 - 6) (Sunday = 0)
│ │ │ │ │
* * * * *
ScheduleDescription
* * * * *Every minute
0 * * * *Every hour
0 0 * * *Every day at midnight
0 0 * * 0Every Sunday at midnight
*/5 * * * *Every 5 minutes
0 9-17 * * 1-5Every hour 9-17, Mon-Fri
apiVersion: batch/v1
kind: CronJob
metadata:
name: backup
spec:
schedule: "0 2 * * *" # Daily at 2 AM
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: busybox
command: ["sh", "-c", "echo Backup started; sleep 10; echo Backup done"]
restartPolicy: OnFailure
successfulJobsHistoryLimit: 3 # Keep 3 successful job records
failedJobsHistoryLimit: 1 # Keep 1 failed job record
Terminal window
# Create CronJob imperatively
kubectl create cronjob backup --image=busybox --schedule="0 2 * * *" -- sh -c "echo Backup done"
# Generate YAML
kubectl create cronjob backup --image=busybox --schedule="*/5 * * * *" --dry-run=client -o yaml -- echo "hello"
Terminal window
# List CronJobs
kubectl get cronjobs
kubectl get cj # Short form
# Describe
kubectl describe cronjob backup
# Manually trigger a job from CronJob
kubectl create job --from=cronjob/backup backup-manual
# Suspend CronJob
kubectl patch cronjob backup -p '{"spec":{"suspend":true}}'
# Resume CronJob
kubectl patch cronjob backup -p '{"spec":{"suspend":false}}'
# Delete CronJob (also deletes Jobs it created)
kubectl delete cronjob backup

Stop and think: You have a CronJob that runs a database backup every hour, but sometimes the backup takes 90 minutes. With the default concurrencyPolicy: Allow, two backup jobs would overlap. What could go wrong with concurrent backups, and which concurrency policy would you choose instead?

spec:
concurrencyPolicy: Allow # Default - allow concurrent jobs
# concurrencyPolicy: Forbid # Skip if previous still running
# concurrencyPolicy: Replace # Kill previous, start new
PolicyBehavior
AllowMultiple Jobs can run simultaneously
ForbidSkip new Job if previous still running
ReplaceKill running Job, start new one

Exam Tip

For scheduled backup tasks, use concurrencyPolicy: Forbid to prevent overlapping runs. For quick tasks that shouldn’t overlap, Replace might be better.


IssueSymptomDebug Command
Image pull failurePod in ImagePullBackOffkubectl describe pod <pod>
Command failureJob never completeskubectl logs job/<job-name>
TimeoutJob killedCheck activeDeadlineSeconds
Too many retriesMultiple failed podsCheck backoffLimit
Terminal window
# 1. Check job status
kubectl get job myjob
kubectl describe job myjob
# 2. Find pods created by job
kubectl get pods -l job-name=myjob
# 3. Check pod logs
kubectl logs <pod-name>
kubectl logs job/myjob # Auto-selects a pod
# 4. If still running, exec into pod
kubectl exec -it <pod-name> -- /bin/sh
# 5. Check events
kubectl get events --field-selector involvedObject.name=myjob

  • Jobs don’t auto-delete by default. Set ttlSecondsAfterFinished to auto-cleanup completed Jobs.

  • CronJob timezone is based on the controller-manager’s timezone (usually UTC). Plan schedules accordingly.

  • Job pods remain after completion for log inspection. Delete the Job to clean up pods.

  • Indexed Jobs (Kubernetes 1.21+) assign unique indexes to pods for parallel processing patterns.


MistakeProblemSolution
Using restartPolicy: AlwaysJob never completesUse Never or OnFailure
Forgetting backoffLimitInfinite retriesSet appropriate backoffLimit
Wrong cron syntaxJob never triggersVerify with crontab.guru
Not checking logsUnknown failure causeAlways check kubectl logs job/name
CronJob overlapResource contentionSet concurrencyPolicy: Forbid

  1. A developer creates a Job with restartPolicy: Always and wonders why it gets rejected. They argue that retrying should mean restarting. Explain why Always is invalid for Jobs and describe the practical difference between Never and OnFailure for a Job that might fail.

    Answer `restartPolicy: Always` is invalid for Jobs because it would create a pod that never terminates -- the kubelet would restart the container forever, and the Job could never reach a "completed" state. Jobs need pods to eventually exit. With `Never`, each failure creates a new pod (the old failed pod stays for log inspection), so with `backoffLimit: 4` you might see 5 pods total. With `OnFailure`, the same pod's container is restarted in place, so you see only 1 pod but with multiple restarts. Use `Never` when you need to inspect failed pod logs side-by-side; use `OnFailure` to keep your pod count clean.
  2. Your data pipeline needs to process 100 items. Each item takes about 30 seconds. You want to finish in under 10 minutes. Design the Job spec with appropriate completions and parallelism values, and explain what happens if one of the parallel pods fails halfway through.

    Answer Set `completions: 100` and `parallelism: 6` (or higher). With 6 pods running in parallel, each taking 30 seconds, you can complete 100 items in roughly `ceil(100/6) * 30s = 510s` (about 8.5 minutes), safely under 10 minutes. If one pod fails, the Job controller creates a replacement pod to redo that specific completion (failed completions don't count toward the 100). The `backoffLimit` controls how many total failures are tolerated before the Job is marked as failed. Set it high enough to handle transient failures (e.g., `backoffLimit: 10`) but not so high that a systematic bug creates hundreds of failed pods.
  3. It’s 3 AM and your on-call pager fires because a CronJob-created backup hasn’t run. The CronJob schedule is 0 2 * * * (daily at 2 AM). You run kubectl get cronjobs and see LAST SCHEDULE: <none>. How do you investigate, and how do you immediately trigger the backup while you fix the root cause?

    Answer First, check if the CronJob is suspended: `kubectl get cronjob backup -o yaml | grep suspend`. If `suspend: true`, that explains it. Next, check `kubectl describe cronjob backup` for events -- the CronJob controller may have logged failures. Also verify the cron schedule syntax is correct (a common mistake is swapping minute/hour fields). To trigger the backup immediately while investigating, run `kubectl create job --from=cronjob/backup backup-emergency`. This creates a Job using the CronJob's template without waiting for the next scheduled time. After the emergency run succeeds, fix the root cause (unsuspend, fix schedule, or check RBAC permissions).
  4. You have a CronJob that runs every 5 minutes to aggregate metrics, but sometimes the aggregation takes 7 minutes. With concurrencyPolicy: Allow (the default), overlapping runs are causing duplicate data. You switch to Forbid, but now some scheduled runs are being skipped entirely. What is the trade-off between Forbid and Replace, and which would you choose for this use case?

    Answer With `Forbid`, the new scheduled run is silently skipped if the previous is still running. You avoid duplicates but miss data from the skipped interval. With `Replace`, the running Job is terminated and a new one starts fresh, which means the in-progress aggregation is lost but you always have the most recent run executing. For a metrics aggregation use case, `Forbid` is usually better because the long-running job will eventually complete and cover that interval's data. `Replace` would waste the 7 minutes of work already done. However, the real fix is to optimize the aggregation to finish within 5 minutes, or change the schedule to every 10 minutes to prevent overlap entirely.

Task: Create Jobs and CronJobs, handle failures.

Steps:

  1. Create a simple Job:
Terminal window
kubectl create job hello --image=busybox -- echo "Hello from job"
kubectl get jobs
kubectl logs job/hello
kubectl delete job hello
  1. Create Job with completions:
Terminal window
cat << 'EOF' | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
name: batch-processor
spec:
completions: 5
parallelism: 2
template:
spec:
containers:
- name: processor
image: busybox
command: ["sh", "-c", "echo Processing $(hostname); sleep 3"]
restartPolicy: Never
EOF
kubectl get jobs batch-processor -w # Watch completions
kubectl get pods -l job-name=batch-processor
kubectl delete job batch-processor
  1. Create a failing Job:
Terminal window
cat << 'EOF' | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
name: failing-job
spec:
backoffLimit: 2
template:
spec:
containers:
- name: fail
image: busybox
command: ["sh", "-c", "echo 'About to fail'; exit 1"]
restartPolicy: Never
EOF
kubectl get jobs failing-job -w
kubectl get pods -l job-name=failing-job # Multiple failed pods
kubectl logs job/failing-job
kubectl delete job failing-job
  1. Create a CronJob:
Terminal window
kubectl create cronjob minute-job --image=busybox --schedule="*/1 * * * *" -- date
# Wait for it to run
sleep 70
kubectl get cronjobs
kubectl get jobs
kubectl logs job/<job-name> # Use actual job name
kubectl delete cronjob minute-job
  1. Manually trigger CronJob:
Terminal window
kubectl create cronjob backup --image=busybox --schedule="0 0 * * *" -- echo "backup"
# Trigger manually
kubectl create job --from=cronjob/backup backup-now
kubectl get jobs
kubectl logs job/backup-now
kubectl delete cronjob backup
kubectl delete job backup-now

Success Criteria:

  • Can create Jobs imperatively and declaratively
  • Understand completions and parallelism
  • Can debug failed Jobs
  • Can create CronJobs
  • Can manually trigger CronJobs

Drill 1: Job Creation Speed Test (Target: 2 minutes)

Section titled “Drill 1: Job Creation Speed Test (Target: 2 minutes)”
Terminal window
# Create job
kubectl create job quick --image=busybox -- echo "done"
# Wait for completion
kubectl wait --for=condition=complete job/quick --timeout=60s
# Check logs
kubectl logs job/quick
# Cleanup
kubectl delete job quick
Terminal window
cat << 'EOF' | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
name: parallel
spec:
completions: 6
parallelism: 3
template:
spec:
containers:
- name: worker
image: busybox
command: ["sh", "-c", "echo Pod: $HOSTNAME; sleep 5"]
restartPolicy: Never
EOF
# Watch
kubectl get pods -l job-name=parallel -w &
kubectl get job parallel -w &
sleep 30
kill %1 %2 2>/dev/null
# Cleanup
kubectl delete job parallel

Drill 3: Job with Timeout (Target: 3 minutes)

Section titled “Drill 3: Job with Timeout (Target: 3 minutes)”
Terminal window
cat << 'EOF' | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
name: timeout-test
spec:
activeDeadlineSeconds: 10
template:
spec:
containers:
- name: long-task
image: busybox
command: ["sleep", "60"]
restartPolicy: Never
EOF
# Watch job timeout
kubectl get job timeout-test -w &
sleep 15
kill %1 2>/dev/null
# Check status
kubectl describe job timeout-test | grep -A3 "Conditions"
# Cleanup
kubectl delete job timeout-test

Drill 4: CronJob Creation (Target: 2 minutes)

Section titled “Drill 4: CronJob Creation (Target: 2 minutes)”
Terminal window
# Create CronJob
kubectl create cronjob every-minute --image=busybox --schedule="*/1 * * * *" -- date
# Verify
kubectl get cronjob every-minute
# Wait for first run
sleep 70
# Check jobs created
kubectl get jobs -l job-name
# Cleanup
kubectl delete cronjob every-minute

Drill 5: Manual CronJob Trigger (Target: 2 minutes)

Section titled “Drill 5: Manual CronJob Trigger (Target: 2 minutes)”
Terminal window
# Create CronJob (won't run for a while)
kubectl create cronjob daily --image=busybox --schedule="0 0 * * *" -- echo "daily task"
# Trigger manually
kubectl create job --from=cronjob/daily daily-manual-run
# Check
kubectl get jobs
kubectl logs job/daily-manual-run
# Cleanup
kubectl delete cronjob daily
kubectl delete job daily-manual-run

Drill 6: Troubleshooting Failed Job (Target: 5 minutes)

Section titled “Drill 6: Troubleshooting Failed Job (Target: 5 minutes)”
Terminal window
# Create intentionally broken job
cat << 'EOF' | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
name: broken
spec:
backoffLimit: 2
template:
spec:
containers:
- name: app
image: busybox
command: ["sh", "-c", "cat /nonexistent/file"]
restartPolicy: Never
EOF
# Diagnose
kubectl get job broken
kubectl get pods -l job-name=broken
kubectl describe job broken
kubectl logs job/broken
# Answer: What's the error? How would you fix it?
# Cleanup
kubectl delete job broken

Drill 7: Challenge - Complete Job Workflow

Section titled “Drill 7: Challenge - Complete Job Workflow”

Create a Job that:

  1. Runs 4 completions, 2 at a time
  2. Each pod echoes its hostname and sleeps 3 seconds
  3. Has a backoff limit of 2
  4. Automatically deletes after 60 seconds
Terminal window
# YOUR TASK: Create this Job
Solution
Terminal window
cat << 'EOF' | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
name: challenge-job
spec:
completions: 4
parallelism: 2
backoffLimit: 2
ttlSecondsAfterFinished: 60
template:
spec:
containers:
- name: worker
image: busybox
command: ["sh", "-c", "echo $HOSTNAME; sleep 3"]
restartPolicy: Never
EOF
kubectl get job challenge-job -w

Module 2.5: Resource Management - Requests, limits, and QoS classes.