Module 6.3: Velero
Toolkit Track | Complexity:
[MEDIUM]| Time: 40-45 minutes
Overview
Section titled “Overview”Backups are like insurance—you hope you never need them, but you’ll be glad you have them when disaster strikes. Velero provides backup and disaster recovery for Kubernetes clusters, including resources, persistent volumes, and the ability to migrate workloads between clusters.
What You’ll Learn:
- Velero architecture and backup strategies
- Scheduled backups and retention policies
- Disaster recovery procedures
- Cluster migration patterns
Prerequisites:
- Kubernetes resources and persistent volumes
- SRE Discipline — Disaster recovery concepts
- Cloud storage basics (S3, GCS, Azure Blob)
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Deploy Velero for Kubernetes cluster backup with scheduled backups and object storage backends
- Configure Velero backup schedules with resource filtering, namespace selection, and TTL policies
- Implement disaster recovery workflows with Velero restore operations and cross-cluster migration
- Secure Velero backups with encryption, RBAC restrictions, and backup validation testing procedures
Why This Module Matters
Section titled “Why This Module Matters”“We’ll restore from etcd backup” sounds good until you realize you also need the PVs, the Secrets, and the correct order of restoration. Velero provides application-aware backups—not just etcd snapshots. It backs up what you need to actually restore a working application.
💡 Did You Know? Velero was originally called “Heptio Ark” and was created by Heptio (founded by Kubernetes creators Joe Beda and Craig McLuckie). After VMware acquired Heptio, it was renamed to Velero (Latin for “sail fast”). It’s now a CNCF sandbox project used by thousands of organizations for Kubernetes disaster recovery.
Backup Strategies
Section titled “Backup Strategies”KUBERNETES BACKUP APPROACHES════════════════════════════════════════════════════════════════════
1. ETCD BACKUP (infrastructure level)─────────────────────────────────────────────────────────────────• Backs up cluster state database• All resources, all namespaces• Doesn't include PV data• Requires cluster access to restore• Good for: cluster-level disaster recovery
2. VELERO (application level)─────────────────────────────────────────────────────────────────• Backs up selected resources• Includes PV snapshots• Namespace-aware• Can restore to different cluster• Good for: application DR, migration, namespace backup
3. GITOPS (configuration level)─────────────────────────────────────────────────────────────────• Git is the source of truth• Manifests stored in version control• Doesn't include runtime state• Doesn't include PV data• Good for: configuration recovery, multi-cluster sync
RECOMMENDED: Use ALL THREE─────────────────────────────────────────────────────────────────• etcd backup: cluster-level recovery• Velero: application + data recovery• GitOps: configuration source of truthArchitecture
Section titled “Architecture”VELERO ARCHITECTURE════════════════════════════════════════════════════════════════════
┌─────────────────────────────────────────────────────────────────┐│ KUBERNETES CLUSTER ││ ││ ┌────────────────────────────────────────────────────────────┐ ││ │ VELERO SERVER │ ││ │ │ ││ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ ││ │ │ Backup │ │ Restore │ │ Schedule │ │ ││ │ │ Controller │ │ Controller │ │ Controller │ │ ││ │ └─────────────┘ └─────────────┘ └─────────────┘ │ ││ │ │ ││ │ ┌─────────────────────────────────────────────────────┐ │ ││ │ │ BackupStorageLocation │ │ ││ │ │ VolumeSnapshotLocation │ │ ││ │ └─────────────────────────────────────────────────────┘ │ ││ └────────────────────────────────────────────────────────────┘ ││ │ │└──────────────────────────────┼───────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────────┐│ CLOUD STORAGE ││ ││ ┌─────────────────────┐ ┌─────────────────────┐ ││ │ Object Storage │ │ Volume Snapshots │ ││ │ (S3, GCS, etc) │ │ (EBS, GCE PD) │ ││ │ │ │ │ ││ │ • Backup metadata │ │ • PV data copies │ ││ │ • Resource YAMLs │ │ • Point-in-time │ ││ │ • Tarball of data │ │ │ ││ └─────────────────────┘ └─────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────┘What Gets Backed Up
Section titled “What Gets Backed Up”| Component | Default | Configurable |
|---|---|---|
| Resources | All namespaced resources | Include/exclude by type, label |
| Cluster resources | Excluded | Can include (RBAC, CRDs) |
| Persistent Volumes | Excluded | Enable with snapshots or Restic/Kopia |
| Secrets | Included | Can exclude |
| ConfigMaps | Included | Can exclude |
Installation
Section titled “Installation”# Install Velero CLIbrew install velero # macOS# or download from https://velero.io/docs/main/basic-install/
# Install Velero with AWS providervelero install \ --provider aws \ --plugins velero/velero-plugin-for-aws:v1.8.0 \ --bucket velero-backups \ --backup-location-config region=us-west-2 \ --snapshot-location-config region=us-west-2 \ --secret-file ./credentials-velero
# Verify installationvelero versionkubectl get pods -n veleroCredentials File (AWS)
Section titled “Credentials File (AWS)”# credentials-velero[default]aws_access_key_id=YOUR_ACCESS_KEYaws_secret_access_key=YOUR_SECRET_KEYBackup Operations
Section titled “Backup Operations”Manual Backup
Section titled “Manual Backup”# Backup entire clustervelero backup create full-backup
# Backup specific namespacevelero backup create app-backup --include-namespaces production
# Backup with volume snapshotsvelero backup create full-backup --snapshot-volumes
# Backup by label selectorvelero backup create app-backup --selector app=myapp
# Backup excluding resourcesvelero backup create backup --exclude-resources secrets,configmaps
# Check backup statusvelero backup describe full-backupvelero backup logs full-backupBackup Spec (Declarative)
Section titled “Backup Spec (Declarative)”apiVersion: velero.io/v1kind: Backupmetadata: name: production-backup namespace: velerospec: includedNamespaces: - production - staging excludedResources: - events - events.events.k8s.io snapshotVolumes: true storageLocation: default volumeSnapshotLocations: - default ttl: 720h # 30 days retention hooks: resources: - name: backup-hook includedNamespaces: - production labelSelector: matchLabels: app: database pre: - exec: container: postgres command: - /bin/bash - -c - pg_dump -U postgres > /backup/dump.sql💡 Did You Know? Velero’s backup hooks let you run commands before and after backups. This is crucial for databases—you can flush writes, create consistent snapshots, or dump data to ensure backup consistency. Without hooks, you might backup a database mid-transaction and get corrupted data.
Scheduled Backups
Section titled “Scheduled Backups”apiVersion: velero.io/v1kind: Schedulemetadata: name: daily-backup namespace: velerospec: schedule: "0 2 * * *" # 2 AM daily template: includedNamespaces: - production snapshotVolumes: true ttl: 168h # 7 day retention---apiVersion: velero.io/v1kind: Schedulemetadata: name: weekly-backup namespace: velerospec: schedule: "0 3 * * 0" # 3 AM Sundays template: includedNamespaces: - "*" # All namespaces snapshotVolumes: true ttl: 720h # 30 day retention# Create schedule via CLIvelero schedule create daily-prod \ --schedule="0 2 * * *" \ --include-namespaces production \ --snapshot-volumes \ --ttl 168h
# List schedulesvelero schedule get
# Check scheduled backup historyvelero backup get | grep daily-prodRestore Operations
Section titled “Restore Operations”Basic Restore
Section titled “Basic Restore”# List available backupsvelero backup get
# Restore entire backupvelero restore create --from-backup full-backup
# Restore to different namespacevelero restore create --from-backup app-backup \ --namespace-mappings production:production-restored
# Restore specific resources onlyvelero restore create --from-backup full-backup \ --include-resources deployments,services
# Restore excluding volumes (resources only)velero restore create --from-backup full-backup \ --restore-volumes=false
# Check restore statusvelero restore describe <restore-name>velero restore logs <restore-name>Disaster Recovery Procedure
Section titled “Disaster Recovery Procedure”DISASTER RECOVERY STEPS════════════════════════════════════════════════════════════════════
1. ASSESS THE DAMAGE─────────────────────────────────────────────────────────────────$ kubectl get nodes$ kubectl get pods -A# Determine what needs recovery
2. VERIFY BACKUP AVAILABILITY─────────────────────────────────────────────────────────────────$ velero backup get# Find most recent successful backup
3. IF CLUSTER IS GONE: Create new cluster─────────────────────────────────────────────────────────────────# Re-install Velero pointing to same backup location$ velero install --provider aws --bucket velero-backups ...
4. RESTORE─────────────────────────────────────────────────────────────────$ velero restore create disaster-recovery \ --from-backup <latest-backup>
5. VERIFY RESTORATION─────────────────────────────────────────────────────────────────$ kubectl get pods -A$ kubectl get pvc -A# Test applicationsVolume Backup Options
Section titled “Volume Backup Options”CSI Snapshots (Native)
Section titled “CSI Snapshots (Native)”# BackupStorageLocation with CSI snapshotsapiVersion: velero.io/v1kind: VolumeSnapshotLocationmetadata: name: aws-default namespace: velerospec: provider: aws config: region: us-west-2File-Level Backup (Kopia/Restic)
Section titled “File-Level Backup (Kopia/Restic)”For volumes without snapshot support:
# Install with file system backup enabledvelero install \ --provider aws \ --plugins velero/velero-plugin-for-aws:v1.8.0 \ --bucket velero-backups \ --use-node-agent \ # Enables file-level backup --default-volumes-to-fs-backup# Enable for specific PVCs with annotationapiVersion: v1kind: Podmetadata: annotations: backup.velero.io/backup-volumes: data-volumespec: volumes: - name: data-volume persistentVolumeClaim: claimName: my-pvcCluster Migration
Section titled “Cluster Migration”CLUSTER MIGRATION WITH VELERO════════════════════════════════════════════════════════════════════
SOURCE CLUSTER TARGET CLUSTER───────────────── ─────────────────
1. Install Velero 1. Install Velero (pointing to S3) (same S3 bucket!)
2. Create backup 2. Wait for backup $ velero backup create to sync migration-backup $ velero backup get
3. Backup completes 3. Restore → Stored in S3 $ velero restore create --from-backup migration-backup
Result: Application running on new cluster with data!Migration Best Practices
Section titled “Migration Best Practices”# 1. Backup source clustervelero backup create migration-backup \ --include-namespaces app-namespace \ --snapshot-volumes
# 2. Wait for backup to completevelero backup wait migration-backup
# 3. On target cluster, verify backup is visiblevelero backup get
# 4. Restore (with any needed transformations)velero restore create migration-restore \ --from-backup migration-backup \ --namespace-mappings old-ns:new-ns
# 5. Verify and testkubectl get pods -n new-nsBackup Retention and Lifecycle
Section titled “Backup Retention and Lifecycle”# Different retention for different backup types---# Daily backups - keep 7 daysapiVersion: velero.io/v1kind: Schedulemetadata: name: dailyspec: schedule: "0 2 * * *" template: ttl: 168h # 7 days---# Weekly backups - keep 4 weeksapiVersion: velero.io/v1kind: Schedulemetadata: name: weeklyspec: schedule: "0 3 * * 0" template: ttl: 672h # 28 days---# Monthly backups - keep 1 yearapiVersion: velero.io/v1kind: Schedulemetadata: name: monthlyspec: schedule: "0 4 1 * *" template: ttl: 8760h # 365 days💡 Did You Know? Velero’s TTL (Time To Live) is set at backup creation time, not schedule time. This means even if you delete a Schedule, the backups it created will remain until their individual TTL expires. Plan your retention carefully—old backups can accumulate significant storage costs.
💡 Did You Know? Velero’s resource filtering is more powerful than most realize. You can backup by namespace, by label selector, by resource type, or any combination. Teams use this for “logical backups”—backing up just a single application (all resources with
app=payment) rather than entire namespaces. This makes restores faster and reduces storage costs by excluding unrelated resources.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| No PV snapshots | Data lost on restore | Use --snapshot-volumes or file-level backup |
| Not testing restores | Discover issues during disaster | Regular restore drills (monthly) |
| Single backup location | Backup lost with region | Cross-region replication |
| No backup hooks | Database corruption | Use pre/post hooks for consistency |
| Backing up too much | Slow backups, high costs | Use include/exclude filters |
| TTL too short | Can’t recover from old corruption | Keep long-term backups (monthly/yearly) |
War Story: The Backup That Wasn’t
Section titled “War Story: The Backup That Wasn’t”A team had Velero running for 6 months. Their cluster crashed. They tried to restore. Nothing happened.
What went wrong:
- Velero server had crashed 3 months ago (OOM)
- No monitoring on Velero pods
- No alerts on backup failures
- They assumed backups were happening
The fix:
# Alert on backup failures- alert: VeleroBackupFailure expr: velero_backup_failure_total > 0 for: 1h labels: severity: critical annotations: summary: "Velero backup failed"
# Alert on no recent backups- alert: VeleroNoRecentBackup expr: time() - velero_backup_last_successful_timestamp > 86400 for: 1h labels: severity: warning annotations: summary: "No successful backup in 24 hours"Lesson: Monitor your backups. Test your restores. Backups you can’t restore from aren’t backups.
Question 1
Section titled “Question 1”What’s the difference between Velero backup and etcd backup?
Show Answer
etcd backup:
- Backs up entire cluster state database
- All resources, all namespaces
- Doesn’t include PV data
- Must restore to same cluster (or identical)
- Infrastructure-level backup
Velero backup:
- Backs up selected resources (configurable)
- Can include PV data (snapshots or file-level)
- Can restore to different cluster
- Application-level backup
- Supports hooks for application consistency
Use both: etcd for cluster-level DR, Velero for application-level.
Question 2
Section titled “Question 2”How do you backup persistent volume data with Velero?
Show Answer
Two methods:
1. CSI Snapshots (native):
velero backup create --snapshot-volumes- Uses cloud provider’s snapshot API
- Fast, point-in-time
- Requires CSI snapshot support
2. File-level backup (Kopia/Restic):
velero install --use-node-agent --default-volumes-to-fs-backup- Copies files from PV to object storage
- Works with any storage
- Slower but more compatible
Choose based on your storage provider’s capabilities.
Question 3
Section titled “Question 3”Why should you test restore regularly?
Show Answer
Reasons to test restores:
- Verify backups are actually complete
- Validate restore procedure works
- Measure restore time (RTO)
- Find issues before real disaster
- Train team on recovery procedures
What can go wrong:
- Backup corrupted
- Missing volumes
- Wrong permissions
- Changed dependencies
- Network/storage issues
Recommendation: Monthly restore drill to test environment. Document findings.
Hands-On Exercise
Section titled “Hands-On Exercise”Objective
Section titled “Objective”Set up Velero, create backups, and perform a restore.
Environment Setup
Section titled “Environment Setup”# For local testing, use Velero with MinIOkubectl apply -f https://raw.githubusercontent.com/vmware-tanzu/velero/main/examples/minio/00-minio-deployment.yaml
# Install Velero with MinIO backendvelero install \ --provider aws \ --plugins velero/velero-plugin-for-aws:v1.8.0 \ --bucket velero \ --secret-file ./credentials-minio \ --use-volume-snapshots=false \ --backup-location-config region=minio,s3ForcePathStyle="true",s3Url=http://minio.velero.svc:9000
# credentials-minio:# [default]# aws_access_key_id = minio# aws_secret_access_key = minio123-
Deploy sample app:
Terminal window kubectl create namespace demokubectl -n demo run nginx --image=nginxkubectl -n demo expose pod nginx --port=80kubectl -n demo create configmap app-config --from-literal=key=value -
Create backup:
Terminal window velero backup create demo-backup --include-namespaces demovelero backup describe demo-backup -
Simulate disaster (delete namespace):
Terminal window kubectl delete namespace demokubectl get namespace demo # Should be gone -
Restore from backup:
Terminal window velero restore create --from-backup demo-backupvelero restore describe <restore-name> -
Verify restoration:
Terminal window kubectl get pods -n demokubectl get svc -n demokubectl get configmap -n demo -
Create scheduled backup:
Terminal window velero schedule create demo-daily \--schedule="0 * * * *" \--include-namespaces demo \--ttl 24h
Success Criteria
Section titled “Success Criteria”- Velero installed and running
- Backup created successfully
- Namespace deleted (simulated disaster)
- Restore completed
- All resources recovered (pods, services, configmaps)
- Schedule created
Bonus Challenge
Section titled “Bonus Challenge”Set up backup hooks to run a command before backing up (simulating database dump).
Further Reading
Section titled “Further Reading”Next Module
Section titled “Next Module”Continue to Platforms Toolkit to learn about Backstage, Crossplane, and cert-manager for internal developer platforms.
“Hope is not a strategy. Backups are. Test them.”