Module 8.5: Disaster Recovery: RTO/RPO for Kubernetes
Цей контент ще не доступний вашою мовою.
Complexity:
[COMPLEX]Time to Complete: 2.5 hours
Prerequisites: Module 8.1: Multi-Account Architecture, experience operating at least one Kubernetes cluster in production
Track: Advanced Cloud Operations
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Design disaster recovery architectures for Kubernetes with defined RTO/RPO targets across cloud regions
- Implement Velero-based backup and restore strategies for cluster state, persistent volumes, and application data
- Configure cross-region replication for etcd snapshots, container images, and persistent storage volumes
- Deploy automated DR failover runbooks that validate recovery procedures through regular chaos testing
Why This Module Matters
Section titled “Why This Module Matters”January 2017. GitLab.
A database engineer executed a rm -rf command on what they believed was a replicated staging database directory. It was the production PostgreSQL data directory. The primary database for GitLab.com was gone. The team immediately looked to their backup systems: five different backup methods were configured. None of them worked. LVM snapshots had never been verified. pg_dump hadn’t run successfully in months due to a silent error. Azure disk snapshots were untested. The S3 backup process was only partially configured. The only thing that saved GitLab was a staging database that happened to be a six-hour-old copy of production, created for an unrelated test.
GitLab lost six hours of data: 5,037 merge requests, comments, issues, and snippets. They also lost something harder to quantify: customer trust. The incident became a case study in every disaster recovery talk for the next five years.
The lesson is not “have backups.” Every team has backups. The lesson is: untested backups are not backups. And in the Kubernetes world, the situation is even more complex. Your cluster state lives in etcd. Your application state lives in databases, PersistentVolumes, and external services. Your configuration lives in Git repos and Helm releases. A real disaster recovery plan must account for all of these, with tested procedures and clear targets for how fast you recover (RTO) and how much data you can afford to lose (RPO).
RTO and RPO: The Two Numbers That Define Your DR Strategy
Section titled “RTO and RPO: The Two Numbers That Define Your DR Strategy”Every disaster recovery plan starts with two numbers. Get these wrong, and everything downstream is wrong.
flowchart LR A[Last Backup or Snapshot] -->|RPO: Data Loss Window| B[Disaster Occurs] B -->|RTO: Downtime Window| C[Service Restored]RPO (Recovery Point Objective): “How much data can we afford to lose?”
- RPO = 0: No data loss (synchronous replication)
- RPO = 1 hour: Lose up to 1 hour of data (hourly backups)
- RPO = 24 hours: Lose up to 1 day (daily backups)
RTO (Recovery Time Objective): “How long can we be down?”
- RTO = 0: No downtime (active-active, covered in Module 8.6)
- RTO = 15 min: Quick failover (warm standby)
- RTO = 4 hours: Cold standby or backup restore
- RTO = 24 hours: Full rebuild from IaC
Mapping RTO/RPO to DR Strategies
Section titled “Mapping RTO/RPO to DR Strategies”| Strategy | RTO | RPO | Cost | Complexity |
|---|---|---|---|---|
| Backup & Restore | 4-24 hours | Hours to days | Low ($) | Low |
| Pilot Light | 30-60 min | Minutes to hours | Medium ($$) | Medium |
| Warm Standby | 5-15 min | Seconds to minutes | High ($$$) | High |
| Active-Active | Near-zero | Near-zero | Very High ($$$$) | Very High |
The War Story: When RTO Doesn’t Match Reality
Section titled “The War Story: When RTO Doesn’t Match Reality”A retail company set their RTO at 4 hours for their Kubernetes platform. During their annual DR test, the actual recovery took 11 hours. Why?
- Restoring etcd from snapshot: 20 minutes (as expected)
- Waiting for nodes to rejoin and pods to schedule: 45 minutes (expected 15)
- Discovering that three CRDs were missing from the backup: 90 minutes to figure out
- DNS propagation for the new cluster endpoint: 35 minutes
- Discovering that PersistentVolume claims couldn’t bind because the storage class didn’t exist: 60 minutes
- Application health checks failing because the database connection string pointed to the old cluster: 120 minutes
- Load testing to verify the restored cluster could handle production traffic: 90 minutes
The lesson: your RTO should be based on tested recovery time, not theoretical recovery time. Add a 2x safety margin to your best-case test result.
Pause and predict: Your database performs asynchronous replication to a DR region with an average lag of 5 minutes. You also take full database snapshots every 12 hours. If your primary region completely fails and you promote the replica in the DR region, what is your actual RPO? If the replica also fails and you must restore from the last snapshot, how does your RPO change?
etcd Backup and Restore
Section titled “etcd Backup and Restore”For self-managed Kubernetes clusters (kubeadm, kOps, Rancher), etcd is the single point of truth. Lose etcd, lose your cluster.
etcd Snapshot Backup
Section titled “etcd Snapshot Backup”# Take a snapshot of etcd (run on a control plane node)ETCDCTL_API=3 etcdctl snapshot save /var/backups/etcd/snapshot-$(date +%Y%m%d-%H%M%S).db \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key
# Verify the snapshotETCDCTL_API=3 etcdctl snapshot status /var/backups/etcd/snapshot-20260324-100000.db \ --write-out=table
# Expected output:# +----------+----------+------------+------------+# | HASH | REVISION | TOTAL KEYS | TOTAL SIZE |# +----------+----------+------------+------------+# | 3e6d0a12 | 15284032 | 12847 | 42 MB |# +----------+----------+------------+------------+
# Upload to S3 for off-site storageaws s3 cp /var/backups/etcd/snapshot-20260324-100000.db \ s3://company-etcd-backups/prod-cluster/$(date +%Y/%m/%d)/ \ --sse aws:kms \ --sse-kms-key-id alias/etcd-backup-keyAutomated etcd Backup with CronJob
Section titled “Automated etcd Backup with CronJob”# For self-managed clusters, run etcd backup as a CronJobapiVersion: batch/v1kind: CronJobmetadata: name: etcd-backup namespace: kube-systemspec: schedule: "0 */4 * * *" # Every 4 hours concurrencyPolicy: Forbid successfulJobsHistoryLimit: 3 failedJobsHistoryLimit: 3 jobTemplate: spec: template: spec: nodeName: control-plane-1 # Pin to control plane hostNetwork: true tolerations: - key: node-role.kubernetes.io/control-plane effect: NoSchedule containers: - name: etcd-backup image: registry.k8s.io/etcd:3.5.16-0 command: - /bin/sh - -c - | set -e BACKUP_FILE="/backups/snapshot-$(date +%Y%m%d-%H%M%S).db" etcdctl snapshot save "$BACKUP_FILE" \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key etcdctl snapshot status "$BACKUP_FILE" --write-out=json echo "Backup complete: $BACKUP_FILE" # Upload to S3 (requires aws-cli) aws s3 cp "$BACKUP_FILE" \ "s3://etcd-backups/prod/$(date +%Y/%m/%d)/" \ --sse aws:kms # Retain only last 7 days locally find /backups -name "*.db" -mtime +7 -delete volumeMounts: - name: etcd-certs mountPath: /etc/kubernetes/pki/etcd readOnly: true - name: backup-dir mountPath: /backups volumes: - name: etcd-certs hostPath: path: /etc/kubernetes/pki/etcd - name: backup-dir hostPath: path: /var/backups/etcd restartPolicy: OnFailureetcd Restore Procedure
Section titled “etcd Restore Procedure”# Stop all control plane componentssystemctl stop kubelet
# Restore from snapshotETCDCTL_API=3 etcdctl snapshot restore /var/backups/etcd/snapshot-20260324-100000.db \ --name etcd-member-1 \ --data-dir=/var/lib/etcd-restored \ --initial-cluster="etcd-member-1=https://10.0.1.10:2380" \ --initial-cluster-token=etcd-cluster-restored \ --initial-advertise-peer-urls=https://10.0.1.10:2380
# Replace the data directorymv /var/lib/etcd /var/lib/etcd-oldmv /var/lib/etcd-restored /var/lib/etcd
# Restart kubelet (which starts etcd and other control plane components)systemctl start kubelet
# Verify the cluster is healthykubectl get nodeskubectl get pods -AVelero: Kubernetes-Native Backup and Restore
Section titled “Velero: Kubernetes-Native Backup and Restore”For managed Kubernetes (EKS, GKE, AKS) where you don’t manage etcd directly, Velero is the standard tool for backing up and restoring Kubernetes resources and persistent volumes.
flowchart TD subgraph K8s["Kubernetes Cluster"] VS["Velero Server (Deployment)<br/>- Watches CRDs<br/>- Snapshots K8s resources<br/>- Triggers volume snapshots<br/>- Uploads to object storage"] VNA["Velero Node Agent (DaemonSet)<br/>- File-level backup of PVs<br/>- For volumes without CSI"] end
OS["S3 / GCS / Blob<br/>(backup files)"] VSnap["EBS / GCE PD / Azure Disk<br/>(Volume snapshots)"]
VS -->|K8s resource JSON| OS VS -->|Volume snapshots| VSnap VNA -->|File-level backup| OSInstalling Velero
Section titled “Installing Velero”# Install Velero CLIbrew install velero
# Install Velero in the cluster (AWS example with EBS snapshots)velero install \ --provider aws \ --plugins velero/velero-plugin-for-aws:v1.10.0 \ --bucket velero-backups-prod \ --backup-location-config region=us-east-1 \ --snapshot-location-config region=us-east-1 \ --secret-file ./velero-credentials \ --use-node-agent \ --default-volumes-to-fs-backup=false
# Verify installationvelero versionkubectl get pods -n veleroBackup Strategies
Section titled “Backup Strategies”# Full cluster backupvelero backup create full-backup-20260324 \ --include-cluster-resources=true \ --snapshot-volumes=true \ --ttl 720h # Retain for 30 days
# Namespace-level backup (for team-specific DR)velero backup create payments-backup-20260324 \ --include-namespaces payments \ --snapshot-volumes=true \ --ttl 2160h # Retain for 90 days
# Scheduled backupsvelero schedule create daily-full \ --schedule="0 2 * * *" \ --include-cluster-resources=true \ --snapshot-volumes=true \ --ttl 720h
velero schedule create hourly-critical \ --schedule="0 * * * *" \ --include-namespaces payments,orders,inventory \ --snapshot-volumes=true \ --ttl 168h # Retain for 7 days
# Label-based backup (only backup PCI workloads)velero backup create pci-backup-20260324 \ --selector compliance=pci \ --snapshot-volumes=true \ --ttl 8760h # Retain for 1 year (compliance)
# Check backup statusvelero backup describe full-backup-20260324velero backup logs full-backup-20260324Restore Procedures
Section titled “Restore Procedures”# Restore entire cluster to a new clustervelero restore create full-restore \ --from-backup full-backup-20260324
# Restore specific namespace onlyvelero restore create payments-restore \ --from-backup full-backup-20260324 \ --include-namespaces payments
# Restore with namespace mapping (restore to different namespace)velero restore create payments-dr-test \ --from-backup full-backup-20260324 \ --include-namespaces payments \ --namespace-mappings payments:payments-dr-test
# Restore excluding certain resources (e.g., keep existing services)velero restore create selective-restore \ --from-backup full-backup-20260324 \ --include-namespaces payments \ --exclude-resources services,ingresses
# Monitor restore progressvelero restore describe full-restorevelero restore logs full-restoreStop and think: You just ran a Velero restore of a critical namespace to a new cluster. The pods are starting, but they are all stuck in
Pendingstate. The persistent volume claims (PVCs) remain unbound. What Kubernetes resource did you likely forget to include in your backup or pre-create in the new cluster, and how would you fix it?
DR Patterns for Kubernetes
Section titled “DR Patterns for Kubernetes”Pattern 1: Backup & Restore (Cold DR)
Section titled “Pattern 1: Backup & Restore (Cold DR)”flowchart LR subgraph Primary["Primary Region (us-east-1)"] EKS_P["EKS Cluster (active)<br/>Workloads running"] S3_P["S3 Bucket<br/>(Velero backups)"] EKS_P -->|Hourly backups| S3_P end
subgraph DR["DR Region (eu-west-1)"] EKS_D["New EKS Cluster<br/>(Provisioned on disaster)"] S3_D["S3 Bucket<br/>(Replicated)"] S3_D -.->|Velero restore<br/>RTO: 2-4 hours| EKS_D end
S3_P -->|Cross-region replication<br/>RPO: 1 hour| S3_DCost: Lowest. You pay only for S3 storage and cross-region replication in steady state. The DR cluster is provisioned only during a disaster.
Risk: Longest recovery time. You are building infrastructure from scratch during the most stressful moment possible.
Pattern 2: Pilot Light
Section titled “Pattern 2: Pilot Light”flowchart LR subgraph Primary["Primary Region (us-east-1)"] EKS_P["EKS Cluster (active)<br/>3 nodes, full load"] RDS_P["RDS Primary<br/>Active, writes"] end
subgraph DR["DR Region (eu-west-1)"] EKS_D["EKS Cluster (minimal)<br/>1 node, idle<br/>Core infra only"] RDS_D["RDS Read Replica<br/>Standby, reads only"] end
RDS_P -->|Async replication<br/>RPO: seconds| RDS_D EKS_D -.->|On disaster:<br/>Scale to 3 nodes<br/>RTO: 15-30 mins| EKS_DPattern 3: Warm Standby
Section titled “Pattern 3: Warm Standby”flowchart LR subgraph Primary["Primary Region (us-east-1)"] EKS_P["EKS Cluster (active)<br/>6 nodes, 100% traffic"] RDS_P["RDS Multi-AZ Primary<br/>Active, writes"] end
subgraph DR["DR Region (eu-west-1)"] EKS_D["EKS Cluster (warm)<br/>3 nodes, 0% traffic<br/>All apps deployed"] RDS_D["RDS Cross-Region<br/>Hot standby"] end
RDS_P -->|Sync replication<br/>RPO: seconds| RDS_D EKS_D -.->|On disaster:<br/>Scale to 6 nodes<br/>Route 100% traffic<br/>RTO: 5-10 mins| EKS_DDNS Failover
Section titled “DNS Failover”DNS is the traffic director in any DR scenario. How you configure DNS failover determines whether your users experience a smooth transition or a prolonged outage.
Route53 Health Check + Failover
Section titled “Route53 Health Check + Failover”# Create health check for primary regionPRIMARY_HC=$(aws route53 create-health-check \ --caller-reference "primary-$(date +%s)" \ --health-check-config '{ "Type": "HTTPS", "ResourcePath": "/healthz", "FullyQualifiedDomainName": "primary.api.example.com", "Port": 443, "RequestInterval": 10, "FailureThreshold": 3, "MeasureLatency": true, "Regions": ["us-east-1", "eu-west-1", "ap-southeast-1"] }' \ --query 'HealthCheck.Id' --output text)
# Create failover routing policyaws route53 change-resource-record-sets \ --hosted-zone-id Z1234567890 \ --change-batch '{ "Changes": [ { "Action": "CREATE", "ResourceRecordSet": { "Name": "api.example.com", "Type": "A", "SetIdentifier": "primary", "Failover": "PRIMARY", "AliasTarget": { "HostedZoneId": "Z2FDTNDATAQYW2", "DNSName": "primary-nlb.elb.us-east-1.amazonaws.com", "EvaluateTargetHealth": true }, "HealthCheckId": "'$PRIMARY_HC'" } }, { "Action": "CREATE", "ResourceRecordSet": { "Name": "api.example.com", "Type": "A", "SetIdentifier": "secondary", "Failover": "SECONDARY", "AliasTarget": { "HostedZoneId": "Z3AADJGX6KTTL2", "DNSName": "dr-nlb.elb.eu-west-1.amazonaws.com", "EvaluateTargetHealth": true } } } ] }'DNS TTL Considerations
Section titled “DNS TTL Considerations”| Time | Event |
|---|---|
| T+0s | Health check fails (3 consecutive failures at 10s interval = 30s) |
| T+30s | Route53 marks primary unhealthy |
| T+30s | Route53 starts returning DR IP for new DNS queries |
| T+30s | Clients with EXPIRED DNS cache get DR IP immediately |
| T+60-300s | Clients with CACHED DNS still hit primary (depends on TTL) |
With TTL=60s: Most clients failover within 90 seconds. With TTL=300s: Some clients stuck for up to 330 seconds.
RECOMMENDATION: Set TTL=60s for DR-critical records. Lower TTLs mean more DNS queries (more cost) but faster failover. TTL=30s is the practical minimum—below that, many resolvers ignore the TTL and cache for at least 30 seconds anyway.
IaC as Disaster Recovery
Section titled “IaC as Disaster Recovery”The most powerful DR strategy for Kubernetes is often the simplest: your entire infrastructure is defined in code, tested regularly, and can be recreated from scratch.
flowchart TD subgraph Git["Git Repository (source of truth)"] TF["terraform/<br/>├── modules/<br/>└── environments/"] GO["gitops/<br/>├── base/<br/>└── overlays/"] end
Infra["Infrastructure<br/>created from code"] Apps["Workloads<br/>deployed from code"]
TF -->|terraform apply| Infra GO -->|argocd sync| Apps
Infra -.->|DR = terraform apply + argocd sync + restore data| AppsDR Terraform Module
Section titled “DR Terraform Module”# environments/eu-west-1/main.tf (DR region)# Same modules as production, different variables
module "networking" { source = "../../modules/networking"
region = "eu-west-1" cidr_block = "10.1.0.0/16" azs = ["eu-west-1a", "eu-west-1b", "eu-west-1c"]
# DR: same structure, different region enable_nat_gateway = var.dr_active # Only create NAT GW when DR is active}
module "eks" { source = "../../modules/eks-cluster"
cluster_name = "prod-dr" cluster_version = "1.35" vpc_id = module.networking.vpc_id subnet_ids = module.networking.private_subnet_ids
# DR: start small, scale up during failover node_groups = { general = { desired_size = var.dr_active ? 6 : 1 min_size = var.dr_active ? 3 : 1 max_size = 12 instance_types = ["m7i.xlarge"] } }}
module "database" { source = "../../modules/databases"
# DR: cross-region read replica that can be promoted create_primary = false create_read_replica = true source_db_arn = var.primary_rds_arn promote_on_failover = var.dr_active}
variable "dr_active" { description = "Set to true during DR failover to scale up resources" type = bool default = false}# DR failover procedure# Step 1: Activate DR infrastructurecd terraform/environments/eu-west-1terraform apply -var="dr_active=true" -auto-approve
# Step 2: Promote database replicaaws rds promote-read-replica \ --db-instance-identifier prod-dr-replica
# Step 3: Update kubeconfig for DR clusteraws eks update-kubeconfig --name prod-dr --region eu-west-1
# Step 4: Trigger ArgoCD sync (if not auto-syncing)argocd app sync --all --prune
# Step 5: Verify workloadskubectl get pods -A | grep -v Running | grep -v Completed
# Step 6: Switch DNSaws route53 change-resource-record-sets \ --hosted-zone-id Z1234567890 \ --change-batch file://failover-dns.json
# Step 7: Monitorkubectl top nodeskubectl top pods -A --sort-by=cpuDid You Know?
Section titled “Did You Know?”-
Velero was originally called “Heptio Ark” and was created by the team at Heptio (founded by two of Kubernetes’ co-creators, Joe Beda and Craig McLuckie). When VMware acquired Heptio in 2018, the project was renamed to Velero (Latin for “sail”) and donated to the CNCF. It is now the de facto standard for Kubernetes backup, with over 8,000 GitHub stars and production use at thousands of organizations.
-
etcd can handle a cluster state restore in under 60 seconds for a typical cluster with 10,000-15,000 objects. The bottleneck is not the restore itself but the time for all controllers to reconcile state after the restore. The kube-controller-manager must re-evaluate every ReplicaSet, Deployment, and StatefulSet, which can take several minutes for large clusters. During this reconciliation window, some pods may be temporarily evicted and rescheduled.
-
AWS S3 Cross-Region Replication has a 99.99% SLA for replication within 15 minutes, but the actual replication latency for most objects is under 30 seconds. This matters for Velero backups: if your primary region fails immediately after a backup completes, the backup files may not have replicated to the DR region yet. For critical RPO requirements, enable S3 Replication Time Control (RTC), which guarantees 99.99% of objects are replicated within 15 minutes.
-
The GitLab 2017 data loss incident was live-streamed on YouTube. The engineering team broadcast their recovery efforts in real-time, including the moments of panic when they discovered each backup method had failed. The video became one of the most-watched incident response recordings in tech history and directly inspired hundreds of companies to test their backup procedures. GitLab later published a detailed post-mortem that became a template for incident documentation.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Why It Happens | How to Fix It |
|---|---|---|
| Never testing restores | ”We have backups, that’s enough” | Schedule quarterly DR tests. Restore to a separate namespace or cluster. Verify data integrity. If you haven’t tested it, it doesn’t work. |
| Backing up K8s resources but not PersistentVolumes | Velero defaults to resource-only backup | Explicitly enable --snapshot-volumes=true or --default-volumes-to-fs-backup=true. Verify PV data after restore. |
| Setting unrealistic RTO/RPO without testing | Business says “4 hours” without engineering input | Run a DR test, measure actual recovery time, report to business. Then set RTO = tested_time x 2. |
| Storing backups in the same region as the cluster | ”S3 is durable enough” | Enable cross-region replication. If the region fails, your backups are inaccessible. |
| Forgetting CRDs and cluster-scoped resources in backups | Velero includes them but some custom configs are missed | Use --include-cluster-resources=true. Also back up your Helm releases, ArgoCD applications, and external secrets separately. |
| No runbook for DR procedures | ”We’ll figure it out during the incident” | Write step-by-step runbooks. Include exact commands, expected outputs, and decision points. Store in a location accessible when your primary infra is down (not in a wiki hosted on the same cluster). |
| Ignoring DNS TTL in RTO calculations | ”DNS is instant” | DNS propagation with a 300s TTL adds up to 5 minutes to your RTO. Set DR-critical records to 60s TTL. |
| Not backing up secrets and config maps separately | ”They’re in the cluster backup” | External Secrets Operator configs, sealed secrets keys, and TLS certificates need special handling. Verify they’re included and restorable. |
1. You are meeting with the VP of Engineering to define the DR strategy for a new payment processing system. They state, "We cannot afford to lose a single transaction, but if the system goes down, we have 4 hours to bring it back online before we face compliance fines." How would you translate this into RTO and RPO metrics, and how do these two metrics influence your architectural choices for this system?
The VP’s requirements translate to an RPO (Recovery Point Objective) of zero and an RTO (Recovery Time Objective) of 4 hours. RPO dictates how much data you can afford to lose; an RPO of zero means you cannot rely on periodic backups and must implement synchronous replication across regions so data is committed in both places simultaneously before acknowledging the transaction. RTO dictates how long the system can be unavailable; an RTO of 4 hours means you do not need the expense of an active-active or warm standby setup. You can use a ‘Pilot Light’ or even a automated ‘Backup & Restore’ infrastructure provisioning process, as long as the data itself is synchronously replicated and protected.
2. Your team manages three Kubernetes clusters: a self-hosted kubeadm cluster on bare metal, and two managed EKS clusters. You need to implement a backup strategy that captures the cluster state and persistent application data across all three. How would your approach differ between the bare-metal and managed clusters, and why?
For the self-hosted kubeadm cluster, you should utilize etcd snapshots to capture the entire cluster state at a specific point in time, as you have direct access to the control plane nodes. etcd snapshots are incredibly fast and ensure total consistency of the Kubernetes data store, though they do not back up persistent volume data on their own. For the EKS clusters, you do not have access to the underlying etcd instances, so you must use a tool like Velero. Velero operates at the Kubernetes API level, backing up resource manifests and coordinating with cloud provider APIs to trigger volume snapshots (like EBS snapshots) to capture persistent data. While Velero can be used on the bare-metal cluster as well, etcd snapshots provide a lower-level, highly reliable bare-metal recovery option.
3. Your startup has grown, and your single-region EKS cluster is now a single point of failure. The CFO has approved a DR budget, but balks at the cost of doubling the infrastructure for an "Active-Active" setup. The CTO, however, insists that a 4-hour recovery time (Cold DR) will destroy customer trust during an outage. Which DR pattern should you recommend to balance these competing concerns, and why does it work?
You should recommend the “Pilot Light” pattern. In this architecture, you maintain a minimal, scaled-down version of your infrastructure in the DR region—such as a single-node EKS cluster with core services (like ArgoCD and monitoring) running, and a database read replica synchronizing data. This addresses the CFO’s concern because the steady-state cloud compute costs are a fraction of your primary region. It addresses the CTO’s concern because the control plane and data are already present; during a disaster, recovery is simply a matter of scaling up the node groups and promoting the database replica, which typically takes 15 to 30 minutes. This provides a dramatic reduction in RTO compared to Cold DR without the prohibitive costs of Active-Active.
4. During your annual DR simulation, your team initiates a failover to the secondary region. According to the architecture document, the RTO is 4 hours. However, it takes the team 11 hours to fully restore service and pass all health checks. Based on common Kubernetes disaster recovery pitfalls, what are the most likely architectural or procedural reasons for this massive discrepancy?
The most common cause of extended recovery times in Kubernetes is discovering missing cluster-scoped resources, such as CustomResourceDefinitions (CRDs) or StorageClasses, which were not explicitly included in the backup scope. Another major factor is PersistentVolume binding failures, which occur when the DR region lacks the exact storage configurations or availability zones expected by the PVCs. Procedurally, extended RTO is often the result of manual interventions required to fix hardcoded configuration strings (like database endpoints or S3 bucket names) that still point to the failed primary region. Finally, if infrastructure provisioning limits, such as cloud provider API rate limits or quota exhaustion, were not verified in advance, the team may spend hours just waiting for nodes to provision. The solution is to mandate quarterly testing and automate these edge cases via infrastructure-as-code.
5. Your company is migrating from a legacy VM-based architecture to Kubernetes. In the old system, DR involved restoring entire VM snapshots from cold storage, which took over 12 hours. You propose implementing "Infrastructure as Code (IaC) as DR" for the new Kubernetes environment. How would you explain to the change management board why this approach is faster and more reliable than their legacy snapshot restores?
In the legacy system, VM snapshots contained everything: the OS, the application binaries, the configuration, and the data, making them massive and slow to transfer and restore. With “IaC as DR”, we completely decouple the infrastructure and application state from the persistent data. When a disaster occurs, we execute our Terraform or Pulumi scripts to provision a fresh, identical Kubernetes cluster in minutes, and our GitOps tools (like ArgoCD) instantly pull and deploy the application manifests from version control. The only thing we actually need to restore from a backup is the persistent database state. This approach is significantly faster because infrastructure creation is parallelized by the cloud provider, and it is more reliable because the DR environment is guaranteed to be configurationally identical to production, eliminating the “configuration drift” that plagues traditional snapshot restores.
6. A massive regional cloud outage takes down your primary Kubernetes cluster. The SRE on call immediately tries to access the company's internal Confluence wiki to follow the disaster recovery runbook, but the wiki is hosted on that exact same Kubernetes cluster and is inaccessible. What structural change must you implement after the post-mortem to prevent this, and what characteristics should the new runbook have?
You must completely decouple your disaster recovery documentation from the infrastructure it is meant to recover. The runbook should be stored in a highly available, out-of-band location, such as a separate cloud provider’s storage bucket, a static site hosted on an independent CDN, or even a physical binder. This ensures that a localized failure or targeted attack does not simultaneously eliminate both your systems and your ability to restore them. Furthermore, the runbook must be written under the assumption that the original author is unavailable. It must contain exact commands, expected terminal outputs, explicit decision trees, and hardcoded escalation contacts so that any on-call engineer can execute the recovery steps without hesitation.
Hands-On Exercise: Build and Test a DR Plan
Section titled “Hands-On Exercise: Build and Test a DR Plan”In this exercise, you will set up Velero backups, perform a simulated disaster, and verify recovery.
Prerequisites
Section titled “Prerequisites”- kind or minikube cluster
- Velero CLI installed
- MinIO (for local S3-compatible backup storage)
Task 1: Set Up MinIO as Backup Storage
Section titled “Task 1: Set Up MinIO as Backup Storage”Solution
# Create a kind clusterkind create cluster --name dr-test
# Deploy MinIO as backup storagekubectl create namespace velero-storage
kubectl apply -f - <<'EOF'apiVersion: apps/v1kind: Deploymentmetadata: name: minio namespace: velero-storagespec: replicas: 1 selector: matchLabels: app: minio template: metadata: labels: app: minio spec: containers: - name: minio image: minio/minio:latest command: ["minio", "server", "/data", "--console-address", ":9001"] env: - name: MINIO_ROOT_USER value: "minioadmin" - name: MINIO_ROOT_PASSWORD value: "minioadmin" ports: - containerPort: 9000 - containerPort: 9001 volumeMounts: - name: data mountPath: /data volumes: - name: data emptyDir: {}---apiVersion: v1kind: Servicemetadata: name: minio namespace: velero-storagespec: selector: app: minio ports: - name: api port: 9000 - name: console port: 9001EOF
# Wait for MinIO to be readykubectl wait --for=condition=Ready pod -l app=minio -n velero-storage --timeout=120s
# Create the velero bucketkubectl run minio-client --rm -it --restart=Never \ --image=minio/mc:latest \ --command -- sh -c ' mc alias set myminio http://minio.velero-storage.svc:9000 minioadmin minioadmin mc mb myminio/velero-backups echo "Bucket created" 'Task 2: Install Velero and Create a Sample Application
Section titled “Task 2: Install Velero and Create a Sample Application”Solution
# Create Velero credentials filecat <<'EOF' > /tmp/velero-creds[default]aws_access_key_id = minioadminaws_secret_access_key = minioadminEOF
# Install Velerovelero install \ --provider aws \ --plugins velero/velero-plugin-for-aws:v1.10.0 \ --bucket velero-backups \ --secret-file /tmp/velero-creds \ --use-volume-snapshots=false \ --backup-location-config \ region=minio,s3ForcePathStyle=true,s3Url=http://minio.velero-storage.svc:9000 \ --use-node-agent
# Wait for Velero to be readykubectl wait --for=condition=Ready pod -l app.kubernetes.io/name=velero -n velero --timeout=120s
# Deploy a sample applicationkubectl create namespace payments
kubectl apply -f - <<'EOF'apiVersion: apps/v1kind: Deploymentmetadata: name: payment-api namespace: paymentsspec: replicas: 3 selector: matchLabels: app: payment-api template: metadata: labels: app: payment-api spec: containers: - name: api image: nginx:stable ports: - containerPort: 80---apiVersion: v1kind: Servicemetadata: name: payment-api namespace: paymentsspec: selector: app: payment-api ports: - port: 80---apiVersion: v1kind: ConfigMapmetadata: name: payment-config namespace: paymentsdata: DATABASE_URL: "postgres://prod-db.us-east-1.rds.amazonaws.com:5432/payments" CACHE_URL: "redis://prod-cache.us-east-1.cache.amazonaws.com:6379" LOG_LEVEL: "info"EOF
# Verify everything is runningkubectl get all -n paymentsTask 3: Create a Backup
Section titled “Task 3: Create a Backup”Solution
# Create a backup of the payments namespacevelero backup create payments-dr-test \ --include-namespaces payments \ --include-cluster-resources=true \ --wait
# Verify the backup succeededvelero backup describe payments-dr-testvelero backup logs payments-dr-test
# List the backup contentsvelero backup describe payments-dr-test --detailsTask 4: Simulate a Disaster and Restore
Section titled “Task 4: Simulate a Disaster and Restore”Solution
# DISASTER: Delete the entire payments namespacekubectl delete namespace payments
# Verify it's gonekubectl get namespace payments 2>&1 || echo "Namespace deleted - disaster simulated"
# RESTORE: Recover from backupvelero restore create payments-recovery \ --from-backup payments-dr-test \ --wait
# Verify the restorevelero restore describe payments-recovery
# Check that everything is backkubectl get all -n paymentskubectl get configmap -n payments
# Verify the ConfigMap data is intactkubectl get configmap payment-config -n payments -o yaml
# Verify pods are runningkubectl wait --for=condition=Ready pod -l app=payment-api -n payments --timeout=120secho "DR recovery complete!"Task 5: Write a DR Runbook
Section titled “Task 5: Write a DR Runbook”Document the exact steps for disaster recovery of the payments service, including pre-checks, recovery steps, and verification.
Solution
# Payments Service DR Runbook
## Pre-Disaster Checklist (verify quarterly)- [ ] Velero backup schedule is running (check: velero schedule get)- [ ] Latest backup completed successfully (check: velero backup get)- [ ] Cross-region replication is active (check: S3 replication metrics)- [ ] DR cluster infrastructure exists (check: terraform plan on DR env)
## During Disaster
### Step 1: Confirm the disaster (5 min)- Verify primary region is actually down (not a monitoring false positive)- Check AWS Health Dashboard for the affected region- Confirm with second team member before proceeding
### Step 2: Activate DR infrastructure (15 min)- cd terraform/environments/eu-west-1- terraform apply -var="dr_active=true" -auto-approve- aws eks update-kubeconfig --name prod-dr --region eu-west-1
### Step 3: Restore from backup (10 min)- velero restore create disaster-$(date +%Y%m%d) \ --from-backup <latest-successful-backup> --wait- kubectl get pods -n payments (verify all pods running)- kubectl get configmap -n payments (verify configs present)
### Step 4: Promote database (5 min)- aws rds promote-read-replica --db-instance-identifier prod-dr-replica- Wait for DB status = "available"- Update DATABASE_URL in payment-config ConfigMap if needed
### Step 5: Switch DNS (2 min)- aws route53 change-resource-record-sets (use failover-dns.json)- Verify: dig api.example.com (should return DR region IP)
### Step 6: Verify (10 min)- curl https://api.example.com/healthz (should return 200)- Run smoke tests: ./scripts/smoke-test.sh- Check Grafana dashboards for error rates- Notify #incident channel: "DR failover complete"
## Total expected time: 47 min (round up to 60 min)Clean Up
Section titled “Clean Up”kind delete cluster --name dr-testrm /tmp/velero-credsSuccess Criteria
Section titled “Success Criteria”- MinIO deployed as backup storage target
- Velero installed and connected to MinIO
- Sample application backed up successfully
- Namespace deleted (disaster simulated) and restored from backup
- All pods, services, and configmaps recovered with correct data
- DR runbook includes pre-checks, step-by-step recovery, and verification
Next Module
Section titled “Next Module”Module 8.6: Multi-Region Active-Active Deployments — Disaster recovery is about surviving failure. Active-active is about eliminating downtime entirely. Learn how to run your Kubernetes workloads in multiple regions simultaneously, handle global state management, and deal with the cost and complexity trade-offs.