Module 8.3: Cloud Repatriation & Migration
Complexity:
[ADVANCED]| Time: 60 minutesPrerequisites: Module 8.1: Multi-Site & Disaster Recovery, Module 8.2: Hybrid Cloud Connectivity
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Evaluate cloud repatriation economics with accurate 5-year TCO models that include hidden migration and staffing costs
- Design a phased migration plan that replaces cloud-managed services (RDS, ElastiCache, ALB) with self-managed on-premises equivalents
- Implement workload migration strategies that maintain service continuity during the transition from cloud to bare metal
- Plan staffing and operational readiness requirements for self-managing infrastructure previously handled by cloud providers
Why This Module Matters
Section titled “Why This Module Matters”In 2022, 37signals published a detailed accounting of their cloud exit. They spent $3.2 million per year on AWS — EC2, RDS, S3, and EKS. Their CTO calculated that equivalent hardware would cost $600,000 upfront, with $840,000 annual operations. Over five years, the on-prem path would save over $7 million.
The migration took eight months. It was not lift-and-shift. Every AWS-managed service had to be replaced: RDS became self-managed PostgreSQL, ElastiCache became self-hosted Redis, ALB became HAProxy, CloudWatch became Prometheus and Grafana. Containers were the easy part. The hard part was everything around them — load balancing, DNS, storage, secrets, monitoring, and dozens of AWS services quietly adopted over five years.
They completed the migration in early 2023 with a 60% infrastructure cost reduction. But they also hired two additional engineers and spent six months stabilizing self-managed PostgreSQL. Cloud repatriation is real, the economics can be compelling, and the execution is harder than anyone expects.
The Moving House Analogy
Moving from cloud to on-prem is like moving from a furnished rental to a house you buy. The rental included furniture (managed services) and a maintenance crew (cloud ops). Your house is cheaper long-term, but you buy all the furniture and learn to fix your own plumbing.
What You’ll Learn
Section titled “What You’ll Learn”- When cloud repatriation makes economic sense
- Translating cloud load balancers (ALB/NLB) to MetalLB
- Storage migration from EBS/EFS to Ceph
- IAM translation from AWS IAM to Keycloak
- Data gravity and migration sequencing
- Phased migration with rollback plans
When Repatriation Makes Sense
Section titled “When Repatriation Makes Sense” Annual cloud spend > $1M? No ──► STAY (savings won't justify effort) Yes ──► Workloads steady-state (not bursty)? No ──► STAY (on-prem can't burst) Yes ──► < 10 managed services? No ──► PARTIAL (move compute, keep managed) Yes ──► Can hire 2-4 infra engineers? No ──► STAY (can't operate on-prem) Yes ──► PROCEED WITH PLANNING| Factor | Cloud (Annual) | On-Prem (Annual) |
|---|---|---|
| Compute (200 nodes) | $1,200,000 | $180,000 (amortized 4yr) |
| Storage (100TB) | $240,000 | $40,000 (Ceph, amortized) |
| Network egress | $180,000 (20TB/mo) | $12,000 (colo bandwidth) |
| Managed services | $360,000 | $0 (self-managed) |
| Additional staff | $0 | $400,000 (2 SREs) |
| Colocation | $0 | $144,000 |
| Total | $1,980,000 | $776,000 (61% savings) |
Warning: At 20 nodes, cloud is almost always cheaper when you factor in staff time. Breakeven is typically 50-100 nodes depending on workload density and cloud discounts (Reserved Instances, Committed Use Discounts).
Pause and predict: 37signals spent $3.2M/year on AWS and estimated on-prem would cost $776K/year. But they also hired 2 additional engineers. At what cloud spend level does the engineering cost make repatriation not worthwhile?
Translating Cloud Load Balancers to MetalLB
Section titled “Translating Cloud Load Balancers to MetalLB” CLOUD (AWS) ON-PREM (MetalLB) Internet ──► ALB (managed) ──► Internet ──► Border Router ──► NodePort MetalLB Speaker Pods (BGP announces IPs) Pods# MetalLB with BGP modeapiVersion: metallb.io/v1beta2kind: BGPPeermetadata: name: datacenter-router namespace: metallb-systemspec: myASN: 64500 peerASN: 64501 peerAddress: 10.0.0.1---apiVersion: metallb.io/v1beta1kind: IPAddressPoolmetadata: name: production-pool namespace: metallb-systemspec: addresses: - 192.168.1.240/28 # 14 usable IPs for LoadBalancer services---apiVersion: metallb.io/v1beta1kind: BGPAdvertisementmetadata: name: production-advertisement namespace: metallb-systemspec: ipAddressPools: - production-poolAWS ALB Annotation Translation
Section titled “AWS ALB Annotation Translation”| AWS Annotation | On-Prem Equivalent |
|---|---|
scheme: internet-facing | MetalLB IPAddressPool with routable IPs |
certificate-arn | cert-manager with Let’s Encrypt or internal CA |
wafv2-acl-arn | ModSecurity in NGINX Ingress |
target-type: ip | Default kube-proxy behavior |
healthcheck-path | NGINX Ingress health-check-path annotation |
ssl-redirect: "443" | nginx.ingress.kubernetes.io/force-ssl-redirect: "true" |
Storage Migration: EBS/EFS to Ceph
Section titled “Storage Migration: EBS/EFS to Ceph” AWS (Source) On-Prem (Target) ┌────────────────┐ ┌────────────────┐ │ EBS Volumes │──rsync──────►│ Ceph RBD │ │ EFS (NFS) │──rsync──────►│ CephFS │ │ S3 Buckets │──rclone─────►│ Ceph RGW (S3) │ └────────────────┘ └────────────────┘Stop and think: You need to migrate 50TB of data from AWS S3 to on-premises Ceph RGW over a 1 Gbps Direct Connect. At best, that is ~7 days of continuous transfer. During that time, the application is still writing new data to S3. How do you handle the gap between the initial sync and the final cutover?
EBS to Ceph RBD
Section titled “EBS to Ceph RBD”The migration pattern for block storage is: snapshot the EBS volume, mount it on a transfer instance, rsync the data to a migration pod on the on-premises cluster that writes to a Ceph RBD PVC. For databases, stop the application first to ensure consistency.
# On AWS: snapshot and mount to a transfer instanceaws ec2 create-snapshot --volume-id vol-0123456789abcdef
# On on-prem: create StorageClass and PVCkubectl apply -f - <<EOFapiVersion: storage.k8s.io/v1kind: StorageClassmetadata: name: ceph-blockprovisioner: rook-ceph.rbd.csi.ceph.comparameters: clusterID: rook-ceph pool: replicapool imageFormat: "2"reclaimPolicy: RetainallowVolumeExpansion: trueEOF
# Transfer via a migration podkubectl apply -f - <<EOFapiVersion: v1kind: Podmetadata: name: data-migration namespace: productionspec: containers: - name: rsync image: instrumentisto/rsync-ssh:latest command: ["rsync", "-avz", "--progress", "-e", "ssh -i /keys/transfer-key", "ubuntu@aws-transfer.example.com:/mnt/ebs-data/", "/target-data/"] volumeMounts: - name: target-vol mountPath: /target-data volumes: - name: target-vol persistentVolumeClaim: claimName: app-data restartPolicy: NeverEOFS3 to Ceph RGW
Section titled “S3 to Ceph RGW”rclone provides an idempotent sync operation that can resume after interruptions and run incremental syncs to catch up with new data written during the migration period.
# Configure rclone for both endpointsrclone config # Set up aws-s3 and ceph-rgw remotes
# Syncrclone sync aws-s3:app-assets ceph-rgw:app-assets --progress --transfers 16
# Verifyrclone check aws-s3:app-assets ceph-rgw:app-assetsIAM Translation: AWS IAM to Keycloak
Section titled “IAM Translation: AWS IAM to Keycloak” AWS IAM On-Prem (Keycloak) IAM Users ──► Keycloak Users IAM Groups ──► Keycloak Groups IAM Roles ──► Keycloak Roles IRSA (OIDC) ──► Keycloak OIDC + ServiceAccount AWS SSO ──► Keycloak Identity BrokeringKubernetes OIDC with Keycloak
Section titled “Kubernetes OIDC with Keycloak”# kube-apiserver flags- --oidc-issuer-url=https://keycloak.example.com/realms/kubernetes- --oidc-client-id=kubernetes-apiserver- --oidc-username-claim=preferred_username- --oidc-groups-claim=groups- --oidc-ca-file=/etc/kubernetes/pki/keycloak-ca.crt# RBAC binding for Keycloak groupsapiVersion: rbac.authorization.k8s.io/v1kind: ClusterRoleBindingmetadata: name: keycloak-platform-adminssubjects:- kind: Group name: platform-admins # Matches Keycloak group apiGroup: rbac.authorization.k8s.ioroleRef: kind: ClusterRole name: cluster-admin apiGroup: rbac.authorization.k8s.io---apiVersion: rbac.authorization.k8s.io/v1kind: RoleBindingmetadata: name: keycloak-developers namespace: developmentsubjects:- kind: Group name: developers apiGroup: rbac.authorization.k8s.ioroleRef: kind: ClusterRole name: edit apiGroup: rbac.authorization.k8s.ioWarning: IRSA (IAM Roles for Service Accounts) is deeply AWS-specific. Any application using IRSA needs code or configuration changes to authenticate with Keycloak OIDC instead of AWS STS. Audit your pods for
eks.amazonaws.com/role-arnannotations before migration.
Data Gravity
Section titled “Data Gravity”Data gravity is the principle that large datasets attract applications. Moving 100TB takes days or weeks. Moving the application that reads it takes minutes. This means your migration sequence must follow the data — migrate storage first, then the applications that depend on it.
Phased Migration and Cutover
Section titled “Phased Migration and Cutover” Month 1-2 Month 3-4 Month 5-6 Month 7-8 PREPARATION DATA MIGRATION APP MIGRATION CUTOVER ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ Provision │ │ rclone/ │ │ Deploy │ │ DNS swap │ │ hardware │ │ rsync │ │ apps │ │ to on-prem│ │ Install K8s│ │ ongoing │ │ Run both │ │ Monitor │ │ Set up │ │ sync │ │ in │ │ Decommis- │ │ network │ │ │ │ parallel │ │ sion cloud│ │ Deploy │ │ IAM │ │ Shadow │ │ (30 days) │ │ platform │ │ migration │ │ traffic │ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘Rollback Plan
Section titled “Rollback Plan” Cutover complete. Error rate > 5%? │ Yes ──► Fixable in 30 min? Yes ──► Fix and monitor No ──► Data issue? Yes ──► IMMEDIATE ROLLBACK (DNS back to cloud) No ──► Performance issue? Yes ──► Split traffic 50/50, investigate No ──► ROLLBACK if unresolved in 2 hours# Pre-cutover validationrclone check aws-s3:production-data ceph-rgw:production-datakubectl --context on-prem get pods -n production | grep -v Running # Should be empty
# Rollback: redirect DNS back to cloudkubectl --context cloud annotate service api-gateway \ external-dns.alpha.kubernetes.io/hostname=api.example.com
# Sync any data written to on-prem back to cloudrclone sync ceph-rgw:production-data aws-s3:production-data --progressDid You Know?
Section titled “Did You Know?”-
Dropbox moved 90% of storage from AWS S3 to custom infrastructure (“Magic Pocket”) in 2016, saving ~$75 million over two years. They kept unpredictable workloads (ML training, experiments) on AWS.
-
Cloud data egress is asymmetric by design. AWS charges $0.09/GB out but $0.00/GB in. A 100TB dataset costs ~$9,200 just to download — the “Hotel California” pricing model makes leaving expensive.
-
MetalLB in BGP mode makes your cluster look like a router. Each LoadBalancer IP is a BGP route. If a speaker node goes down, another takes over in 1-3 seconds (BGP hold timer) — faster than DNS failover.
-
Self-hosted Keycloak handles 2,500+ auth requests per second on a single instance. AWS Cognito’s soft limit is 120/s per user pool. A three-replica Keycloak cluster supports 500,000+ users.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Why It Happens | What To Do Instead |
|---|---|---|
| Big-bang migration | Impatience | Migrate in phases: non-critical first, production last |
| Ignoring data egress costs | Focus on destination | Budget $0.09/GB for AWS egress upfront |
| Forgetting managed service deps | Developers use services silently | Audit all AWS API calls via CloudTrail |
| No parallel running period | ”We tested in staging” | Run both environments 2-4 weeks with shadow traffic |
| Hardcoded cloud endpoints | SDK defaults (s3.amazonaws.com) | Use env vars for all endpoints; grep for cloud URLs |
| No rollback plan | Optimism bias | Document and rehearse rollback; keep cloud running 30 days |
Question 1
Section titled “Question 1”Your company spends $800K/year on AWS running a 50-node Kubernetes cluster. The CFO reads about 37signals saving millions through cloud repatriation and asks you to plan a move to on-premises. Your estimate: $600K/year operating cost plus 2 additional SRE hires at $200K each. Should you recommend proceeding?
Answer
No. At 50 nodes, the economics do not justify repatriation.
The math: On-premises operating cost is $600K/year + $400K/year for 2 SREs = $1M/year ongoing. Add $500K in hardware CapEx (amortized over 4 years = $125K/year) and $200-400K in migration engineering costs. First-year total: $1.5-1.7M. Ongoing: $1.125M/year. This is MORE expensive than the $800K/year cloud bill.
The better approach: Optimize cloud spend without migrating. Reserved Instances or Savings Plans reduce EC2 costs by 30-40%. Right-sizing instances (most are over-provisioned) saves another 10-20%. Spot instances for batch workloads save 60-90%. These optimizations can reduce the $800K bill to $480-560K/year with minimal engineering effort.
When to revisit: If the company grows to 150+ nodes and the cloud bill exceeds $2M/year, repatriation economics become compelling because the infrastructure staff cost is fixed while cloud costs scale linearly. The breakeven point where on-prem becomes cheaper is typically 50-100 nodes, depending on workload density, cloud discounts, and staff costs.
Key insight from 37signals: They spent $3.2M/year on cloud (hundreds of servers). At that scale, the $400K/year SRE cost is 12% of savings. At $800K/year, the same SRE cost is 50% of the total — a completely different equation.
Question 2
Section titled “Question 2”Your AWS-hosted application uses an ALB with three annotations: certificate-arn (for TLS termination), wafv2-acl-arn (for web application firewall), and ssl-redirect: "443" (for HTTPS redirect). You are migrating to on-premises Kubernetes with NGINX Ingress. How do you replicate each capability?
Answer
Each AWS-managed capability maps to a specific on-premises tool:
-
certificate-arn(TLS certificates): Deploy cert-manager with a ClusterIssuer. For internet-facing services, use Let’s Encrypt ACME. For internal services, use an internal CA. Reference the issuer viacert-manager.io/cluster-issuerannotation on the Ingress resource. cert-manager handles certificate issuance, renewal, and rotation automatically — replacing the manual ACM certificate management workflow. -
wafv2-acl-arn(Web Application Firewall): Enable ModSecurity in the NGINX Ingress ConfigMap withenable-modsecurity: "true"andenable-owasp-modsecurity-crs: "true". The OWASP Core Rule Set provides protection against SQL injection, XSS, and other common attacks. For more advanced WAF needs, deploy a dedicated WAF like Coraza (the successor to ModSecurity) as a sidecar or upstream proxy. -
ssl-redirect: "443"(HTTPS redirect): Setnginx.ingress.kubernetes.io/force-ssl-redirect: "true"on the Ingress resource. This configures NGINX to return a 308 redirect for all HTTP requests to their HTTPS equivalent.
Key difference from AWS: On AWS, these three features are a few annotations on a single ALB resource. On-premises, they require three separate systems (cert-manager, ModSecurity, NGINX config) that you must install, configure, and maintain. This operational overhead is often underestimated during migration planning.
Question 3
Section titled “Question 3”You are using rclone to migrate 50TB of data from AWS S3 to on-premises Ceph RGW over a 1 Gbps Direct Connect. How long will the transfer take, what are the risks, and how do you handle data that changes during the migration?
Answer
Transfer time calculation: At 80% effective throughput (accounting for protocol overhead, TCP windowing, and S3 API latency), you get ~100 MB/s. 50 TB / 100 MB/s = ~500,000 seconds = ~5.8 days. With retries, throttling, and real-world variability, plan for 7-10 days.
Risks and mitigations:
-
Connection interruption: Direct Connect circuits can experience brief outages. Use
rclone sync(idempotent — only transfers changed/missing files on retry) rather thanrclone copy. If interrupted, re-running the same command resumes from where it left off. -
Data changing during transfer: The application continues writing new objects to S3 during the 7-10 day initial sync. Solution: start the bulk sync 2-3 weeks before cutover. Run incremental
rclone syncnightly to catch new and modified objects. The final sync before cutover will only transfer the delta from the last 24 hours — typically minutes, not days. -
S3 API rate limits: AWS throttles to 5,500 GET requests per second per prefix. With 50TB of small files, you may hit this limit. Monitor for 503 SlowDown errors and use
--transfers 16(not 64) to stay within limits. -
Bandwidth contention: If production traffic also uses the Direct Connect, the migration competes for bandwidth. Use
--bwlimit 500Mduring business hours and remove the limit overnight.
Strategy: Start bulk sync 2-3 weeks early. Nightly incremental syncs. Final sync in a 2-hour maintenance window. Verify with rclone check before cutover.
Question 4
Section titled “Question 4”After migrating from AWS to on-premises, your application pods cannot authenticate to the self-managed PostgreSQL database. On AWS, the application used IRSA (IAM Roles for Service Accounts) to obtain temporary credentials for RDS IAM database authentication. What broke, and how do you fix it?
Answer
The entire authentication chain is AWS-specific and breaks completely on-premises.
What broke: IRSA works through a mutating webhook that injects AWS STS tokens into pods based on their ServiceAccount annotation (eks.amazonaws.com/role-arn). The application SDK (e.g., AWS SDK) uses these tokens to call AWS STS and receive temporary credentials, which are then presented to RDS for IAM-based database authentication. On-premises, there is no STS endpoint, no IRSA webhook, and self-managed PostgreSQL does not support AWS IAM authentication. Every link in the chain is missing.
Fix options (in order of preference):
-
Standard PostgreSQL authentication: Create database users with password authentication. Store credentials in Kubernetes Secrets. The application needs a configuration change (connection string) but no code change if using standard database drivers.
-
External Secrets Operator + Vault: Use Vault to generate dynamic PostgreSQL credentials with automatic rotation. ESO syncs credentials to Kubernetes Secrets. This provides similar security properties to IRSA (short-lived credentials, automatic rotation) without AWS dependencies.
-
Keycloak OIDC for service identity: If the application supports OIDC-based database authentication (e.g., via a custom auth plugin), configure Keycloak to issue tokens for service accounts. This is the most complex option and rarely necessary.
Key lesson: Before migration, audit all pods for eks.amazonaws.com/role-arn annotations. Every pod with this annotation requires a migration plan for its authentication mechanism. IRSA is the single most common “hidden” AWS dependency.
Hands-On Exercise: Simulate Cloud-to-On-Prem Migration
Section titled “Hands-On Exercise: Simulate Cloud-to-On-Prem Migration”Objective: Migrate a workload between two kind clusters, translating cloud endpoints to on-prem equivalents.
# 1. Create clusterskind create cluster --name cloud-simkind create cluster --name onprem-sim
# 2. Deploy "cloud" app with cloud-specific configkubectl config use-context kind-cloud-simkubectl create namespace webappkubectl create configmap app-settings -n webapp \ --from-literal=DB_HOST=rds.aws.example.com \ --from-literal=CACHE_HOST=elasticache.aws.example.com \ --from-literal=S3_ENDPOINT=https://s3.amazonaws.comkubectl create deployment webapp --image=nginx:1.27 -n webapp --replicas=3
# 3. Deploy on on-prem with translated configkubectl config use-context kind-onprem-simkubectl create namespace webappkubectl create configmap app-settings -n webapp \ --from-literal=DB_HOST=postgres.database.svc.cluster.local \ --from-literal=CACHE_HOST=redis.cache.svc.cluster.local \ --from-literal=S3_ENDPOINT=https://rgw.onprem.example.comkubectl create deployment webapp --image=nginx:1.27 -n webapp --replicas=3
# 4. Compare configurationsecho "=== Cloud ==="kubectl --context kind-cloud-sim get configmap app-settings -n webapp -o yamlecho "=== On-Prem ==="kubectl --context kind-onprem-sim get configmap app-settings -n webapp -o yaml
# 5. Clean upkind delete cluster --name cloud-simkind delete cluster --name onprem-simSuccess Criteria
Section titled “Success Criteria”- Application deployed on both clusters
- ConfigMap translated from cloud to on-prem endpoints
- Both environments verified with running pods
- Differences between configurations documented
Next Module
Section titled “Next Module”This is the final module in the Resilience & Migration section. Return to the Resilience & Migration overview to review the full section, or continue to the next section in the on-premises track.