Skip to content

Scaling & Reliability Toolkit

Toolkit Track | 3 Modules | ~2.5 hours total

The Scaling & Reliability Toolkit covers advanced autoscaling and disaster recovery for Kubernetes. Karpenter provides intelligent node provisioning, KEDA enables event-driven workload scaling, and Velero ensures you can recover from disasters.

This toolkit applies concepts from SRE Discipline and Reliability Engineering.

Before starting this toolkit:

  • SRE Discipline — Scaling and reliability concepts
  • Kubernetes HPA basics
  • Persistent Volume concepts
  • Cloud provider fundamentals
#ModuleComplexityTime
6.1Karpenter[COMPLEX]45-50 min
6.2KEDA[MEDIUM]40-45 min
6.3Velero[MEDIUM]40-45 min
6.4FinOps with OpenCost[MEDIUM]40-45 min
6.5Chaos Engineering[COMPLEX]50 min
6.6Knative[MEDIUM]40-45 min

After completing this toolkit, you will be able to:

  1. Configure Karpenter — Intelligent node provisioning, spot instances, consolidation
  2. Implement KEDA — Event-driven scaling, scale to zero, custom metrics
  3. Set up Velero — Backups, restores, cluster migration
  4. Design for reliability — Autoscaling strategies, disaster recovery plans
WHICH SCALING/RELIABILITY TOOL?
─────────────────────────────────────────────────────────────────
"I need faster, smarter node autoscaling"
└──▶ Karpenter
• Provisions nodes in seconds
• Right-sizes for actual workloads
• Automatic consolidation
• Better than Cluster Autoscaler
"I need to scale workloads on queues/events"
└──▶ KEDA
• 60+ scalers (SQS, Kafka, Prometheus...)
• Scale to zero
• Event-driven, not just metrics
"I need backup and disaster recovery"
└──▶ Velero
• Application-aware backups
• PV snapshots
• Cross-cluster migration
SCALING COMPARISON:
─────────────────────────────────────────────────────────────────
HPA Cluster-AS Karpenter KEDA
─────────────────────────────────────────────────────────────────
Scales Pods Nodes Nodes Pods
Based on CPU/mem Pending pods Pending pods Any metric
Min replicas 1 ASG min 0 (with KEDA) 0
Speed Seconds Minutes Seconds Seconds
Custom metrics Complex N/A N/A Built-in
┌─────────────────────────────────────────────────────────────────┐
│ KUBERNETES RELIABILITY STACK │
├─────────────────────────────────────────────────────────────────┤
│ │
│ WORKLOAD SCALING │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ HPA (CPU/memory) │ KEDA (events, custom metrics) │ │
│ │ VPA (right-size) │ Custom metrics adapter │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │ │
│ NODE SCALING │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Karpenter (intelligent) │ Cluster Autoscaler (legacy) │ │
│ │ Node pools │ Auto-scaling groups │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │ │
│ DISASTER RECOVERY │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Velero (application) │ etcd backup (cluster) │ │
│ │ PV snapshots │ GitOps (configuration) │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Module 6.1: Karpenter
│ Node provisioning fundamentals
│ Spot instances, consolidation
Module 6.2: KEDA
│ Event-driven workload scaling
│ Custom metric triggers
Module 6.3: Velero
│ Backup and disaster recovery
│ Cluster migration
[Toolkit Complete] → Platforms Toolkit
LevelToolWhat ScalesTrigger
ApplicationHPA, KEDAPod replicasMetrics, events
InfrastructureKarpenterNodesPending pods
CostKarpenter + SpotInstance typesAvailability, price
RELIABILITY LAYERS
─────────────────────────────────────────────────────────────────
1. PREVENT OUTAGES
• HPA/KEDA: Scale before overload
• Karpenter: Provision capacity fast
• Pod disruption budgets: Safe updates
2. SURVIVE OUTAGES
• Multi-AZ: Survive zone failure
• Spot instance handling: Graceful interruption
• Circuit breakers: Prevent cascade
3. RECOVER FROM OUTAGES
• Velero: Restore applications
• etcd backup: Restore cluster
• GitOps: Rebuild from source
HA SCALING ARCHITECTURE
─────────────────────────────────────────────────────────────────
┌─────────────┐
│ Metrics │
│ (Prom/CW) │
└──────┬──────┘
┌────────────┼────────────┐
│ │ │
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐
│ HPA │ │ KEDA │ │ Custom │
│ (CPU) │ │(events)│ │ Alert │
└────┬───┘ └────┬───┘ └────┬───┘
│ │ │
└────────────┼────────────┘
┌─────────────┐
│ Deployment │
│ (replicas) │
└──────┬──────┘
│ Pending pods?
┌─────────────┐
│ Karpenter │
│ (nodes) │
└─────────────┘
DR ARCHITECTURE
─────────────────────────────────────────────────────────────────
PRIMARY CLUSTER DR CLUSTER
(us-west-2) (us-east-1)
┌─────────────────┐ ┌─────────────────┐
│ Production │ │ Standby │
│ Workloads │ │ (scaled down) │
└────────┬────────┘ └────────┬────────┘
│ │
│ Velero backup │
▼ │
┌─────────────────┐ │
│ S3 Bucket │───── Replication ────▶│
│ (backups) │ │
└─────────────────┘ ┌────────▼────────┐
│ S3 Bucket │
│ (replica) │
└─────────────────┘
FAILOVER: Restore from replicated backup to DR cluster
ModuleKey Exercise
KarpenterConfigure NodePool, observe provisioning
KEDAScale on Prometheus metrics, test zero
VeleroBackup namespace, delete, restore

“Scale automatically. Recover gracefully. Sleep peacefully.”