Scaling & Reliability Toolkit
Toolkit Track | 3 Modules | ~2.5 hours total
Overview
Section titled “Overview”The Scaling & Reliability Toolkit covers advanced autoscaling and disaster recovery for Kubernetes. Karpenter provides intelligent node provisioning, KEDA enables event-driven workload scaling, and Velero ensures you can recover from disasters.
This toolkit applies concepts from SRE Discipline and Reliability Engineering.
Prerequisites
Section titled “Prerequisites”Before starting this toolkit:
- SRE Discipline — Scaling and reliability concepts
- Kubernetes HPA basics
- Persistent Volume concepts
- Cloud provider fundamentals
Modules
Section titled “Modules”| # | Module | Complexity | Time |
|---|---|---|---|
| 6.1 | Karpenter | [COMPLEX] | 45-50 min |
| 6.2 | KEDA | [MEDIUM] | 40-45 min |
| 6.3 | Velero | [MEDIUM] | 40-45 min |
| 6.4 | FinOps with OpenCost | [MEDIUM] | 40-45 min |
| 6.5 | Chaos Engineering | [COMPLEX] | 50 min |
| 6.6 | Knative | [MEDIUM] | 40-45 min |
Learning Outcomes
Section titled “Learning Outcomes”After completing this toolkit, you will be able to:
- Configure Karpenter — Intelligent node provisioning, spot instances, consolidation
- Implement KEDA — Event-driven scaling, scale to zero, custom metrics
- Set up Velero — Backups, restores, cluster migration
- Design for reliability — Autoscaling strategies, disaster recovery plans
Tool Selection Guide
Section titled “Tool Selection Guide”WHICH SCALING/RELIABILITY TOOL?─────────────────────────────────────────────────────────────────
"I need faster, smarter node autoscaling"└──▶ Karpenter • Provisions nodes in seconds • Right-sizes for actual workloads • Automatic consolidation • Better than Cluster Autoscaler
"I need to scale workloads on queues/events"└──▶ KEDA • 60+ scalers (SQS, Kafka, Prometheus...) • Scale to zero • Event-driven, not just metrics
"I need backup and disaster recovery"└──▶ Velero • Application-aware backups • PV snapshots • Cross-cluster migration
SCALING COMPARISON:───────────────────────────────────────────────────────────────── HPA Cluster-AS Karpenter KEDA─────────────────────────────────────────────────────────────────Scales Pods Nodes Nodes PodsBased on CPU/mem Pending pods Pending pods Any metricMin replicas 1 ASG min 0 (with KEDA) 0Speed Seconds Minutes Seconds SecondsCustom metrics Complex N/A N/A Built-inThe Reliability Stack
Section titled “The Reliability Stack”┌─────────────────────────────────────────────────────────────────┐│ KUBERNETES RELIABILITY STACK │├─────────────────────────────────────────────────────────────────┤│ ││ WORKLOAD SCALING ││ ┌───────────────────────────────────────────────────────────┐ ││ │ HPA (CPU/memory) │ KEDA (events, custom metrics) │ ││ │ VPA (right-size) │ Custom metrics adapter │ ││ └───────────────────────────────────────────────────────────┘ ││ │ ││ NODE SCALING ││ ┌───────────────────────────────────────────────────────────┐ ││ │ Karpenter (intelligent) │ Cluster Autoscaler (legacy) │ ││ │ Node pools │ Auto-scaling groups │ ││ └───────────────────────────────────────────────────────────┘ ││ │ ││ DISASTER RECOVERY ││ ┌───────────────────────────────────────────────────────────┐ ││ │ Velero (application) │ etcd backup (cluster) │ ││ │ PV snapshots │ GitOps (configuration) │ ││ └───────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────┘Study Path
Section titled “Study Path”Module 6.1: Karpenter │ │ Node provisioning fundamentals │ Spot instances, consolidation ▼Module 6.2: KEDA │ │ Event-driven workload scaling │ Custom metric triggers ▼Module 6.3: Velero │ │ Backup and disaster recovery │ Cluster migration ▼[Toolkit Complete] → Platforms ToolkitKey Concepts
Section titled “Key Concepts”Scaling Levels
Section titled “Scaling Levels”| Level | Tool | What Scales | Trigger |
|---|---|---|---|
| Application | HPA, KEDA | Pod replicas | Metrics, events |
| Infrastructure | Karpenter | Nodes | Pending pods |
| Cost | Karpenter + Spot | Instance types | Availability, price |
Reliability Layers
Section titled “Reliability Layers”RELIABILITY LAYERS─────────────────────────────────────────────────────────────────
1. PREVENT OUTAGES • HPA/KEDA: Scale before overload • Karpenter: Provision capacity fast • Pod disruption budgets: Safe updates
2. SURVIVE OUTAGES • Multi-AZ: Survive zone failure • Spot instance handling: Graceful interruption • Circuit breakers: Prevent cascade
3. RECOVER FROM OUTAGES • Velero: Restore applications • etcd backup: Restore cluster • GitOps: Rebuild from sourceCommon Architectures
Section titled “Common Architectures”High-Availability Scaling
Section titled “High-Availability Scaling”HA SCALING ARCHITECTURE─────────────────────────────────────────────────────────────────
┌─────────────┐ │ Metrics │ │ (Prom/CW) │ └──────┬──────┘ │ ┌────────────┼────────────┐ │ │ │ ▼ ▼ ▼ ┌────────┐ ┌────────┐ ┌────────┐ │ HPA │ │ KEDA │ │ Custom │ │ (CPU) │ │(events)│ │ Alert │ └────┬───┘ └────┬───┘ └────┬───┘ │ │ │ └────────────┼────────────┘ │ ▼ ┌─────────────┐ │ Deployment │ │ (replicas) │ └──────┬──────┘ │ │ Pending pods? ▼ ┌─────────────┐ │ Karpenter │ │ (nodes) │ └─────────────┘Disaster Recovery Architecture
Section titled “Disaster Recovery Architecture”DR ARCHITECTURE─────────────────────────────────────────────────────────────────
PRIMARY CLUSTER DR CLUSTER(us-west-2) (us-east-1)
┌─────────────────┐ ┌─────────────────┐│ Production │ │ Standby ││ Workloads │ │ (scaled down) │└────────┬────────┘ └────────┬────────┘ │ │ │ Velero backup │ ▼ │┌─────────────────┐ ││ S3 Bucket │───── Replication ────▶││ (backups) │ │└─────────────────┘ ┌────────▼────────┐ │ S3 Bucket │ │ (replica) │ └─────────────────┘
FAILOVER: Restore from replicated backup to DR clusterHands-On Focus
Section titled “Hands-On Focus”| Module | Key Exercise |
|---|---|
| Karpenter | Configure NodePool, observe provisioning |
| KEDA | Scale on Prometheus metrics, test zero |
| Velero | Backup namespace, delete, restore |
Related Tracks
Section titled “Related Tracks”- Before: SRE Discipline — SRE concepts
- Before: Reliability Engineering — Theory
- Related: IaC Tools Toolkit — Terraform modules for Karpenter, KEDA
- Related: Observability Toolkit — Metrics for scaling
- After: Platforms Toolkit — Platform features
“Scale automatically. Recover gracefully. Sleep peacefully.”