Cloud-Native Databases Toolkit
Toolkit Track | 5 Modules | ~4 hours total
Overview
Section titled “Overview”The Cloud-Native Databases Toolkit covers running databases on Kubernetes—something that went from “don’t do it” to “actually, it’s great now” in just a few years. Modern operators, distributed architectures, and serverless models have made stateful workloads on Kubernetes not just viable, but often preferable to traditional deployments.
This toolkit applies concepts from Distributed Systems Foundation and Reliability Engineering Foundation.
Prerequisites
Section titled “Prerequisites”Before starting this toolkit:
- Solid Kubernetes fundamentals (StatefulSets, PVCs, Services)
- Basic SQL and database concepts
- Understanding of replication and high availability
- Distributed Systems Foundation - Consensus, consistency
- Reliability Engineering Foundation - SLOs, failure modes
Modules
Section titled “Modules”| # | Module | Complexity | Time |
|---|---|---|---|
| 15.1 | CockroachDB | [COMPLEX] | 55-65 min |
| 15.2 | CloudNativePG | [MEDIUM] | 45-50 min |
| 15.3 | Neon & PlanetScale | [MEDIUM] | 40-45 min |
| 15.4 | Vitess | [COMPLEX] | 50-55 min |
| 15.5 | etcd-operator | [MEDIUM] | 40-45 min |
Learning Outcomes
Section titled “Learning Outcomes”After completing this toolkit, you will be able to:
- Deploy CockroachDB — Globally distributed SQL that survives regional failures
- Run PostgreSQL on K8s — CloudNativePG operator for day-2 operations
- Use serverless databases — Neon and PlanetScale for developer productivity
- Scale MySQL horizontally — Vitess for YouTube/Slack-scale sharding
- Manage etcd with operators — etcd-operator for automated TLS, upgrades, and cluster lifecycle
- Choose the right database — Understand trade-offs for your use case
Tool Selection Guide
Section titled “Tool Selection Guide”WHICH CLOUD-NATIVE DATABASE?─────────────────────────────────────────────────────────────────
"I need multi-region, survive datacenter failures"└──▶ CockroachDB • Distributed SQL (PostgreSQL wire protocol) • Automatic sharding and rebalancing • Strong consistency across regions • Sleep through outages
"I want PostgreSQL with great K8s operations"└──▶ CloudNativePG • Best PostgreSQL operator • Declarative configuration • Automated failover, backups, PITR • You manage the database, not K8s complexity
"I need database branching for dev/preview environments"└──▶ Neon or PlanetScale • Branch databases like git branches • Serverless scaling (pay for usage) • Neon = PostgreSQL, PlanetScale = MySQL • Perfect for preview environments
"I need to scale MySQL to millions of QPS"└──▶ Vitess • Horizontal MySQL sharding • YouTube, Slack, Square scale • Keep your MySQL app, add scale • Complex but proven
"I need simple, single-node database on K8s"└──▶ CloudNativePG or Bitnami charts • Don't over-engineer • Simple HA with replicas • Managed backups to S3
COMPARISON MATRIX:───────────────────────────────────────────────────────────────── CockroachDB CloudNativePG Neon Vitess─────────────────────────────────────────────────────────────────Database type Distributed PostgreSQL Postgres MySQLMulti-region ✓✓ Manual ✓ ManualAuto-sharding ✓ ✗ ✓ ✓✓Consistency Strong Strong Strong Eventually*Wire protocol PostgreSQL PostgreSQL Postgres MySQLServerless ✓ (cloud) ✗ ✓✓ ✗Database branching ✗ ✗ ✓✓ ✗K8s native ✓ ✓✓ Managed ✓Self-hosted cost $$$ $ N/A $$Operational burden Medium Low None HighCNCF status ✗ Sandbox ✗ Graduated
* Vitess supports various consistency modesThe Database-on-Kubernetes Evolution
Section titled “The Database-on-Kubernetes Evolution”┌─────────────────────────────────────────────────────────────────┐│ DATABASES ON KUBERNETES - THE EVOLUTION │├─────────────────────────────────────────────────────────────────┤│ ││ 2015-2017: "DON'T RUN DATABASES ON KUBERNETES" ││ ┌───────────────────────────────────────────────────────────┐ ││ │ • StatefulSets were new and scary │ ││ │ • Persistent volumes unreliable │ ││ │ • No good operators │ ││ │ • "Pets vs cattle" mentality │ ││ └───────────────────────────────────────────────────────────┘ ││ ││ 2018-2020: "MAYBE FOR DEV/TEST..." ││ ┌───────────────────────────────────────────────────────────┐ ││ │ • Operators emerged (Zalando, KubeDB) │ ││ │ • CSI standardized storage │ ││ │ • Success stories appeared │ ││ │ • Still nervous about production │ ││ └───────────────────────────────────────────────────────────┘ ││ ││ 2021-NOW: "ACTUALLY, IT'S GREAT" ││ ┌───────────────────────────────────────────────────────────┐ ││ │ • CloudNativePG (CNCF Sandbox) │ ││ │ • CockroachDB on K8s is default │ ││ │ • Vitess (CNCF Graduated) │ ││ │ • Major banks running production DBs on K8s │ ││ │ • DoK (Data on Kubernetes) community thriving │ ││ └───────────────────────────────────────────────────────────┘ ││ ││ THE FUTURE: SERVERLESS + KUBERNETES HYBRID ││ ┌───────────────────────────────────────────────────────────┐ ││ │ • Neon, PlanetScale for dev/preview │ ││ │ • Self-hosted for production compliance │ ││ │ • eBPF for database observability │ ││ │ • AI-assisted query optimization │ ││ └───────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────┘Why Databases on Kubernetes?
Section titled “Why Databases on Kubernetes?”BENEFITS OF DATABASES ON KUBERNETES─────────────────────────────────────────────────────────────────
✓ UNIFIED OPERATIONS └── Same tools (kubectl, Helm, ArgoCD) for apps AND databases Same monitoring (Prometheus), same alerts, same runbooks
✓ DECLARATIVE CONFIGURATION └── Database config in Git, version controlled "I want 3 replicas with 100GB storage" = YAML
✓ AUTOMATED DAY-2 OPERATIONS └── Operators handle: failover, backups, scaling, upgrades No more 3 AM manual failover procedures
✓ CONSISTENCY WITH APPLICATIONS └── Same deployment patterns, same CI/CD pipelines Database changes go through same review process
✓ COST EFFICIENCY (sometimes) └── Right-size database instances Scale down dev/test environments Bin-packing with other workloads
RISKS TO CONSIDER─────────────────────────────────────────────────────────────────
⚠ STORAGE COMPLEXITY └── CSI driver quality varies Local NVMe vs network storage trade-offs Snapshot/backup integration
⚠ NETWORKING └── Pod IP changes on restart (use Services) Cross-AZ latency for replicas Network policies for security
⚠ RESOURCE CONTENTION └── "Noisy neighbor" with shared nodes Need proper resource limits and node affinity Consider dedicated node pools for databases
⚠ OPERATIONAL MATURITY └── Team needs K8s AND database expertise Debugging is harder (more layers) Disaster recovery is differentArchitecture Patterns
Section titled “Architecture Patterns”Single-Region HA (Most Common)
Section titled “Single-Region HA (Most Common)”SINGLE-REGION HIGH AVAILABILITY─────────────────────────────────────────────────────────────────
┌─────────────────────────────────────────────────────────────┐│ Kubernetes Cluster ││ (3 Availability Zones) ││ ││ ┌─────────────────────────────────────────────────────┐ ││ │ CloudNativePG Cluster │ ││ │ │ ││ │ AZ-1 AZ-2 AZ-3 │ ││ │ ┌──────┐ ┌──────┐ ┌──────┐ │ ││ │ │ Pod │ │ Pod │ │ Pod │ │ ││ │ │PRIMARY│◄───────│REPLICA│ │REPLICA│ │ ││ │ └──┬───┘ sync └──┬───┘ async └──┬───┘ │ ││ │ │ │ │ │ ││ │ ┌──┴───┐ ┌──┴───┐ ┌──┴───┐ │ ││ │ │ PVC │ │ PVC │ │ PVC │ │ ││ │ │100GB │ │100GB │ │100GB │ │ ││ │ └──────┘ └──────┘ └──────┘ │ ││ │ │ ││ └─────────────────────────────────────────────────────┘ ││ │ ││ ┌───────┴───────┐ ││ │ Service │ ││ │ (LoadBalancer)│ ││ └───────────────┘ ││ │ │└────────────────────────────┼────────────────────────────────┘ │ Application PodsMulti-Region Active-Active
Section titled “Multi-Region Active-Active”MULTI-REGION WITH COCKROACHDB─────────────────────────────────────────────────────────────────
┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐│ US-EAST │ │ US-WEST │ │ EU-WEST ││ │ │ │ │ ││ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ ││ │ CockroachDB │◄─┼─────┼─▶│ CockroachDB │◄─┼─────┼─▶│ CockroachDB │ ││ │ Nodes │ │ │ │ Nodes │ │ │ │ Nodes │ ││ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ ││ │ │ │ │ │ │ │ ││ ▼ │ │ ▼ │ │ ▼ ││ ┌─────────┐ │ │ ┌─────────┐ │ │ ┌─────────┐ ││ │ Apps │ │ │ │ Apps │ │ │ │ Apps │ ││ └─────────┘ │ │ └─────────┘ │ │ └─────────┘ ││ ▲ │ │ ▲ │ │ ▲ ││ │ │ │ │ │ │ │ │└────────┼──────────┘ └────────┼──────────┘ └────────┼──────────┘ │ │ │ └─────────────────────────┼─────────────────────────┘ │ Global Load Balancer (GeoDNS / Anycast)
• Writes go to any region (CockroachDB handles consensus)• Reads are local (low latency)• Region failure = automatic failover• Data locality controls for complianceServerless for Development
Section titled “Serverless for Development”SERVERLESS DATABASE WORKFLOW─────────────────────────────────────────────────────────────────
Developer Workflow with Neon:┌─────────────────────────────────────────────────────────────┐│ ││ main branch (production) ││ ┌─────────────────────────────────────────────────────┐ ││ │████████████████████████████████████ 500GB data │ ││ └─────────────────────────────────────────────────────┘ ││ │ ││ │ git checkout -b feature/new-schema ││ │ neon branch create feature/new-schema ││ ▼ ││ feature branch (copy-on-write, instant) ││ ┌─────────────────────────────────────────────────────┐ ││ │████████████████████████████████████ 500GB (shared) │ ││ │░░░░░░░░░ Changes only (10MB) │ ││ └─────────────────────────────────────────────────────┘ ││ │ ││ │ Run migrations, test, PR approved ││ │ ││ ▼ ││ Merge to main ││ ┌─────────────────────────────────────────────────────┐ ││ │████████████████████████████████████ 500GB + changes │ ││ └─────────────────────────────────────────────────────┘ ││ ││ Feature branch deleted (only paid for 10MB, not 500GB) ││ │└─────────────────────────────────────────────────────────────┘Study Path
Section titled “Study Path”Module 15.1: CockroachDB │ │ Distributed SQL fundamentals │ Multi-region deployment │ Consensus and consistency ▼Module 15.2: CloudNativePG │ │ PostgreSQL on Kubernetes │ Operator-managed operations │ Backup and restore ▼Module 15.3: Neon & PlanetScale │ │ Serverless database model │ Database branching │ Developer productivity ▼Module 15.4: Vitess │ │ Horizontal MySQL sharding │ Migration from monolith │ Extreme scale patterns ▼[Toolkit Complete] → IaC Tools or Security ToolsKey Concepts
Section titled “Key Concepts”CAP Theorem in Practice
Section titled “CAP Theorem in Practice”CAP THEOREM - PICK TWO─────────────────────────────────────────────────────────────────
Consistency │ │ ┌───────────┴───────────┐ │ │ │ CockroachDB │ │ (CP - sacrifices │ │ availability │ │ during network │ │ partitions) │ │ │Availability├───────────────────────┤Partition │ │ Tolerance │ Cassandra │ │ (AP - sacrifices │ │ consistency │ │ for availability) │ │ │ └───────────────────────┘
REALITY CHECK:─────────────────────────────────────────────────────────────────• Network partitions are rare but happen• "Consistency" has different meanings (linearizable vs eventual)• Most cloud-native DBs are "CP" with good availability• CockroachDB: 99.99% available while strongly consistent• Trade-offs are about WHAT happens during rare partitionsOperators and Custom Resources
Section titled “Operators and Custom Resources”# CloudNativePG Custom ResourceapiVersion: postgresql.cnpg.io/v1kind: Clustermetadata: name: my-postgresspec: instances: 3
storage: size: 100Gi storageClass: premium-ssd
postgresql: parameters: max_connections: "200" shared_buffers: "256MB"
backup: barmanObjectStore: destinationPath: s3://my-backups/postgres s3Credentials: accessKeyId: name: backup-creds key: ACCESS_KEY_ID secretAccessKey: name: backup-creds key: SECRET_ACCESS_KEY wal: compression: gzip
# Automated operations the operator handles: # ✓ Leader election and failover # ✓ Replica synchronization # ✓ Continuous WAL archiving # ✓ Point-in-time recovery # ✓ Certificate rotation # ✓ Configuration changes (rolling restart)Hands-On Focus
Section titled “Hands-On Focus”| Module | Key Exercise |
|---|---|
| CockroachDB | Deploy 3-region cluster, simulate region failure |
| CloudNativePG | Deploy cluster, perform failover, restore from backup |
| Neon/PlanetScale | Create database branches, test schema migrations |
| Vitess | Shard a MySQL database, run cross-shard queries |
Cost Considerations
Section titled “Cost Considerations”TOTAL COST COMPARISON (100GB, 10K QPS workload)─────────────────────────────────────────────────────────────────
CockroachDB Self-Hosted (3 nodes):├── Compute: 3x 8 vCPU, 32GB ≈ $600/month├── Storage: 3x 200GB SSD ≈ $60/month├── Bandwidth: Inter-region ≈ $100/month├── Engineering: 0.2 FTE ≈ $3,000/month└── Total: ~$3,800/month
CockroachDB Serverless:├── Storage: 100GB × $1 ≈ $100/month├── Compute: 10K RU/s × $0.20 ≈ $200/month└── Total: ~$300/month (but less control)
CloudNativePG Self-Hosted (3 replicas):├── Compute: 3x 4 vCPU, 16GB ≈ $300/month├── Storage: 3x 100GB SSD ≈ $30/month├── Engineering: 0.1 FTE ≈ $1,500/month└── Total: ~$1,830/month
RDS PostgreSQL (Multi-AZ):├── Instance: db.r6g.xlarge × 2 ≈ $500/month├── Storage: 100GB gp3 ≈ $15/month├── Engineering: Minimal└── Total: ~$515/month + less control
Neon (Serverless PostgreSQL):├── Storage: 100GB × $0.25 ≈ $25/month├── Compute: Pay per active time└── Total: ~$50-200/month (depends on usage)
WHEN SELF-HOSTED WINS:─────────────────────────────────────────────────────────────────• Multi-cloud / hybrid requirements• Data residency / compliance• Extreme scale (>1PB, >1M QPS)• Already have K8s expertise
WHEN MANAGED/SERVERLESS WINS:─────────────────────────────────────────────────────────────────• Small-medium scale• Developer productivity priority• Limited platform team• Variable workloads (scale to zero)Related Tracks
Section titled “Related Tracks”- Before: Distributed Systems Foundation — Consensus, consistency
- Before: Reliability Engineering Foundation — SLOs, failure modes
- Related: Observability Toolkit — Database monitoring
- Related: GitOps & Deployments — Database GitOps
- Related: Security Tools — Database encryption, access control
“The question isn’t ‘should I run databases on Kubernetes?’ anymore. It’s ‘which databases, and how?’”