Skip to content

Module 1.1: Stateful Workloads & Storage Deep Dive

Discipline Module | Complexity: [COMPLEX] | Time: 3 hours

Before starting this module:

  • Required: Kubernetes Storage fundamentals — PersistentVolumes, PersistentVolumeClaims, StorageClasses
  • Required: Working knowledge of StatefulSets, Deployments, and Services
  • Recommended: Experience with at least one database (PostgreSQL, MySQL, MongoDB, etc.)
  • Recommended: Familiarity with Linux filesystem and block device concepts

After completing this module, you will be able to:

  • Design StatefulSet configurations that handle persistent storage, ordered deployment, and stable network identity
  • Implement operator-managed stateful workloads using patterns like the sidecar, init container, and ambassador
  • Configure persistent volume claims with appropriate storage classes for database and queue workloads
  • Diagnose common stateful workload failures — split-brain, data corruption, volume mount issues — on Kubernetes

There is a persistent myth in the Kubernetes community: “Don’t run databases on Kubernetes.” Five years ago, that was reasonable advice. Today, it is dangerously outdated.

The world’s largest organizations — Uber, Spotify, Bloomberg, Apple — run critical stateful workloads on Kubernetes. Not because they enjoy complexity, but because Kubernetes gives them something traditional VM-based deployments cannot: declarative lifecycle management for stateful applications at scale.

But here is the catch. Running stateless web servers on Kubernetes is like driving on a straight highway. Running stateful workloads is like navigating a mountain pass at night. The vehicle is the same, but the skill required is entirely different. You need to understand storage internals, ordering guarantees, data protection, and failure modes that simply do not exist in the stateless world.

This module takes you from “I can deploy a Deployment” to “I can run a production database cluster on Kubernetes and sleep at night.”


  • Etcd, the brain of Kubernetes itself, is a stateful workload. Every Kubernetes cluster already runs a distributed stateful system. If Kubernetes could not handle state, it could not run itself.
  • A single misconfigured reclaimPolicy has caused more production data loss than any Kubernetes bug. The default policy for dynamically provisioned volumes is Delete — meaning when you delete a PVC, your data is gone forever.
  • Local PVs can deliver 3-10x the IOPS of network-attached storage. For write-heavy workloads like databases and message queues, the storage layer is almost always the bottleneck, and local disks eliminate the network hop entirely.

When a Pod writes data on Kubernetes, it passes through multiple layers. Understanding each layer is critical for performance tuning and troubleshooting.

┌─────────────────────────────────────────────┐
│ Application (Pod) │
├─────────────────────────────────────────────┤
│ Filesystem (ext4, xfs) │
├─────────────────────────────────────────────┤
│ Volume Plugin (CSI Driver) │
├─────────────────────────────────────────────┤
│ Storage Backend (Cloud/Local) │
├─────────────────────────────────────────────┤
│ Physical/Virtual Disk │
└─────────────────────────────────────────────┘

Each layer introduces latency. Network-attached storage (EBS, Persistent Disk, Azure Disk) adds a network round-trip. Local storage eliminates that round-trip but sacrifices portability.

Before CSI, every storage vendor wrote their own volume plugin directly inside the Kubernetes codebase. This was a nightmare — storage vendors had to match Kubernetes release cycles, bugs in one plugin could affect the entire cluster, and adding new storage backends required recompiling Kubernetes.

CSI changed everything. It defines a standard gRPC interface between Kubernetes and storage providers:

┌──────────────┐ gRPC ┌──────────────┐
│ Kubernetes │ ←───────────────→ │ CSI Driver │
│ (kubelet) │ │ (vendor) │
└──────────────┘ └──────────────┘
┌──────────────┐
│ Storage │
│ Backend │
└──────────────┘

CSI has three core RPCs:

RPCPurposeExample
CreateVolumeProvision new storageCreate an EBS volume
NodeStageVolumeAttach volume to nodeMount the block device
NodePublishVolumeMount into containerBind-mount into Pod path

The beauty of CSI is that Kubernetes does not need to know anything about the underlying storage. Whether it is AWS EBS, a Ceph cluster, or a local NVMe drive, the interface is identical.

A CSI driver consists of two components:

# Controller Plugin — runs as a Deployment (one per cluster)
# Handles: CreateVolume, DeleteVolume, ControllerPublishVolume
apiVersion: apps/v1
kind: Deployment
metadata:
name: csi-controller
namespace: kube-system
spec:
replicas: 2
selector:
matchLabels:
app: csi-controller
template:
metadata:
labels:
app: csi-controller
spec:
serviceAccountName: csi-controller-sa
containers:
- name: csi-provisioner
image: registry.k8s.io/sig-storage/csi-provisioner:v5.1.0
args:
- "--csi-address=/csi/csi.sock"
- "--leader-election"
- name: csi-attacher
image: registry.k8s.io/sig-storage/csi-attacher:v4.7.0
args:
- "--csi-address=/csi/csi.sock"
- "--leader-election"
- name: csi-driver
image: my-storage-vendor/csi-driver:v2.3.0
volumeMounts:
- name: socket-dir
mountPath: /csi
volumes:
- name: socket-dir
emptyDir: {}
# Node Plugin — runs as a DaemonSet (one per node)
# Handles: NodeStageVolume, NodePublishVolume, NodeGetInfo
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: csi-node
namespace: kube-system
spec:
selector:
matchLabels:
app: csi-node
template:
metadata:
labels:
app: csi-node
spec:
containers:
- name: csi-node-driver-registrar
image: registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.12.0
args:
- "--csi-address=/csi/csi.sock"
- "--kubelet-registration-path=/var/lib/kubelet/plugins/my-driver/csi.sock"
- name: csi-driver
image: my-storage-vendor/csi-driver:v2.3.0
securityContext:
privileged: true
volumeMounts:
- name: socket-dir
mountPath: /csi
- name: kubelet-dir
mountPath: /var/lib/kubelet
mountPropagation: Bidirectional

A StorageClass is the contract between a workload and its storage. It defines what kind of storage you get and how it behaves.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
provisioner: ebs.csi.aws.com
parameters:
type: gp3
iops: "16000"
throughput: "1000"
encrypted: "true"
reclaimPolicy: Retain # CRITICAL: Keep data on PVC deletion
volumeBindingMode: WaitForFirstConsumer # Bind when Pod is scheduled
allowVolumeExpansion: true # Allow online resize
mountOptions:
- noatime # Skip access time updates for performance

Key fields explained:

FieldWhy It Matters
reclaimPolicy: RetainPrevents accidental data deletion. Delete is the default and has caused countless incidents
volumeBindingMode: WaitForFirstConsumerEnsures volume is created in the same zone as the Pod. Without this, you get cross-zone scheduling failures
allowVolumeExpansion: trueLets you resize PVCs without recreating them. Essential for growing databases

Local Persistent Volumes: Maximum Performance

Section titled “Local Persistent Volumes: Maximum Performance”

Network-attached storage is convenient — it survives node failures and can be attached to any node. But for workloads that demand maximum I/O performance, the network is the enemy.

Typical latency comparison:

Storage TypeRead LatencyWrite LatencyIOPS
Network SSD (gp3)0.5-1 ms0.5-1 ms16,000
Network SSD (io2)0.2-0.5 ms0.2-0.5 ms64,000
Local NVMe0.02-0.1 ms0.02-0.1 ms200,000+
Local SSD (SATA)0.1-0.3 ms0.1-0.3 ms80,000+

For databases like CockroachDB, Cassandra, or Kafka — where every microsecond of write latency affects throughput — local storage can be transformative.

Local PVs require manual provisioning. You tell Kubernetes exactly which disk on which node to use.

apiVersion: v1
kind: PersistentVolume
metadata:
name: local-pv-node1-ssd0
spec:
capacity:
storage: 500Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: local-nvme
local:
path: /mnt/disks/ssd0
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- worker-node-1
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: local-nvme
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Retain

Critical detail: volumeBindingMode: WaitForFirstConsumer is mandatory for local PVs. Without it, the scheduler might bind a PVC to a PV on node A, then try to schedule the Pod on node B, resulting in a stuck Pod.

Manually creating PV manifests for every disk on every node does not scale. The Local Static Provisioner automates this:

apiVersion: v1
kind: ConfigMap
metadata:
name: local-provisioner-config
namespace: kube-system
data:
storageClassMap: |
local-nvme:
hostDir: /mnt/disks/nvme
mountDir: /mnt/disks/nvme
blockCleanerCommand:
- "/scripts/quick_reset.sh"
volumeMode: Filesystem
fsType: ext4
namePattern: "local-pv-*"

The provisioner watches the configured directory. When a new disk appears (formatted and mounted), it automatically creates a PV. When a PV is released and cleaned, it becomes available again.

Local PVs come with a fundamental trade-off:

┌─────────────────────────────────────────────┐
│ LOCAL PV TRADE-OFFS │
├─────────────────┬───────────────────────────┤
│ YOU GET │ YOU LOSE │
├─────────────────┼───────────────────────────┤
│ 3-10x IOPS │ Cross-node portability │
│ Sub-ms latency │ Automatic reattachment │
│ Predictable IO │ Node failure resilience │
│ No network tax │ Easy volume snapshots │
└─────────────────┴───────────────────────────┘

This means your application must handle replication itself. If a node dies, the data on its local PVs is inaccessible (or gone). Databases like CockroachDB, TiDB, and Cassandra handle this gracefully because they replicate at the application layer. A single-instance PostgreSQL on a local PV? That is a disaster waiting to happen.


Deployments treat Pods as interchangeable cattle. Any Pod can be replaced by any other Pod. This is perfect for stateless web servers but catastrophic for databases.

Stateful applications need three guarantees that Deployments cannot provide:

  1. Stable network identity — Each Pod gets a predictable DNS name
  2. Stable storage — Each Pod gets its own PVC that follows it across rescheduling
  3. Ordered lifecycle — Pods start and stop in a defined sequence
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: cockroachdb
namespace: data
spec:
serviceName: cockroachdb # Required: headless service name
replicas: 3
podManagementPolicy: Parallel # or OrderedReady (default)
updateStrategy:
type: RollingUpdate
rollingUpdate:
partition: 0 # Update all Pods (set higher to canary)
maxUnavailable: 1 # Allow 1 Pod down during update
selector:
matchLabels:
app: cockroachdb
template:
metadata:
labels:
app: cockroachdb
spec:
terminationGracePeriodSeconds: 60
containers:
- name: cockroachdb
image: cockroachdb/cockroach:v24.3.2
ports:
- containerPort: 26257
name: grpc
- containerPort: 8080
name: http
volumeMounts:
- name: datadir
mountPath: /cockroach/cockroach-data
readinessProbe:
httpGet:
path: /health?ready=1
port: http
initialDelaySeconds: 10
periodSeconds: 5
resources:
requests:
cpu: "2"
memory: 4Gi
limits:
memory: 4Gi
volumeClaimTemplates:
- metadata:
name: datadir
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: local-nvme
resources:
requests:
storage: 100Gi

What this creates:

ResourceNamePurpose
Podcockroachdb-0First replica (ordinal 0)
Podcockroachdb-1Second replica (ordinal 1)
Podcockroachdb-2Third replica (ordinal 2)
PVCdatadir-cockroachdb-0Storage for Pod 0
PVCdatadir-cockroachdb-1Storage for Pod 1
PVCdatadir-cockroachdb-2Storage for Pod 2

The headless Service gives each Pod a DNS entry:

cockroachdb-0.cockroachdb.data.svc.cluster.local
cockroachdb-1.cockroachdb.data.svc.cluster.local
cockroachdb-2.cockroachdb.data.svc.cluster.local

With the default podManagementPolicy: OrderedReady:

Startup: cockroachdb-0 must be Running and Ready before cockroachdb-1 starts. cockroachdb-1 must be Running and Ready before cockroachdb-2 starts.

Shutdown: The reverse. cockroachdb-2 terminates first, then cockroachdb-1, then cockroachdb-0.

Rolling updates: cockroachdb-2 is updated first, then cockroachdb-1, then cockroachdb-0.

This ordering is essential for many distributed systems. For example, a database might elect its leader as the lowest-ordinal Pod. Shutting down Pod 0 first could trigger an unnecessary leader election.

For databases that handle their own coordination (like CockroachDB with Raft), you can use podManagementPolicy: Parallel to speed up scaling operations.

StatefulSets support canary deployments through the partition field:

updateStrategy:
type: RollingUpdate
rollingUpdate:
partition: 2 # Only update Pods with ordinal >= 2

With partition: 2 and 3 replicas, only cockroachdb-2 gets the new image. Pods 0 and 1 stay on the old version. This lets you validate the new version on a single replica before rolling it out to the entire cluster.


The Operator Pattern for Stateful Applications

Section titled “The Operator Pattern for Stateful Applications”

StatefulSets handle Pod identity, ordering, and storage. But they know nothing about your application’s domain logic:

  • How do you initialize a database cluster?
  • How do you add a new replica and rebalance data?
  • How do you perform a zero-downtime schema migration?
  • How do you handle a split-brain scenario?
  • How do you take consistent backups?

This is where Operators come in. An Operator encodes the operational knowledge of running a specific application into a Kubernetes controller.

┌────────────────────────────────────────────────┐
│ Operator │
│ │
│ ┌──────────┐ ┌───────────┐ ┌─────────┐ │
│ │ Watch │───→│ Analyze │───→│ Act │ │
│ │ State │ │ Diff │ │ Reconcile││
│ └──────────┘ └───────────┘ └─────────┘ │
│ ▲ │ │
│ │ │ │
│ └────────────────────────────────┘ │
│ Reconciliation Loop │
└────────────────────────────────────────────────┘
│ │
▼ ▼
┌──────────┐ ┌──────────────┐
│ Custom │ │ StatefulSet │
│ Resource │ │ Services │
│ (CR) │ │ ConfigMaps │
└──────────┘ └──────────────┘

You declare what you want (a 3-node CockroachDB cluster), and the Operator figures out how to make it happen:

apiVersion: crdb.cockroachlabs.com/v1alpha1
kind: CrdbCluster
metadata:
name: cockroachdb
namespace: data
spec:
dataStore:
pvc:
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
storageClassName: local-nvme
volumeMode: Filesystem
resources:
requests:
cpu: "2"
memory: 4Gi
limits:
memory: 4Gi
tlsEnabled: true
image:
name: cockroachdb/cockroach:v24.3.2
nodes: 3
additionalLabels:
app.kubernetes.io/part-of: data-platform

The Operator translates this into StatefulSets, Services, Jobs (for initialization), TLS certificates, and monitoring resources — plus it handles day-2 operations like scaling, upgrades, and backups.

OperatorDatabaseMaturityKey Features
CockroachDB OperatorCockroachDBProductionAuto-scaling, rolling upgrades, backup/restore
StrimziApache KafkaProductionFull Kafka lifecycle, Connect, MirrorMaker
CloudNativePGPostgreSQLProductionHA, backups to S3, connection pooling
Percona OperatorsMySQL/MongoDB/PGProductionMulti-cloud, backup, encryption
Crunchy PGOPostgreSQLProductionHA, monitoring, disaster recovery
Redis Operator (Spotahome)RedisProductionRedis Sentinel, cluster mode

CSI Volume Snapshots provide point-in-time copies of PVCs:

# 1. Create a VolumeSnapshotClass
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: csi-snapclass
driver: ebs.csi.aws.com
deletionPolicy: Retain
parameters:
tagSpecification_1: "Department=data-engineering"
---
# 2. Take a snapshot
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: cockroachdb-0-snap-20260324
namespace: data
spec:
volumeSnapshotClassName: csi-snapclass
source:
persistentVolumeClaimName: datadir-cockroachdb-0
---
# 3. Restore from snapshot (create a new PVC from it)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: datadir-cockroachdb-0-restored
namespace: data
spec:
accessModes:
- ReadWriteOnce
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi
dataSource:
name: cockroachdb-0-snap-20260324
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io

For production stateful workloads, snapshots alone are not enough. You need a comprehensive backup strategy:

┌──────────────────────────────────────────────────┐
│ BACKUP STRATEGY LAYERS │
├──────────────────────────────────────────────────┤
│ Layer 1: Application-level backups │
│ (pg_dump, cockroach backup, etc.) │
│ → Most consistent, application-aware │
├──────────────────────────────────────────────────┤
│ Layer 2: Volume Snapshots (CSI) │
│ → Fast, crash-consistent │
├──────────────────────────────────────────────────┤
│ Layer 3: Velero cluster backup │
│ → Full namespace/cluster recovery │
├──────────────────────────────────────────────────┤
│ Layer 4: Cross-region replication │
│ → Disaster recovery across regions │
└──────────────────────────────────────────────────┘

Velero for Kubernetes-native backup:

# Schedule automated backups with Velero
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: cockroachdb-daily
namespace: velero
spec:
schedule: "0 2 * * *" # Daily at 2 AM
template:
includedNamespaces:
- data
labelSelector:
matchLabels:
app: cockroachdb
snapshotVolumes: true
storageLocation: aws-backup
ttl: 720h0m0s # Retain for 30 days
defaultVolumesToFsBackup: false

MistakeWhy It HappensWhat To Do Instead
Using reclaimPolicy: Delete for database PVCsIt is the default for dynamic provisioningAlways set reclaimPolicy: Retain for stateful workloads
Running a single-instance database on a local PVLocal PVs are fast, seems like a good ideaUse network storage for single instances, local PVs only with replicated databases
Skipping WaitForFirstConsumer binding modeNot understanding zone topologyAlways use WaitForFirstConsumer for topology-constrained workloads
Setting CPU limits on database PodsFollowing generic “always set limits” adviceSet CPU requests but skip CPU limits — databases need burst capacity
Not testing failover before going to production”It will probably work”Run chaos tests monthly: kill Pods, drain nodes, corrupt a replica
Using emptyDir for database state in developmentQuick and easy for testingEven in dev, use PVCs. Develop the habit now to avoid mistakes later
Ignoring terminationGracePeriodSecondsDefault 30s seems enoughDatabases need 60-300s to flush buffers and deregister from the cluster cleanly

Question 1: What are the three guarantees that StatefulSets provide that Deployments do not?

Show Answer
  1. Stable network identity — Each Pod gets a persistent hostname (e.g., db-0, db-1) and DNS entry via a headless Service.
  2. Stable persistent storage — Each Pod gets its own PVC via volumeClaimTemplates, and that PVC follows the Pod across rescheduling.
  3. Ordered, graceful deployment and scaling — Pods are created, updated, and deleted in a defined sequential order.

Question 2: Why is volumeBindingMode: WaitForFirstConsumer critical for local PVs?

Show Answer

Without WaitForFirstConsumer, the PVC could be bound to a PV on a different node than where the Pod gets scheduled. Since local PVs are physically tied to a specific node, this would result in the Pod being stuck in Pending state forever — it cannot access a local disk on another node.

WaitForFirstConsumer delays volume binding until the Pod is scheduled, ensuring the PV and Pod are on the same node.

Question 3: What is the difference between reclaimPolicy: Retain and reclaimPolicy: Delete?

Show Answer
  • Delete: When a PVC is deleted, the underlying PV and its storage (e.g., the EBS volume) are automatically deleted. This is the default for dynamically provisioned volumes and is dangerous for stateful workloads.
  • Retain: When a PVC is deleted, the PV is marked as Released but the data is preserved. An administrator must manually reclaim or delete the volume. This is essential for preventing accidental data loss.

Question 4: Why would you use podManagementPolicy: Parallel instead of the default OrderedReady?

Show Answer

Use Parallel when the application handles its own coordination and does not rely on StatefulSet ordering. Databases like CockroachDB use Raft consensus and do not need Kubernetes to start Pods sequentially. Parallel significantly speeds up initial deployment and scaling — all Pods start simultaneously instead of one at a time, which matters when you have dozens of replicas.

Question 5: An Operator extends Kubernetes. What three components does it consist of, and what role does each play?

Show Answer
  1. Custom Resource Definition (CRD) — Defines the schema of a new resource type (e.g., CrdbCluster), extending the Kubernetes API.
  2. Custom Resource (CR) — An instance of the CRD that declares the desired state (e.g., “I want a 3-node CockroachDB cluster with TLS”).
  3. Controller — A control loop that watches CRs, compares desired state to actual state, and takes actions (create StatefulSets, run init Jobs, issue TLS certs) to reconcile the difference.

Hands-On Exercise: Deploy CockroachDB with the Operator on Local PVs

Section titled “Hands-On Exercise: Deploy CockroachDB with the Operator on Local PVs”

Deploy a 3-node CockroachDB cluster using the CockroachDB Operator with local persistent volumes, then simulate a node failure and verify the cluster survives.

You need a multi-node kind cluster:

kind-config.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
extraMounts:
- hostPath: /tmp/cockroachdb/node1
containerPath: /mnt/disks/ssd0
- role: worker
extraMounts:
- hostPath: /tmp/cockroachdb/node2
containerPath: /mnt/disks/ssd0
- role: worker
extraMounts:
- hostPath: /tmp/cockroachdb/node3
containerPath: /mnt/disks/ssd0
Terminal window
# Create host directories and the kind cluster
mkdir -p /tmp/cockroachdb/node{1,2,3}
kind create cluster --name data-lab --config kind-config.yaml

Step 1: Create the Local PV StorageClass and PVs

Section titled “Step 1: Create the Local PV StorageClass and PVs”
Terminal window
# Get worker node names
NODES=$(kubectl get nodes --selector='!node-role.kubernetes.io/control-plane' -o jsonpath='{.items[*].metadata.name}')
echo "Worker nodes: $NODES"
local-storage.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: local-nvme
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: Retain
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: local-pv-worker-1
spec:
capacity:
storage: 10Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: local-nvme
local:
path: /mnt/disks/ssd0
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- data-lab-worker
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: local-pv-worker-2
spec:
capacity:
storage: 10Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: local-nvme
local:
path: /mnt/disks/ssd0
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- data-lab-worker2
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: local-pv-worker-3
spec:
capacity:
storage: 10Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: local-nvme
local:
path: /mnt/disks/ssd0
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- data-lab-worker3
Terminal window
kubectl apply -f local-storage.yaml
kubectl get pv
# All three PVs should show Available
Terminal window
# Install the CockroachDB Operator via its manifest
kubectl apply -f https://raw.githubusercontent.com/cockroachdb/cockroach-operator/v2.15.0/install/crds.yaml
kubectl apply -f https://raw.githubusercontent.com/cockroachdb/cockroach-operator/v2.15.0/install/operator.yaml
# Wait for the operator to be ready
kubectl -n cockroach-operator-system wait --for=condition=Available \
deployment/cockroach-operator-manager --timeout=120s
cockroachdb-cluster.yaml
apiVersion: crdb.cockroachlabs.com/v1alpha1
kind: CrdbCluster
metadata:
name: cockroachdb
namespace: default
spec:
dataStore:
pvc:
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
storageClassName: local-nvme
volumeMode: Filesystem
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
memory: 1Gi
tlsEnabled: false
image:
name: cockroachdb/cockroach:v24.3.2
nodes: 3
Terminal window
kubectl apply -f cockroachdb-cluster.yaml
# Watch Pods come up one by one
kubectl get pods -w -l app.kubernetes.io/name=cockroachdb
# Verify cluster health (wait for all 3 Pods to be Running)
kubectl exec cockroachdb-0 -- cockroach node status --insecure
Terminal window
# Open a SQL shell
kubectl exec -it cockroachdb-0 -- cockroach sql --insecure
# Inside the SQL shell:
CREATE DATABASE testdb;
USE testdb;
CREATE TABLE sensors (
id UUID DEFAULT gen_random_uuid() PRIMARY KEY,
sensor_name STRING NOT NULL,
reading FLOAT NOT NULL,
recorded_at TIMESTAMP DEFAULT now()
);
INSERT INTO sensors (sensor_name, reading) VALUES
('temp-1', 23.5), ('temp-2', 24.1), ('pressure-1', 1013.25),
('humidity-1', 65.2), ('temp-3', 22.8);
SELECT * FROM sensors;
-- You should see 5 rows
\q
Terminal window
# Identify which node cockroachdb-1 is on
NODE=$(kubectl get pod cockroachdb-1 -o jsonpath='{.spec.nodeName}')
echo "cockroachdb-1 is on node: $NODE"
# Cordon and drain the node (simulates node failure)
kubectl cordon $NODE
kubectl drain $NODE --delete-emptydir-data --ignore-daemonsets --force --timeout=60s
# Check cluster status — should show 2/3 nodes healthy
kubectl exec cockroachdb-0 -- cockroach node status --insecure
# Verify data is still accessible
kubectl exec -it cockroachdb-0 -- cockroach sql --insecure \
-e "SELECT count(*) FROM testdb.sensors;"
# Should return 5 — data is safe because CockroachDB replicates across nodes
Terminal window
# Uncordon the node
kubectl uncordon $NODE
# The Pod will be rescheduled and rejoin the cluster
kubectl get pods -w -l app.kubernetes.io/name=cockroachdb
# Verify all 3 nodes are healthy again
kubectl exec cockroachdb-0 -- cockroach node status --insecure
Terminal window
kubectl delete crdbcluster cockroachdb
kubectl delete pvc -l app.kubernetes.io/name=cockroachdb
kubectl delete -f local-storage.yaml
kind delete cluster --name data-lab
rm -rf /tmp/cockroachdb

You have completed this exercise when you:

  • Created 3 local PVs bound to specific worker nodes
  • Deployed CockroachDB via the Operator with local storage
  • Wrote test data and verified it across the cluster
  • Simulated a node failure and confirmed data availability
  • Recovered the node and verified cluster health returned to 3/3

  1. CSI standardized storage in Kubernetes — Every storage vendor speaks the same gRPC interface, making storage pluggable and vendor-neutral.
  2. Local PVs trade portability for performance — Use them only with applications that replicate data at the application layer.
  3. StatefulSets provide identity, ordering, and stable storage — They are the foundation for stateful workloads but insufficient alone for complex databases.
  4. Operators encode operational knowledge — They automate day-2 operations (scaling, upgrades, backups) that StatefulSets cannot handle.
  5. Backup in layers — Application-level backups, volume snapshots, and Velero each protect against different failure modes.

Books:

  • “Managing Kubernetes” — Brendan Burns, Craig Tracey (O’Reilly) — Deep dive on cluster operations
  • “Kubernetes Patterns” — Bilgin Ibryam, Roland Huss (O’Reilly) — Stateful Service patterns

Articles:

  • “Running Databases on Kubernetes” — DoK (Data on Kubernetes) Community white paper
  • “Local Persistent Volumes” — Kubernetes documentation (kubernetes.io/docs/concepts/storage/volumes/#local)

Talks:

  • “Data on Kubernetes: How We Got Here” — Melissa Logan, KubeCon 2024 (YouTube)
  • “CockroachDB on Kubernetes at Scale” — Cockroach Labs Engineering Blog

Running stateful workloads on Kubernetes is no longer an experiment — it is a proven production practice. The key is understanding the stack: CSI drivers provide pluggable storage, StorageClasses define the contract, StatefulSets provide identity and ordering, and Operators automate the complex lifecycle management that stateful applications demand.

The most common mistake is not a technical one — it is treating stateful workloads like stateless ones. Databases are not Deployments. They need stable identities, careful ordering, tested backup strategies, and operators who understand both Kubernetes and the specific database. Respect the complexity, and Kubernetes becomes an excellent platform for data workloads.


Continue to Module 1.2: Apache Kafka on Kubernetes (Strimzi) to learn how to deploy and operate the most widely-used distributed streaming platform on Kubernetes.


“Data is not just state. Data is the reason most applications exist.” — Kelsey Hightower