Module 4.2: Software-Defined Storage with Ceph & Rook
Complexity:
[COMPLEX]| Time: 60 minutesPrerequisites: Module 4.1: Storage Architecture, Rook/Ceph Toolkit
Why This Module Matters
Section titled “Why This Module Matters”Ceph is the dominant distributed storage system for on-premises Kubernetes. It turns a collection of local disks across multiple servers into a unified, replicated, self-healing storage pool that Kubernetes can consume via CSI. When a disk fails, Ceph automatically redistributes data. When a node goes down, Ceph keeps serving from replicas on surviving nodes. When you add new servers, Ceph rebalances automatically.
But Ceph is not simple. It has its own daemons (MON, OSD, MDS, MGR), its own consensus protocol (Paxos for monitors), its own networking requirements (separate public and cluster networks), and its own failure modes. Running Ceph poorly is worse than not running it at all — a misconfigured Ceph cluster can amplify failures instead of preventing them.
Rook is the Kubernetes operator that manages Ceph. It turns Ceph deployment from a multi-day manual process into a kubectl apply. But understanding what Rook does under the hood is essential for troubleshooting.
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Deploy a production-grade Ceph cluster via Rook with properly sized MON, OSD, and MDS components
- Configure Ceph storage classes for block (RBD), filesystem (CephFS), and object (RGW) storage in Kubernetes
- Optimize Ceph performance by tuning OSD placement, replication factors, CRUSH rules, and network separation
- Troubleshoot Ceph health warnings, slow OSD recovery, and PG degradation during node failures
What You’ll Learn
Section titled “What You’ll Learn”- Ceph architecture (MON, OSD, MDS, MGR, RADOS)
- Rook operator deployment and CephCluster CRD
- Storage classes for block (RBD), filesystem (CephFS), and object (RGW)
- Performance tuning for on-premises workloads
- Monitoring and alerting for Ceph health
- Failure recovery procedures
Ceph Architecture
Section titled “Ceph Architecture”┌─────────────────────────────────────────────────────────────┐│ CEPH ARCHITECTURE ││ ││ ┌──────────┐ ┌──────────┐ ┌──────────┐ ││ │ MON 1 │ │ MON 2 │ │ MON 3 │ Monitors ││ │ (Paxos) │ │ (Paxos) │ │ (Paxos) │ - Cluster map ││ └──────────┘ └──────────┘ └──────────┘ - Quorum (odd #) ││ - 3 or 5 MONs ││ ┌──────────┐ ││ │ MGR │ Manager ││ │ │ - Dashboard, metrics, modules ││ │ │ - Active/standby HA ││ └──────────┘ ││ ││ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ││ │OSD 0 │ │OSD 1 │ │OSD 2 │ │OSD 3 │ │OSD 4 │ │OSD 5 │ ││ │NVMe │ │NVMe │ │NVMe │ │NVMe │ │NVMe │ │NVMe │ ││ │Node 1│ │Node 1│ │Node 2│ │Node 2│ │Node 3│ │Node 3│ ││ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘ ││ ││ Object Storage Daemons (OSDs): ││ - One per physical drive ││ - Stores data as objects in a flat namespace ││ - Replicates data to other OSDs (default 3x) ││ - Self-heals: rebuilds replicas when an OSD fails ││ ││ ┌──────────┐ (Optional) ││ │ MDS │ Metadata Server ││ │ │ - Required only for CephFS ││ │ │ - Manages file/directory namespace ││ └──────────┘ ││ ││ Storage Types: ││ RBD (Block) → PersistentVolumes (ReadWriteOnce) ││ CephFS (File)→ PersistentVolumes (ReadWriteMany) ││ RGW (Object) → S3-compatible object storage ││ │└─────────────────────────────────────────────────────────────┘Ceph Networking
Section titled “Ceph Networking”┌─────────────────────────────────────────────────────────────┐│ CEPH NETWORK DESIGN ││ ││ PUBLIC NETWORK (VLAN 20 or 30): ││ ├── Client → OSD communication (read/write data) ││ ├── MON communication (cluster map queries) ││ └── K8s nodes → Ceph (CSI driver traffic) ││ ││ CLUSTER NETWORK (VLAN 30, separate from public): ││ ├── OSD → OSD replication (write amplification: 3x data) ││ ├── OSD → OSD recovery (backfill after failure) ││ └── Heartbeat between OSDs ││ ││ WHY SEPARATE: ││ Replication traffic = 2x the client write traffic ││ Recovery traffic = can saturate the network for hours ││ Separating prevents storage operations from impacting pods ││ ││ RECOMMENDED: ││ Public: 25GbE (shared with K8s node network) ││ Cluster: 25GbE (dedicated, same bond, different VLAN) ││ MTU: 9000 (jumbo frames) on both networks ││ │└─────────────────────────────────────────────────────────────┘Deploying Ceph with Rook
Section titled “Deploying Ceph with Rook”Step 1: Install Rook Operator
Section titled “Step 1: Install Rook Operator”# Add Rook Helm repohelm repo add rook-release https://charts.rook.io/releasehelm repo update
# Install Rook operatorhelm install rook-ceph rook-release/rook-ceph \ --namespace rook-ceph --create-namespace \ --set csi.enableRBDDriver=true \ --set csi.enableCephFSDriver=true
# Wait for operatorkubectl -n rook-ceph wait --for=condition=Ready pod \ -l app=rook-ceph-operator --timeout=300sPause and predict: The CephCluster definition below explicitly lists which devices on which nodes to use as OSDs, rather than setting
useAllDevices: true. Why is explicit device listing safer? What could go wrong withuseAllDevices: trueon a server that has both OS drives and data drives?
Step 2: Create CephCluster
Section titled “Step 2: Create CephCluster”The CephCluster CRD below configures a production-grade Ceph deployment. Notice three critical design decisions: (1) allowMultiplePerNode: false for MONs ensures that a single node failure cannot lose quorum, (2) provider: host for networking bypasses container networking overhead for storage I/O, and (3) resource limits on OSDs prevent them from consuming all CPU and memory during recovery operations:
apiVersion: ceph.rook.io/v1kind: CephClustermetadata: name: rook-ceph namespace: rook-cephspec: cephVersion: image: quay.io/ceph/ceph:v19.2 dataDirHostPath: /var/lib/rook
mon: count: 3 allowMultiplePerNode: false # 1 MON per node (HA)
mgr: count: 2 allowMultiplePerNode: false
dashboard: enabled: true ssl: true
network: provider: host # Use host networking for best performance # Or specify Multus for dedicated storage network: # provider: multus # selectors: # public: rook-ceph/public-net # cluster: rook-ceph/cluster-net
storage: useAllNodes: false useAllDevices: false nodes: - name: "storage-01" devices: - name: "nvme0n1" - name: "nvme1n1" - name: "nvme2n1" - name: "nvme3n1" - name: "storage-02" devices: - name: "nvme0n1" - name: "nvme1n1" - name: "nvme2n1" - name: "nvme3n1" - name: "storage-03" devices: - name: "nvme0n1" - name: "nvme1n1" - name: "nvme2n1" - name: "nvme3n1"
resources: osd: limits: cpu: "2" memory: "4Gi" requests: cpu: "1" memory: "2Gi"
placement: mon: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: role operator: In values: ["storage"]Stop and think: The CephBlockPool below sets
failureDomain: hostwithreplicated.size: 3. This means each block is copied to 3 different servers. If you accidentally setfailureDomain: osdinstead ofhost, two replicas could land on different drives of the same server. What happens when that server loses power?
Step 3: Create StorageClasses
Section titled “Step 3: Create StorageClasses”# Block storage (RBD) — most common for databases, stateful appsapiVersion: ceph.rook.io/v1kind: CephBlockPoolmetadata: name: replicated-pool namespace: rook-cephspec: failureDomain: host # Replicate across hosts, not just OSDs replicated: size: 3 # 3 copies of every block requireSafeReplicaSize: true---apiVersion: storage.k8s.io/v1kind: StorageClassmetadata: name: ceph-blockprovisioner: rook-ceph.rbd.csi.ceph.comparameters: clusterID: rook-ceph pool: replicated-pool imageFormat: "2" imageFeatures: layering,fast-diff,object-map,deep-flatten,exclusive-lock csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph csi.storage.k8s.io/fstype: ext4reclaimPolicy: DeleteallowVolumeExpansion: true
---# Filesystem storage (CephFS) — for shared access (ReadWriteMany)apiVersion: ceph.rook.io/v1kind: CephFilesystemmetadata: name: shared-fs namespace: rook-cephspec: metadataPool: replicated: size: 3 dataPools: - name: data0 replicated: size: 3 metadataServer: activeCount: 1 activeStandby: true---apiVersion: storage.k8s.io/v1kind: StorageClassmetadata: name: ceph-filesystemprovisioner: rook-ceph.cephfs.csi.ceph.comparameters: clusterID: rook-ceph fsName: shared-fs pool: shared-fs-data0 csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node csi.storage.k8s.io/node-stage-secret-namespace: rook-cephreclaimPolicy: DeleteallowVolumeExpansion: trueStep 4: Verify Ceph Health
Section titled “Step 4: Verify Ceph Health”# Check Ceph cluster statuskubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status# cluster:# id: a1b2c3d4-...# health: HEALTH_OK## services:# mon: 3 daemons, quorum a,b,c# mgr: a(active), standbys: b# osd: 12 osds: 12 up, 12 in## data:# pools: 2 pools, 128 pgs# objects: 1.23k objects, 4.5 GiB# usage: 15 GiB used, 45 TiB / 45 TiB avail
# Check OSD statuskubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd tree# ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT# -1 45.00000 root default# -3 15.00000 host storage-01# 0 ssd 3.75000 osd.0 up 1.00000# 1 ssd 3.75000 osd.1 up 1.00000# 2 ssd 3.75000 osd.2 up 1.00000# 3 ssd 3.75000 osd.3 up 1.00000# ...
# Check pool IOPSkubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd pool statsPause and predict: After a storage node failure, Ceph starts rebuilding data replicas on surviving nodes. This recovery traffic can consume 80% of available I/O bandwidth. Your production database is on the same Ceph cluster. How would you balance the trade-off between fast recovery (data safety) and application performance?
Performance Tuning
Section titled “Performance Tuning”Key Tuning Parameters
Section titled “Key Tuning Parameters”The tuning parameters below control the tension between recovery speed and client I/O performance. Setting osd_recovery_max_active to 1 means only one recovery operation per OSD runs at a time — slower recovery, but application latency stays predictable. Setting it to 3 recovers faster but can spike I/O latency by 5-10x during the recovery window:
# Inside rook-ceph-tools pod:
# Increase OSD recovery speed (at cost of client I/O)ceph config set osd osd_recovery_max_active 3ceph config set osd osd_recovery_sleep 0
# Or throttle recovery to protect client I/Oceph config set osd osd_recovery_max_active 1ceph config set osd osd_recovery_sleep 0.5
# Enable RBD caching for read-heavy workloadsceph config set client rbd_cache trueceph config set client rbd_cache_size 134217728 # 128MB
# Set scrub schedule (background data integrity check)ceph config set osd osd_scrub_begin_hour 2 # Start at 2 AMceph config set osd osd_scrub_end_hour 6 # End at 6 AM
# Monitor PG (Placement Group) count — critical for performance# Rule of thumb: total PGs = (OSDs * 100) / replication_factor# 12 OSDs, replication 3: (12 * 100) / 3 = 400 PGs per pool# Round to nearest power of 2: 512Did You Know?
Section titled “Did You Know?”-
Ceph’s CRUSH algorithm (Controlled Replication Under Scalable Hashing) determines where data is stored without a central lookup table. This means Ceph can scale to thousands of OSDs without a metadata bottleneck — any client can calculate the location of any object independently.
-
Ceph monitors use Paxos consensus, not Raft. Paxos predates Raft by 20 years (1989 vs 2013) and is mathematically equivalent but harder to implement. The Ceph team chose Paxos because Raft did not exist when Ceph was designed.
-
A single Ceph cluster can scale to exabytes. CERN runs one of the largest Ceph deployments: 30+ PB across thousands of OSDs, storing physics experiment data from the Large Hadron Collider.
-
BlueStore replaced FileStore as the default OSD backend in Ceph Luminous (2017). BlueStore writes directly to raw block devices, bypassing the Linux filesystem entirely. This eliminates the double-write penalty that FileStore suffered and improves write performance by 2x.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| Too few PGs | Uneven data distribution, hotspots | Calculate PGs: (OSDs × 100) / replication_factor |
| No cluster network | Replication competes with client I/O | Separate public and cluster networks |
| MONs on OSD nodes | MON fails when OSD saturates CPU/RAM | Dedicated MON nodes (or on K8s control plane nodes) |
| No resource limits on OSDs | OSD consumes all node RAM during recovery | Set CPU/memory limits in CephCluster spec |
| Replication factor 2 | Single failure = data at risk during rebuild | Always use replication factor 3 |
| Not monitoring disk health | Drive fails silently, OSD goes down | smartmontools + Prometheus SMART exporter |
| Scrubbing during peak hours | Background scrub competes with workload I/O | Schedule scrubs for off-peak hours |
Using useAllDevices: true | Accidentally formats OS drives as OSDs | Explicitly list devices per node |
Question 1
Section titled “Question 1”You have 12 NVMe drives across 3 storage nodes (4 per node). What replication factor and failure domain should you use?
Answer
Replication factor 3, failure domain host.
- Each object is replicated to 3 different OSDs on 3 different hosts
- If an entire host fails (all 4 OSDs), 2 copies remain on the other 2 hosts
- Recovery redistributes the lost replicas using the 8 surviving OSDs
- Failure domain
hostensures no two replicas of the same object are on the same server
Do NOT use failure domain osd (default if not specified) — this would allow 2 replicas on the same host, meaning a host failure could lose 2 of 3 copies.
spec: failureDomain: host # Not "osd" replicated: size: 3Question 2
Section titled “Question 2”Your Ceph cluster shows HEALTH_WARN: 1 osds down. What is the immediate impact and what should you do?
Answer
Immediate impact: Minimal. With replication factor 3, all data has 2 remaining copies. No data is lost and all volumes are accessible. However, those placement groups that had replicas on the failed OSD now have only 2 copies instead of 3 (degraded).
What happens automatically:
- Ceph marks the OSD as
downand starts a timer - After 10 minutes (default
mon_osd_down_out_interval), Ceph marks the OSD asout - Ceph begins redistributing data to rebuild the third copy on surviving OSDs
- Recovery time depends on data size and network speed (~100GB/min on 25GbE)
What you should do:
- Check which OSD and which node:
ceph osd tree - Check if the node is reachable (is this a disk failure or a node failure?)
- If disk failure: replace the drive, then
ceph osd purge <id> --yes-i-really-mean-itand let Rook redeploy - If node failure: fix the node; when it comes back, the OSD will rejoin automatically
- Monitor recovery:
ceph -w(watch recovery progress)
Question 3
Section titled “Question 3”When should you use CephFS (ReadWriteMany) instead of RBD (ReadWriteOnce)?
Answer
Use RBD (block) for:
- Databases (PostgreSQL, MySQL, MongoDB) — need consistent block I/O
- Single-pod workloads that need persistent storage
- Any workload where only one pod writes at a time
- Best performance (direct block device, no filesystem overhead)
Use CephFS (filesystem) for:
- Shared data that multiple pods read/write simultaneously
- ML training datasets (multiple training pods read the same data)
- CMS content directories (multiple web servers serve the same files)
- Log aggregation (multiple pods write to a shared directory)
- Any workload that needs
ReadWriteManyaccess mode
Do NOT use CephFS for databases — the POSIX filesystem layer adds latency and doesn’t provide the consistency guarantees that databases expect from block devices.
# RBD PVCaccessModes: ["ReadWriteOnce"]storageClassName: ceph-block
# CephFS PVCaccessModes: ["ReadWriteMany"]storageClassName: ceph-filesystemHands-On Exercise: Deploy Rook-Ceph in Kind
Section titled “Hands-On Exercise: Deploy Rook-Ceph in Kind”Note: Rook removed support for directory-backed OSDs in v1.4. This exercise uses PVC-based OSDs with Kind’s default
standardStorageClass (local-path provisioner), which is the recommended approach for test clusters.
# Create a kind cluster with 3 worker nodescat <<EOF | kind create cluster --config=-kind: ClusterapiVersion: kind.x-k8s.io/v1alpha4nodes: - role: control-plane - role: worker - role: worker - role: workerEOF
# Install Rook operatorhelm repo add rook-release https://charts.rook.io/releasehelm install rook-ceph rook-release/rook-ceph \ --namespace rook-ceph --create-namespace
# Wait for operatorkubectl -n rook-ceph wait --for=condition=Ready pod \ -l app=rook-ceph-operator --timeout=300s
# Deploy a test CephCluster using PVC-based OSDs# This uses Kind's default StorageClass to back each OSD with a PVCkubectl apply -f - <<EOFapiVersion: ceph.rook.io/v1kind: CephClustermetadata: name: rook-ceph namespace: rook-cephspec: cephVersion: image: quay.io/ceph/ceph:v19.2 allowUnsupported: true dataDirHostPath: /var/lib/rook mon: count: 1 allowMultiplePerNode: true mgr: count: 1 allowMultiplePerNode: true dashboard: enabled: false crashCollector: disable: true storage: storageClassDeviceSets: - name: set1 count: 3 portable: true volumeClaimTemplates: - metadata: name: data spec: resources: requests: storage: 5Gi storageClassName: standard volumeMode: Block accessModes: - ReadWriteOnce resources: mon: limits: memory: "512Mi" requests: memory: "256Mi" osd: limits: memory: "1Gi" requests: memory: "512Mi"EOF
# Wait for Ceph to be healthy (takes 3-5 minutes)kubectl -n rook-ceph wait --for=condition=Ready cephcluster/rook-ceph --timeout=600s
# Deploy toolbox for ceph commandskubectl apply -f https://raw.githubusercontent.com/rook/rook/release-1.16/deploy/examples/toolbox.yaml
# Check Ceph healthkubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status
# Create a block pool and storage classkubectl apply -f https://raw.githubusercontent.com/rook/rook/release-1.16/deploy/examples/csi/rbd/storageclass-test.yaml
# Create a test PVCkubectl apply -f - <<EOFapiVersion: v1kind: PersistentVolumeClaimmetadata: name: test-pvcspec: accessModes: ["ReadWriteOnce"] storageClassName: rook-ceph-block resources: requests: storage: 1GiEOF
# Verify PVC is boundkubectl get pvc test-pvc# NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS# test-pvc Bound pvc-... 1Gi RWO rook-ceph-block
# Cleanupkubectl delete pvc test-pvckind delete clusterSuccess Criteria
Section titled “Success Criteria”- Rook operator deployed and running
- CephCluster healthy (HEALTH_OK)
- Block pool and StorageClass created
- PVC bound successfully
-
ceph statusshows healthy cluster
Next Module
Section titled “Next Module”Continue to Module 4.3: Local Storage & Alternatives to learn about lightweight storage options that do not require a distributed storage system.