Module 4.1: Storage Architecture Decisions
Complexity:
[MEDIUM]| Time: 45 minutesPrerequisites: Module 1.2: Server Sizing, CKA: Storage
Why This Module Matters
Section titled “Why This Module Matters”A data analytics company deployed a 40-node Kubernetes cluster on bare metal with local SATA SSDs in each server. For the first six months, everything worked. Then they deployed Apache Kafka, which needed persistent storage that survived node rescheduling. When a worker node failed, Kafka brokers were rescheduled to new nodes — but their data was on the dead node’s local SSDs. They lost 4 hours of event data and spent 3 days manually rebalancing partitions.
They then deployed Ceph via Rook on the same SSDs. Performance dropped 60%. Ceph’s replication tripled the write load on drives that were already serving application I/O. The OSD processes competed with Kubernetes workloads for CPU and memory. Their monitoring showed that Ceph rebalancing after a node failure consumed 80% of cluster I/O bandwidth for 4 hours, making all applications sluggish.
The fix was dedicating 6 servers as Ceph OSD nodes with NVMe drives, separated from the Kubernetes worker nodes. Storage traffic ran on a dedicated VLAN with jumbo frames. The lesson: storage architecture decisions must be made before you buy hardware, not after you deploy workloads.
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Design storage architectures that separate compute and storage tiers with dedicated networks and hardware
- Evaluate distributed storage (Ceph, Longhorn) vs. local storage vs. external SAN/NAS for specific workload requirements
- Plan storage hardware procurement including drive types (NVMe, SSD, HDD), RAID configurations, and capacity projections
- Diagnose storage performance bottlenecks caused by I/O contention, replication overhead, and network saturation
What You’ll Learn
Section titled “What You’ll Learn”- Direct-Attached Storage (DAS) vs Network-Attached Storage (NAS) vs SAN
- NVMe vs SSD vs HDD tiering for Kubernetes workloads
- etcd storage requirements (the most demanding component)
- When to use local storage vs distributed storage
- Dedicated storage nodes vs hyper-converged (storage on worker nodes)
- IOPS, throughput, and latency benchmarking
Storage Options for On-Premises K8s
Section titled “Storage Options for On-Premises K8s”The Three Storage Models
Section titled “The Three Storage Models”┌─────────────────────────────────────────────────────────────┐│ STORAGE MODELS ││ ││ DAS (Direct-Attached Storage) ││ ┌──────────────────────────────┐ ││ │ Server │ ││ │ ┌──────┐ ┌──────┐ ┌──────┐│ ││ │ │NVMe 1│ │NVMe 2│ │NVMe 3││ ← drives inside server ││ │ └──────┘ └──────┘ └──────┘│ ││ └──────────────────────────────┘ ││ ✓ Lowest latency (no network hop) ││ ✓ Highest IOPS (direct PCIe) ││ ✗ Data lost if node fails (no replication) ││ ✗ Cannot move PVs between nodes ││ Best for: etcd, databases with app-level replication ││ ││ NAS (Network-Attached Storage) ││ ┌──────────────────────────────┐ ││ │ NFS/SMB Server │ ││ │ ┌──────────────────────────┐│ ← shared over network ││ │ │ File System (XFS/ZFS) ││ ││ │ │ ┌──────┐ ┌──────┐ ││ ││ │ │ │Disk 1│ │Disk 2│ ... ││ ││ │ │ └──────┘ └──────┘ ││ ││ │ └──────────────────────────┘│ ││ └──────────────────────────────┘ ││ ✓ Shared access (ReadWriteMany) ││ ✓ Simple to set up (NFS CSI driver) ││ ✗ Higher latency (network + file protocol overhead) ││ ✗ Single point of failure (unless HA NFS) ││ Best for: shared data, ML datasets, media files ││ ││ SDS/SAN (Software-Defined / Storage Area Network) ││ ┌──────────────────────────────┐ ││ │ Ceph / StorageClass │ ││ │ ┌─────┐ ┌─────┐ ┌─────┐ │ ← distributed across ││ │ │OSD 1│ │OSD 2│ │OSD 3│ │ multiple nodes ││ │ └─────┘ └─────┘ └─────┘ │ ││ │ Data replicated 3x │ ││ └──────────────────────────────┘ ││ ✓ Replicated (survives node failures) ││ ✓ PVs can be accessed from any node ││ ✓ Self-healing (auto-rebalances) ││ ✗ Higher latency than DAS (network + replication) ││ ✗ Requires dedicated storage nodes (or hyper-converged) ││ Best for: general-purpose PVs, databases without replication││ │└─────────────────────────────────────────────────────────────┘Pause and predict: You need to provide storage for three workloads: (1) PostgreSQL with streaming replication, (2) a shared ML training dataset read by 8 pods simultaneously, and (3) etcd. Before looking at the decision matrix below, assign each workload to DAS, NAS, or SDS and explain your reasoning.
Decision Matrix
Section titled “Decision Matrix”| Workload | Storage Type | Why |
|---|---|---|
| etcd | DAS (NVMe) | Requires <10ms p99 fsync; network storage too slow |
| PostgreSQL (with streaming replication) | DAS (NVMe) | App handles replication; local NVMe gives best latency |
| PostgreSQL (single instance) | Ceph RBD | Needs to survive node failure; no app-level replication |
| Kafka | DAS or Ceph | Kafka replicates data; DAS is faster but Ceph is safer |
| Prometheus TSDB | DAS (NVMe) | Write-heavy, latency-sensitive; Thanos handles replication |
| ML training data | NFS / CephFS | Large datasets, read-heavy, shared access needed |
| CI/CD workspace | DAS (SSD) | Ephemeral, high IOPS for builds, data is disposable |
| Container images | Ceph RBD or NFS | Harbor/registry storage, moderate IOPS, shared access |
| Backup targets | NFS / HDD | Large capacity, low IOPS, cost-optimized |
Storage Tiers
Section titled “Storage Tiers”┌─────────────────────────────────────────────────────────────┐│ STORAGE TIER GUIDE ││ ││ Tier Tech IOPS Latency $/TB Use ││ ──── ──── ──── ─────── ──── ─── ││ 0 NVMe Gen5 1M+ <100μs $300 etcd, ││ (PCIe 5.0) hot DB ││ ││ 1 NVMe Gen4 500K+ <200μs $150 App data ││ (PCIe 4.0) Ceph OSD ││ ││ 2 SATA SSD 50-100K 1-5ms $80 Logs, ││ (enterprise) bulk ││ ││ 3 HDD SAS 100-200 5-15ms $20 Backup, ││ (10K/15K RPM) cold ││ ││ CRITICAL RULES: ││ • etcd: Tier 0 or 1 ONLY (SATA SSD will cause elections) ││ • Ceph OSD: Tier 1 minimum (SATA SSD limits throughput) ││ • Ceph WAL/DB: Tier 0 on separate device from OSD data ││ • Backup: Tier 3 is fine (capacity over speed) ││ │└─────────────────────────────────────────────────────────────┘Stop and think: A team proposes running Ceph OSD processes on the same servers as their Kubernetes worker pods to save money on hardware. Their cluster is 40 nodes running a mix of web services and a PostgreSQL database. What happens to PostgreSQL I/O latency when a Ceph OSD node fails and triggers data rebalancing? Would your recommendation change if the cluster had 200 nodes?
Dedicated Storage Nodes vs Hyper-Converged
Section titled “Dedicated Storage Nodes vs Hyper-Converged”Dedicated Storage Nodes
Section titled “Dedicated Storage Nodes”┌─────────────────────────────────────────────────────────────┐│ DEDICATED STORAGE NODES ││ ││ Worker Nodes (compute only): Storage Nodes (Ceph only): ││ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ││ │ Worker 1 │ │ Worker 2 │ │ Ceph OSD1│ │ Ceph OSD2│ ││ │ K8s pods │ │ K8s pods │ │ 12x NVMe │ │ 12x NVMe │ ││ │ 2x NVMe │ │ 2x NVMe │ │ 256GB RAM│ │ 256GB RAM│ ││ │ (OS only)│ │ (OS only)│ │ No K8s │ │ No K8s │ ││ └──────────┘ └──────────┘ │ workloads│ │ workloads│ ││ └──────────┘ └──────────┘ ││ ││ ✓ No resource contention (storage doesn't compete with pods)││ ✓ Storage can have different hardware (more drives, less CPU)││ ✓ Storage failures don't affect compute ││ ✗ More servers needed (higher cost) ││ ✗ Network is the bottleneck (all I/O crosses the network) ││ ││ Best for: > 100 nodes, high I/O workloads, regulated envs ││ │└─────────────────────────────────────────────────────────────┘Hyper-Converged (Storage on Worker Nodes)
Section titled “Hyper-Converged (Storage on Worker Nodes)”┌─────────────────────────────────────────────────────────────┐│ HYPER-CONVERGED ││ ││ Each worker runs BOTH K8s pods AND Ceph OSD: ││ ┌───────────────────┐ ┌───────────────────┐ ││ │ Worker/OSD Node 1 │ │ Worker/OSD Node 2 │ ││ │ │ │ │ ││ │ K8s pods │ │ K8s pods │ ││ │ Ceph OSD │ │ Ceph OSD │ ││ │ NVMe 1: OS + pods │ │ NVMe 1: OS + pods │ ││ │ NVMe 2-4: Ceph │ │ NVMe 2-4: Ceph │ ││ └───────────────────┘ └───────────────────┘ ││ ││ ✓ Fewer servers (lower CapEx) ││ ✓ Data locality (pods can read from local OSD) ││ ✗ Resource contention (OSD competes with pods for CPU/RAM) ││ ✗ Node failure loses both compute AND storage ││ ✗ Harder to size (need enough CPU/RAM for both) ││ ││ Best for: < 50 nodes, budget-constrained, moderate I/O ││ │└─────────────────────────────────────────────────────────────┘Pause and predict: Your vendor claims their enterprise SATA SSD delivers “100K IOPS.” Based on the storage tier guide above, would this drive be suitable for etcd? What specific metric would you benchmark to verify, and what threshold would you consider a pass or fail?
Benchmarking Storage
Section titled “Benchmarking Storage”Always benchmark before deploying workloads. The three tests below cover the three most important storage patterns for Kubernetes: sequential writes (log files, WAL), random reads (database queries), and fsync-heavy writes (etcd). Running all three tells you which storage tier your hardware actually belongs to, regardless of what the vendor’s datasheet claims:
# Install fioapt-get install fio
# Test sequential write throughput (simulates log writes)fio --name=seq-write \ --rw=write --bs=128k --direct=1 \ --size=4g --numjobs=4 --runtime=60 \ --group_reporting --filename=/data/fio-test
# Test random read IOPS (simulates database reads)fio --name=rand-read \ --rw=randread --bs=4k --direct=1 \ --size=4g --numjobs=8 --runtime=60 \ --group_reporting --filename=/data/fio-test
# Test etcd-like workload (sequential write with fdatasync)fio --name=etcd-wal \ --rw=write --bs=2300 --fdatasync=1 \ --size=22m --runtime=60 --time_based \ --filename=/data/etcd-test
# Expected results by storage type:# NVMe Gen4: seq write ~3 GB/s, rand read ~500K IOPS, fsync ~0.1ms# SATA SSD: seq write ~500 MB/s, rand read ~80K IOPS, fsync ~2-5ms# HDD SAS: seq write ~200 MB/s, rand read ~200 IOPS, fsync ~5-15msDid You Know?
Section titled “Did You Know?”-
Ceph was started at UC Santa Cruz by Sage Weil in 2004 and formalized in his PhD thesis in 2007. It became the storage backbone for most OpenStack deployments and is now the default distributed storage for on-premises Kubernetes via Rook.
-
NVMe drives can fail faster under sustained write loads than SATA SSDs because they have higher write amplification at full speed. Enterprise NVMe drives (like Samsung PM9A3) have much higher endurance (DWPD — Drive Writes Per Day) than consumer NVMe. Always use enterprise-grade drives for Ceph OSDs.
-
A single NVMe drive provides more IOPS than an entire SAN array from 10 years ago. A Samsung PM9A3 (3.84TB) delivers 1M random read IOPS. A mid-range EMC VNX from 2014 delivered ~200K IOPS from 120 spinning disks. Modern on-prem storage can outperform legacy enterprise storage at 1/10th the cost.
-
Kubernetes persistent volumes are statically provisioned by default — you create a PV manually and a PVC claims it. Dynamic provisioning (via StorageClass + CSI driver) is essential for on-prem. Without it, every PV request becomes a manual ticket.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| SATA SSD for etcd | Leader elections under load | NVMe only for etcd (Tier 0 or 1) |
| Hyper-converged without resource limits | Ceph OSD steals CPU/RAM from pods | Use cgroup limits: 2 cores + 4GB per OSD |
| No separate storage network | Ceph replication competes with pod traffic | Dedicated VLAN with jumbo frames for storage |
| Single replication factor | Data loss on node failure | Ceph replication factor 3 (tolerates 1 failure) |
| Ceph on worker nodes at scale | Rebalancing kills application performance | Dedicated storage nodes above 100 nodes |
| Not benchmarking before deploy | Discover performance issues in production | fio benchmark on every storage tier before use |
| Mixing drive types in one pool | Inconsistent performance across PVs | Separate storage classes per tier |
| No monitoring for disk health | Drives fail silently until data loss | smartctl monitoring + Prometheus alerts |
Question 1
Section titled “Question 1”You need persistent storage for a PostgreSQL database that uses streaming replication across 3 pods. Should you use local NVMe or Ceph RBD?
Answer
Local NVMe (DAS).
PostgreSQL with streaming replication handles data replication at the application level — each replica maintains its own copy of the data. Adding Ceph replication on top would mean triple replication (Ceph 3x) times triple replication (PostgreSQL 3 replicas) = 9 copies of every write. This wastes storage and adds unnecessary latency.
With local NVMe:
- Write latency: <0.1ms (direct PCIe, no network)
- No double-replication overhead
- If a node fails, PostgreSQL promotes a replica (application-level failover)
Use local-path-provisioner or TopoLVM as the CSI driver for local NVMe volumes with topology awareness.
Exception: If PostgreSQL is a single instance without replication (dev/test), use Ceph RBD so the volume can be remounted on another node if the original fails.
Question 2
Section titled “Question 2”Your Ceph cluster shows 80% I/O bandwidth consumed during rebalancing after a node failure. How do you prevent this from affecting application performance?
Answer
Limit Ceph recovery/backfill bandwidth:
# Set recovery limits (Ceph CLI)ceph config set osd osd_recovery_max_active 1 # default 3ceph config set osd osd_recovery_sleep 0.5 # 500ms sleep between opsceph config set osd osd_max_backfills 1 # default 1ceph config set osd osd_recovery_op_priority 1 # lowest priority (default 3)
# Set client I/O priority higher than recoveryceph config set osd osd_client_op_priority 63 # highest priorityArchitectural solutions:
- Dedicated storage network: Recovery traffic on VLAN 30, application I/O on VLAN 20
- More OSDs: Distribute data across more drives so each OSD rebalances less data
- Faster drives: NVMe for OSDs so recovery completes faster
- Spare capacity: Keep cluster below 70% utilization so rebalancing has headroom
Question 3
Section titled “Question 3”What is the minimum number of Ceph OSD nodes for a production deployment, and why?
Answer
Minimum 3 OSD nodes with replication factor 3.
Why 3:
- Ceph replicates each object to N different OSDs (default N=3)
- Each replica must be on a different failure domain (different node)
- With 3 nodes, each object has one copy on each node
- If one node fails, 2 copies remain (no data loss)
- Ceph can rebuild the third copy using the remaining 2
Why not 2: With replication factor 3 and only 2 nodes, Ceph cannot place 3 replicas on 3 different nodes. It will either place 2 replicas on one node (unsafe) or refuse to create the volume.
Recommended for production: 5+ OSD nodes. This allows:
- Node failure with no capacity pressure
- Rolling maintenance (take one node offline, still have 4)
- Better data distribution (more OSDs = more even load)
Question 4
Section titled “Question 4”Your etcd nodes use SATA SSDs and you see frequent leader elections. Explain the root cause and the fix.
Answer
Root cause: etcd uses Raft consensus, which requires the leader to fsync the Write-Ahead Log (WAL) on every write. SATA SSDs have fsync latency of 2-10ms (99th percentile). Under load, this can spike to 20-50ms.
etcd’s heartbeat interval is 100ms, and the election timeout is 1,000ms (10x heartbeat). If the leader’s fsync latency consistently exceeds 50ms, it cannot process writes fast enough to send heartbeats within the 100ms interval. Followers assume the leader is dead and trigger an election. During the election, the cluster is read-only.
Fix: Replace SATA SSDs with NVMe drives for etcd data directory (/var/lib/etcd). NVMe fsync latency is 0.05-0.5ms (99th percentile) — 10-100x faster than SATA.
Verification:
# Before (SATA SSD):etcdctl check perf# ... PASS: Throughput is ... FAIL: Slowest request took ...
# After (NVMe):etcdctl check perf# ... PASS: Throughput is ... PASS: Slowest request took ...This is one of the most common on-premises Kubernetes issues and is entirely preventable by selecting the right storage during hardware procurement.
Hands-On Exercise: Benchmark Storage Tiers
Section titled “Hands-On Exercise: Benchmark Storage Tiers”Task: Benchmark different storage types and compare performance.
# Create a test directorymkdir -p /mnt/storage-benchmark# NOTE: Run benchmarks on the target storage device mount point.# /tmp is often tmpfs (RAM disk) which gives misleading results.
# Test 1: Sequential write throughputecho "=== Sequential Write ==="fio --name=seq-write --rw=write --bs=128k --direct=1 \ --size=1g --numjobs=1 --runtime=30 --time_based \ --group_reporting --filename=/mnt/storage-benchmark/seq-test
# Test 2: Random read IOPSecho "=== Random Read IOPS ==="fio --name=rand-read --rw=randread --bs=4k --direct=1 \ --size=1g --numjobs=4 --runtime=30 --time_based \ --group_reporting --filename=/mnt/storage-benchmark/iops-test
# Test 3: etcd WAL simulationecho "=== etcd WAL (fdatasync) ==="fio --name=etcd-wal --rw=write --bs=2300 --fdatasync=1 \ --size=22m --runtime=30 --time_based \ --filename=/mnt/storage-benchmark/etcd-test
# Record results and compare against the tier guide above# NVMe should show: >1 GB/s seq, >100K IOPS rand, <1ms fdatasync# SATA SSD: ~400 MB/s seq, ~50K IOPS rand, ~2-5ms fdatasync
# Cleanuprm -rf /mnt/storage-benchmarkSuccess Criteria
Section titled “Success Criteria”- Sequential write throughput measured (MB/s)
- Random read IOPS measured
- fsync latency measured (p99)
- Results compared against tier guide
- Storage tier identified for the tested device
Next Module
Section titled “Next Module”Continue to Module 4.2: Software-Defined Storage (Ceph/Rook) to learn how to deploy and operate Ceph as distributed storage for Kubernetes.