Module 4.3: Local Storage & Alternatives
Цей контент ще не доступний вашою мовою.
Complexity:
[INTERMEDIATE]| Time: 45 minutesPrerequisites: Module 4.1: Storage Architecture, Module 4.2: Ceph & Rook
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Evaluate when local storage, Longhorn, OpenEBS, or LVM-based solutions are appropriate vs. full distributed storage
- Implement local-path-provisioner and TopoLVM for workloads that tolerate node-local storage semantics
- Deploy Longhorn as a lightweight replicated storage solution with automated backups and snapshot capabilities
- Design a storage strategy that matches each workload tier to the right storage backend based on persistence and performance needs
Why This Module Matters
Section titled “Why This Module Matters”A fintech startup ran six microservices on three bare-metal servers. They deployed Rook-Ceph because “everyone uses Ceph.” Within a week, the Ceph monitors consumed 2 GB of RAM each, the OSD recovery traffic saturated their 10 GbE links during a node reboot, and a junior engineer accidentally deleted the mon keyring while debugging. Total downtime: 14 hours.
They replaced Ceph with Longhorn. Deployment took 20 minutes. Each microservice got a 3-replica volume backed by local NVMe. The total storage overhead dropped from 9 pods (3 MONs, 3 MGRs, 3 OSDs) to a single DaemonSet. Six months later, they have had zero storage incidents.
Not every on-premises cluster needs a distributed storage system. If your workloads are stateless, or if each pod can tolerate node-local storage that disappears when the node dies, local storage solutions are simpler, faster, and cheaper. Even when you need replication, lightweight alternatives like Longhorn and OpenEBS provide it without the operational complexity of Ceph.
This module covers the full spectrum: from zero-overhead local-path-provisioner to replicated-but-simple Longhorn, with LVM-based solutions in between.
What You’ll Learn
Section titled “What You’ll Learn”- When local storage is the right choice (and when it is not)
- local-path-provisioner for development and ephemeral workloads
- TopoLVM and LVM CSI for production local volumes with topology awareness
- OpenEBS local engines (LocalPV-HostPath, LocalPV-LVM, LocalPV-ZFS)
- Longhorn for replicated storage without Ceph complexity
- Topology-aware provisioning and scheduler constraints
- Decision framework for choosing the right storage solution
The Local Storage Spectrum
Section titled “The Local Storage Spectrum”┌─────────────────────────────────────────────────────────────────────┐│ LOCAL STORAGE SPECTRUM ││ ││ Simplest Most ││ ─────────────────────────────────────────────────> Capable ││ ││ hostPath local-path LVM CSI OpenEBS Longhorn ││ (manual) provisioner TopoLVM LocalPV (replicated)││ ││ No CSI CSI driver LVM thin LVM/ZFS Cross-node ││ No dynamic Dynamic PV Snapshots Snapshots replication ││ No cleanup Auto cleanup Quota Quota Snapshots ││ ─────── ─────────── ───────── ───────── ────────── ││ Dev only Dev / Edge Production Production Production ││ ││ Risk: data on one node. Node failure = volume gone. ││ Exception: Longhorn replicates across nodes. │└─────────────────────────────────────────────────────────────────────┘local-path-provisioner
Section titled “local-path-provisioner”Rancher’s local-path-provisioner is the simplest dynamic provisioner. It creates a directory on the node’s filesystem and binds it to a PV. No LVM, no snapshots, no replication. Just a directory.
When to Use
Section titled “When to Use”- Development clusters (kind, k3s, single-node labs)
- Edge deployments where simplicity outweighs durability
- Workloads that already handle their own replication (etcd, CockroachDB, Kafka)
Pause and predict: Your cluster has a mix of workloads: some are stateless web services, some are databases with built-in replication (like CockroachDB), and some are single-instance PostgreSQL databases. Which of these need distributed storage (like Ceph), and which can use local storage? What happens to a local-path PVC when its node fails?
How It Works
Section titled “How It Works”┌──────────────────────────────────────────┐│ Node: worker-01 ││ ││ /opt/local-path-provisioner/ ││ ├── pvc-abc123/ (PV for Pod A) ││ ├── pvc-def456/ (PV for Pod B) ││ └── pvc-ghi789/ (PV for Pod C) ││ ││ PVC created ──> Provisioner creates dir ││ PVC deleted ──> Provisioner deletes dir ││ Pod scheduled ──> Must land on this node │└──────────────────────────────────────────┘Deployment
Section titled “Deployment”# Install local-path-provisionerkubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.30/deploy/local-path-storage.yaml
# Create a PVC (stays Pending until a Pod references it)kubectl apply -f - <<EOFapiVersion: v1kind: PersistentVolumeClaimmetadata: name: test-localspec: accessModes: ["ReadWriteOnce"] storageClassName: local-path resources: requests: storage: 1GiEOFKey limitation: The PVC has no actual size enforcement. You requested 1 Gi but nothing prevents the pod from using the entire node disk. For real quota enforcement, use LVM-based solutions.
TopoLVM: Production-Grade Local Volumes
Section titled “TopoLVM: Production-Grade Local Volumes”TopoLVM uses LVM (Logical Volume Manager) to carve local disks into thin-provisioned volumes with actual capacity enforcement. It integrates with the Kubernetes scheduler via a mutating webhook to ensure pods land on nodes that have enough free space.
Architecture
Section titled “Architecture”┌─────────────────────────────────────────────────────────────┐│ TopoLVM ARCHITECTURE ││ ││ ┌───────────────┐ ┌───────────────┐ ││ │ kube-scheduler │ ◄──── │ topolvm-sched │ ││ │ (filter/score)│ │ (webhook) │ ││ └───────────────┘ └───────────────┘ ││ │ ││ │ Schedule pod to node with enough VG space ││ ▼ ││ ┌───────────────┐ ┌───────────────┐ ││ │ topolvm-node │ ◄──── │ topolvm-csi │ ││ │ (DaemonSet) │ │ (controller) │ ││ │ │ │ │ ││ │ Reports VG │ │ Creates LVs │ ││ │ free space to │ │ via CSI calls │ ││ │ scheduler │ │ │ ││ └───────────────┘ └───────────────┘ ││ │ ││ ▼ ││ ┌───────────────┐ ││ │ LVM VG │ Volume Group on local NVMe/SSD ││ │ ├── lv-pvc1 │ Thin-provisioned logical volumes ││ │ ├── lv-pvc2 │ Actual capacity enforcement ││ │ └── lv-pvc3 │ Snapshots supported ││ └───────────────┘ │└─────────────────────────────────────────────────────────────┘# Prerequisite: Create an LVM Volume Group on each worker node# (run on each node via SSH or DaemonSet)pvcreate /dev/nvme1n1vgcreate myvg /dev/nvme1n1
# Install TopoLVM via Helmhelm repo add topolvm https://topolvm.github.io/topolvmhelm install topolvm topolvm/topolvm \ --namespace topolvm-system --create-namespace \ --set lvmd.deviceClasses[0].name=nvme \ --set lvmd.deviceClasses[0].volume-group=myvg \ --set lvmd.deviceClasses[0].default=true \ --set storageClasses[0].name=topolvm-provisioner \ --set storageClasses[0].storageClass.fsType=xfs \ --set storageClasses[0].storageClass.isDefaultClass=trueStorageClass with Device Classes
Section titled “StorageClass with Device Classes”# Use device classes to separate NVMe from HDDapiVersion: storage.k8s.io/v1kind: StorageClassmetadata: name: topolvm-nvmeprovisioner: topolvm.ioparameters: topolvm.io/device-class: "nvme"volumeBindingMode: WaitForFirstConsumerallowVolumeExpansion: true---apiVersion: storage.k8s.io/v1kind: StorageClassmetadata: name: topolvm-hddprovisioner: topolvm.ioparameters: topolvm.io/device-class: "hdd"volumeBindingMode: WaitForFirstConsumerallowVolumeExpansion: trueOpenEBS Local Engines
Section titled “OpenEBS Local Engines”OpenEBS provides three local engines: LocalPV-HostPath (simple directory-based, like local-path with OpenEBS scheduling), LocalPV-LVM (LVM thin provisioning with quota enforcement and snapshots), and LocalPV-ZFS (ZFS pools with checksums, compression, and native snapshots).
OpenEBS LocalPV-LVM
Section titled “OpenEBS LocalPV-LVM”# Install OpenEBS LocalPV-LVMhelm repo add openebs https://openebs.github.io/openebshelm install openebs openebs/openebs \ --namespace openebs --create-namespace \ --set engines.local.lvm.enabled=true \ --set engines.replicated.mayastor.enabled=false
# Create a VolumeGroup on each node first# (same as TopoLVM: pvcreate + vgcreate)
# StorageClasskubectl apply -f - <<EOFapiVersion: storage.k8s.io/v1kind: StorageClassmetadata: name: openebs-lvmpvprovisioner: local.csi.openebs.ioparameters: storage: "lvm" volgroup: "lvmvg" fsType: "xfs"volumeBindingMode: WaitForFirstConsumerallowVolumeExpansion: trueEOFOpenEBS also offers LocalPV-ZFS for workloads needing data integrity guarantees: checksums on every block, copy-on-write snapshots, native compression (lz4/zstd), and self-healing with mirrors. Create a ZFS pool on each node (zpool create zfspv-pool /dev/nvme1n1) and reference it in a StorageClass with provisioner: zfs.csi.openebs.io.
Longhorn: Replicated Storage Without Ceph
Section titled “Longhorn: Replicated Storage Without Ceph”Longhorn occupies a unique position: it provides cross-node replication (like Ceph) but with drastically simpler operations. Each volume is an independent replicated block device. There is no global storage pool, no placement groups, no CRUSH map.
Stop and think: A team is debating between Ceph and Longhorn for a 5-node cluster running 10 stateful services. Ceph would require 3 MON pods, 2 MGR pods, and at least 5 OSD pods (13 storage pods total). Longhorn requires a single DaemonSet (5 pods). Beyond operational complexity, what is the architectural difference that makes Longhorn simpler per-volume but potentially less efficient at scale?
Architecture
Section titled “Architecture”┌─────────────────────────────────────────────────────────────────┐│ LONGHORN ARCHITECTURE ││ ││ Volume: pvc-abc123 (3 replicas) ││ ││ ┌──────────────┐ ││ │ Engine │ iSCSI target (runs on the pod's node) ││ │ (worker-01) │ Coordinates reads/writes to replicas ││ └──────┬───────┘ ││ │ ││ ┌────┴────┬──────────┐ ││ ▼ ▼ ▼ ││ ┌────┐ ┌────┐ ┌────┐ ││ │ R1 │ │ R2 │ │ R3 │ Replicas (sparse files on disk) ││ │w-01│ │w-02│ │w-03│ Each replica is a full copy ││ └────┘ └────┘ └────┘ Written synchronously ││ ││ Node failure (worker-02): ││ - Engine continues with R1 + R3 ││ - Longhorn rebuilds R2 on worker-04 (if available) ││ - No manual intervention required ││ ││ Key difference from Ceph: ││ - No global pool, no CRUSH map, no placement groups ││ - Each volume is independent — no blast radius ││ - Simpler to debug: one volume = one engine + N replicas │└─────────────────────────────────────────────────────────────────┘Deployment
Section titled “Deployment”# Install Longhorn via Helmhelm repo add longhorn https://charts.longhorn.iohelm install longhorn longhorn/longhorn \ --namespace longhorn-system --create-namespace \ --set defaultSettings.defaultDataPath=/var/lib/longhorn \ --set defaultSettings.defaultReplicaCount=3
# Verify all components are runningkubectl -n longhorn-system get pods# longhorn-manager-xxxxx (DaemonSet - one per node)# longhorn-driver-xxxxx (CSI driver)# longhorn-ui-xxxxx (Web UI)
# Default StorageClass is created automaticallykubectl get sc longhorn# NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE# longhorn driver.longhorn.io Delete Immediate
# Create a replicated PVCkubectl apply -f - <<EOFapiVersion: v1kind: PersistentVolumeClaimmetadata: name: longhorn-testspec: accessModes: ["ReadWriteOnce"] storageClassName: longhorn resources: requests: storage: 5GiEOFComparison Table
Section titled “Comparison Table”| Feature | local-path | TopoLVM | OpenEBS LVM | OpenEBS ZFS | Longhorn |
|---|---|---|---|---|---|
| Dynamic provisioning | Yes | Yes | Yes | Yes | Yes |
| Capacity enforcement | No | Yes | Yes | Yes | Yes |
| Snapshots | No | Yes | Yes | Yes (native) | Yes |
| Volume expansion | No | Yes | Yes | Yes | Yes |
| Replication | No | No | No | Mirror only | Yes (cross-node) |
| Topology-aware scheduling | Basic | Advanced | Yes | Yes | N/A (Immediate) |
| Compression | No | No | No | Yes (lz4/zstd) | No |
| Checksums | No | No | No | Yes | No |
| Overhead (RAM per node) | ~10 MB | ~50 MB | ~50 MB | ~100 MB | ~500 MB |
| Complexity | Minimal | Low | Low | Medium | Medium |
| Best for | Dev/edge | Prod local | Prod local | Data integrity | Replicated local |
Pause and predict: You deploy a StatefulSet with 3 replicas using a local storage StorageClass, but the binding mode is set to
Immediateinstead ofWaitForFirstConsumer. What happens? (Hint: the PVs are created before the scheduler decides where to place the pods.)
Topology-Aware Provisioning
Section titled “Topology-Aware Provisioning”All local storage solutions must handle a fundamental constraint: once a volume is created on a node, the pod must always run on that node. Kubernetes manages this through topology keys and WaitForFirstConsumer binding mode.
# StorageClass with WaitForFirstConsumerapiVersion: storage.k8s.io/v1kind: StorageClassmetadata: name: local-nvmeprovisioner: topolvm.iovolumeBindingMode: WaitForFirstConsumer # CRITICALallowedTopologies: - matchLabelExpressions: - key: node.kubernetes.io/instance-type values: - storage-optimizedWithout WaitForFirstConsumer, the PV is provisioned on a random node before the pod is scheduled, causing conflicts. With it, the PV is provisioned on the same node the scheduler chose for the pod.
When to Use Local vs Distributed Storage
Section titled “When to Use Local vs Distributed Storage”┌────────────────────────────────────────────────────┐│ DECISION TREE: LOCAL vs DISTRIBUTED ││ ││ Does your app handle its own replication? ││ (etcd, CockroachDB, Kafka, Cassandra, Elastic) ││ │ ││ Yes ─┤ ││ └──> Use LOCAL storage (TopoLVM / OpenEBS) ││ App replicates data, no need to pay ││ for storage-level replication. ││ ││ No ──┤ ││ │ Do you need ReadWriteMany? ││ │ │ ││ │ Yes ─┴──> Use Ceph (CephFS) or NFS ││ │ ││ │ No ──┤ ││ │ │ Can you tolerate 30-60s ││ │ │ failover to another node? ││ │ │ │ ││ │ │ Yes ─┴──> Use Longhorn ││ │ │ ││ │ │ No ──┴──> Use Ceph (RBD) ││ │ │ with fast failover │└────────────────────────────────────────────────────┘Did You Know?
Section titled “Did You Know?”-
local-path-provisioner powers all default k3s installations. When you run k3s on an edge device, the
local-pathStorageClass is created automatically. Over 100,000 edge clusters worldwide rely on it for lightweight persistent storage without any distributed storage overhead. -
TopoLVM was created by Cybozu, the Japanese enterprise software company that also created the
coilCNI plugin and theaccuratenamespace controller. Cybozu runs hundreds of Kubernetes clusters on bare metal and needed a storage solution that enforced LVM quotas without the complexity of a distributed system. -
Longhorn was originally a Rancher Labs project before SUSE acquired Rancher in 2020. It became a CNCF sandbox project in 2019 and graduated to incubating status. Unlike Ceph, each Longhorn volume is an independent replicated unit — a failure in one volume’s engine does not affect any other volume.
-
OpenEBS was the first CNCF sandbox storage project (joined in 2019) and pioneered the concept of “Container Attached Storage” — the idea that storage controllers should run as containers alongside application containers, not as a separate infrastructure layer.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
Using hostPath in production | No lifecycle management, no cleanup, security risk | Use local-path-provisioner at minimum |
Missing WaitForFirstConsumer | PV created on wrong node, pod cannot start | Always set volumeBindingMode: WaitForFirstConsumer for local storage |
| Longhorn on slow networks | Synchronous replication adds write latency proportional to network RTT | Ensure dedicated storage network or 10 GbE minimum |
| No VG space monitoring | Node runs out of LVM space, new PVCs fail | Monitor VG free space with Prometheus + node-exporter textfile collector |
| Choosing Ceph for 3-node clusters | Ceph overhead dominates small clusters (9+ pods just for storage) | Use Longhorn or TopoLVM for clusters under 6 nodes |
| Ignoring disk I/O isolation | One pod’s heavy writes starve other pods on the same disk | Use LVM thin pools with I/O limits, or separate VGs per workload class |
| Running databases on local-path | No capacity enforcement, no snapshots, no backup integration | Use TopoLVM or OpenEBS LVM for databases |
| Longhorn replica count > node count | Replicas cannot be placed, volume stays degraded | Set replica count to min(3, number_of_nodes) |
Question 1
Section titled “Question 1”You have a 3-node bare-metal cluster running PostgreSQL with streaming replication (1 primary, 2 replicas). Should you use Longhorn or TopoLVM for the database volumes?
Answer
Use TopoLVM (or OpenEBS LVM). PostgreSQL streaming replication already handles data replication at the application level. Each PostgreSQL instance maintains a full copy of the data.
Adding Longhorn replication would mean:
- Each PostgreSQL replica (3 copies at the app level) is stored with 3 copies at the storage level
- Total copies: 3 x 3 = 9 copies of the same data
- Triple the write amplification (every PostgreSQL write goes to 3 Longhorn replicas)
- Triple the network traffic for storage replication
With TopoLVM:
- Each PostgreSQL instance gets a local LVM volume on its node
- PostgreSQL handles replication itself (WAL shipping)
- Total copies: 3 (one per PostgreSQL instance)
- Writes go directly to local NVMe with no network overhead
Rule of thumb: If the application replicates, the storage should not.
Question 2
Section titled “Question 2”A developer creates a PVC with storageClassName: local-path requesting 50 Gi. The node has 100 Gi free. Can the pod use more than 50 Gi?
Answer
Yes. local-path-provisioner does not enforce capacity limits. The 50Gi in the PVC is purely informational. The pod can write up to the full 100 Gi of free disk space on the node (or until the filesystem is full).
This is because local-path-provisioner creates a plain directory on the host filesystem. There is no LVM logical volume, no quota, no cgroup constraint limiting the directory’s size.
To enforce actual capacity:
- TopoLVM: Creates an LVM logical volume of exactly the requested size. Writes beyond that size fail with ENOSPC.
- OpenEBS LVM: Same LVM-based enforcement.
- OpenEBS ZFS: ZFS quota on the dataset.
# With TopoLVM, the LV has a fixed sizelvs myvg# LV VG Size# pvc-abc123 myvg 50.00g # Cannot exceed thisQuestion 3
Section titled “Question 3”Your Longhorn volume shows 2 of 3 replicas healthy. The third replica is on a node that was permanently removed from the cluster. What happens?
Answer
Longhorn automatically rebuilds the missing replica on another available node. The process:
- Longhorn detects the node is gone (node controller marks it as not ready)
- After the
replica-replenishment-wait-interval(default: 600 seconds / 10 minutes), Longhorn schedules a new replica on a healthy node with sufficient disk space - The engine copies data from one of the 2 healthy replicas to the new replica
- During rebuild, reads and writes continue normally (served by the 2 healthy replicas)
- Once complete, the volume returns to 3 healthy replicas
If no other node has space, the volume remains degraded at 2 replicas. It is still fully operational but has reduced redundancy. Longhorn will continuously retry placing the third replica.
You can check replica status:
kubectl -n longhorn-system get replicas.longhorn.io \ -l longhornvolume=pvc-abc123Question 4
Section titled “Question 4”You need to store 500 Gi of ML training data that 8 pods read simultaneously. The data is written once and read many times. Which local storage solution should you use?
Answer
None of the local storage solutions in this module are appropriate. This workload requires ReadWriteMany (RWX) access mode, which none of the local storage options support natively:
- local-path, TopoLVM, OpenEBS LVM/ZFS: Only
ReadWriteOnce(RWO) — one node at a time - Longhorn: Supports RWX only via NFS export, which adds a single-node bottleneck
Better options:
- CephFS (from Module 4.2): True distributed filesystem, handles concurrent reads well
- NFS server: Simple, good for read-heavy workloads
- Object storage (MinIO): If the ML framework supports S3-compatible storage
- Pre-load to local storage: Copy the dataset to each node’s local volume, accept the storage duplication
If the data fits on a single NVMe and performance matters more than storage efficiency, the pre-load approach is fastest:
# Init container copies data from shared source to local volumeinitContainers: - name: data-loader image: busybox command: ["cp", "-r", "/shared/dataset", "/local/dataset"]Hands-On Exercise: Compare local-path and Longhorn
Section titled “Hands-On Exercise: Compare local-path and Longhorn”# Create a kind cluster with 3 worker nodescat <<EOF | kind create cluster --config=-kind: ClusterapiVersion: kind.x-k8s.io/v1alpha4nodes: - role: control-plane - role: worker - role: worker - role: workerEOF
# Step 1: local-path is already included in kindkubectl get sc standard
# Step 2: Create a PVC and pod with local-pathkubectl apply -f - <<EOFapiVersion: v1kind: PersistentVolumeClaimmetadata: name: local-pvcspec: accessModes: ["ReadWriteOnce"] storageClassName: standard resources: requests: storage: 1Gi---apiVersion: v1kind: Podmetadata: name: writerspec: containers: - name: app image: busybox command: ["sh", "-c", "echo 'hello' > /data/test.txt && sleep 3600"] volumeMounts: - mountPath: /data name: vol volumes: - name: vol persistentVolumeClaim: claimName: local-pvcEOF
# Step 3: Verify pod is pinned to a specific nodekubectl get pod writer -o jsonpath='{.spec.nodeName}'
# Step 4: Install Longhorn and create a replicated PVChelm repo add longhorn https://charts.longhorn.iohelm install longhorn longhorn/longhorn \ --namespace longhorn-system --create-namespace \ --set defaultSettings.defaultReplicaCount=2 --wait --timeout 5m
kubectl apply -f - <<EOFapiVersion: v1kind: PersistentVolumeClaimmetadata: name: longhorn-pvcspec: accessModes: ["ReadWriteOnce"] storageClassName: longhorn resources: requests: storage: 1GiEOF
# Step 5: Verify Longhorn replicas span multiple nodeskubectl -n longhorn-system get replicas.longhorn.io
# Cleanupkubectl delete pod writer && kubectl delete pvc local-pvc longhorn-pvckind delete clusterSuccess Criteria
Section titled “Success Criteria”- local-path PVC bound and data written successfully
- Pod pinned to specific node (verified with
spec.nodeName) - Longhorn PVC has replicas on multiple nodes
- Understood the difference: local-path has no replication, Longhorn does
Next Module
Section titled “Next Module”Continue to Module 5.1: Private Cloud Platforms to learn how VMware vSphere, OpenStack, and Harvester provide infrastructure abstraction layers for on-premises Kubernetes.