Module 1.4: High-Performance Storage for AI
Цей контент ще не доступний вашою мовою.
Discipline Module | Complexity:
[MEDIUM]| Time: 3 hours
Prerequisites
Section titled “Prerequisites”Before starting this module:
- Required: Kubernetes storage fundamentals (PersistentVolumes, PersistentVolumeClaims, StorageClasses, CSI drivers)
- Required: Basic understanding of ML training data pipelines (datasets, batches, data loaders)
- Recommended: Module 1.1: GPU Provisioning — GPU workload basics
- Recommended: Experience with object storage (S3, GCS, MinIO)
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Design high-throughput storage architectures for AI workloads — training data, checkpoints, and model artifacts
- Implement storage solutions using CSI drivers, NFS, and object storage optimized for large-scale data access
- Configure caching layers that reduce data loading bottlenecks during distributed training
- Evaluate storage options — local NVMe, network-attached, cloud object stores — against AI workload I/O patterns
Why This Module Matters
Section titled “Why This Module Matters”You spent months building a beautiful GPU platform. The GPUs are provisioned, shared efficiently, connected by InfiniBand. Then your ML team starts training and reports this:
“Our 8-GPU job only uses 40% GPU utilization. The GPUs are waiting for data.”
This is the IO bottleneck — the most common and most underestimated performance killer in AI infrastructure. Your $300,000 DGX node is sitting idle 60% of the time because the storage system cannot feed data to the GPUs fast enough.
The numbers are stark:
| Component | Throughput | Latency |
|---|---|---|
| GPU compute (A100 BF16) | 312 TFLOPS | nanoseconds |
| GPU memory (HBM3) | 2 TB/s | nanoseconds |
| NVMe SSD (local) | 7 GB/s | 10-100 μs |
| Network storage (CephFS) | 1-5 GB/s | 0.5-5 ms |
| Object storage (S3) | 100-500 MB/s | 10-100 ms |
There is a 1,000x gap between GPU memory speed and network storage speed. Bridging this gap is what this module is about.
The IO Bottleneck in ML Workloads
Section titled “The IO Bottleneck in ML Workloads”Where IO Happens
Section titled “Where IO Happens”Every training step involves IO at multiple stages:
┌─────────────────────────────────────────────────────────────┐│ Training Loop ││ ││ 1. Load batch 2. Transfer to GPU 3. Compute ││ ┌──────────────┐ ┌──────────────┐ ┌────────────┐ ││ │ Read from │ │ CPU RAM → │ │ Forward + │ ││ │ storage │ ──→ │ GPU VRAM │ ──→ │ Backward │ ││ │ (IO bound) │ │ (PCIe bound) │ │ (compute) │ ││ └──────────────┘ └──────────────┘ └────────────┘ ││ 100ms - 5s 1-10ms 10-100ms ││ ││ ← This dominates when storage is slow │└─────────────────────────────────────────────────────────────┘Workload IO Profiles
Section titled “Workload IO Profiles”Different ML workloads have radically different IO characteristics:
| Workload | Data Size | Access Pattern | Read Size | Throughput Need |
|---|---|---|---|---|
| Image classification (ImageNet) | 150 GB | Random, small files | 100-500 KB | 2-5 GB/s |
| Object detection (COCO) | 20 GB | Random, medium files | 200 KB - 5 MB | 1-3 GB/s |
| NLP pre-training (C4) | 800 GB | Sequential, large files | 1-100 MB | 5-20 GB/s |
| Video training | 5-50 TB | Sequential, very large | 50-500 MB | 10-50 GB/s |
| LLM fine-tuning (tokenized) | 10-100 GB | Sequential | 1-10 MB | 1-5 GB/s |
| Checkpoint save | 1-50 GB per save | Sequential write | Full model | 5-20 GB/s burst |
The key insight: image training does millions of small random reads (hard for network storage), while LLM training does large sequential reads (easier to cache).
Profiling IO Bottlenecks
Section titled “Profiling IO Bottlenecks”Before optimizing, measure. Run your training job with GPU utilization monitoring:
# Monitor GPU utilization during training# If GPU util is < 80% and you're not memory-bound, you're IO-bound
# Quick check: watch nvidia-smi during trainingkubectl exec -it training-pod -- watch -n 1 'nvidia-smi --query-gpu=utilization.gpu,utilization.memory --format=csv,noheader'
# Better: check PyTorch DataLoader timing# Add this to your training script:# import time# for batch in dataloader:# load_end = time.time()# print(f"Data load: {load_end - load_start:.3f}s")# # ... training step ...# load_start = time.time()Storage Tiers for AI
Section titled “Storage Tiers for AI”The Storage Pyramid
Section titled “The Storage Pyramid”AI workloads need a multi-tier storage architecture:
┌─────────────┐ │ GPU VRAM │ 2 TB/s, μs latency │ (training) │ Managed by framework ├─────────────┤ ┌──┤ Local NVMe │ 3-14 GB/s, 10-100 μs │ │ (hot cache) │ TopoLVM, OpenEBS LVM │ ├─────────────┤ ┌──┤ │ Distributed │ 1-10 GB/s, 0.5-5 ms │ │ │ FS (warm) │ CephFS, GlusterFS, JuiceFS │ │ ├─────────────┤ ┌──┤ │ │ Object │ 100 MB-5 GB/s, 10-100 ms │ │ │ │ Storage │ S3, GCS, MinIO │ │ │ │ (cold) │ │ │ │ ├─────────────┤ │ │ │ │ Tape/ │ Archival │ │ │ │ Archive │ Glacier, Coldline │ │ │ └─────────────┘ │ │ │ Cost │ │ │ Speed ▼ │ │ │ ▲ $ $$ $$$ $$$$The platform team’s job is to build infrastructure that automatically moves data between tiers based on access patterns.
Local NVMe Caching
Section titled “Local NVMe Caching”Why Local Storage Matters
Section titled “Why Local Storage Matters”A modern NVMe SSD delivers 3-7 GB/s sequential read and 500K-1M IOPS random read. This is 10-50x faster than network storage for the random small-file reads that image training demands.
The strategy: keep the active dataset (or a cache of it) on local NVMe while the canonical copy lives in object storage.
TopoLVM: Topology-Aware Local Volumes
Section titled “TopoLVM: Topology-Aware Local Volumes”TopoLVM is a CSI driver that provisions PersistentVolumes from local LVM volume groups, with topology awareness — it ensures Pods are scheduled on nodes that have available local storage.
# Install TopoLVMhelm repo add topolvm https://topolvm.github.io/topolvmhelm repo update
helm install topolvm topolvm/topolvm \ --namespace topolvm-system \ --create-namespace \ --set controller.replicaCount=2 \ --set node.volumeGroup.name=nvme-vg # LVM VG name on each nodeCreate a StorageClass:
apiVersion: storage.k8s.io/v1kind: StorageClassmetadata: name: nvme-localprovisioner: topolvm.ioparameters: topolvm.io/device-class: nvme # Maps to a device class in TopoLVM configvolumeBindingMode: WaitForFirstConsumer # Delay binding until Pod is scheduledallowVolumeExpansion: truereclaimPolicy: DeleteUse in a training Pod:
apiVersion: v1kind: PersistentVolumeClaimmetadata: name: training-cache namespace: ml-trainingspec: storageClassName: nvme-local accessModes: - ReadWriteOnce resources: requests: storage: 500Gi---apiVersion: v1kind: Podmetadata: name: image-trainer namespace: ml-trainingspec: containers: - name: trainer image: nvcr.io/nvidia/pytorch:24.09-py3 volumeMounts: - name: cache mountPath: /data/cache - name: dataset mountPath: /data/s3 # S3 FUSE mount or pre-downloaded resources: limits: nvidia.com/gpu: 4 volumes: - name: cache persistentVolumeClaim: claimName: training-cache - name: dataset persistentVolumeClaim: claimName: imagenet-s3OpenEBS LVM Local PV
Section titled “OpenEBS LVM Local PV”OpenEBS provides a simpler alternative for local NVMe provisioning:
# Install OpenEBS LVM LocalPVhelm repo add openebs https://openebs.github.io/openebshelm repo update
helm install openebs openebs/openebs \ --namespace openebs \ --create-namespace \ --set lvm-localpv.enabled=true \ --set engines.replicated.mayastor.enabled=falseapiVersion: storage.k8s.io/v1kind: StorageClassmetadata: name: openebs-nvmeprovisioner: local.csi.openebs.ioparameters: storage: "lvm" vgPattern: "nvme-vg" # LVM volume group pattern fsType: "xfs" # XFS recommended for large filesvolumeBindingMode: WaitForFirstConsumerInit Container Pattern for Data Staging
Section titled “Init Container Pattern for Data Staging”A common pattern: use an init container to stage data from object storage to local NVMe before training begins:
apiVersion: batch/v1kind: Jobmetadata: name: training-with-staging namespace: ml-trainingspec: template: spec: initContainers: - name: stage-data image: amazon/aws-cli:2.17 command: ["sh", "-c"] args: - | echo "Staging dataset from S3..." start=$(date +%s) aws s3 sync s3://my-datasets/imagenet/ /data/cache/imagenet/ \ --no-sign-request --quiet end=$(date +%s) size=$(du -sh /data/cache/imagenet/ | cut -f1) echo "Staged $size in $((end-start)) seconds" volumeMounts: - name: cache mountPath: /data/cache resources: requests: cpu: "4" memory: 8Gi containers: - name: trainer image: nvcr.io/nvidia/pytorch:24.09-py3 command: ["torchrun", "--nproc_per_node=4", "train.py", "--data_dir=/data/cache/imagenet"] volumeMounts: - name: cache mountPath: /data/cache resources: limits: nvidia.com/gpu: 4 volumes: - name: cache persistentVolumeClaim: claimName: training-cache restartPolicy: OnFailureDistributed Filesystems
Section titled “Distributed Filesystems”CephFS
Section titled “CephFS”Ceph is the most widely deployed distributed storage system in Kubernetes. CephFS provides a POSIX-compatible filesystem backed by the Ceph cluster.
Strengths for AI:
- POSIX semantics (training frameworks expect filesystem APIs)
- Scalable metadata server (can handle millions of small files)
- Multi-reader access (ReadWriteMany) for data-parallel training
- Integrated with Rook for Kubernetes-native deployment
Weaknesses for AI:
- Latency: 0.5-5ms per operation (100-1000x slower than local NVMe)
- Throughput ceiling: limited by network and OSD count
- Small file performance: poor for millions of tiny files (image datasets)
# Rook-Ceph CephFS StorageClassapiVersion: storage.k8s.io/v1kind: StorageClassmetadata: name: cephfs-aiprovisioner: rook-ceph.cephfs.csi.ceph.comparameters: clusterID: rook-ceph fsName: ai-filesystem pool: ai-data-pool csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node csi.storage.k8s.io/node-stage-secret-namespace: rook-cephmountOptions: - noatime # Disable access time updates (major perf win) - nodiratimeJuiceFS
Section titled “JuiceFS”JuiceFS is a cloud-native distributed filesystem purpose-built for the gap between object storage and high-performance compute. It separates metadata (stored in Redis, PostgreSQL, or TiKV) from data (stored in any object storage).
┌─────────────────────────────────────────────────────┐│ JuiceFS Architecture ││ ││ ┌────────────┐ ┌──────────────┐ ┌──────────┐ ││ │ POSIX │ │ Metadata │ │ Object │ ││ │ Client │──→ │ Engine │ │ Storage │ ││ │ (FUSE/CSI) │ │ (Redis/PG) │ │ (S3) │ ││ │ │ └──────────────┘ │ │ ││ │ │──────────────────────→ │ │ ││ └────────────┘ Data path └──────────┘ ││ │ ││ ▼ ││ ┌────────────┐ ││ │ Local Cache │ ← NVMe or RAM ││ │ (read/write)│ ││ └────────────┘ │└─────────────────────────────────────────────────────┘Why JuiceFS excels for AI:
- Transparent caching: Reads are cached on local NVMe. Second read of same file is at NVMe speed.
- POSIX compatible: Drop-in replacement for local filesystem in training scripts.
- Any object store backend: S3, GCS, Azure Blob, MinIO — your data stays where it is.
- Metadata engine flexibility: Redis for speed, PostgreSQL for durability, TiKV for scale.
- Kubernetes-native: CSI driver with dynamic provisioning.
Installing JuiceFS CSI Driver
Section titled “Installing JuiceFS CSI Driver”# Install JuiceFS CSI Driverhelm repo add juicefs https://juicedata.github.io/charts/helm repo update
helm install juicefs-csi juicefs/juicefs-csi-driver \ --namespace kube-system \ --set storageClasses[0].name=juicefs-sc \ --set storageClasses[0].enabled=true \ --set storageClasses[0].backend.name=ai-data \ --set storageClasses[0].backend.metaurl=redis://:password@redis-master:6379/1 \ --set storageClasses[0].backend.storage=s3 \ --set storageClasses[0].backend.bucket=s3://my-ai-datasets \ --set storageClasses[0].backend.accessKey=AKIAIOSFODNN7EXAMPLE \ --set storageClasses[0].backend.secretKey=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY \ --set storageClasses[0].cachePVC=juicefs-cacheJuiceFS StorageClass with Caching
Section titled “JuiceFS StorageClass with Caching”apiVersion: storage.k8s.io/v1kind: StorageClassmetadata: name: juicefs-aiprovisioner: csi.juicefs.comparameters: csi.storage.k8s.io/provisioner-secret-name: juicefs-secret csi.storage.k8s.io/provisioner-secret-namespace: kube-system juicefs/mount-options: | cache-dir=/var/jfsCache cache-size=512000 # 500GB local cache buffer-size=1024 # 1GB read-ahead buffer prefetch=3 # Prefetch 3 blocks ahead max-uploads=40 # Parallel upload threads metacache-expire=300 # Metadata cache TTL (seconds) open-cache=300 # Open file handle cachereclaimPolicy: RetainvolumeBindingMode: ImmediateDataset Caching: Fluid and Alluxio
Section titled “Dataset Caching: Fluid and Alluxio”The Caching Problem
Section titled “The Caching Problem”Consider this scenario: 50 ML engineers share a 500GB ImageNet dataset stored in S3. Without caching:
Engineer 1 training job: Downloads 500GB from S3 → 30 minEngineer 2 training job: Downloads 500GB from S3 → 30 min...Engineer 50 training job: Downloads 500GB from S3 → 30 min
Total: 25 TB downloaded, 25 hours of wait time, $50+ in S3 egressWith a caching layer:
Engineer 1 training job: Downloads 500GB from S3 → 30 min (cold)Engineer 2 training job: Reads from cache → 2 min (warm)...Engineer 50 training job: Reads from cache → 2 min (warm)
Total: 500GB downloaded, 2 hours total wait, $1 in S3 egressFluid: Kubernetes-Native Dataset Orchestration
Section titled “Fluid: Kubernetes-Native Dataset Orchestration”Fluid is a CNCF sandbox project that brings dataset-aware scheduling to Kubernetes. It manages datasets as first-class resources and uses cache engines (Alluxio, JuiceFS, JindoFS) under the hood.
# Install Fluidhelm repo add fluid https://fluid-cloudnative.github.io/chartshelm repo update
helm install fluid fluid/fluid \ --namespace fluid-system \ --create-namespaceDefine a dataset and its caching runtime:
apiVersion: data.fluid.io/v1alpha1kind: Datasetmetadata: name: imagenet namespace: ml-trainingspec: mounts: - mountPoint: s3://my-datasets/imagenet/ name: imagenet options: aws.accessKeyId: AKIAIOSFODNN7EXAMPLE aws.region: us-east-1 encryptOptions: - name: aws.secretAccessKey valueFrom: secretKeyRef: name: s3-credentials key: secretAccessKey---apiVersion: data.fluid.io/v1alpha1kind: AlluxioRuntimemetadata: name: imagenet namespace: ml-trainingspec: replicas: 3 # 3 cache workers tieredstore: levels: - mediumtype: SSD path: /dev/shm,/var/cache/alluxio quota: 100Gi,400Gi # 100GB RAM + 400GB SSD cache per worker high: "0.95" low: "0.7" fuse: args: - fuse - --attr-timeout=7200s - --entry-timeout=7200s cleanPolicy: OnDemand properties: alluxio.user.metadata.cache.enabled: "true" alluxio.user.metadata.cache.expireTime: "2day" alluxio.user.streaming.data.timeout: "300sec"Use the dataset in a training Pod:
apiVersion: batch/v1kind: Jobmetadata: name: imagenet-training namespace: ml-trainingspec: template: spec: containers: - name: trainer image: nvcr.io/nvidia/pytorch:24.09-py3 command: ["python", "train.py", "--data_dir=/data/imagenet"] volumeMounts: - name: imagenet mountPath: /data/imagenet readOnly: true resources: limits: nvidia.com/gpu: 4 volumes: - name: imagenet persistentVolumeClaim: claimName: imagenet # Automatically created by Fluid restartPolicy: OnFailureFluid’s Data-Aware Scheduling
Section titled “Fluid’s Data-Aware Scheduling”Fluid tracks where cached data resides and preferentially schedules Pods on nodes that already have the data cached:
# Fluid automatically injects scheduling hints# Pods using the 'imagenet' dataset prefer nodes where Alluxio workers# have already cached ImageNet data
# You can also trigger pre-warming:apiVersion: data.fluid.io/v1alpha1kind: DataLoadmetadata: name: imagenet-warmup namespace: ml-trainingspec: dataset: name: imagenet namespace: ml-training loadMetadata: true target: - path: / replicas: 2 # Cache 2 copies for fault toleranceAlluxio Standalone (without Fluid)
Section titled “Alluxio Standalone (without Fluid)”For teams that want more control, Alluxio can be deployed independently:
helm repo add alluxio https://alluxio-charts.storage.googleapis.com/openSourcehelm repo update
helm install alluxio alluxio/alluxio \ --namespace alluxio \ --create-namespace \ --set master.count=1 \ --set worker.count=3 \ --set worker.resources.limits.memory=32Gi \ --set tieredStore.levels[0].level=0 \ --set tieredStore.levels[0].mediumtype=MEM \ --set tieredStore.levels[0].path=/dev/shm \ --set tieredStore.levels[0].quota=16Gi \ --set tieredStore.levels[1].level=1 \ --set tieredStore.levels[1].mediumtype=SSD \ --set tieredStore.levels[1].path=/mnt/nvme/alluxio \ --set tieredStore.levels[1].quota=500Gi \ --set properties."alluxio.underfs.s3.region"=us-east-1Checkpoint Storage
Section titled “Checkpoint Storage”Why Checkpoint IO Matters
Section titled “Why Checkpoint IO Matters”During training, checkpoints must be saved periodically. A checkpoint for a 70B parameter model is:
Model parameters: 70B × 2 bytes (BF16) = 140 GBOptimizer state: 70B × 8 bytes (Adam) = 560 GBTotal: ~700 GB per checkpointIf your storage can write at 2 GB/s, saving one checkpoint takes 350 seconds — almost 6 minutes. During this time, GPUs are either idle (synchronous checkpoint) or must continue while carefully not overwriting the in-flight checkpoint (asynchronous).
Strategies for Fast Checkpointing
Section titled “Strategies for Fast Checkpointing”Synchronous (simple, slow):
# Training pauses during savetorch.save(model.state_dict(), "/checkpoints/latest.pt")# 6 minutes of idle GPUsAsynchronous with background thread:
import threading
def save_async(state_dict, path): torch.save(state_dict, path)
# Clone state dict to CPU, then save in backgroundstate_dict_cpu = {k: v.cpu().clone() for k, v in model.state_dict().items()}thread = threading.Thread(target=save_async, args=(state_dict_cpu, path))thread.start()# Training continues immediatelySharded checkpoints (PyTorch FSDP / DeepSpeed):
# Each GPU saves its own shard in parallel# 8 GPUs writing 87.5 GB each at 2 GB/s = 44 seconds (vs 350 seconds serial)from torch.distributed.checkpoint import savesave(model.state_dict(), checkpoint_id=f"/checkpoints/step_{step}")Recommended Checkpoint Storage
Section titled “Recommended Checkpoint Storage”| Storage Type | Write Speed | Best For |
|---|---|---|
| Local NVMe (TopoLVM) | 5-7 GB/s | Fastest saves; risk of data loss on node failure |
| CephFS / GlusterFS (RWX) | 1-5 GB/s | Shared access, multi-node distributed saves |
| JuiceFS (NVMe cache + S3) | 3-7 GB/s local, async to S3 | Best of both: fast writes, durable storage |
| NFS | 0.5-2 GB/s | Simple, widely available; potential bottleneck |
Try This: Measure Your Storage Performance
Section titled “Try This: Measure Your Storage Performance”Run this inside a Pod on your cluster to understand your storage baseline:
# Create a test pod with your storage classcat <<'EOF' | kubectl apply -f -apiVersion: v1kind: Podmetadata: name: storage-bench namespace: defaultspec: containers: - name: bench image: ubuntu:22.04 command: ["sleep", "infinity"] volumeMounts: - name: test-vol mountPath: /data resources: requests: cpu: "4" memory: 8Gi volumes: - name: test-vol persistentVolumeClaim: claimName: bench-pvcEOF
# Inside the pod:kubectl exec -it storage-bench -- bash
apt-get update && apt-get install -y fio
# Sequential read (simulates loading a large dataset)fio --name=seq-read --rw=read --bs=1M --size=10G \ --numjobs=4 --direct=1 --directory=/data \ --runtime=60 --time_based --group_reporting
# Random read (simulates image dataset loading)fio --name=rand-read --rw=randread --bs=256K --size=10G \ --numjobs=8 --direct=1 --directory=/data \ --runtime=60 --time_based --group_reporting \ --iodepth=32
# Sequential write (simulates checkpoint saves)fio --name=seq-write --rw=write --bs=1M --size=10G \ --numjobs=4 --direct=1 --directory=/data \ --runtime=60 --time_based --group_reportingDid You Know?
Section titled “Did You Know?”-
ImageNet, the dataset that launched the deep learning revolution, contains 14 million images totaling about 150GB. But the images are tiny JPEG files (average ~10KB each). This means loading ImageNet requires 14 million random reads — a worst case for any storage system. This is why ImageNet training was one of the first workloads to expose storage bottlenecks in GPU clusters.
-
The concept of “data gravity” is literal in AI infrastructure. Moving a 10TB dataset across the internet takes hours to days, but computing on it takes seconds to minutes. This is why cloud providers offer “data import” services where they physically ship hard drives. Google’s Transfer Appliance can hold 1 PB and ships via FedEx — sometimes the highest-bandwidth network is a truck full of disks.
-
Meta reported that during Llama 3 training, their storage system served 240 PB of data over the 54-day run — roughly 4.4 PB per day, or 51 GB per second sustained. This required a custom distributed filesystem (Tectonic) because no off-the-shelf system could handle this throughput at this scale.
War Story: The Cache That Saved $200K
Section titled “War Story: The Cache That Saved $200K”A medical imaging startup trained models on a 2TB dataset of CT scans stored in Google Cloud Storage (GCS). They had 20 GPU nodes, each running training jobs that loaded the full dataset.
Before caching: Each job downloaded 2TB from GCS at ~500 MB/s = 67 minutes startup time. With 20 nodes running 3 jobs/day each, they downloaded 120 TB/day from GCS.
- GCS egress cost: $0.12/GB × 120,000 GB/day = $14,400/day = $432,000/month
- GPU idle time during downloads: 20 nodes × 3 jobs × 67 min = 67 GPU-hours/day wasted
After JuiceFS with NVMe cache: First job on each node downloads from GCS (cold cache). Subsequent jobs read from local NVMe cache at 5 GB/s = 7 minutes.
- GCS egress: 20 nodes × 2TB × 1 download/week = 40TB/week = $19,200/month
- GPU idle time: negligible (7 min cached vs 67 min uncached)
Monthly savings: $432,000 - $19,200 = $412,800/month. The caching infrastructure (JuiceFS + NVMe on each node) cost $3,000/month.
Lesson: In AI infrastructure, the most impactful optimization is often the simplest: cache the dataset close to the GPUs.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
| Using S3 FUSE mounts for training | FUSE adds 2-10x latency overhead per IO operation | Use JuiceFS or Alluxio with local NVMe cache; or download to local disk first |
| Network storage for ImageNet-style training | Millions of small random reads kill network storage | Cache dataset on local NVMe; or use WebDataset/TFRecord for sequential access |
| Synchronous checkpoints on slow storage | GPUs idle for minutes during each checkpoint save | Use async checkpointing or sharded distributed checkpoints |
No noatime mount option | Every file read triggers a metadata write (access time update) | Always mount with noatime,nodiratime for training volumes |
| RWO volumes for multi-node training | ReadWriteOnce cannot be mounted on multiple nodes | Use RWX storage (CephFS, NFS, JuiceFS) or local cache per node |
Ignoring storage class volumeBindingMode | PVC binds to wrong node before Pod is scheduled | Always use WaitForFirstConsumer for local storage |
| Not pre-warming cache before training | First epoch runs at cold-cache speed, skewing benchmark results | Use Fluid’s DataLoad CRD or an init container to warm cache |
| Using ext4 for large files | ext4 fragments large sequential writes | Use XFS for datasets and checkpoint volumes; it handles large files better |
Quiz: Check Your Understanding
Section titled “Quiz: Check Your Understanding”Question 1
Section titled “Question 1”Why is storing a dataset of 14 million small images on S3 problematic for GPU training?
Show Answer
Three compounding issues:
-
Latency: Each S3 GET request has 10-100ms latency. With 14M random reads, the cumulative latency is enormous even with parallelism.
-
Request overhead: S3 is optimized for throughput on large objects, not IOPS on small objects. Each 10KB image requires a full HTTP GET with TLS handshake, authentication, etc. The protocol overhead exceeds the data size.
-
No prefetching: S3 has no concept of “read the next file” — each read is independent. Local filesystems and caching layers can prefetch adjacent files, but S3 cannot.
Solutions: (a) Convert to sequential formats like WebDataset or TFRecord, (b) cache on local NVMe with JuiceFS/Alluxio, or (c) download the entire dataset to local storage before training.
Question 2
Section titled “Question 2”Explain the difference between JuiceFS and Alluxio as caching solutions for AI workloads.
Show Answer
JuiceFS is a full POSIX filesystem with caching:
- Separates metadata (Redis/PostgreSQL) from data (any object store)
- Provides a complete filesystem (create, write, read, delete, rename)
- Client-side caching on local NVMe
- Can be used as primary storage, not just a cache layer
- Simpler architecture (no separate master/worker topology)
Alluxio is a caching middleware layer:
- Sits between compute and existing storage systems
- Master/worker architecture with distributed cache
- Does not store data itself — always backed by an “under filesystem” (S3, HDFS)
- Richer data management: pinning, TTL, replication policies
- More complex to operate but more features for large-scale deployments
When to choose which:
- JuiceFS: when you need a filesystem that also caches, or when simplicity matters
- Alluxio (via Fluid): when you need dataset-aware scheduling, multi-tier caching, or already have complex data infrastructure
Question 3
Section titled “Question 3”A training job saves a 700GB checkpoint every 1000 steps. Steps take 2 seconds each. Checkpoint save takes 350 seconds on the current storage. What percentage of GPU time is wasted on checkpointing, and how would you reduce it?
Show Answer
Waste calculation:
- Steps between checkpoints: 1000 × 2s = 2000s of training
- Checkpoint time: 350s
- Waste: 350 / (2000 + 350) = 14.9% of total time
Reduction strategies:
-
Sharded checkpoints: 8 GPUs each save 87.5GB in parallel → 44s instead of 350s → waste drops to 2.1%
-
Async checkpointing: Clone state dict to CPU RAM (takes ~10s), save in background thread while training continues → waste drops to ~0.5%
-
Faster storage: Local NVMe at 7 GB/s → 100s → waste drops to 4.8%. Combined with sharding: 12.5s → 0.6%
-
Less frequent checkpoints: Every 2000 steps instead of 1000 → halves the waste, but doubles max lost work on failure
The best approach combines sharded + async: each GPU clones its shard to CPU, then background threads write to fast storage while training continues. This achieves <1% waste.
Question 4
Section titled “Question 4”What does Fluid’s data-aware scheduling do that a regular PVC does not?
Show Answer
A regular PVC binds to a volume and any node that can access that volume can run the Pod. It has no awareness of data locality — a Pod might run on a node that has no cached data, causing a cold-cache start.
Fluid’s data-aware scheduling:
- Tracks cache location: Knows which nodes’ Alluxio workers have cached which datasets
- Prefers warm nodes: Injects scheduling hints (nodeAffinity) so Pods prefer nodes where their dataset is already cached
- Enables pre-warming:
DataLoadCRD can prefill cache before Pods start - Manages cache lifecycle: Evicts stale data, rebalances across workers, manages multi-tier (RAM + SSD) caching
- Abstracts the cache engine: User sees a PVC; Fluid manages Alluxio/JuiceFS/JindoFS underneath
Hands-On Exercise: JuiceFS Cache Over S3 with Latency Measurement
Section titled “Hands-On Exercise: JuiceFS Cache Over S3 with Latency Measurement”Objective
Section titled “Objective”Deploy JuiceFS with a local NVMe cache backed by S3-compatible object storage, load a dataset, and measure the difference between cold-cache and warm-cache read performance.
Environment
Section titled “Environment”- Kubernetes cluster with at least one node
- MinIO or any S3-compatible storage (we will deploy MinIO for this exercise)
- A node with local storage available (emptyDir is acceptable for the exercise)
Step 1: Deploy MinIO (S3-compatible Object Store)
Section titled “Step 1: Deploy MinIO (S3-compatible Object Store)”# Install MinIO for local S3-compatible storagekubectl create namespace storage
cat <<'EOF' | kubectl apply -f -apiVersion: apps/v1kind: Deploymentmetadata: name: minio namespace: storagespec: replicas: 1 selector: matchLabels: app: minio template: metadata: labels: app: minio spec: containers: - name: minio image: minio/minio:RELEASE.2024-10-13T13-34-11Z args: ["server", "/data", "--console-address", ":9001"] env: - name: MINIO_ROOT_USER value: minioadmin - name: MINIO_ROOT_PASSWORD value: minioadmin123 ports: - containerPort: 9000 - containerPort: 9001 volumeMounts: - name: data mountPath: /data volumes: - name: data emptyDir: sizeLimit: 20Gi---apiVersion: v1kind: Servicemetadata: name: minio namespace: storagespec: ports: - port: 9000 targetPort: 9000 name: api - port: 9001 targetPort: 9001 name: console selector: app: minioEOF
kubectl -n storage wait --for=condition=Ready pod -l app=minio --timeout=120sStep 2: Create a Test Dataset in MinIO
Section titled “Step 2: Create a Test Dataset in MinIO”# Create a bucket and upload test datakubectl -n storage run mc --rm -it --restart=Never \ --image=minio/mc:RELEASE.2024-10-08T09-37-26Z -- bash -c ' mc alias set local http://minio:9000 minioadmin minioadmin123 mc mb local/ai-datasets
# Create a 1GB test dataset (256 files of 4MB each) for i in $(seq 1 256); do dd if=/dev/urandom of=/tmp/data_${i}.bin bs=4M count=1 2>/dev/null mc cp /tmp/data_${i}.bin local/ai-datasets/training/ rm /tmp/data_${i}.bin done
echo "Dataset created:" mc ls local/ai-datasets/training/ | wc -l mc du local/ai-datasets/'Step 3: Deploy Redis (JuiceFS Metadata Engine)
Section titled “Step 3: Deploy Redis (JuiceFS Metadata Engine)”cat <<'EOF' | kubectl apply -f -apiVersion: apps/v1kind: Deploymentmetadata: name: redis namespace: storagespec: replicas: 1 selector: matchLabels: app: redis template: metadata: labels: app: redis spec: containers: - name: redis image: redis:7-alpine ports: - containerPort: 6379---apiVersion: v1kind: Servicemetadata: name: redis namespace: storagespec: ports: - port: 6379 selector: app: redisEOFStep 4: Install JuiceFS CSI Driver
Section titled “Step 4: Install JuiceFS CSI Driver”helm repo add juicefs https://juicedata.github.io/charts/helm repo update
# Create the JuiceFS secretcat <<'EOF' | kubectl apply -f -apiVersion: v1kind: Secretmetadata: name: juicefs-secret namespace: storagetype: OpaquestringData: name: ai-data metaurl: redis://redis.storage.svc:6379/1 storage: s3 bucket: http://minio.storage.svc:9000/ai-datasets access-key: minioadmin secret-key: minioadmin123EOF
# Install the CSI driverhelm install juicefs-csi juicefs/juicefs-csi-driver \ --namespace kube-system \ --version v0.24.8Step 5: Create JuiceFS StorageClass and PVC
Section titled “Step 5: Create JuiceFS StorageClass and PVC”cat <<'EOF' | kubectl apply -f -apiVersion: storage.k8s.io/v1kind: StorageClassmetadata: name: juicefs-cacheprovisioner: csi.juicefs.comparameters: csi.storage.k8s.io/provisioner-secret-name: juicefs-secret csi.storage.k8s.io/provisioner-secret-namespace: storage csi.storage.k8s.io/node-publish-secret-name: juicefs-secret csi.storage.k8s.io/node-publish-secret-namespace: storage juicefs/mount-options: "cache-size=10240,buffer-size=512"reclaimPolicy: DeletevolumeBindingMode: Immediate---apiVersion: v1kind: PersistentVolumeClaimmetadata: name: juicefs-data namespace: storagespec: storageClassName: juicefs-cache accessModes: - ReadWriteMany resources: requests: storage: 20GiEOFStep 6: Measure Cold vs Warm Cache Performance
Section titled “Step 6: Measure Cold vs Warm Cache Performance”cat <<'BENCHEOF' | kubectl apply -f -apiVersion: v1kind: Podmetadata: name: cache-benchmark namespace: storagespec: containers: - name: bench image: ubuntu:22.04 command: ["bash", "-c"] args: - | apt-get update -qq && apt-get install -y -qq time bc 2>/dev/null
echo "=== COLD CACHE READ (first access, data fetched from S3) ===" sync; echo 3 > /proc/sys/vm/drop_caches 2>/dev/null || true
cold_start=$(date +%s%N) total_bytes=0 for f in /data/training/data_*.bin; do cat "$f" > /dev/null 2>&1 total_bytes=$((total_bytes + $(stat -c%s "$f" 2>/dev/null || echo 0))) done cold_end=$(date +%s%N)
cold_ms=$(( (cold_end - cold_start) / 1000000 )) cold_mbps=$(echo "scale=2; $total_bytes / 1048576 / ($cold_ms / 1000)" | bc 2>/dev/null || echo "N/A") echo "Cold read: ${cold_ms}ms for $(echo "$total_bytes / 1048576" | bc)MB = ${cold_mbps} MB/s" echo ""
echo "=== WARM CACHE READ (second access, data from local cache) ===" warm_start=$(date +%s%N) for f in /data/training/data_*.bin; do cat "$f" > /dev/null 2>&1 done warm_end=$(date +%s%N)
warm_ms=$(( (warm_end - warm_start) / 1000000 )) warm_mbps=$(echo "scale=2; $total_bytes / 1048576 / ($warm_ms / 1000)" | bc 2>/dev/null || echo "N/A") echo "Warm read: ${warm_ms}ms for $(echo "$total_bytes / 1048576" | bc)MB = ${warm_mbps} MB/s" echo ""
speedup=$(echo "scale=1; $cold_ms / $warm_ms" | bc 2>/dev/null || echo "N/A") echo "=== RESULT: Warm cache is ${speedup}x faster than cold ==="
sleep 3600 volumeMounts: - name: data mountPath: /data resources: requests: cpu: "2" memory: 4Gi volumes: - name: data persistentVolumeClaim: claimName: juicefs-data restartPolicy: NeverBENCHEOF
# Wait and check resultskubectl -n storage wait --for=condition=Ready pod/cache-benchmark --timeout=300ssleep 60kubectl -n storage logs cache-benchmarkStep 7: Cleanup
Section titled “Step 7: Cleanup”kubectl delete namespace storageSuccess Criteria
Section titled “Success Criteria”You have completed this exercise when:
- MinIO is running and contains a 1GB test dataset (256 files)
- Redis metadata engine is running
- JuiceFS CSI driver is installed and StorageClass is created
- PVC is bound and mountable
- Cold cache read time is measured (expect: 30-120 seconds for 1GB)
- Warm cache read time is measured (expect: 2-10 seconds for 1GB)
- Warm cache read is at least 3x faster than cold cache read
- You can explain why the speedup occurs (local NVMe/memory vs S3 network round-trip)
Key Takeaways
Section titled “Key Takeaways”- IO is the most common bottleneck in GPU training — if GPU utilization is below 80%, investigate storage first
- Local NVMe is 10-50x faster than network storage for the random small-file reads that image training demands
- JuiceFS bridges the gap between object storage (cheap, durable, slow) and local NVMe (fast, ephemeral)
- Fluid/Alluxio add data-aware scheduling — Pods prefer nodes that already have their dataset cached
- Checkpointing must be fast — synchronous saves to slow storage can waste 15%+ of GPU time; use sharded async checkpoints
- Mount with
noatime— a one-line fix that eliminates unnecessary metadata writes on every file read - Measure before optimizing — use
fioto benchmark your storage andnvidia-smito correlate with GPU utilization - The init container pattern for data staging is simple, reliable, and often sufficient for datasets under 1TB
Further Reading
Section titled “Further Reading”Documentation:
- JuiceFS: juicefs.com/docs/
- Fluid: github.com/fluid-cloudnative/fluid
- Alluxio: docs.alluxio.io
- TopoLVM: github.com/topolvm/topolvm
- Rook-Ceph: rook.io/docs/rook/latest/
Papers:
- “Analyzing and Mitigating Data Stalls in DNN Training” — Jayaram et al. (IO bottleneck analysis)
- “CoorDL: Coordinated and Progressive Data Loading for Deep Learning” — Mohan et al.
Talks:
- “Building a Petabyte-Scale AI Data Platform on Kubernetes” — KubeCon EU 2024
- “JuiceFS: A Cloud-Native Distributed File System for AI Workloads” — CNCF Webinar
Summary
Section titled “Summary”Storage is the hidden bottleneck that prevents expensive GPUs from reaching their potential. A multi-tier approach — local NVMe for hot data, distributed filesystem for warm data, object storage for cold data — combined with intelligent caching (JuiceFS, Fluid/Alluxio) bridges the 1,000x performance gap between GPU memory and network storage. Fast checkpoint storage with async and sharded writes minimizes GPU idle time during saves. Measure, cache, and measure again.
Next Module
Section titled “Next Module”Continue to Module 1.5: Serving LLMs at Scale to learn how to deploy large language models for inference with vLLM, continuous batching, and KEDA autoscaling.
“Data is the new oil, but storage is the pipeline. A clogged pipeline makes the oil worthless.” — Anonymous infrastructure engineer