Module 1.3: Distributed Training Infrastructure
Discipline Module | Complexity:
[COMPLEX]| Time: 5 hours
Prerequisites
Section titled “Prerequisites”Before starting this module:
- Required: Module 1.2: Advanced GPU Scheduling & Sharing — GPU topology, multi-GPU allocation
- Required: Kubernetes networking fundamentals (Services, CNI, Pod-to-Pod communication)
- Recommended: Basic understanding of neural network training (forward pass, backward pass, gradient descent)
- Recommended: Familiarity with PyTorch or TensorFlow distributed APIs
- Recommended: Access to a cluster with at least 2 GPU nodes (for multi-node exercises)
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Implement distributed training jobs on Kubernetes using PyTorch DDP, Horovod, or DeepSpeed
- Design multi-node training architectures with proper NCCL configuration and network topology awareness
- Configure Kubernetes operators like Kubeflow Training Operator for managing distributed training lifecycle
- Diagnose common distributed training failures — NCCL timeouts, OOM errors, stragglers — in Kubernetes environments
Why This Module Matters
Section titled “Why This Module Matters”Modern AI models are too large and too slow to train on a single GPU. GPT-4 is estimated to have been trained on ~25,000 GPUs for roughly 100 days. Even a “small” LLM like Llama-3-8B requires 1.3 million GPU-hours to train.
This reality creates a brutal infrastructure challenge: you must make hundreds or thousands of GPUs across dozens of machines work together as if they were a single giant accelerator. Every microsecond of network latency between GPUs becomes a tax on your training throughput. Every node failure forces a decision: restart from scratch or recover from a checkpoint?
The platform team’s job is to build the infrastructure that makes this possible:
- High-speed networking (InfiniBand, RoCE) so GPUs can exchange gradients at wire speed
- Kubernetes operators (MPI Operator, PyTorch Operator) that orchestrate distributed jobs
- Multi-network CNIs (Multus) that give pods secondary high-speed network interfaces
- Fault tolerance that recovers from node failures without losing days of work
Get this right, and your ML team ships models on schedule. Get it wrong, and they burn millions of dollars in GPU-hours on jobs that fail, stall, or run at a fraction of their potential speed.
Distributed Training Fundamentals
Section titled “Distributed Training Fundamentals”Why Distribute?
Section titled “Why Distribute?”A single A100 GPU can process roughly 300 TFLOPS of BF16 operations. Training Llama-3-70B requires approximately 6.4 x 10^23 FLOPs. At 300 TFLOPS:
6.4 × 10^23 FLOPS / (300 × 10^12 FLOPS/s)= 2.1 × 10^9 seconds= ~67 years on a single GPUWith 2,048 GPUs running at 50% efficiency:
67 years / (2,048 × 0.50)= 24 daysThe math is clear: you either distribute or you don’t train.
Parallelism Strategies
Section titled “Parallelism Strategies”Distributed training uses three complementary strategies:
┌─────────────────────────────────────────────────────────────┐│ Distributed Training │├──────────────────┬──────────────────┬───────────────────────┤│ Data Parallel │ Model Parallel │ Pipeline Parallel ││ │ (Tensor) │ ││ Same model on │ Split layers │ Split layers ││ every GPU. │ across GPUs. │ into stages. ││ Different data. │ Each GPU has │ Micro-batches ││ │ part of each │ flow through ││ Sync gradients │ layer. │ stages. ││ after each step.│ │ ││ │ Forward/backward│ Reduces memory ││ Scales to many │ require all- │ per stage. ││ GPUs easily. │ to-all comms. │ │└──────────────────┴──────────────────┴───────────────────────┘Data Parallelism (most common for models that fit in one GPU’s memory):
GPU 0: Full model + Batch shard 0 ─┐GPU 1: Full model + Batch shard 1 ─┤─ All-Reduce gradientsGPU 2: Full model + Batch shard 2 ─┤ after each stepGPU 3: Full model + Batch shard 3 ─┘Tensor Parallelism (for layers too large for one GPU):
Layer N: [GPU 0: columns 0-2047] [GPU 1: columns 2048-4095] ────── All-Reduce activations between layers ──────Pipeline Parallelism (for models with many layers):
GPU 0: Layers 0-9 → micro-batch → GPU 1: Layers 10-19GPU 2: Layers 20-29 → micro-batch → GPU 3: Layers 30-39In practice, large training runs use 3D parallelism: data parallel across nodes, tensor parallel within a node (NVLink), and pipeline parallel across node groups.
The Communication Bottleneck
Section titled “The Communication Bottleneck”Here is the key insight: distributed training is a networking problem disguised as a compute problem.
During data-parallel training, every GPU must synchronize its gradients with every other GPU after each training step. For a model with 70 billion parameters in BF16:
Gradient size per step: 70 × 10^9 × 2 bytes = 140 GBAll-Reduce data volume: 2 × 140 GB × (N-1)/N ≈ 280 GB (ring all-reduce)Training steps per second target: 1 step/sRequired network bandwidth: 280 GB/s bidirectional across the clusterThis is why standard Kubernetes networking (typically 10-25 Gbps) is completely inadequate for distributed training. You need specialized high-bandwidth, low-latency networks.
High-Speed Networking: InfiniBand and RoCE
Section titled “High-Speed Networking: InfiniBand and RoCE”InfiniBand
Section titled “InfiniBand”InfiniBand (IB) is a specialized network fabric designed for high-performance computing. It operates entirely outside the TCP/IP stack:
| Property | InfiniBand HDR | InfiniBand NDR | Ethernet (25GbE) |
|---|---|---|---|
| Bandwidth per port | 200 Gbps | 400 Gbps | 25 Gbps |
| Latency | ~0.6 μs | ~0.5 μs | ~10-25 μs |
| Protocol | Native IB verbs | Native IB verbs | TCP/IP |
| RDMA | Yes (native) | Yes (native) | No (requires RoCE) |
| CPU overhead | Near zero | Near zero | Significant |
| Cost | $$$$ | $$$$$ | $ |
InfiniBand’s killer feature is RDMA (Remote Direct Memory Access): one machine can read from or write to another machine’s memory without involving either machine’s CPU or operating system. The network card (HCA) handles the entire transfer in hardware.
Traditional TCP: RDMA:App → Kernel → TCP → NIC App → NIC ──────→ Remote NIC → Remote Memory ↑ CPU copies data ↑ Zero CPU involvement ↑ Context switches ↑ Zero-copy ↑ Interrupt handling ↑ Kernel bypassRoCE (RDMA over Converged Ethernet)
Section titled “RoCE (RDMA over Converged Ethernet)”RoCE brings RDMA to standard Ethernet networks. It is cheaper than InfiniBand but requires lossless Ethernet (PFC/ECN configuration):
RoCEv2 stack:┌────────────┐│ RDMA App │├────────────┤│ IB Verbs │ (same API as InfiniBand)├────────────┤│ RoCE │ (encapsulates IB over UDP)├────────────┤│ UDP/IP │├────────────┤│ Ethernet │ (requires lossless config: PFC + ECN)└────────────┘GPUDirect RDMA
Section titled “GPUDirect RDMA”The ultimate optimization: GPUDirect RDMA allows a network card to read/write directly to GPU memory, bypassing both system memory and the CPU entirely:
Without GPUDirect:GPU Memory → PCIe → System RAM → PCIe → NIC → Network → NIC → PCIe → System RAM → PCIe → GPU Memory 4 PCIe hops, 2 memory copies, CPU involvement
With GPUDirect RDMA:GPU Memory → PCIe → NIC → Network → NIC → PCIe → GPU Memory 2 PCIe hops, 0 memory copies, 0 CPU involvementGPUDirect RDMA requires:
- The GPU and NIC on the same PCIe root complex (ideally same PCIe switch)
- NVIDIA peer memory kernel module (
nvidia-peermem) - A supported NIC (Mellanox/NVIDIA ConnectX-6 or newer)
- The
nv_peer_memornvidia-peermemdriver loaded
In Kubernetes, the GPU Operator can manage the peermem driver:
apiVersion: nvidia.com/v1kind: ClusterPolicymetadata: name: cluster-policyspec: driver: enabled: true rdma: enabled: true # Enable GPUDirect RDMA useHostMofed: true # Use host-installed Mellanox OFED driversNCCL: The GPU Communication Library
Section titled “NCCL: The GPU Communication Library”What NCCL Does
Section titled “What NCCL Does”NCCL (NVIDIA Collective Communications Library, pronounced “nickel”) is the library that implements collective operations (all-reduce, all-gather, broadcast, reduce-scatter) across GPUs. Every distributed training framework (PyTorch DDP, Horovod, DeepSpeed) uses NCCL under the hood.
NCCL automatically selects the best communication path:
┌─────────────────────────────────────────────────┐│ NCCL ││ ││ ┌─────────────┐ ┌──────────┐ ┌────────────┐ ││ │ NVLink │ │ PCIe │ │ Network │ ││ │ (intra- │ │ (intra- │ │ (inter- │ ││ │ node, │ │ node, │ │ node, │ ││ │ fastest) │ │ slower) │ │ IB/RoCE) │ ││ └─────────────┘ └──────────┘ └────────────┘ ││ ││ Automatically selects best path per GPU pair │└─────────────────────────────────────────────────┘Critical NCCL Environment Variables
Section titled “Critical NCCL Environment Variables”These environment variables dramatically affect distributed training performance:
# Network selectionNCCL_IB_DISABLE=0 # 0 = use InfiniBand if availableNCCL_SOCKET_IFNAME=eth0 # Fallback TCP interface (for bootstrapping)NCCL_IB_HCA=mlx5 # InfiniBand HCA device name
# Performance tuningNCCL_ALGO=Ring # Algorithm: Ring, Tree, CollNetDirectNCCL_PROTO=Simple # Protocol: Simple, LL, LL128NCCL_MIN_NCHANNELS=4 # Minimum parallel channelsNCCL_MAX_NCHANNELS=12 # Maximum parallel channelsNCCL_BUFFSIZE=8388608 # Buffer size per channel (8MB)
# GPUDirectNCCL_NET_GDR_LEVEL=5 # GPUDirect RDMA level (5 = across PCIe switches)NCCL_P2P_LEVEL=5 # Peer-to-peer level (intra-node)
# DebuggingNCCL_DEBUG=INFO # Logging: WARN, INFO, TRACENCCL_DEBUG_SUBSYS=INIT,NET # Subsystem-specific debuggingNCCL Topology Detection
Section titled “NCCL Topology Detection”NCCL probes the system topology at initialization and logs its findings. Look for these lines in your training job logs:
NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1NCCL INFO Channel 00 : 0 1 2 3 4 5 6 7NCCL INFO 0 : NET/IB/0/GDRDMA [receive] via NET/IB/0/GDRDMA [send]NCCL INFO Using 12 channels per connectionNCCL INFO comm 0x7f8b00003c00 rank 0 nranks 16 - Init COMPLETEThe key line is NET/IB/0/GDRDMA — this confirms NCCL is using InfiniBand with GPUDirect RDMA, the optimal path.
If you see NET/Socket instead, NCCL has fallen back to TCP, and your training will be 10-50x slower for communication.
Multus CNI: Multi-Network Pods
Section titled “Multus CNI: Multi-Network Pods”The Problem
Section titled “The Problem”Kubernetes gives each Pod a single network interface (typically eth0) on the cluster’s primary CNI network. This network is designed for general traffic — Service discovery, API calls, metrics scraping — not for 400 Gbps GPU-to-GPU data transfers.
For distributed training, Pods need a second network interface connected to the high-speed InfiniBand or RoCE fabric.
Multus Architecture
Section titled “Multus Architecture”Multus CNI is a “meta-plugin” that chains multiple CNI plugins, giving Pods multiple network interfaces:
┌─────────────────────────────────────────────────┐│ Pod ││ ││ ┌─────────────┐ ┌──────────────────┐ ││ │ eth0 │ │ net1 │ ││ │ (primary) │ │ (secondary) │ ││ │ Calico/ │ │ SRIOV/macvlan/ │ ││ │ Cilium │ │ host-device │ ││ │ 10.0.1.5 │ │ 192.168.1.5 │ ││ └──────┬──────┘ └────────┬─────────┘ │└─────────┼────────────────────────┼───────────────┘ │ │ ┌─────┴──────┐ ┌──────┴──────┐ │ Cluster │ │ InfiniBand │ │ Network │ │ /RoCE │ │ (10GbE) │ │ (200 Gbps) │ └────────────┘ └─────────────┘Installing Multus
Section titled “Installing Multus”# Install Multus CNI (thick plugin — recommended)kubectl apply -f https://raw.githubusercontent.com/k8snetworkplumbingwg/multus-cni/master/deployments/multus-daemonset-thick.yml
# Verifykubectl get pods -n kube-system -l app=multusNetwork Attachment Definitions
Section titled “Network Attachment Definitions”Define secondary networks using NetworkAttachmentDefinition CRDs:
macvlan (for RoCE/Ethernet high-speed networks):
apiVersion: k8s.cni.cncf.io/v1kind: NetworkAttachmentDefinitionmetadata: name: roce-network namespace: ml-trainingspec: config: | { "cniVersion": "0.3.1", "type": "macvlan", "master": "enp94s0f0", "mode": "bridge", "ipam": { "type": "host-local", "subnet": "192.168.10.0/24", "rangeStart": "192.168.10.100", "rangeEnd": "192.168.10.200", "routes": [ { "dst": "192.168.10.0/24" } ] } }SR-IOV (for InfiniBand with hardware-level network isolation):
apiVersion: k8s.cni.cncf.io/v1kind: NetworkAttachmentDefinitionmetadata: name: ib-sriov-network namespace: ml-trainingspec: config: | { "cniVersion": "0.3.1", "type": "ib-sriov", "pkey": "0x00FF", "link_state": "enable", "rdmaIsolation": true, "ibKubernetesEnabled": true, "ipam": { "type": "whereabouts", "range": "192.168.20.0/24" } }host-device (directly attach host NIC to Pod):
apiVersion: k8s.cni.cncf.io/v1kind: NetworkAttachmentDefinitionmetadata: name: ib-host-device namespace: ml-trainingspec: config: | { "cniVersion": "0.3.1", "type": "host-device", "device": "mlx5_0", "ipam": { "type": "host-local", "subnet": "192.168.30.0/24" } }Using Secondary Networks in Pods
Section titled “Using Secondary Networks in Pods”Annotate Pods to attach secondary networks:
apiVersion: v1kind: Podmetadata: name: training-worker-0 namespace: ml-training annotations: k8s.v1.cni.cncf.io/networks: roce-networkspec: containers: - name: trainer image: nvcr.io/nvidia/pytorch:24.09-py3 env: - name: NCCL_SOCKET_IFNAME value: "net1" # Use the secondary interface for NCCL - name: NCCL_IB_DISABLE value: "0"Inside the Pod, you will see both interfaces:
# ip addr show1: lo: <LOOPBACK> ...2: eth0@if123: <BROADCAST> ... # Primary (cluster network) inet 10.0.1.5/243: net1@if456: <BROADCAST> ... # Secondary (high-speed network) inet 192.168.10.105/24Kubeflow Training Operators
Section titled “Kubeflow Training Operators”Overview
Section titled “Overview”The Kubeflow Training Operator manages distributed training jobs on Kubernetes. It provides CRDs for:
| CRD | Framework | Communication |
|---|---|---|
PyTorchJob | PyTorch DDP/FSDP | NCCL + Gloo |
MPIJob | Horovod, DeepSpeed | MPI (OpenMPI/MPICH) |
TFJob | TensorFlow | gRPC |
PaddleJob | PaddlePaddle | NCCL + Gloo |
JAXJob | JAX/XLA | gRPC |
Installing the Training Operator
Section titled “Installing the Training Operator”# Install via Helmhelm repo add kubeflow https://kubeflow.github.io/training-operatorhelm repo update
helm install training-operator kubeflow/training-operator \ --namespace kubeflow \ --create-namespace \ --version v1.8.1PyTorchJob Example
Section titled “PyTorchJob Example”A complete multi-node PyTorch DDP training job:
apiVersion: kubeflow.org/v1kind: PyTorchJobmetadata: name: llama-finetune namespace: ml-trainingspec: nprocPerNode: "4" # 4 GPUs per node pytorchReplicaSpecs: Master: replicas: 1 restartPolicy: OnFailure template: metadata: annotations: k8s.v1.cni.cncf.io/networks: roce-network spec: tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule containers: - name: pytorch image: my-registry/llama-trainer:v1.2 command: - torchrun args: - --nnodes=2 - --nproc_per_node=4 - --rdzv_backend=c10d - --rdzv_endpoint=$(MASTER_ADDR):$(MASTER_PORT) - train.py - --model_name=meta-llama/Llama-3-8B - --batch_size=8 - --gradient_accumulation_steps=4 - --fp16 - --output_dir=/checkpoints/llama-ft env: - name: NCCL_SOCKET_IFNAME value: "net1" - name: NCCL_DEBUG value: "INFO" - name: NCCL_IB_DISABLE value: "0" resources: limits: nvidia.com/gpu: 4 cpu: "32" memory: 128Gi volumeMounts: - name: checkpoints mountPath: /checkpoints - name: dshm mountPath: /dev/shm volumes: - name: checkpoints persistentVolumeClaim: claimName: training-checkpoints - name: dshm emptyDir: medium: Memory sizeLimit: 64Gi # Large shared memory for NCCL Worker: replicas: 1 restartPolicy: OnFailure template: metadata: annotations: k8s.v1.cni.cncf.io/networks: roce-network spec: tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule containers: - name: pytorch image: my-registry/llama-trainer:v1.2 command: - torchrun args: - --nnodes=2 - --nproc_per_node=4 - --rdzv_backend=c10d - --rdzv_endpoint=$(MASTER_ADDR):$(MASTER_PORT) - train.py - --model_name=meta-llama/Llama-3-8B - --batch_size=8 - --gradient_accumulation_steps=4 - --fp16 - --output_dir=/checkpoints/llama-ft env: - name: NCCL_SOCKET_IFNAME value: "net1" - name: NCCL_DEBUG value: "INFO" - name: NCCL_IB_DISABLE value: "0" resources: limits: nvidia.com/gpu: 4 cpu: "32" memory: 128Gi volumeMounts: - name: checkpoints mountPath: /checkpoints - name: dshm mountPath: /dev/shm volumes: - name: checkpoints persistentVolumeClaim: claimName: training-checkpoints - name: dshm emptyDir: medium: Memory sizeLimit: 64GiMPIJob Example (for Horovod/DeepSpeed)
Section titled “MPIJob Example (for Horovod/DeepSpeed)”apiVersion: kubeflow.org/v1kind: MPIJobmetadata: name: deepspeed-training namespace: ml-trainingspec: slotsPerWorker: 4 # GPUs per worker runPolicy: cleanPodPolicy: Running backoffLimit: 3 mpiReplicaSpecs: Launcher: replicas: 1 restartPolicy: OnFailure template: spec: containers: - name: launcher image: my-registry/deepspeed-trainer:v2.0 command: - mpirun args: - --allow-run-as-root - -np - "8" - -bind-to - none - -map-by - slot - -x NCCL_DEBUG=INFO - -x NCCL_IB_DISABLE=0 - -x NCCL_SOCKET_IFNAME=net1 - -x LD_LIBRARY_PATH - python - train_deepspeed.py - --deepspeed_config=ds_config.json resources: limits: cpu: "2" memory: 4Gi Worker: replicas: 2 restartPolicy: OnFailure template: metadata: annotations: k8s.v1.cni.cncf.io/networks: roce-network spec: containers: - name: worker image: my-registry/deepspeed-trainer:v2.0 resources: limits: nvidia.com/gpu: 4 cpu: "32" memory: 128Gi volumeMounts: - name: dshm mountPath: /dev/shm volumes: - name: dshm emptyDir: medium: Memory sizeLimit: 64GiTopology Spread and Pod Placement
Section titled “Topology Spread and Pod Placement”The Problem
Section titled “The Problem”When training across multiple nodes, you want pods placed on nodes that are physically close in the network. In a data center with multiple racks and spine-leaf networking, two nodes on the same leaf switch have ~2 μs latency, while nodes on different spine switches may have ~10 μs.
Topology Spread Constraints
Section titled “Topology Spread Constraints”Use Kubernetes topology spread constraints to keep training workers close:
spec: topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: training.kubeflow.org/job-name: llama-finetune - maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: training.kubeflow.org/job-name: llama-finetuneCloud Provider Placement Groups
Section titled “Cloud Provider Placement Groups”For cloud-based training clusters:
# GCP: Compact placement policygcloud compute resource-policies create group-placement ml-training-compact \ --collocation=COLLOCATED --vm-count=16
# AWS: Cluster placement groupaws ec2 create-placement-group \ --group-name ml-training-cluster \ --strategy clusterNode Failure Handling
Section titled “Node Failure Handling”The Reality
Section titled “The Reality”In a cluster with 128 GPU nodes running a multi-day training job, node failures are not a possibility — they are a certainty. Common failure modes:
| Failure | Frequency (per 1000 GPU-hours) | Impact |
|---|---|---|
| GPU Xid errors (recoverable) | 2-5 | Training step fails, retry |
| GPU fallen off bus (Xid 79) | 0.5-1 | Node reboot required |
| NIC failures | 0.2-0.5 | NCCL timeout, job stalls |
| OOM kills | 1-3 | Worker restarts |
| Node kernel panic | 0.1-0.3 | Node replacement |
| Disk failures | 0.05-0.1 | Checkpoint loss if local |
Checkpointing Strategy
Section titled “Checkpointing Strategy”Checkpointing saves model state periodically so training can resume after failures:
# In your training script (PyTorch example)import torchimport torch.distributed as dist
def save_checkpoint(model, optimizer, epoch, step, path): if dist.get_rank() == 0: # Only rank 0 saves checkpoint = { 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict(), 'epoch': epoch, 'step': step, 'rng_state': torch.cuda.get_rng_state(), } torch.save(checkpoint, f"{path}/checkpoint_epoch{epoch}_step{step}.pt") # Keep only last 3 checkpoints cleanup_old_checkpoints(path, keep=3)
# Checkpoint every N stepsfor step, batch in enumerate(dataloader): loss = train_step(model, batch) if step % 500 == 0: save_checkpoint(model, optimizer, epoch, step, "/checkpoints")Elastic Training
Section titled “Elastic Training”PyTorch Elastic (torchrun) supports elastic training — workers can join or leave during training:
apiVersion: kubeflow.org/v1kind: PyTorchJobmetadata: name: elastic-trainingspec: elasticPolicy: minReplicas: 2 # Minimum workers to continue training maxReplicas: 8 # Maximum workers if available maxRestarts: 10 # Total restart budget rdzvBackend: c10d pytorchReplicaSpecs: Worker: replicas: 4 # Desired replicas (elastic: can scale 2-8) restartPolicy: OnFailure template: spec: containers: - name: trainer image: my-registry/elastic-trainer:v1 command: ["torchrun"] args: - --nnodes=2:8 - --nproc_per_node=4 - --rdzv_backend=c10d - --rdzv_endpoint=elastic-training-master-0:29400 - --max_restarts=10 - train_elastic.py resources: limits: nvidia.com/gpu: 4With elastic training, if a node fails:
- Remaining workers detect the failure via rendezvous timeout
- Workers re-form a new group (excluding the failed node)
- Training resumes from the last checkpoint with fewer GPUs
- When the node recovers, it can rejoin the group
Did You Know?
Section titled “Did You Know?”-
Meta’s Grand Teton cluster for Llama 3 training used 16,384 H100 GPUs connected via 400 Gbps RoCE fabric. During the 54-day training run, they experienced 466 job interruptions — roughly 8.6 per day. Their checkpoint-and-resume automation was so good that these interruptions cost only 3% of total GPU time.
-
The
/dev/shmmount is one of the most overlooked performance bottlenecks in distributed training on Kubernetes. NCCL and PyTorch DataLoader use shared memory extensively. Docker’s default/dev/shmsize is 64MB. A multi-GPU training job can easily need 16-64GB of shared memory. Forgetting to setemptyDir.medium: Memorywith an adequatesizeLimitis one of the most common reasons distributed training silently runs 2-5x slower than expected. -
The term “NCCL” was originally an acronym for “NVIDIA Collective Communications Library” but the team internally jokes it stands for “Nickel” because every distributed training job is “nickel-and-dimed” by communication overhead. At Meta’s scale, a 1% improvement in NCCL efficiency saves millions of dollars per year.
War Story: The Mysterious 50% Throughput Drop
Section titled “War Story: The Mysterious 50% Throughput Drop”A team running a 32-GPU (4 nodes x 8 GPUs) training job on RoCE noticed that throughput was exactly half of what they expected based on single-node benchmarks.
The investigation:
- NCCL logs: showed
NET/Socketinstead ofNET/IB— NCCL was using TCP, not RDMA - Root cause: Multus was correctly attaching the RoCE NIC, but
NCCL_SOCKET_IFNAMEwas set toeth0instead ofnet1 - Secondary issue: Even after fixing the interface name, RoCE was slow because the switch did not have PFC (Priority Flow Control) enabled, causing packet drops and retransmissions
The fix:
- Set
NCCL_SOCKET_IFNAME=net1(or the correct secondary interface) - Configured PFC on the ToR switches for the RoCE VLAN
- Verified with
NCCL_DEBUG=INFOthat logs showedNET/IB/0/GDRDMA
Result: throughput jumped from 50% to 93% of linear scaling.
Lesson: Always check NCCL debug output. The difference between TCP fallback and RDMA is not a 10% performance difference — it is a 10x difference.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Problem | Solution |
|---|---|---|
Missing /dev/shm mount | NCCL and DataLoader OOM or slow | Always mount emptyDir: {medium: Memory, sizeLimit: 64Gi} at /dev/shm |
Wrong NCCL_SOCKET_IFNAME | NCCL uses slow primary interface instead of high-speed secondary | Set to the Multus secondary interface name (e.g., net1) |
| No checkpointing | Node failure loses all training progress | Checkpoint every 500-1000 steps to shared storage |
| TCP fallback without noticing | Training runs 10-50x slower; team blames “Kubernetes overhead” | Always check NCCL_DEBUG=INFO output for NET/IB vs NET/Socket |
| Pods on different racks | Inter-rack latency kills all-reduce performance | Use topology spread constraints or placement groups |
Insufficient backoffLimit | Job fails permanently after first transient error | Set backoffLimit: 3-5 for training jobs |
Not setting hostIPC: true when needed | Some NCCL/MPI configurations need host IPC namespace | Enable hostIPC for MPI jobs that use shared memory transports |
| Forgetting RDMA device permissions | Pod cannot open IB device — errno 13 | Use RDMA device plugin or set securityContext.capabilities.add: ["IPC_LOCK"] |
Quiz: Check Your Understanding
Section titled “Quiz: Check Your Understanding”Question 1
Section titled “Question 1”Why is distributed training fundamentally a networking problem, not a compute problem?
Show Answer
In data-parallel training, every GPU computes gradients independently (the compute part), but then all GPUs must synchronize gradients before the next training step via an all-reduce operation. For a 70B parameter model in BF16, this means exchanging ~280GB of data every training step.
If the network is slow, GPUs spend most of their time waiting for gradient synchronization instead of computing. With 10 Gbps Ethernet, transferring 280GB takes ~224 seconds. With 400 Gbps InfiniBand, it takes ~5.6 seconds. The compute (forward + backward pass) typically takes 1-3 seconds.
With Ethernet: GPU utilization = 3/(3+224) = 1.3% With InfiniBand: GPU utilization = 3/(3+5.6) = 35%
The network determines whether your expensive GPUs are computing or waiting.
Question 2
Section titled “Question 2”What is GPUDirect RDMA and why does it matter for distributed training?
Show Answer
GPUDirect RDMA allows a network card (NIC/HCA) to read from and write to GPU memory directly, without copying data through system RAM or involving the CPU.
Without GPUDirect: GPU → PCIe → RAM → PCIe → NIC (4 PCIe hops, 2 memory copies, CPU involvement) With GPUDirect: GPU → PCIe → NIC (2 PCIe hops, zero copies, zero CPU)
This reduces gradient synchronization latency by 30-50% and eliminates CPU bottlenecks. It requires the GPU and NIC to be on the same PCIe root complex, the nvidia-peermem kernel module, and a supported NIC (ConnectX-6+).
Question 3
Section titled “Question 3”Why do distributed training Pods need a large /dev/shm mount, and what happens without it?
Show Answer
/dev/shm is shared memory backed by RAM. NCCL uses it for intra-node GPU-to-GPU communication buffers, and PyTorch DataLoader uses it for num_workers > 0 to share data between loader processes and the training process.
Docker/containerd defaults /dev/shm to 64MB. A 4-GPU training job with NCCL can easily need 8-16GB of shared memory, and DataLoader with 8 workers loading images can need another 8-16GB.
Without adequate /dev/shm:
- NCCL falls back to slower communication paths (socket instead of shared memory)
- DataLoader workers get killed by OOM or fail to allocate shared tensors
- Training silently runs 2-5x slower without any obvious error message
Fix: mount emptyDir: {medium: Memory, sizeLimit: 64Gi} at /dev/shm.
Question 4
Section titled “Question 4”Explain the difference between a PyTorchJob and an MPIJob in the Kubeflow Training Operator.
Show Answer
PyTorchJob:
- Uses PyTorch’s native distributed launching (
torchrun) - Workers discover each other via a rendezvous mechanism (c10d, etcd)
- Master sets
MASTER_ADDRandMASTER_PORTenvironment variables - Supports elastic training (workers can join/leave)
- Best for: PyTorch DDP and FSDP workloads
MPIJob:
- Uses MPI (OpenMPI/MPICH) to launch processes
- A separate Launcher pod runs
mpirunto start workers via SSH - Workers must have SSH access configured (the operator handles this)
- Better for legacy HPC workloads, Horovod, and DeepSpeed with MPI launcher
- Does not natively support elasticity
Key architectural difference: PyTorchJob workers self-organize via rendezvous, while MPIJob workers are orchestrated by a central launcher via MPI’s process management.
Question 5
Section titled “Question 5”A 4-node distributed training job reports NCCL timeout errors after 5 minutes. Node 3 is the last to report the timeout. What is your investigation workflow?
Show Answer
Investigation workflow:
-
Check NCCL debug logs (
NCCL_DEBUG=INFO): Is it using IB/RDMA or Socket/TCP? If TCP fallback, that is likely the cause. -
Check node 3 specifically: Since it is the last to timeout, it is likely the source:
kubectl describe pod— is the pod on the right node? Does it have GPU and network resources?kubectl execinto the pod — can it ping other workers on the secondary network (net1)?- Check
nvidia-smi— are GPUs healthy? - Check
ibstat— is the InfiniBand/RoCE link up?
-
Check network connectivity: From node 3’s pod, verify RDMA connectivity:
ib_write_bw/ib_read_bwto other nodes- Check for packet drops on the switch port connected to node 3
-
Check for asymmetric issues:
- Different NCCL versions across nodes
- One node missing the
nvidia-peermemmodule - One node’s NIC on a different VLAN
-
Check resource limits: Is the pod hitting CPU/memory limits that cause NCCL thread starvation?
Hands-On Exercise: Multi-Node Distributed PyTorch Training with Multus
Section titled “Hands-On Exercise: Multi-Node Distributed PyTorch Training with Multus”Objective
Section titled “Objective”Deploy Multus CNI, configure a secondary macvlan network for high-speed inter-node communication, and run a distributed PyTorch training job across 2 nodes using the Kubeflow Training Operator.
Environment
Section titled “Environment”- Kubernetes cluster with 2+ GPU nodes (at least 1 GPU each)
- GPU Operator installed (Module 1.1)
- Helm installed
- Nodes with a shared Layer-2 network on a secondary NIC (for the macvlan exercise; if using cloud, adapt to use the primary network)
Step 1: Install Multus CNI
Section titled “Step 1: Install Multus CNI”# Install Multus thick pluginkubectl apply -f https://raw.githubusercontent.com/k8snetworkplumbingwg/multus-cni/v4.1.4/deployments/multus-daemonset-thick.yml
# Verify Multus is running on all nodeskubectl -n kube-system get pods -l app=multus -o wideStep 2: Create a Network Attachment Definition
Section titled “Step 2: Create a Network Attachment Definition”# Identify the secondary NIC on your GPU nodes# For cloud: this might be the same as the primary interface# For bare-metal: find the high-speed interface# kubectl debug node/<gpu-node> -it --image=ubuntu -- ip link show
# Create a macvlan NetworkAttachmentDefinitioncat <<'EOF' | kubectl apply -f -apiVersion: k8s.cni.cncf.io/v1kind: NetworkAttachmentDefinitionmetadata: name: training-network namespace: ml-trainingspec: config: | { "cniVersion": "0.3.1", "type": "macvlan", "master": "eth0", "mode": "bridge", "ipam": { "type": "host-local", "subnet": "192.168.100.0/24", "rangeStart": "192.168.100.10", "rangeEnd": "192.168.100.50" } }EOF
# Create the namespacekubectl create namespace ml-training# Re-apply in the correct namespacekubectl apply -n ml-training -f - <<'EOF'apiVersion: k8s.cni.cncf.io/v1kind: NetworkAttachmentDefinitionmetadata: name: training-networkspec: config: | { "cniVersion": "0.3.1", "type": "macvlan", "master": "eth0", "mode": "bridge", "ipam": { "type": "host-local", "subnet": "192.168.100.0/24", "rangeStart": "192.168.100.10", "rangeEnd": "192.168.100.50" } }EOFStep 3: Install the Kubeflow Training Operator
Section titled “Step 3: Install the Kubeflow Training Operator”helm repo add kubeflow https://kubeflow.github.io/training-operatorhelm repo update
helm install training-operator kubeflow/training-operator \ --namespace kubeflow \ --create-namespace \ --version v1.8.1
# Verifykubectl -n kubeflow get podsStep 4: Create Shared Storage for Checkpoints
Section titled “Step 4: Create Shared Storage for Checkpoints”cat <<'EOF' | kubectl apply -f -apiVersion: v1kind: PersistentVolumeClaimmetadata: name: training-checkpoints namespace: ml-trainingspec: accessModes: - ReadWriteMany # Must be RWX for multi-node access storageClassName: "" # Use default; change to your RWX storage class resources: requests: storage: 50GiEOFStep 5: Deploy a Distributed PyTorch Training Job
Section titled “Step 5: Deploy a Distributed PyTorch Training Job”# First, create a ConfigMap with the training script# (torchrun does not support -c for inline scripts — it requires a file path)cat <<'SCRIPTEOF' | kubectl apply -f -apiVersion: v1kind: ConfigMapmetadata: name: mnist-train-script namespace: ml-trainingdata: train.py: | import torch import torch.nn as nn import torch.distributed as dist import torch.optim as optim from torch.nn.parallel import DistributedDataParallel as DDP from torchvision import datasets, transforms from torch.utils.data.distributed import DistributedSampler
dist.init_process_group(backend='nccl') rank = dist.get_rank() world_size = dist.get_world_size() device = torch.device('cuda') print(f"Rank {rank}/{world_size} on {device}")
transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,)) ]) dataset = datasets.MNIST('/tmp/data', train=True, download=True, transform=transform) sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank) loader = torch.utils.data.DataLoader(dataset, batch_size=64, sampler=sampler)
model = nn.Sequential( nn.Flatten(), nn.Linear(784, 256), nn.ReLU(), nn.Linear(256, 10) ).to(device) model = DDP(model) optimizer = optim.Adam(model.parameters(), lr=0.001) criterion = nn.CrossEntropyLoss()
for epoch in range(3): sampler.set_epoch(epoch) total_loss = 0 for batch_idx, (data, target) in enumerate(loader): data, target = data.to(device), target.to(device) optimizer.zero_grad() output = model(data) loss = criterion(output, target) loss.backward() optimizer.step() total_loss += loss.item() if batch_idx % 100 == 0: print(f"Rank {rank} Epoch {epoch} Batch {batch_idx} Loss {loss.item():.4f}") print(f"Rank {rank} Epoch {epoch} Avg Loss: {total_loss/len(loader):.4f}")
if rank == 0: torch.save(model.state_dict(), '/checkpoints/mnist_ddp.pt') print("Model saved!") dist.destroy_process_group()SCRIPTEOF
# Now create the PyTorchJob referencing the script filecat <<'JOBEOF' | kubectl apply -f -apiVersion: kubeflow.org/v1kind: PyTorchJobmetadata: name: distributed-mnist namespace: ml-trainingspec: nprocPerNode: "1" pytorchReplicaSpecs: Master: replicas: 1 restartPolicy: OnFailure template: metadata: annotations: k8s.v1.cni.cncf.io/networks: training-network spec: containers: - name: pytorch image: nvcr.io/nvidia/pytorch:24.09-py3 command: ["torchrun"] args: - --nnodes=2 - --nproc_per_node=1 - --rdzv_backend=c10d - --rdzv_endpoint=$(MASTER_ADDR):$(MASTER_PORT) - /scripts/train.py env: - name: NCCL_DEBUG value: "INFO" resources: limits: nvidia.com/gpu: 1 cpu: "4" memory: 16Gi volumeMounts: - name: scripts mountPath: /scripts - name: checkpoints mountPath: /checkpoints - name: dshm mountPath: /dev/shm volumes: - name: scripts configMap: name: mnist-train-script - name: checkpoints persistentVolumeClaim: claimName: training-checkpoints - name: dshm emptyDir: medium: Memory sizeLimit: 8Gi Worker: replicas: 1 restartPolicy: OnFailure template: metadata: annotations: k8s.v1.cni.cncf.io/networks: training-network spec: containers: - name: pytorch image: nvcr.io/nvidia/pytorch:24.09-py3 command: ["torchrun"] args: - --nnodes=2 - --nproc_per_node=1 - --rdzv_backend=c10d - --rdzv_endpoint=$(MASTER_ADDR):$(MASTER_PORT) - /scripts/train.py env: - name: NCCL_DEBUG value: "INFO" resources: limits: nvidia.com/gpu: 1 cpu: "4" memory: 16Gi volumeMounts: - name: scripts mountPath: /scripts - name: dshm mountPath: /dev/shm volumes: - name: scripts configMap: name: mnist-train-script - name: dshm emptyDir: medium: Memory sizeLimit: 8GiJOBEOF
# Watch the jobkubectl -n ml-training get pods -wStep 6: Verify Distributed Communication
Section titled “Step 6: Verify Distributed Communication”# Check NCCL initialization in master logskubectl -n ml-training logs distributed-mnist-master-0 | grep "NCCL INFO"
# Look for:# - "Init COMPLETE" — NCCL initialized successfully# - "NET/IB" or "NET/Socket" — which transport is being used# - "nranks 2" — both nodes participating
# Check training progresskubectl -n ml-training logs -f distributed-mnist-master-0
# Verify the secondary network interface existskubectl -n ml-training exec distributed-mnist-master-0 -- ip addr show net1Step 7: Cleanup
Section titled “Step 7: Cleanup”kubectl -n ml-training delete pytorchjob distributed-mnistkubectl delete namespace ml-trainingSuccess Criteria
Section titled “Success Criteria”You have completed this exercise when:
- Multus CNI is running on all nodes
-
NetworkAttachmentDefinitionis created and Pods get anet1interface - The PyTorchJob creates both Master and Worker pods on different nodes
- NCCL logs show
Init COMPLETEwithnranks 2 - Both ranks report decreasing loss values across 3 epochs
- The master saves a model checkpoint to the shared PVC
- You can identify which transport NCCL used (IB, Socket, etc.) from the logs
Key Takeaways
Section titled “Key Takeaways”- Distributed training is bottlenecked by network bandwidth — standard Kubernetes networking (10-25 Gbps) is orders of magnitude too slow for gradient synchronization
- InfiniBand/RoCE with RDMA eliminates CPU overhead and memory copies, providing 200-400 Gbps with sub-microsecond latency
- GPUDirect RDMA is the optimal path — NIC reads/writes GPU memory directly, zero system memory involvement
- Multus CNI gives training Pods secondary network interfaces connected to the high-speed fabric
- NCCL environment variables are critical — a wrong
NCCL_SOCKET_IFNAMEcan silently degrade performance by 10-50x - Always mount
/dev/shmwith adequate size (16-64GB) for NCCL and DataLoader shared memory - Checkpointing is mandatory — node failures in multi-day training runs are certain, not possible
- Elastic training (PyTorch Elastic) allows training to continue with fewer workers after failures
Further Reading
Section titled “Further Reading”Documentation:
- Kubeflow Training Operator: www.kubeflow.org/docs/components/training/
- NCCL Documentation: docs.nvidia.com/deeplearning/nccl/user-guide/
- Multus CNI: github.com/k8snetworkplumbingwg/multus-cni
- GPUDirect RDMA: docs.nvidia.com/cuda/gpudirect-rdma/
Papers:
- “Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM” — NVIDIA (3D parallelism)
- “PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel” — Meta (scaling DDP)
Talks:
- “Training LLMs at Scale on Kubernetes” — Meta, KubeCon NA 2024
- “High Performance Networking for AI/ML in Kubernetes” — NVIDIA, KubeCon EU 2024
Summary
Section titled “Summary”Distributed training infrastructure is the highest-stakes platform engineering challenge in AI. It requires mastering high-speed networking (InfiniBand/RoCE), GPU communication libraries (NCCL), multi-network Kubernetes (Multus), distributed job orchestration (Training Operator), and fault tolerance (checkpointing, elastic training). Every layer must work perfectly — a single misconfiguration in any layer can silently reduce your million-dollar GPU cluster to a fraction of its potential throughput.
Next Module
Section titled “Next Module”Continue to Module 1.4: High-Performance Storage for AI to learn how to solve the data pipeline bottleneck with NVMe caching, distributed filesystems, and dataset caching layers.
“In distributed training, the network is the computer.” — Adapted from John Gage (Sun Microsystems)