Module 9.7: GPU Scheduling & NVIDIA GPU Operator on Kubernetes
Complexity: [COMPLEX]
Section titled “Complexity: [COMPLEX]”Time to Complete: 50 minutes Prerequisites: Kubernetes scheduling (taints, tolerations, node affinity), Module 9.1 (Kubeflow basics), basic ML/AI concepts Learning Objectives:
- Understand GPU device plugin architecture in Kubernetes
- Install and configure the NVIDIA GPU Operator
- Configure GPU sharing with time-slicing and MIG
- Optimize GPU node pools for cost and utilization
- Monitor GPU workloads with DCGM and Prometheus
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Configure Kubernetes GPU scheduling with device plugins, resource limits, and multi-GPU node management
- Implement GPU sharing strategies using MIG, time-slicing, and MPS for cost-effective GPU utilization
- Deploy GPU-aware autoscaling with Karpenter or Cluster Autoscaler for dynamic ML workload demand
- Monitor GPU utilization metrics and optimize scheduling policies for mixed ML training and inference workloads
Why This Module Matters
Section titled “Why This Module Matters”GPUs are the most expensive resource in any Kubernetes cluster. A single NVIDIA A100 node costs $12-30/hour on cloud providers. Most ML teams treat GPU scheduling as an afterthought: they request whole GPUs for jobs that use 15% of capacity, leave nodes idle overnight, and never set up proper monitoring.
The result? Teams routinely waste $50K-$100K/month on underutilized GPU infrastructure.
Proper GPU scheduling is the highest-ROI infrastructure work you can do for an ML team.
A Series B startup came to us after their cloud bill hit $240K/month. Their ML team had 32 A100 GPUs running 24/7 across three clusters. Average utilization? 11%. Fine-tuning jobs that needed 2 GPUs were requesting 8 “just in case.” Inference workloads sat on dedicated A100s when an L4 would have been fine. Nobody had configured time-slicing. Nobody had set up preemption. Within three weeks, we cut their GPU spend to $80K/month—same throughput, same training times—by implementing proper scheduling, right-sizing, and spot instances for fault-tolerant jobs. The lead ML engineer said, “We were basically lighting $160K on fire every month.”
Did You Know?
Section titled “Did You Know?”- A single NVIDIA H100 GPU costs $30,000-$40,000 to purchase, and cloud instances with 8 H100s can exceed $98/hour ($72K/month if left running)
- NVIDIA’s MIG technology can split one A100 into 7 independent GPU instances, each with isolated memory and compute—turning one $15K GPU into 7 smaller ones
- The average GPU utilization in enterprise Kubernetes clusters is under 15%, according to Run.ai’s 2025 GPU Utilization Report—meaning 85% of GPU spend is wasted
- Google’s Borg system (Kubernetes’ predecessor) supported GPU scheduling internally 4 years before Kubernetes added device plugin support in v1.10 (2018)
GPU Device Plugin Architecture
Section titled “GPU Device Plugin Architecture”Before we touch the NVIDIA operator, you need to understand how Kubernetes discovers and allocates GPUs. Kubernetes itself knows nothing about GPUs. It relies on device plugins to advertise specialized hardware.
┌────────────────────────────────────────────────────────────────────┐│ Kubernetes Node ││ ││ ┌──────────────┐ ┌──────────────────────────────────┐ ││ │ kubelet │◄───────►│ Device Plugin (gRPC) │ ││ │ │ gRPC │ │ ││ │ • Allocate │ socket │ 1. ListAndWatch() │ ││ │ • Track │ │ → Reports available GPUs │ ││ │ • Advertise │ │ → "nvidia.com/gpu: 4" │ ││ │ │ │ │ ││ │ Extended │ │ 2. Allocate() │ ││ │ Resources: │ │ → Returns device paths │ ││ │ nvidia.com/ │ │ → /dev/nvidia0, /dev/nvidia1 │ ││ │ gpu: 4 │ │ → Sets NVIDIA_VISIBLE_DEVICES│ ││ └──────┬───────┘ └──────────────┬───────────────────┘ ││ │ │ ││ │ Schedule pod │ Expose devices ││ ▼ ▼ ││ ┌──────────────┐ ┌──────────────────────────────────┐ ││ │ Pod │ │ GPU Hardware │ ││ │ │ │ │ ││ │ Container: │────────►│ GPU 0: A100 80GB (allocated) │ ││ │ nvidia.com/ │ │ GPU 1: A100 80GB (allocated) │ ││ │ gpu: 2 │ │ GPU 2: A100 80GB (free) │ ││ │ │ │ GPU 3: A100 80GB (free) │ ││ └──────────────┘ └──────────────────────────────────┘ │└────────────────────────────────────────────────────────────────────┘Key takeaway: GPU allocation is all-or-nothing by default. If you request nvidia.com/gpu: 1, you get an entire physical GPU. There is no native fractional GPU support in Kubernetes—that requires time-slicing or MIG, which we cover below.
NVIDIA GPU Operator
Section titled “NVIDIA GPU Operator”Managing GPU software on Kubernetes nodes is painful. You need the NVIDIA driver, container toolkit, device plugin, monitoring tools, and optionally MIG configuration—all version-matched. The GPU Operator automates the entire stack.
Components
Section titled “Components”| Component | What It Does | Why You Need It |
|---|---|---|
| NVIDIA Driver | Kernel module for GPU access | Without it, the GPU is invisible to software |
| Container Toolkit | nvidia-container-runtime | Enables containers to access GPU devices |
| Device Plugin | Advertises GPUs to kubelet | Kubernetes scheduling of nvidia.com/gpu |
| DCGM Exporter | GPU metrics in Prometheus format | Monitoring utilization, temperature, errors |
| GPU Feature Discovery | Labels nodes with GPU details | Schedule workloads to specific GPU types |
| MIG Manager | Configures Multi-Instance GPU | Split A100/H100 into isolated instances |
| Node Status Exporter | Reports operator health | Alerts when GPU stack is unhealthy |
Installation
Section titled “Installation”# Add the NVIDIA Helm repohelm repo add nvidia https://helm.ngc.nvidia.com/nvidiahelm repo update
# Install the GPU Operator (installs ALL components)helm install gpu-operator nvidia/gpu-operator \ --namespace gpu-operator \ --create-namespace \ --set driver.enabled=true \ --set toolkit.enabled=true \ --set devicePlugin.enabled=true \ --set dcgmExporter.enabled=true \ --set migManager.enabled=false \ --set gfd.enabled=true
# Verify installation (all pods should be Running)k get pods -n gpu-operator
# Check that GPUs are discoveredk get nodes -o json | jq '.items[].status.capacity | select(."nvidia.com/gpu")'After installation, every GPU node automatically gets labeled with GPU metadata:
# GPU Feature Discovery labels examplesk get node gpu-node-1 --show-labels | tr ',' '\n' | grep nvidia
# nvidia.com/cuda.driver.major=535# nvidia.com/cuda.runtime.major=12# nvidia.com/gpu.count=4# nvidia.com/gpu.memory=81920# nvidia.com/gpu.product=NVIDIA-A100-SXM4-80GB# nvidia.com/mig.capable=trueRunning Your First GPU Workload
Section titled “Running Your First GPU Workload”apiVersion: v1kind: Podmetadata: name: gpu-testspec: restartPolicy: Never containers: - name: cuda-test image: nvidia/cuda:12.3.1-base-ubuntu22.04 command: ["nvidia-smi"] resources: limits: nvidia.com/gpu: 1 # Request exactly 1 GPUk apply -f gpu-test.yamlk logs gpu-test# Should show nvidia-smi output with GPU detailsGPU Sharing: Time-Slicing and MIG
Section titled “GPU Sharing: Time-Slicing and MIG”By default, one pod gets one whole GPU. For many workloads—especially inference, development, and small training jobs—this wastes massive capacity. Two solutions exist.
Time-Slicing (Software Sharing)
Section titled “Time-Slicing (Software Sharing)”Time-slicing lets multiple pods share a single physical GPU by rapidly switching between them, similar to how a CPU time-shares between processes. There is no memory isolation—a misbehaving pod can OOM the entire GPU.
# ConfigMap for time-slicing configurationapiVersion: v1kind: ConfigMapmetadata: name: gpu-sharing-config namespace: gpu-operatordata: any: |- version: v1 sharing: timeSlicing: renameByDefault: false failRequestsGreaterThanOne: false resources: - name: nvidia.com/gpu replicas: 4 # Each physical GPU appears as 4 schedulable units# Patch the GPU Operator to enable time-slicinghelm upgrade gpu-operator nvidia/gpu-operator \ --namespace gpu-operator \ --set devicePlugin.config.name=gpu-sharing-config
# Now each physical GPU reports 4 allocatable unitsk get node gpu-node-1 -o json | jq '.status.allocatable["nvidia.com/gpu"]'# "16" (4 physical GPUs x 4 replicas each)When to use time-slicing: Inference workloads, Jupyter notebooks, development environments, workloads that do not need memory isolation.
When to avoid it: Training jobs that need guaranteed GPU memory, production inference with strict latency SLAs.
MIG (Multi-Instance GPU) for A100/H100
Section titled “MIG (Multi-Instance GPU) for A100/H100”MIG provides hardware-level isolation. An A100 can be partitioned into up to 7 independent GPU instances, each with its own memory, cache, and compute units. A crashed process in one MIG instance cannot affect another.
┌─────────────────────────────────────────────────────┐│ NVIDIA A100 80GB ││ ││ Full GPU Mode: ││ ┌─────────────────────────────────────────────────┐││ │ 1 x 80GB Instance │││ └─────────────────────────────────────────────────┘││ ││ MIG Mode (3g.40gb + 2g.20gb + 2g.20gb): ││ ┌─────────────────────────┬──────────┬──────────┐ ││ │ 3g.40gb (42 SMs) │ 2g.20gb │ 2g.20gb │ ││ │ 40GB Memory │ 20GB Mem │ 20GB Mem │ ││ │ Pod A: Training │ Pod B: │ Pod C: │ ││ │ │ Inference│ Notebook │ ││ └─────────────────────────┴──────────┴──────────┘ ││ ││ MIG Mode (7 x 1g.10gb): ││ ┌──────┬──────┬──────┬──────┬──────┬──────┬──────┐ ││ │1g.10g│1g.10g│1g.10g│1g.10g│1g.10g│1g.10g│1g.10g│ ││ │Pod A │Pod B │Pod C │Pod D │Pod E │Pod F │Pod G │ ││ └──────┴──────┴──────┴──────┴──────┴──────┴──────┘ │└─────────────────────────────────────────────────────┘# MIG configuration via GPU OperatorapiVersion: v1kind: ConfigMapmetadata: name: mig-parted-config namespace: gpu-operatordata: config.yaml: | version: v1 mig-configs: mixed-workload: - devices: [0] mig-enabled: true mig-devices: "3g.40gb": 1 "2g.20gb": 2 all-small: - devices: [0] mig-enabled: true mig-devices: "1g.10gb": 7# Request a specific MIG instance in a podapiVersion: v1kind: Podmetadata: name: inference-podspec: containers: - name: model image: my-inference:latest resources: limits: nvidia.com/mig-2g.20gb: 1 # Request a 2g.20gb MIG sliceGPU Node Management
Section titled “GPU Node Management”GPU Node Pools with Taints and Tolerations
Section titled “GPU Node Pools with Taints and Tolerations”GPU nodes are expensive. You do not want random pods landing on them. Use taints to reserve GPU nodes exclusively for GPU workloads.
# Taint GPU nodes so only GPU workloads schedule therek taint nodes gpu-pool-node-1 nvidia.com/gpu=present:NoSchedulek taint nodes gpu-pool-node-2 nvidia.com/gpu=present:NoSchedule# Training job that tolerates the GPU taintapiVersion: batch/v1kind: Jobmetadata: name: model-trainingspec: template: spec: tolerations: - key: "nvidia.com/gpu" operator: "Equal" value: "present" effect: "NoSchedule" nodeSelector: nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB containers: - name: trainer image: my-training:latest resources: limits: nvidia.com/gpu: 4 restartPolicy: NeverCost Optimization with Spot GPU Instances
Section titled “Cost Optimization with Spot GPU Instances”Spot/preemptible GPU instances cost 60-90% less than on-demand. For fault-tolerant workloads (training with checkpointing), this is free money.
# Karpenter NodePool for spot GPU instances# (See Module 6.1: Karpenter for NodePool fundamentals)apiVersion: karpenter.sh/v1kind: NodePoolmetadata: name: gpu-spot-trainingspec: template: spec: requirements: - key: karpenter.sh/capacity-type operator: In values: ["spot"] - key: node.kubernetes.io/instance-type operator: In values: ["p4d.24xlarge", "p5.48xlarge"] taints: - key: nvidia.com/gpu value: "present" effect: NoSchedule limits: nvidia.com/gpu: 32 # Max 32 GPUs in this pool disruption: consolidationPolicy: WhenEmpty consolidateAfter: 5mCheckpointing is mandatory for spot GPU training. When a spot instance is reclaimed, your training job loses all progress unless it saves checkpoints. Most frameworks (PyTorch Lightning, Hugging Face Trainer) have built-in checkpointing—make sure it is enabled.
Gang Scheduling for Distributed Training
Section titled “Gang Scheduling for Distributed Training”Distributed training jobs need all their GPUs simultaneously. If a job needs 8 GPUs across 2 nodes and only 6 are available, the job cannot start—but those 6 GPUs sit reserved and idle, blocking other work.
Gang scheduling solves this by ensuring all pods in a group are scheduled together or not at all. Kubernetes 1.35 introduced the CoScheduling feature gate as an alpha API.
# Using the scheduling.k8s.io/pod-group API (K8s 1.35+ alpha)apiVersion: scheduling.k8s.io/v1alpha1kind: PodGroupmetadata: name: distributed-trainingspec: scheduleTimeoutSeconds: 300 minMember: 4 # All 4 pods must be scheduled together---apiVersion: batch/v1kind: Jobmetadata: name: distributed-trainingspec: parallelism: 4 completions: 4 template: metadata: labels: scheduling.k8s.io/pod-group: distributed-training spec: schedulerName: coscheduling tolerations: - key: "nvidia.com/gpu" operator: "Equal" value: "present" effect: "NoSchedule" containers: - name: worker image: my-distributed-training:latest resources: limits: nvidia.com/gpu: 2 env: - name: WORLD_SIZE value: "4" - name: NCCL_DEBUG value: "INFO" restartPolicy: NeverFor production use today, consider Volcano (CNCF sandbox project) or Coscheduling plugin for kube-scheduler, which provide mature gang scheduling support.
GPU Monitoring with DCGM Exporter
Section titled “GPU Monitoring with DCGM Exporter”The DCGM (Data Center GPU Manager) Exporter ships GPU metrics to Prometheus. If you installed the GPU Operator with dcgmExporter.enabled=true, it is already running.
Key Metrics
Section titled “Key Metrics”| Metric | What It Tells You | Alert Threshold |
|---|---|---|
DCGM_FI_DEV_GPU_UTIL | GPU compute utilization % | < 10% for 30min = wasted |
DCGM_FI_DEV_MEM_COPY_UTIL | Memory bandwidth utilization % | Indicates data transfer bottleneck |
DCGM_FI_DEV_FB_USED | GPU memory used (MB) | Near limit = OOM risk |
DCGM_FI_DEV_GPU_TEMP | GPU temperature (C) | > 85C = throttling risk |
DCGM_FI_DEV_POWER_USAGE | Power draw (W) | Track for cost allocation |
DCGM_FI_DEV_XID_ERRORS | Hardware/driver errors | Any value > 0 = investigate |
Prometheus ServiceMonitor
Section titled “Prometheus ServiceMonitor”apiVersion: monitoring.coreos.com/v1kind: ServiceMonitormetadata: name: dcgm-exporter namespace: gpu-operatorspec: selector: matchLabels: app: nvidia-dcgm-exporter endpoints: - port: gpu-metrics interval: 15sGrafana Dashboard Query Examples
Section titled “Grafana Dashboard Query Examples”# Average GPU utilization across all GPUsavg(DCGM_FI_DEV_GPU_UTIL) by (gpu, Hostname)
# GPU memory usage percentageDCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE * 100
# Idle GPUs (utilization below 5% for 30 minutes)avg_over_time(DCGM_FI_DEV_GPU_UTIL[30m]) < 5Import NVIDIA’s official Grafana dashboard (ID: 12239) for an out-of-the-box GPU monitoring view.
GPU Vendor Comparison
Section titled “GPU Vendor Comparison”| Feature | NVIDIA (CUDA) | AMD (ROCm) | Intel (oneAPI) |
|---|---|---|---|
| K8s Device Plugin | Mature, GPU Operator | Community amd-gpu plugin | Intel Device Plugins Operator |
| ML Framework Support | Universal (PyTorch, TF, JAX) | PyTorch (good), TF (limited) | PyTorch (growing), oneAPI DPC++ |
| GPU Sharing | Time-slicing + MIG | No equivalent to MIG | SR-IOV based partitioning |
| Monitoring | DCGM Exporter (Prometheus) | ROCm SMI Exporter | Intel GPU metrics (limited) |
| Cloud Availability | All major clouds | Limited (Azure, some AWS) | Intel Flex/Max on select clouds |
| Ecosystem Maturity | Production-grade | Catching up rapidly | Early stage for ML |
| Cost | Premium ($10K-$40K/GPU) | 30-50% cheaper for similar perf | Competitive for inference |
| Best For | Training + inference (default) | Budget-conscious training | Intel-shop inference |
The honest take: NVIDIA dominates. ROCm has made impressive strides—PyTorch on AMD MI300X is competitive with H100 for many workloads—but the ecosystem gap in tooling, monitoring, and community support is still significant. Choose AMD or Intel only if you have a specific cost or vendor strategy.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Why It Happens | How to Fix |
|---|---|---|
| Requesting whole GPUs for inference | Copy-paste from training configs | Use time-slicing or MIG for inference workloads |
| No GPU node taints | Did not realize non-GPU pods would land there | Taint GPU nodes with NoSchedule from day one |
| Ignoring GPU utilization metrics | DCGM exporter not installed or not dashboarded | Install GPU Operator with DCGM, import Grafana dashboard 12239 |
| Running training on on-demand instances | Default instance type in node pool config | Use spot/preemptible for fault-tolerant training with checkpointing |
| No resource quotas on GPU namespaces | Single team hoards all GPUs | Set ResourceQuota with nvidia.com/gpu limits per namespace |
| Using A100 for small inference | Over-provisioning “just in case” | Right-size: use T4/L4 for inference, A100/H100 for training |
| Not enabling MIG on multi-tenant clusters | Unaware of MIG or think it is complex | GPU Operator MIG Manager automates partitioning |
| Distributed training without gang scheduling | Pods scheduled piecemeal, deadlocking | Use Volcano or CoScheduling for all-or-nothing scheduling |
Question 1: A pod requests nvidia.com/gpu: 1. The node has 4 physical GPUs with time-slicing configured at replicas: 4. How many pods requesting 1 GPU can this node run simultaneously?
Show Answer
16 pods. Time-slicing with replicas: 4 makes each physical GPU appear as 4 schedulable units. 4 physical GPUs x 4 replicas = 16 allocatable nvidia.com/gpu units. Note that all 16 pods share the physical GPU memory, so actual capacity depends on memory usage.
Question 2: Why is MIG preferred over time-slicing for production multi-tenant GPU sharing?
Show Answer
MIG provides hardware-level isolation. Each MIG instance has dedicated compute units (SMs), memory, and L2 cache. A process in one MIG instance cannot access another’s memory or cause it to OOM. Time-slicing has no memory isolation—a single pod can exhaust all GPU memory and crash every other pod sharing that GPU.
Question 3: Your distributed training job needs 8 GPUs across 4 nodes (2 GPUs each). Only 6 GPUs are currently available. Without gang scheduling, what happens?
Show Answer
Without gang scheduling, the scheduler places pods on the 6 available GPUs. The remaining 2 pods stay Pending. The 6 running pods cannot start training (they need all 8 workers for NCCL communication). Result: 6 expensive GPUs sit completely idle waiting for the last 2, while blocking other jobs from using those GPUs. Gang scheduling prevents this by requiring all 8 pods to be schedulable before any are placed.
Question 4: You notice DCGM_FI_DEV_GPU_UTIL averaging 8% on your inference nodes. What are two actions to improve utilization?
Show Answer
- Enable time-slicing or MIG to pack multiple inference workloads onto each GPU. At 8% utilization, each GPU could likely serve 4-8 models simultaneously.
- Right-size the GPU type. If inference workloads only need 2-4GB of GPU memory, switch from A100 ($12/hr) to T4 ($0.70/hr) or L4 ($1.20/hr). A 8% utilized A100 is doing work that a fully utilized T4 handles for 95% less cost.
Hands-On Exercise: GPU Scheduling with Time-Slicing
Section titled “Hands-On Exercise: GPU Scheduling with Time-Slicing”Goal: Configure GPU time-slicing and demonstrate multi-pod GPU sharing.
Note: This exercise requires a GPU-equipped cluster. If you do not have GPU hardware, you can follow along conceptually or use a cloud provider’s GPU node pool (even a single T4 instance works).
Step 1: Install the GPU Operator
Section titled “Step 1: Install the GPU Operator”helm repo add nvidia https://helm.ngc.nvidia.com/nvidiahelm repo update
helm install gpu-operator nvidia/gpu-operator \ --namespace gpu-operator \ --create-namespace \ --wait --timeout 10mStep 2: Verify GPU Discovery
Section titled “Step 2: Verify GPU Discovery”# Wait for all operator pods to be Runningk get pods -n gpu-operator -w
# Confirm GPU count on your nodek describe node <gpu-node> | grep nvidia.com/gpu# Expected: nvidia.com/gpu: 1 (or however many GPUs your node has)Step 3: Enable Time-Slicing (4 replicas per GPU)
Section titled “Step 3: Enable Time-Slicing (4 replicas per GPU)”k create configmap gpu-sharing-config -n gpu-operator --from-literal=any='version: v1sharing: timeSlicing: resources: - name: nvidia.com/gpu replicas: 4'
helm upgrade gpu-operator nvidia/gpu-operator \ --namespace gpu-operator \ --set devicePlugin.config.name=gpu-sharing-config
# Verify: allocatable GPUs should now be 4x physical countk describe node <gpu-node> | grep nvidia.com/gpu# Expected: nvidia.com/gpu: 4 (1 physical GPU x 4 replicas)Step 4: Run 4 Pods on 1 Physical GPU
Section titled “Step 4: Run 4 Pods on 1 Physical GPU”for i in 1 2 3 4; do k run gpu-pod-$i --image=nvidia/cuda:12.3.1-base-ubuntu22.04 \ --limits=nvidia.com/gpu=1 \ --command -- sleep 3600done
# All 4 should be Running on the same nodek get pods -o wide | grep gpu-podStep 5: Verify GPU Sharing
Section titled “Step 5: Verify GPU Sharing”# Each pod sees the same physical GPUfor i in 1 2 3 4; do echo "=== gpu-pod-$i ===" k exec gpu-pod-$i -- nvidia-smi --query-gpu=name,memory.total --format=csv,noheaderdoneSuccess Criteria
Section titled “Success Criteria”- GPU Operator pods all Running in
gpu-operatornamespace - Node reports 4x allocatable GPUs after time-slicing config
- All 4 pods are Running simultaneously on a single physical GPU
- Each pod can execute
nvidia-smiand sees the GPU
Cleanup
Section titled “Cleanup”k delete pod gpu-pod-1 gpu-pod-2 gpu-pod-3 gpu-pod-4Current Landscape
Section titled “Current Landscape”| Tool | Purpose | When to Use |
|---|---|---|
| NVIDIA GPU Operator | Full GPU stack management | Any NVIDIA GPU cluster |
| Run.ai | GPU virtualization and scheduling | Enterprise multi-tenant GPU sharing |
| Volcano | Batch/gang scheduling for K8s | Distributed training (production-ready today) |
| Kueue | K8s-native job queueing | GPU job queuing with fair sharing |
| Karpenter | Node autoscaling | Auto-provision GPU nodes on demand (Module 6.1) |
For MLOps pipeline integration with GPU workloads, see Module 9.1: Kubeflow.
Best Practices
Section titled “Best Practices”- Taint every GPU node from the moment it joins the cluster. No exceptions.
- Set namespace-level GPU quotas to prevent a single team from monopolizing GPUs.
- Use time-slicing for dev/inference, MIG for multi-tenant production.
- Monitor utilization weekly. Any GPU averaging under 20% needs right-sizing or consolidation.
- Use spot instances for all training workloads that support checkpointing.
- Right-size GPU types: T4/L4 for inference, A100/H100 for training.
- Enable gang scheduling for any distributed training job.
- Label GPU nodes with GPU type, memory, and MIG capability for precise scheduling.
Further Reading
Section titled “Further Reading”- NVIDIA GPU Operator Documentation
- Kubernetes Device Plugin Framework
- NVIDIA MIG User Guide
- Volcano Project - Gang Scheduling
- DCGM Exporter for Prometheus
- Run.ai GPU Utilization Report 2025
Next Module
Section titled “Next Module”Module 10.1: Anomaly Detection Tools - Apply AI to your infrastructure with AIOps.