AI/GPU Infrastructure on Kubernetes

The infrastructure side of AI — GPU scheduling, distributed training, and LLM serving at scale.

This discipline focuses on the infrastructure challenges of running AI workloads on Kubernetes. It complements the existing MLOps discipline (model lifecycle) and ML Platforms toolkit (tools like Kubeflow, MLflow). Here you’ll learn to provision GPUs, schedule them efficiently, run distributed training, and serve models in production.

Modules

#	Module	Time	What You’ll Learn
1.1	GPU Provisioning & Device Plugins	3h	GPU Operator, NFD, DCGM-Exporter
1.2	Advanced GPU Scheduling & Sharing	4h	MIG, time-slicing, DRA, topology-aware
1.3	Distributed Training Infrastructure	5h	NCCL, Multus CNI, PyTorch Operator
1.4	High-Performance Storage for AI	3h	NVMe caching, JuiceFS, Fluid/Alluxio
1.5	Serving LLMs at Scale	4h	vLLM, TGI, PagedAttention, KEDA autoscaling
1.6	Cost & Capacity Planning	3h	Spot GPUs, Karpenter, Kueue, cost per inference

Total time: ~22 hours

Prerequisites

Kubernetes Administration (CKA level)
Basic Linux hardware knowledge
Familiarity with ML concepts (helpful but not required)