AI/GPU Infrastructure on Kubernetes
The infrastructure side of AI — GPU scheduling, distributed training, and LLM serving at scale.
This discipline focuses on the infrastructure challenges of running AI workloads on Kubernetes. It complements the existing MLOps discipline (model lifecycle) and ML Platforms toolkit (tools like Kubeflow, MLflow). Here you’ll learn to provision GPUs, schedule them efficiently, run distributed training, and serve models in production.
Modules
Section titled “Modules”| # | Module | Time | What You’ll Learn |
|---|---|---|---|
| 1.1 | GPU Provisioning & Device Plugins | 3h | GPU Operator, NFD, DCGM-Exporter |
| 1.2 | Advanced GPU Scheduling & Sharing | 4h | MIG, time-slicing, DRA, topology-aware |
| 1.3 | Distributed Training Infrastructure | 5h | NCCL, Multus CNI, PyTorch Operator |
| 1.4 | High-Performance Storage for AI | 3h | NVMe caching, JuiceFS, Fluid/Alluxio |
| 1.5 | Serving LLMs at Scale | 4h | vLLM, TGI, PagedAttention, KEDA autoscaling |
| 1.6 | Cost & Capacity Planning | 3h | Spot GPUs, Karpenter, Kueue, cost per inference |
Total time: ~22 hours
Prerequisites
Section titled “Prerequisites”- Kubernetes Administration (CKA level)
- Basic Linux hardware knowledge
- Familiarity with ML concepts (helpful but not required)