Skip to content

AI/GPU Infrastructure on Kubernetes

The infrastructure side of AI — GPU scheduling, distributed training, and LLM serving at scale.

This discipline focuses on the infrastructure challenges of running AI workloads on Kubernetes. It complements the existing MLOps discipline (model lifecycle) and ML Platforms toolkit (tools like Kubeflow, MLflow). Here you’ll learn to provision GPUs, schedule them efficiently, run distributed training, and serve models in production.


#ModuleTimeWhat You’ll Learn
1.1GPU Provisioning & Device Plugins3hGPU Operator, NFD, DCGM-Exporter
1.2Advanced GPU Scheduling & Sharing4hMIG, time-slicing, DRA, topology-aware
1.3Distributed Training Infrastructure5hNCCL, Multus CNI, PyTorch Operator
1.4High-Performance Storage for AI3hNVMe caching, JuiceFS, Fluid/Alluxio
1.5Serving LLMs at Scale4hvLLM, TGI, PagedAttention, KEDA autoscaling
1.6Cost & Capacity Planning3hSpot GPUs, Karpenter, Kueue, cost per inference

Total time: ~22 hours


  • Kubernetes Administration (CKA level)
  • Basic Linux hardware knowledge
  • Familiarity with ML concepts (helpful but not required)