AI/ML Infrastructure (On-Prem)
Complexity:
[ADVANCED]| 6 modules | ~6 hoursPrerequisites: Planning & Economics, Day-2 Operations, practical familiarity with Kubernetes GPU workloads.
Running AI and ML workloads on bare metal is a different sport from running them in the cloud. There is no managed SageMaker, no Vertex AI, no g5.24xlarge you can spin up for an afternoon. You pick the GPU vendor, rack the hardware, tune the driver stack, build the networking fabric that keeps NCCL happy, and decide how training datasets move through your storage tier without starving the GPUs. This section covers the entire private-AI-infrastructure stack — from GPU scheduling primitives to a full on-prem MLOps platform — with the tradeoffs that only matter when you can’t hand the bill to a hyperscaler.
Modules
Section titled “Modules”| Module | Focus | Time |
|---|---|---|
| 9.1 GPU Nodes & Accelerated Computing | NVIDIA GPU Operator, MIG, time slicing, DCGM monitoring, AMD ROCm, Intel Gaudi | 60 min |
| 9.2 Private AI Training Infrastructure | Distributed training, NCCL over InfiniBand/RoCE, Volcano/Kueue, fault-tolerant jobs | 75 min |
| 9.3 Private LLM Serving | vLLM, TGI, Ollama at scale, quantization, KServe, continuous batching | 75 min |
| 9.4 Private MLOps Platform | Kubeflow, MLflow, Feast, model registry, experiment tracking on bare metal | 60 min |
| 9.5 Private AIOps | Anomaly detection, predictive scaling, AI-augmented incident response with guardrails | 60 min |
| 9.6 High-Performance Storage for AI | NFS-over-RDMA, Lustre/BeeGFS/WekaFS, avoiding GPU idle from storage bottlenecks | 60 min |
Why a dedicated section?
Section titled “Why a dedicated section?”The rest of the on-prem track covers general Kubernetes operations. Accelerated computing adds a layer of complexity that doesn’t exist in CPU-only workloads: driver management, MIG partitioning, RDMA fabrics, and dataset I/O patterns that can cost you 50% of your GPU utilization if the storage tier can’t keep up. These modules assume you already know how to run a bare-metal K8s cluster and focus on what’s specific to AI workloads.
Not covered here
Section titled “Not covered here”- General Kubernetes cluster operations — see Day-2 Operations
- Cloud AI services — see the Cloud track’s managed-services section
- Platform engineering for AI teams (org design, self-service) — see Platform Engineering
- AI/ML curriculum content (LLMs, transformers, RAG, fine-tuning) — see the AI/ML Engineering track