Skip to content

AI/ML Infrastructure (On-Prem)

Complexity: [ADVANCED] | 6 modules | ~6 hours

Prerequisites: Planning & Economics, Day-2 Operations, practical familiarity with Kubernetes GPU workloads.

Running AI and ML workloads on bare metal is a different sport from running them in the cloud. There is no managed SageMaker, no Vertex AI, no g5.24xlarge you can spin up for an afternoon. You pick the GPU vendor, rack the hardware, tune the driver stack, build the networking fabric that keeps NCCL happy, and decide how training datasets move through your storage tier without starving the GPUs. This section covers the entire private-AI-infrastructure stack — from GPU scheduling primitives to a full on-prem MLOps platform — with the tradeoffs that only matter when you can’t hand the bill to a hyperscaler.

ModuleFocusTime
9.1 GPU Nodes & Accelerated ComputingNVIDIA GPU Operator, MIG, time slicing, DCGM monitoring, AMD ROCm, Intel Gaudi60 min
9.2 Private AI Training InfrastructureDistributed training, NCCL over InfiniBand/RoCE, Volcano/Kueue, fault-tolerant jobs75 min
9.3 Private LLM ServingvLLM, TGI, Ollama at scale, quantization, KServe, continuous batching75 min
9.4 Private MLOps PlatformKubeflow, MLflow, Feast, model registry, experiment tracking on bare metal60 min
9.5 Private AIOpsAnomaly detection, predictive scaling, AI-augmented incident response with guardrails60 min
9.6 High-Performance Storage for AINFS-over-RDMA, Lustre/BeeGFS/WekaFS, avoiding GPU idle from storage bottlenecks60 min

The rest of the on-prem track covers general Kubernetes operations. Accelerated computing adds a layer of complexity that doesn’t exist in CPU-only workloads: driver management, MIG partitioning, RDMA fabrics, and dataset I/O patterns that can cost you 50% of your GPU utilization if the storage tier can’t keep up. These modules assume you already know how to run a bare-metal K8s cluster and focus on what’s specific to AI workloads.

  • General Kubernetes cluster operations — see Day-2 Operations
  • Cloud AI services — see the Cloud track’s managed-services section
  • Platform engineering for AI teams (org design, self-service) — see Platform Engineering
  • AI/ML curriculum content (LLMs, transformers, RAG, fine-tuning) — see the AI/ML Engineering track