Skip to content

Data Engineering on Kubernetes

Run data infrastructure where your applications already live.

Kafka, Spark, Flink, Airflow — the backbone of modern data pipelines. This discipline teaches you to run them on Kubernetes, where they benefit from the same orchestration, scaling, and observability as your application workloads.


#ModuleTimeWhat You’ll Learn
1.1Stateful Workloads & Storage3hCSI internals, Local PVs, StatefulSets, Operators for state
1.2Apache Kafka on K8s (Strimzi)3.5hKRaft, Strimzi Operator, schema management, securing Kafka
1.3Stream Processing with Flink3.5hFlink Operator, checkpointing, event time, watermarks
1.4Batch Processing with Spark3hSpark on K8s, Spark Operator, dynamic allocation
1.5Data Orchestration with Airflow2.5hKubernetesExecutor, DAGs, Helm deployment
1.6Building a Data Lakehouse3.5hIceberg, Delta Lake, Trino on K8s, Hive Metastore

Total time: ~19 hours


  • Kubernetes Storage (PV/PVC, StorageClasses)
  • Kubernetes Jobs and CronJobs
  • Basic Python (for Airflow and Spark exercises)

Running data infrastructure on K8s gives you unified tooling (same CI/CD, same monitoring, same RBAC), elastic scaling, and reduced operational overhead. The trade-off is complexity — these modules teach you to handle it.