Data Engineering on Kubernetes
Run data infrastructure where your applications already live.
Kafka, Spark, Flink, Airflow — the backbone of modern data pipelines. This discipline teaches you to run them on Kubernetes, where they benefit from the same orchestration, scaling, and observability as your application workloads.
Modules
Section titled “Modules”| # | Module | Time | What You’ll Learn |
|---|---|---|---|
| 1.1 | Stateful Workloads & Storage | 3h | CSI internals, Local PVs, StatefulSets, Operators for state |
| 1.2 | Apache Kafka on K8s (Strimzi) | 3.5h | KRaft, Strimzi Operator, schema management, securing Kafka |
| 1.3 | Stream Processing with Flink | 3.5h | Flink Operator, checkpointing, event time, watermarks |
| 1.4 | Batch Processing with Spark | 3h | Spark on K8s, Spark Operator, dynamic allocation |
| 1.5 | Data Orchestration with Airflow | 2.5h | KubernetesExecutor, DAGs, Helm deployment |
| 1.6 | Building a Data Lakehouse | 3.5h | Iceberg, Delta Lake, Trino on K8s, Hive Metastore |
Total time: ~19 hours
Prerequisites
Section titled “Prerequisites”- Kubernetes Storage (PV/PVC, StorageClasses)
- Kubernetes Jobs and CronJobs
- Basic Python (for Airflow and Spark exercises)
Why on Kubernetes?
Section titled “Why on Kubernetes?”Running data infrastructure on K8s gives you unified tooling (same CI/CD, same monitoring, same RBAC), elastic scaling, and reduced operational overhead. The trade-off is complexity — these modules teach you to handle it.