Module 9.7: Streaming Data Pipelines (MSK / Confluent / Dataflow)
Complexity: [COMPLEX] | Time to Complete: 3h | Prerequisites: Module 9.2 (Message Brokers), Module 9.6 (Search & Analytics), distributed systems basics
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Configure Kubernetes consumers for managed streaming platforms (Amazon MSK, Confluent Cloud, Azure Event Hubs with Kafka protocol)
- Implement exactly-once processing patterns with Kafka transactions and Kubernetes StatefulSet consumer groups
- Deploy stream processing applications (Kafka Streams, Flink) on Kubernetes with managed streaming backends
- Design data pipeline architectures that combine managed streaming with Kubernetes batch and real-time processing workloads
Why This Module Matters
Section titled “Why This Module Matters”In February 2024, an online marketplace processed 850,000 orders per day. Their event pipeline — order-placed, payment-confirmed, inventory-updated, shipment-created — ran through a self-managed Kafka cluster on EKS. Six brokers, each on r6i.2xlarge instances with 2 TB of gp3 storage. Total monthly cost: $9,400. The platform team spent 15 hours per week on Kafka operations: broker rolling restarts, partition rebalancing, disk monitoring, ZooKeeper maintenance, and upgrading between Kafka versions.
One Tuesday, a broker lost its EBS volume due to an AZ-level storage event. Kafka’s under-replicated partitions jumped to 340. The cluster’s ISR (In-Sync Replicas) dropped below the minimum for 28 topic-partitions, blocking producers. The team spent 6 hours manually reassigning partitions and rebuilding replicas. During that time, the order pipeline was partially degraded — 12% of orders were delayed by up to 4 hours.
They evaluated Amazon MSK (Managed Streaming for Kafka) and Confluent Cloud. The migration to MSK took five weeks. The same 6-broker cluster now costs $7,200/month (cheaper because MSK handles the control plane), and the platform team spends 2 hours per week on Kafka-related tasks instead of 15. Broker replacements happen automatically. ZooKeeper is gone (MSK uses KRaft since 2024). The operational difference is transformative.
This module teaches you when to use managed Kafka versus running Strimzi in-cluster, how partitioning and consumer groups work at scale, how exactly-once semantics prevent duplicate processing, how to monitor consumer lag and prevent data loss, how schema registries maintain data contracts, and how to build stream processing pipelines on Kubernetes.
Managed Kafka vs In-Cluster Strimzi
Section titled “Managed Kafka vs In-Cluster Strimzi”Decision Framework
Section titled “Decision Framework”| Factor | Managed (MSK/Confluent) | In-Cluster (Strimzi on K8s) |
|---|---|---|
| Operational burden | Provider handles brokers, patching, storage | You handle everything |
| Cost at small scale | Higher (minimum 2-3 brokers) | Lower (share cluster resources) |
| Cost at large scale | Often cheaper (no ops engineer time) | Higher (hidden ops costs) |
| Network latency | 1-5ms (VPC connectivity) | < 1ms (in-cluster) |
| Customization | Limited (provider-defined configs) | Full control |
| Multi-cluster | Each cluster is independent | Strimzi MirrorMaker2 for replication |
| Kafka version | Provider cadence (1-3 months behind) | Any version you want |
| ZooKeeper | Gone (KRaft mode on MSK since 2024) | Strimzi manages for you |
When to Choose Each
Section titled “When to Choose Each”Choose managed Kafka when:
- Your team does not have Kafka expertise
- You process more than 100 MB/s sustained
- You need guaranteed durability (financial, healthcare)
- You want to focus on producers/consumers, not broker operations
Choose Strimzi when:
- Sub-millisecond latency matters (in-cluster communication)
- You are on a tight budget with low volume
- You need full control over Kafka configuration
- You are running in environments without managed services (on-prem, edge)
Strimzi Quick Setup (for comparison)
Section titled “Strimzi Quick Setup (for comparison)”# Strimzi Kafka cluster in KubernetesapiVersion: kafka.strimzi.io/v1beta2kind: Kafkametadata: name: event-cluster namespace: kafkaspec: kafka: version: 3.8.0 replicas: 3 listeners: - name: plain port: 9092 type: internal tls: false - name: tls port: 9093 type: internal tls: true config: offsets.topic.replication.factor: 3 transaction.state.log.replication.factor: 3 transaction.state.log.min.isr: 2 default.replication.factor: 3 min.insync.replicas: 2 num.partitions: 12 storage: type: jbod volumes: - id: 0 type: persistent-claim size: 500Gi class: gp3-encrypted resources: requests: memory: 4Gi cpu: "2" zookeeper: replicas: 3 storage: type: persistent-claim size: 50Gi class: gp3-encrypted entityOperator: topicOperator: {} userOperator: {}Partitioning: The Foundation of Kafka Scalability
Section titled “Partitioning: The Foundation of Kafka Scalability”A Kafka topic is divided into partitions. Partitions are the unit of parallelism — more partitions means more consumers can process data concurrently.
How Partitions Work
Section titled “How Partitions Work”graph TD Prod[Producer] -->|key: order-123 --> hash%6 = P2| P2 subgraph Topic: order-events P0 P1 P2 P3 P4 P5 end P2 --> Events["offset 0: order-123 created<br/>offset 1: order-123 confirmed<br/>offset 2: order-123 shipped<br/>(ordered within partition)"]Partition Count Guidelines
Section titled “Partition Count Guidelines”| Throughput | Partitions | Reasoning |
|---|---|---|
| < 10 MB/s | 6 | Enough for small workloads, easy to manage |
| 10-100 MB/s | 12-24 | Allows 12-24 parallel consumers |
| 100 MB/s - 1 GB/s | 50-100 | Match consumer count to partition count |
| > 1 GB/s | 100+ | Carefully test; more partitions = more overhead |
Rule of thumb: partitions should equal or exceed the maximum number of consumers in your largest consumer group.
Partition Key Design
Section titled “Partition Key Design”The partition key determines which partition a message lands in. All messages with the same key go to the same partition (in order).
from confluent_kafka import Producer
producer = Producer({ 'bootstrap.servers': 'msk-broker1:9092,msk-broker2:9092', 'acks': 'all', 'enable.idempotence': True, 'max.in.flight.requests.per.connection': 5,})
# Key by order_id -- all events for one order are in the same partitiondef publish_order_event(order_id, event_type, payload): producer.produce( topic='order-events', key=str(order_id).encode('utf-8'), value=json.dumps({ 'order_id': order_id, 'event_type': event_type, 'payload': payload, 'timestamp': datetime.utcnow().isoformat() }).encode('utf-8'), callback=delivery_report, ) producer.flush()
def delivery_report(err, msg): if err: print(f"Delivery failed: {err}") else: print(f"Delivered to {msg.topic()} [{msg.partition()}] @ offset {msg.offset()}")Common key strategies:
- Customer ID: All events for one customer in order
- Order ID: All order lifecycle events in order
- Null key: Round-robin across partitions (maximum throughput, no ordering)
- Tenant ID: Multi-tenant isolation per partition
Pause and predict: If you have a topic with 12 partitions and a consumer deployment with 15 replicas, what exactly happens to the last 3 pods? How will Kubernetes metrics report their status compared to their actual utility?
Consumer Groups and Lag Monitoring
Section titled “Consumer Groups and Lag Monitoring”Consumer Group Mechanics
Section titled “Consumer Group Mechanics”graph TD subgraph Topic: order-events P0 P1 P2 P3 P4 P5 end
subgraph Consumer Group: order-processor Pod1[Pod 1] Pod2[Pod 2] Pod3[Pod 3] end
P0 --> Pod1 P1 --> Pod1 P2 --> Pod2 P3 --> Pod2 P4 --> Pod3 P5 --> Pod3Each pod gets an equal share of partitions. Adding a 4th pod triggers rebalancing. A 7th pod would be idle (6 partitions, 7 consumers).
Kubernetes Consumer Deployment
Section titled “Kubernetes Consumer Deployment”apiVersion: apps/v1kind: Deploymentmetadata: name: order-processor namespace: streamingspec: replicas: 6 # Match partition count selector: matchLabels: app: order-processor template: metadata: labels: app: order-processor spec: serviceAccountName: kafka-consumer containers: - name: consumer image: mycompany/order-processor:2.3.0 env: - name: KAFKA_BROKERS value: "b-1.msk-cluster.abc123.kafka.us-east-1.amazonaws.com:9092,b-2.msk-cluster.abc123.kafka.us-east-1.amazonaws.com:9092" - name: KAFKA_TOPIC value: "order-events" - name: KAFKA_GROUP_ID value: "order-processor" - name: KAFKA_AUTO_OFFSET_RESET value: "earliest" - name: KAFKA_SECURITY_PROTOCOL value: "SASL_SSL" - name: KAFKA_SASL_MECHANISM value: "AWS_MSK_IAM" resources: requests: cpu: 500m memory: 512MiConsumer Lag Monitoring
Section titled “Consumer Lag Monitoring”Consumer lag is the difference between the latest message in a partition and the last message processed by a consumer. High lag means consumers are falling behind.
Partition 3: Latest offset: 1,000,000 Consumer offset: 985,000 Lag: 15,000 messagesMonitoring with Prometheus and Kafka Exporter
Section titled “Monitoring with Prometheus and Kafka Exporter”# Deploy kafka-exporter for Prometheus metricsapiVersion: apps/v1kind: Deploymentmetadata: name: kafka-exporter namespace: monitoringspec: replicas: 1 selector: matchLabels: app: kafka-exporter template: metadata: labels: app: kafka-exporter annotations: prometheus.io/scrape: "true" prometheus.io/port: "9308" spec: containers: - name: exporter image: danielqsj/kafka-exporter:v1.8.0 args: - --kafka.server=b-1.msk-cluster.abc123.kafka.us-east-1.amazonaws.com:9092 - --kafka.server=b-2.msk-cluster.abc123.kafka.us-east-1.amazonaws.com:9092 - --topic.filter=order-.* - --group.filter=.* ports: - containerPort: 9308# PrometheusRule for consumer lag alertsapiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata: name: kafka-lag-alerts namespace: monitoringspec: groups: - name: kafka-consumer-lag rules: - alert: KafkaConsumerLagHigh expr: | sum by (consumergroup, topic) (kafka_consumergroup_lag) > 50000 for: 10m labels: severity: warning annotations: summary: "Consumer group {{ $labels.consumergroup }} has high lag on {{ $labels.topic }}" - alert: KafkaConsumerLagCritical expr: | sum by (consumergroup, topic) (kafka_consumergroup_lag) > 500000 for: 5m labels: severity: critical annotations: summary: "Consumer group {{ $labels.consumergroup }} critically behind on {{ $labels.topic }}"KEDA Scaling on Consumer Lag
Section titled “KEDA Scaling on Consumer Lag”apiVersion: keda.sh/v1alpha1kind: ScaledObjectmetadata: name: order-processor-scaler namespace: streamingspec: scaleTargetRef: name: order-processor minReplicaCount: 3 maxReplicaCount: 12 # Never exceed partition count pollingInterval: 10 triggers: - type: kafka metadata: bootstrapServers: b-1.msk-cluster.abc123.kafka.us-east-1.amazonaws.com:9092 consumerGroup: order-processor topic: order-events lagThreshold: "1000" offsetResetPolicy: earliestExactly-Once Processing and Stateful Consumers
Section titled “Exactly-Once Processing and Stateful Consumers”For financial transactions or inventory updates, processing a message more than once (at-least-once semantics) or dropping it (at-most-once) is unacceptable. Kafka achieves exactly-once semantics (EOS) through the combination of idempotent producers and transactional APIs.
The Transactional Pipeline
Section titled “The Transactional Pipeline”When a stream processing application consumes from Topic A, processes the event, and writes to Topic B, the consumer offset commitment and the producer write must happen atomically.
from confluent_kafka import Producer, Consumer
# 1. Producer requires a transactional.idproducer = Producer({ 'bootstrap.servers': 'msk-cluster:9092', 'transactional.id': 'order-processor-txn-1', 'enable.idempotence': True})producer.init_transactions()
# 2. Consume an eventmsg = consumer.poll(timeout=1.0)if msg: producer.begin_transaction() try: # 3. Process and produce derived event producer.produce('enriched-orders', key=msg.key(), value=enrich(msg.value()))
# 4. Send the consumer offsets as part of the transaction producer.send_offsets_to_transaction( consumer.position(consumer.assignment()), consumer.consumer_group_metadata() ) # 5. Commit atomically producer.commit_transaction() except Exception: producer.abort_transaction()Why StatefulSets for Kafka Streams?
Section titled “Why StatefulSets for Kafka Streams?”When using frameworks like Kafka Streams that maintain local state (e.g., using RocksDB to aggregate windowed data or join topics), Deployments are an anti-pattern. If a Deployment pod is restarted, its local state is wiped out. When the new pod joins the consumer group, it must download the entire state from Kafka’s changelog topic before it can process a single new message, which can take hours for large states.
Instead, stream processors with local state should be deployed as a StatefulSet with persistent volume claims:
apiVersion: apps/v1kind: StatefulSetmetadata: name: payment-aggregator namespace: streamingspec: serviceName: "payment-aggregator" replicas: 6 selector: matchLabels: app: payment-aggregator template: metadata: labels: app: payment-aggregator spec: containers: - name: processor image: mycompany/payment-aggregator:1.2.0 volumeMounts: - name: state-store mountPath: /var/lib/kafka-streams volumeClaimTemplates: - metadata: name: state-store spec: accessModes: [ "ReadWriteOnce" ] resources: requests: storage: 50GiBy using a StatefulSet, payment-aggregator-0 maintains its stable network identity and keeps its state-store volume across restarts. When the pod comes back up, its RocksDB cache is already populated, requiring only a minimal delta update before processing resumes.
Schema Registry: Data Contracts for Events
Section titled “Schema Registry: Data Contracts for Events”Without schema management, producers can change the event structure without warning, breaking consumers.
The Problem
Section titled “The Problem”Week 1: {"order_id": "123", "amount": 49.99, "currency": "USD"}Week 2: {"orderId": "123", "total": 49.99} <-- broke every consumerSchema Registry Architecture
Section titled “Schema Registry Architecture”sequenceDiagram participant P as Producer participant SR as Schema Registry participant K as Kafka Cluster participant C as Consumer
P->>SR: 1. Register schema SR-->>P: Returns schema ID: 42 P->>K: 2. Produce [schema_id=42][serialized_data] C->>K: Poll messages K-->>C: Returns [schema_id=42][serialized_data] C->>SR: 3. Get schema by ID (42) SR-->>C: Returns schema definition Note over C: 4. Deserialize with schemaConfluent Schema Registry on Kubernetes
Section titled “Confluent Schema Registry on Kubernetes”apiVersion: apps/v1kind: Deploymentmetadata: name: schema-registry namespace: streamingspec: replicas: 2 selector: matchLabels: app: schema-registry template: metadata: labels: app: schema-registry spec: containers: - name: schema-registry image: confluentinc/cp-schema-registry:7.7.0 ports: - containerPort: 8081 env: - name: SCHEMA_REGISTRY_HOST_NAME valueFrom: fieldRef: fieldPath: status.podIP - name: SCHEMA_REGISTRY_KAFKASTORE_BOOTSTRAP_SERVERS value: "b-1.msk-cluster.abc123.kafka.us-east-1.amazonaws.com:9092" - name: SCHEMA_REGISTRY_KAFKASTORE_SECURITY_PROTOCOL value: "SASL_SSL" resources: requests: cpu: 250m memory: 512Mi---apiVersion: v1kind: Servicemetadata: name: schema-registry namespace: streamingspec: selector: app: schema-registry ports: - port: 8081Registering and Using Schemas
Section titled “Registering and Using Schemas”# Register an Avro schema for order eventscurl -XPOST "http://schema-registry:8081/subjects/order-events-value/versions" \ -H "Content-Type: application/vnd.schemaregistry.v1+json" \ -d '{ "schema": "{\"type\":\"record\",\"name\":\"OrderEvent\",\"namespace\":\"com.example\",\"fields\":[{\"name\":\"order_id\",\"type\":\"string\"},{\"name\":\"event_type\",\"type\":{\"type\":\"enum\",\"name\":\"EventType\",\"symbols\":[\"CREATED\",\"CONFIRMED\",\"SHIPPED\",\"DELIVERED\",\"CANCELLED\"]}},{\"name\":\"amount\",\"type\":\"double\"},{\"name\":\"currency\",\"type\":\"string\"},{\"name\":\"timestamp\",\"type\":\"long\",\"logicalType\":\"timestamp-millis\"}]}"}'
# Set compatibility mode (BACKWARD = new schema can read old data)curl -XPUT "http://schema-registry:8081/config/order-events-value" \ -H "Content-Type: application/vnd.schemaregistry.v1+json" \ -d '{"compatibility": "BACKWARD"}'Schema Compatibility Modes
Section titled “Schema Compatibility Modes”| Mode | Rule | Example |
|---|---|---|
| BACKWARD | New schema can read old data | Adding optional field with default |
| FORWARD | Old schema can read new data | Removing optional field |
| FULL | Both backward and forward | Only adding/removing optional fields with defaults |
| NONE | No compatibility checking | Any change allowed (dangerous) |
For most pipelines, BACKWARD compatibility is the safest choice. It ensures that new consumers can always read messages produced by older producers.
Stop and think: Your team decides to deploy a new schema that changes an integer field
quantityto a string fieldquantity_strto support formats like “1 dozen”. If you are using BACKWARD compatibility, what will the schema registry do when the producer tries to register this schema?
Stream Processing on Kubernetes
Section titled “Stream Processing on Kubernetes”Architecture Options
Section titled “Architecture Options”| Tool | Deployment Model | Best For | Complexity |
|---|---|---|---|
| Kafka Streams | Library (runs in your pods) | Simple transformations, joins | Low |
| Apache Flink | Operator (FlinkDeployment) | Complex event processing, windows | High |
| ksqlDB | Deployment | SQL-like stream processing | Medium |
| Google Dataflow | Managed (GCP only) | Batch + stream unified | Medium |
Kafka Streams Application on Kubernetes
Section titled “Kafka Streams Application on Kubernetes”apiVersion: apps/v1kind: Deploymentmetadata: name: order-enricher namespace: streamingspec: replicas: 6 selector: matchLabels: app: order-enricher template: metadata: labels: app: order-enricher spec: containers: - name: enricher image: mycompany/order-enricher:1.0.0 env: - name: KAFKA_BROKERS value: "b-1.msk-cluster.abc123.kafka.us-east-1.amazonaws.com:9092" - name: APPLICATION_ID value: "order-enricher" - name: INPUT_TOPIC value: "raw-orders" - name: OUTPUT_TOPIC value: "enriched-orders" - name: SCHEMA_REGISTRY_URL value: "http://schema-registry:8081" - name: STATE_DIR value: "/tmp/kafka-streams" resources: requests: cpu: "1" memory: 2Gi volumeMounts: - name: state-store mountPath: /tmp/kafka-streams volumes: - name: state-store emptyDir: sizeLimit: 10GiFlink on Kubernetes (Flink Operator)
Section titled “Flink on Kubernetes (Flink Operator)”apiVersion: flink.apache.org/v1beta1kind: FlinkDeploymentmetadata: name: order-analytics namespace: streamingspec: image: mycompany/flink-order-analytics:1.0.0 flinkVersion: v1_19 flinkConfiguration: taskmanager.numberOfTaskSlots: "2" state.backend: rocksdb state.checkpoints.dir: s3://flink-checkpoints/order-analytics execution.checkpointing.interval: "60000" serviceAccount: flink-sa jobManager: resource: memory: "2048m" cpu: 1 taskManager: resource: memory: "4096m" cpu: 2 replicas: 3 job: jarURI: local:///opt/flink/usrlib/order-analytics.jar parallelism: 6 upgradeMode: savepointSetting Up Amazon MSK
Section titled “Setting Up Amazon MSK”# Create MSK Serverless cluster (simplest option)aws kafka create-cluster-v2 \ --cluster-name orders-streaming \ --serverless '{ "vpcConfigs": [{ "subnetIds": ["subnet-aaa", "subnet-bbb", "subnet-ccc"], "securityGroupIds": ["sg-kafka"] }], "clientAuthentication": { "sasl": {"iam": {"enabled": true}} } }'
# Or create a provisioned cluster for predictable costsaws kafka create-cluster \ --cluster-name orders-streaming \ --kafka-version 3.7.0 \ --number-of-broker-nodes 3 \ --broker-node-group-info '{ "InstanceType": "kafka.m7g.large", "ClientSubnets": ["subnet-aaa", "subnet-bbb", "subnet-ccc"], "SecurityGroups": ["sg-kafka"], "StorageInfo": { "EbsStorageInfo": {"VolumeSize": 500} } }' \ --encryption-info '{ "EncryptionInTransit": {"ClientBroker": "TLS", "InCluster": true}, "EncryptionAtRest": {"DataVolumeKMSKeyId": "alias/msk-key"} }'Did You Know?
Section titled “Did You Know?”-
Apache Kafka processes over 7 trillion messages per day at LinkedIn, where it was originally created. LinkedIn’s Kafka clusters handle over 35 PB of data per day across 100+ clusters. The project was open-sourced in 2011 and is now the most widely deployed streaming platform in the world.
-
Amazon MSK Serverless eliminates cluster capacity planning entirely. You create a topic, produce and consume data, and AWS automatically provisions and scales the underlying infrastructure. Pricing is per-partition-hour and per-GB of data, which can save 30-50% for workloads with variable throughput compared to provisioned MSK.
-
The KRaft mode (Kafka Raft) replaced ZooKeeper starting with Kafka 3.3 (2022) and became production-ready in Kafka 3.5. ZooKeeper was the #1 operational pain point for Kafka clusters — it required separate monitoring, backup, and scaling. KRaft eliminates ZooKeeper entirely by embedding metadata management in Kafka brokers themselves.
-
Schema Registry compatibility checking has prevented billions of dollars in downstream damage according to Confluent’s estimates. A single incompatible schema change in a high-throughput pipeline can corrupt millions of messages before anyone notices. The registry acts as a gatekeeper, rejecting incompatible schemas at registration time rather than at consumption time.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Why It Happens | How to Fix It |
|---|---|---|
| More consumers than partitions | ”More pods = more throughput” | Extra consumers sit idle; scale partitions first, then consumers |
| Not using a partition key when ordering matters | Null key gives best throughput | Use customer/order ID as key for ordered event processing |
Setting auto.offset.reset=latest in production | ”We only want new messages” | Use earliest and let the consumer group track its position; latest loses messages on restart |
| Not monitoring consumer lag | ”If messages are flowing, everything is fine” | Deploy kafka-exporter and alert on lag > threshold |
| Skipping schema registry | ”We will coordinate schema changes manually” | Manual coordination fails at scale; registry enforces compatibility |
| Under-replicating topics (replication factor = 1) | Testing configuration leaked to production | Always use replication factor >= 3 and min.insync.replicas >= 2 |
| Running Kafka Streams without persistent state store volumes | Using emptyDir for state | State is lost on pod restart, causing full reprocessing; use PVCs for state stores |
Not setting producer acks=all for critical data | Default was acks=1 before Kafka 3.0 | Always set acks=all and enable.idempotence=true for data safety |
1. Your team deploys an e-commerce order processing service scaling via HPA. During a flash sale, the HPA scales the consumer Deployment to 20 pods to handle a massive spike in CPU usage. However, the Kafka topic `orders-raw` only has 12 partitions. Despite having 20 pods running, processing throughput hits a hard ceiling. What is happening to pods 13-20, and how should you re-architect to handle this scale?
In a Kafka consumer group, each partition is exclusively assigned to a single consumer to guarantee ordering. Because there are only 12 partitions, pods 13 through 20 are completely starved of data; they sit entirely idle, receiving zero messages and processing nothing, while consuming cluster resources. The hard throughput ceiling is bound by the 12 active pods. To handle this scale, you must first increase the partition count of the orders-raw topic to at least 20 (or higher, like 50) so that the incoming data stream can be fanned out. Only then can the additional pods take ownership of the new partitions and begin contributing to the overall throughput.
2. You are designing an architecture for a healthcare startup that processes 500 MB/s of patient telemetry data across three AWS regions. The team is small, consisting of four application developers and zero dedicated database administrators. They are debating between deploying Strimzi on EKS or paying for Amazon MSK. Which solution should they choose, and what operational realities drive this decision?
The startup should unequivocally choose Amazon MSK (or Confluent Cloud). At 500 MB/s of sustained throughput, a Kafka cluster is a massive, IO-heavy distributed system requiring continuous tuning, disk monitoring, partition rebalancing, and failover management. Deploying Strimzi means the four application developers become part-time Kafka administrators. When a broker disk fails at 3 AM or ZooKeeper/KRaft desynchronizes, the developers are responsible for the repair, directly detracting from product velocity. While MSK carries a higher line-item cost on the AWS bill, it absorbs the operational burden of node replacements, patching, and control-plane management, ensuring data durability for critical healthcare telemetry without requiring the team to hire a dedicated infrastructure engineer.
3. At 3:00 AM on Black Friday, your alerting system triggers. CPU and memory metrics for your payment processing pods are well within normal limits, and pod logs show zero error messages. However, the `kafka_consumergroup_lag` metric has spiked from 50 to 85,000 for the `payment-events` topic. What is happening in your system, and why is this metric catching an issue that standard Kubernetes metrics missed?
Your payment processing pods are failing to keep pace with the incoming surge of payment events. The producers are writing messages to the topic much faster than the consumers can process and acknowledge them, resulting in a rapidly growing backlog (lag). This situation often occurs when consumers become bottlenecked by a downstream system, such as a sluggish database or external API. In these scenarios, the consumer pods are simply blocked waiting for I/O; they aren’t crashing, generating errors, or maxing out CPU/memory. Standard Kubernetes metrics indicate the pods are “healthy” because they are running, but only the kafka_consumergroup_lag metric reveals the true business reality: the data pipeline is silently falling dangerously behind.
4. The data engineering team proposes changing the schema compatibility mode from BACKWARD to FORWARD in the Confluent Schema Registry. They argue this will force producers to update their applications before consumers can read the new data formats. In a decoupled microservices architecture where you manage the consumer but another team manages the producer, why is switching away from BACKWARD compatibility a dangerous anti-pattern?
Switching away from BACKWARD compatibility severely jeopardizes deployment safety because it destroys the guarantee that an updated consumer can read historical data. In event-driven architectures, topics act as ledgers containing older messages. If the producer team introduces a schema change and you deploy a new version of your consumer, your consumer will immediately encounter older messages on the topic that were serialized with the previous schema. If the new schema is not backward compatible, your consumer will fail to deserialize those older messages, resulting in a catastrophic pipeline crash or a poison-pill loop. BACKWARD compatibility is strictly required to ensure that consumers can be safely upgraded at any time without coordinating downtime with the producers.
5. You configured a standard Kubernetes HPA targeting 80% CPU utilization for a legacy inventory syncing consumer. During a database slowdown, the inventory consumers spend all their time waiting for the database to respond (I/O bound). The Kafka topic backs up with 200,000 unprocessed messages, but the HPA does not scale up the deployment. Why did the CPU-based HPA fail to scale, and how would replacing it with KEDA solve this exact problem?
The CPU-based HPA failed because threads blocked on network I/O (waiting for a database response) do not consume CPU cycles. The pods appeared idle to the Kubernetes metrics server, hovering well below the 80% threshold, so the HPA saw no reason to scale out, despite the massive backlog of work. Replacing the standard HPA with KEDA solves this by shifting the scaling metric from internal resource utilization (CPU) to external queue depth (Kafka lag). KEDA directly queries the Kafka cluster for the consumer group lag and scales the deployment proportionally. If the lag crosses the configured threshold, KEDA will immediately spawn new pods to help chew through the backlog, entirely bypassing the misleading CPU metrics.
6. An SRE configures a critical financial ledger topic with a replication factor of 3, `min.insync.replicas=1`, and instructs developers to use `acks=all` in their producers. Later that day, broker 2 crashes, followed shortly by broker 3. The producer successfully writes a deposit event to the remaining broker 1, but then broker 1's disk fails before any other broker recovers. Why did `acks=all` fail to prevent data loss in this scenario, and how would changing `min.insync.replicas` have changed the outcome?
The acks=all setting guarantees that the producer will wait for all currently in-sync replicas to acknowledge the write. However, because min.insync.replicas was set to 1, the cluster was perfectly willing to accept writes even when only a single broker (broker 1) was alive and in-sync. The producer received a success acknowledgment after writing solely to broker 1. When broker 1’s disk failed, that un-replicated data was permanently lost. If min.insync.replicas had been configured to 2, the cluster would have proactively rejected the producer’s write attempt once brokers 2 and 3 went down. The producer would have received an error instead of a false confirmation, allowing the application to safely retry or alert, thereby preserving data integrity at the cost of temporary unavailability.
Hands-On Exercise: Kafka Pipeline with Strimzi
Section titled “Hands-On Exercise: Kafka Pipeline with Strimzi”# Create kind cluster with extra resourcescat > /tmp/kind-kafka.yaml << 'EOF'kind: ClusterapiVersion: kind.x-k8s.io/v1alpha4nodes: - role: control-plane - role: worker - role: worker - role: workerEOF
kind create cluster --name kafka-lab --config /tmp/kind-kafka.yaml
# Install Strimzi operatork create namespace kafkak apply -f https://strimzi.io/install/latest?namespace=kafka
k wait --for=condition=ready pod -l name=strimzi-cluster-operator \ --namespace kafka --timeout=180sTask 1: Create a Kafka Cluster
Section titled “Task 1: Create a Kafka Cluster”Deploy a 3-broker Kafka cluster using Strimzi.
Solution
apiVersion: kafka.strimzi.io/v1beta2kind: Kafkametadata: name: lab-cluster namespace: kafkaspec: kafka: version: 3.8.0 replicas: 3 listeners: - name: plain port: 9092 type: internal tls: false config: offsets.topic.replication.factor: 3 transaction.state.log.replication.factor: 3 transaction.state.log.min.isr: 2 default.replication.factor: 3 min.insync.replicas: 2 num.partitions: 6 storage: type: ephemeral resources: requests: memory: 1Gi cpu: 500m zookeeper: replicas: 3 storage: type: ephemeral resources: requests: memory: 512Mi cpu: 250m entityOperator: topicOperator: {}k apply -f /tmp/kafka-cluster.yaml# This takes 3-5 minutesk wait kafka/lab-cluster --for=condition=Ready --timeout=300s -n kafkaTask 2: Create a Topic and Produce Messages
Section titled “Task 2: Create a Topic and Produce Messages”Create an order-events topic and publish messages.
Solution
apiVersion: kafka.strimzi.io/v1beta2kind: KafkaTopicmetadata: name: order-events namespace: kafka labels: strimzi.io/cluster: lab-clusterspec: partitions: 6 replicas: 3 config: retention.ms: 86400000 min.insync.replicas: 2k apply -f /tmp/topic.yaml
# Produce messagesk run kafka-producer --rm -it --image=quay.io/strimzi/kafka:0.44.0-kafka-3.8.0 \ -n kafka --restart=Never -- \ bin/kafka-console-producer.sh \ --broker-list lab-cluster-kafka-bootstrap:9092 \ --topic order-events \ --property "parse.key=true" \ --property "key.separator=:" << 'EOF'order-001:{"order_id":"001","event":"created","amount":29.99}order-002:{"order_id":"002","event":"created","amount":49.99}order-001:{"order_id":"001","event":"confirmed","amount":29.99}order-003:{"order_id":"003","event":"created","amount":99.99}order-002:{"order_id":"002","event":"confirmed","amount":49.99}order-001:{"order_id":"001","event":"shipped","amount":29.99}order-003:{"order_id":"003","event":"confirmed","amount":99.99}order-002:{"order_id":"002","event":"shipped","amount":49.99}order-003:{"order_id":"003","event":"shipped","amount":99.99}order-001:{"order_id":"001","event":"delivered","amount":29.99}EOFTask 3: Deploy Consumer Group and Observe Partition Assignment
Section titled “Task 3: Deploy Consumer Group and Observe Partition Assignment”Create a consumer Deployment with 3 replicas and verify partition distribution.
Solution
apiVersion: apps/v1kind: Deploymentmetadata: name: order-consumer namespace: kafkaspec: replicas: 3 selector: matchLabels: app: order-consumer template: metadata: labels: app: order-consumer spec: containers: - name: consumer image: quay.io/strimzi/kafka:0.44.0-kafka-3.8.0 command: - /bin/sh - -c - | bin/kafka-console-consumer.sh \ --bootstrap-server lab-cluster-kafka-bootstrap:9092 \ --topic order-events \ --group order-processor \ --from-beginning \ --property print.key=true \ --property print.partition=true resources: requests: cpu: 100m memory: 256Mik apply -f /tmp/consumer-deployment.yamlsleep 15
# Check consumer group partition assignmentsk run check-group --rm -it --image=quay.io/strimzi/kafka:0.44.0-kafka-3.8.0 \ -n kafka --restart=Never -- \ bin/kafka-consumer-groups.sh \ --bootstrap-server lab-cluster-kafka-bootstrap:9092 \ --describe --group order-processorTask 4: Monitor Consumer Lag
Section titled “Task 4: Monitor Consumer Lag”Produce more messages and observe lag building up.
Solution
# Produce 1000 messages rapidlyk run bulk-producer --rm -it --image=quay.io/strimzi/kafka:0.44.0-kafka-3.8.0 \ -n kafka --restart=Never -- \ /bin/sh -c ' for i in $(seq 1 1000); do echo "order-$((i % 100)):$(printf "{\"order_id\":\"%03d\",\"event\":\"created\",\"amount\":%d.%02d}" $i $((RANDOM % 100)) $((RANDOM % 100)))" done | bin/kafka-console-producer.sh \ --broker-list lab-cluster-kafka-bootstrap:9092 \ --topic order-events \ --property "parse.key=true" \ --property "key.separator=:" echo "Produced 1000 messages" '
# Check lagk run check-lag --rm -it --image=quay.io/strimzi/kafka:0.44.0-kafka-3.8.0 \ -n kafka --restart=Never -- \ bin/kafka-consumer-groups.sh \ --bootstrap-server lab-cluster-kafka-bootstrap:9092 \ --describe --group order-processorTask 5: Verify Ordering Within Partitions
Section titled “Task 5: Verify Ordering Within Partitions”Confirm that messages with the same key always appear in order.
Solution
# Consume from a specific partition to verify orderingk run partition-check --rm -it --image=quay.io/strimzi/kafka:0.44.0-kafka-3.8.0 \ -n kafka --restart=Never -- \ bin/kafka-console-consumer.sh \ --bootstrap-server lab-cluster-kafka-bootstrap:9092 \ --topic order-events \ --partition 0 \ --from-beginning \ --max-messages 20 \ --property print.key=true \ --property print.offset=true
# All messages with the same key will have increasing offsets# within the same partition, confirming order is preservedSuccess Criteria
Section titled “Success Criteria”- 3-broker Kafka cluster is running and Ready
- order-events topic has 6 partitions with replication factor 3
- Consumer group shows 3 consumers with 2 partitions each
- Consumer lag is visible after bulk production
- Messages with the same key appear in order within their partition
Cleanup
Section titled “Cleanup”kind delete cluster --name kafka-labNext Module: Module 9.8: Secrets Management Deep Dive — Learn how External Secrets Operator, Secrets Store CSI, and HashiCorp Vault integrate with Kubernetes to manage dynamic secrets, TTLs, and credential rotation at scale.