Skip to content

Module 1.7: Event Streaming Fundamentals

Complexity: [INTERMEDIATE]

Time to Complete: 3 hours

Prerequisites: Module 1.1 — Stateful Workloads & Storage Deep Dive, basic Kubernetes operations on version 1.35+, and familiarity with HTTP APIs or message queues


After completing this module, you will be able to:

  • Design event-streaming topologies that use an append-only log as the replayable source of truth
  • Evaluate partitioning schemes and explain the exact ordering guarantees they do and do not provide
  • Compare at-most-once, at-least-once, and exactly-once processing using producer, broker, and consumer settings
  • Diagnose backpressure by separating producer throttling, broker quotas, consumer lag, and downstream saturation
  • Choose between Kafka and NATS JetStream based on operator posture, latency target, ordering model, and Kubernetes maturity

At first, the outage looked like a normal analytics delay.

The checkout service was still accepting orders. The payment provider was still sending settlement events. The warehouse service was still updating inventory. Dashboards were late, but nobody expected a dashboard to be perfectly real time.

Then support agents noticed something worse. Customers were receiving shipment notifications for orders that had been refunded. Inventory counts were drifting in different directions across regions. Fraud review was making decisions with stale account state.

The team had built “streaming” as if it were just a faster queue. Every consumer advanced independently. Some services retried failed messages without idempotency keys. Other services assumed Kafka ordering was global. One team increased partitions during a promotion and accidentally split a customer’s events across multiple workers. Another team treated a growing consumer lag graph as a broker problem when the real bottleneck was a slow database write behind the consumer.

None of those mistakes came from not knowing a command. They came from using streaming without a mental model.

Event streaming is not just “send a message and receive it later.” It is a storage and coordination pattern built around an append-only log. Producers append facts. Brokers retain those facts. Consumers replay them at their own pace. Multiple teams can build different projections from the same history without asking the producer to send the same event again.

That design is powerful because it decouples time. It is dangerous because the system keeps moving even when one part falls behind. The broker can be healthy while a consumer corrupts state. The producer can succeed while the downstream database is unavailable. “Exactly-once” can protect writes to a stream but still leave duplicate side effects in an email service.

This module gives you the map before you operate the tools. You will learn what the log guarantees, what partitions change, why ordering is usually per key rather than global, how delivery semantics are assembled, and when Kafka or NATS JetStream is the right operational shape.

If you later deploy Kafka with Strimzi in Module 1.2, this module is the reasoning layer underneath the YAML.

Most engineers first meet messaging through queues. A producer puts work in. A worker takes work out. After the worker acknowledges the item, the queue forgets it.

That model is useful for distributing jobs, but it is not the core idea behind event streaming.

In streaming, the durable log is the product. Consumers do not “take” an event from the log. They read it, remember their own position, and leave the event available for other consumers. That one difference changes the architecture.

append
+----------------------+
| Producers |
| orders, payments, |
| inventory, devices |
+----------+-----------+
|
v
+----------------------+
| Append-Only Stream |
| |
| offset 100: A |
| offset 101: B |
| offset 102: C |
| offset 103: D |
| offset 104: E |
+----+----------+------+
| |
replay | | fan-out
v v
+---------------+ +----------------+
| Billing view | | Fraud model |
| offset: 104 | | offset: 102 |
+---------------+ +----------------+
|
v
+----------------------+
| Warehouse projection |
| offset: 099 |
+----------------------+

The same event history can feed billing, fraud detection, search indexing, ML features, warehouse loading, and incident replay. Each consumer group tracks its own offset. One group can be caught up while another is deliberately replaying last month.

The log gives you three important properties.

First, it is append-only. New facts are added at the end. Existing facts are not edited in place. If an order is refunded, the stream receives a refund event rather than mutating the original order event. This is why streams pair naturally with audit trails.

Second, it is replayable. A new service can start from the beginning and build its own state. A broken deployment can rewind to a known offset and reprocess corrected logic. A data science team can replay historical events into a feature pipeline.

Third, it supports fan-out. The producer does not need to know every future consumer. The stream becomes a contract between teams. The payment service publishes PaymentAuthorized. Fraud, billing, analytics, and support tooling can subscribe without adding synchronous calls to the payment path.

That does not mean every event should be huge. An event should usually contain the facts that changed, the identity of the thing that changed, and enough metadata to route, deduplicate, and debug it. A common envelope looks like this:

id: evt_20260518_9c1d
source: checkout-api
type: order.created
subject: order-88391
time: "2026-05-18T10:25:13Z"
datacontenttype: application/json
partitionKey: customer-991
data:
orderId: order-88391
customerId: customer-991
totalCents: 12850
currency: USD

The important field for this module is partitionKey. It is the bridge between a business invariant and a broker guarantee.

If all events for customer-991 must be processed in order, the key needs to be stable for that customer. If all events for order-88391 must be processed in order, the key needs to be stable for that order. You cannot postpone that decision until the consumer is written. The producer is choosing the shape of ordering when it chooses the key.

Imagine a support dashboard stores a materialized view of each order. It consumes an orders stream and writes the latest status into PostgreSQL.

At noon, the team discovers a bug: OrderRefunded events were ignored for orders above a certain value. In a non-replayable queue, the team would need a compensating migration from another database. In a stream, they can do this:

1. Pause the broken consumer deployment.
2. Deploy fixed projection code under a new consumer group name.
3. Start the new group at offset 0, or at a trusted snapshot offset.
4. Let it replay all order events into a clean table.
5. Compare counts and selected orders against the old table.
6. Switch the dashboard read path to the corrected projection.

The replay works because the stream retained the source facts. The consumer’s table was not the source of truth. It was a projection.

This distinction is not academic. When a team treats a projection as the truth, it starts patching derived state directly. When a team treats the log as the truth, it asks whether the stream has enough retained history to rebuild the projection correctly.

Check yourself: Pick a service you know. If its derived database were deleted, which event stream would let you rebuild it? If the answer is “none,” that service is not using the log as its source of truth.

A stream can behave like a queue for one consumer group, but the design intent is different.

QuestionQueue answerStream answer
What happens after a worker acknowledges a message?The message is usually removed or hidden permanently.The event remains until retention removes or compacts it.
Can two independent teams consume the same event history?Usually not without duplicating messages or topics.Yes, each consumer group tracks its own position.
Can a new service replay old events?Usually no, unless dead-letter or archive behavior was added.Yes, if retention still contains the needed history.
Is the broker the source of truth?Usually no, it is a work dispatcher.Often yes, or at least the durable integration source.

This is why “just use Kafka as a queue” often leads to pain. If there is exactly one worker pool, no replay requirement, no fan-out, and no need to retain history, a durable task queue may be simpler. Streaming earns its complexity when history has value.

No serious streaming platform gives infinite ordering, infinite throughput, and infinite availability at the same time. Partitions are the compromise.

A topic or stream is split into ordered shards. Each shard has an ordered sequence. The broker can place shards across nodes. Consumers can process different shards in parallel.

That gives you scale, but it narrows the ordering guarantee.

Kafka orders records within a partition. NATS JetStream orders messages within a stream or ordered consumer view, but subject filters and shared pull consumers change how applications observe that order. Neither system gives you simple global ordering across all business entities at high scale.

The practical rule is:

Ordering is per key when the key always maps to the same ordered shard.
Ordering is not global unless you accept a single ordered shard.

Suppose the orders topic has three partitions. The producer uses customerId as the key. The partitioner hashes the key and maps it to a partition.

orders topic
Partition 0
+--------+-------------------+------------------------------+
| Offset | Key | Event |
+--------+-------------------+------------------------------+
| 0 | customer-a | OrderCreated(order-10) |
| 1 | customer-a | PaymentAuthorized(order-10) |
| 2 | customer-d | OrderCreated(order-13) |
+--------+-------------------+------------------------------+
Partition 1
+--------+-------------------+------------------------------+
| Offset | Key | Event |
+--------+-------------------+------------------------------+
| 0 | customer-b | OrderCreated(order-11) |
| 1 | customer-b | OrderCancelled(order-11) |
| 2 | customer-e | OrderCreated(order-14) |
+--------+-------------------+------------------------------+
Partition 2
+--------+-------------------+------------------------------+
| Offset | Key | Event |
+--------+-------------------+------------------------------+
| 0 | customer-c | OrderCreated(order-12) |
| 1 | customer-c | AddressChanged(customer-c) |
| 2 | customer-f | OrderCreated(order-15) |
+--------+-------------------+------------------------------+

For customer-a, OrderCreated is processed before PaymentAuthorized because both events are in partition 0. For customer-b, OrderCreated is processed before OrderCancelled because both events are in partition 1.

There is no meaningful global order between customer-a offset 1 and customer-b offset 1. They live in different partitions. The broker can process, replicate, and deliver them independently.

If a dashboard says “show all events exactly as they happened across the company,” the stream cannot magically produce one total order unless the design forces all events through one partition or adds a separate sequencing service. That usually trades away throughput and availability.

The same event can be partitioned several ways. Each choice protects a different invariant.

Partition keyOrdering you getOrdering you loseTypical fit
orderIdAll events for one order stay ordered.A customer’s many orders may interleave across partitions.Order lifecycle, fulfillment, refunds.
customerIdAll customer events stay ordered.A hot customer or tenant can overload one partition.Account state, fraud, entitlement changes.
tenantIdTenant-level audit order is preserved.High-volume tenants become hot spots.Multi-tenant billing and compliance logs.
Random or round-robinMaximum distribution.Business ordering is mostly gone.Metrics, independent telemetry, fire-and-forget analytics.

There is no “best” key without a business rule. The key is the answer to the question: “Which events would be dangerous to process out of order?”

Your team emits these events:

UserEmailChanged(user-7, old=a@example.com, new=b@example.com)
PasswordResetRequested(user-7, email=b@example.com)
MarketingConsentRevoked(user-7)

If the producer partitions by eventType, the three events may land in three different partitions. The password reset worker might observe the reset before the email change. The marketing system might send a campaign before consent revocation is processed.

Check yourself: What partition key would you choose if account security is the highest-risk workflow? What if marketing analytics is the only consumer and all events are independent counters?

Increasing partitions sounds harmless. It is not always harmless.

In Kafka, changing a topic from three partitions to six partitions changes the key-to-partition mapping for many keys. New events for a key may go to a different partition than old events. That can break order for consumers that replay a mixed range of old and new records.

A safe repartition plan usually includes one of these moves:

StrategyHow it worksTrade-off
Create a new topicWrite to orders-v2 with the new partition count and migrate consumers.Cleanest ordering boundary, but more rollout work.
Version the keyInclude a routing version and teach consumers where the boundary is.More application complexity.
Keep old topic stableAdd throughput by splitting by domain, not by mutating partitions.Requires topic design work.
Accept the breakUse only when records are independent and order is not a correctness requirement.Easy to underestimate risk.

Operators need to treat partition count as an architectural decision, not just a capacity knob.

Delivery semantics describe what can happen during failures. They are not moral grades. At-most-once can be correct for telemetry. At-least-once can be correct for payments when consumers are idempotent. Exactly-once can be necessary for stream-to-stream transformations but still insufficient for side effects outside the transaction boundary.

The ladder looks like this:

At-most-once
Message may be lost.
Message should not be redelivered.
At-least-once
Message should not be lost after acknowledgment rules are met.
Message may be redelivered.
Exactly-once processing
The system coordinates writes and offsets so each input affects the committed output once.
Scope matters: stream-to-stream is different from stream-to-email.

The dangerous phrase is “exactly-once delivery.” In most real systems, the broker can only coordinate the parts it owns or the clients that participate in its protocol. If a consumer sends an email and crashes before committing an offset, the broker can redeliver the event. The broker cannot unsend the email.

Worked Example: The Same Failure Under Three Semantics

Section titled “Worked Example: The Same Failure Under Three Semantics”

A consumer reads PaymentCaptured(order-10). It writes a row to a reporting table. Then the process crashes before committing its offset.

Semantic targetWhat happens after restartCorrect design response
At-most-onceThe offset may have been committed before processing, so the row might be missing.Use only if missing data is acceptable or can be recovered elsewhere.
At-least-onceThe event is read again, so the row might be written twice.Make the database write idempotent with event_id or an upsert key.
Exactly-once stream processingThe output write and consumed offset commit are atomic inside the streaming transaction.Keep output in a transactional sink that participates, or accept a weaker boundary.

The lesson is not “always use exactly-once.” The lesson is “know where duplicates or loss can appear, then design the side effect.”

Kafka’s official producer documentation describes idempotence, retries, acknowledgments, and transactions as separate pieces. The table below turns those pieces into an operator checklist.

GoalProducer settingsConsumer settingsOffset behaviorWhat it protectsWhat it does not protect
At-most-onceacks=0 or low durability settings, retries disabled or irrelevantCommit offset before processingOffset advances before work finishesAvoids duplicate processing from broker redeliveryCan lose messages during producer, broker, or consumer failure
Basic at-least-onceacks=all, retries enabledProcess first, commit offset after successOffset advances only after work succeedsAvoids acknowledged data loss under normal failuresCan duplicate side effects after crashes
Idempotent productionenable.idempotence=true, acks=all, retries enabled, bounded in-flight requestsSame as at-least-onceSame as at-least-oncePrevents duplicate records caused by producer retry within the producer sessionDoes not deduplicate application-level resends with new IDs
Transactional stream-to-streamtransactional.id set, idempotence activeisolation.level=read_committed for downstream readersConsumer offsets are sent to the producer transactionAtomically commits output records and consumed offsetsDoes not make arbitrary external calls exactly-once
Application-level exactly-once effectSame as transactional where possibleConsumer uses idempotency key at sinkSink records processed event_id or equivalentGives exactly-once effect in a database or API designed for idempotencyRequires sink support and careful schema design

For Kafka, remember the three moving parts:

  • Idempotent producers prevent duplicate writes caused by producer retries.
  • Transactions coordinate output records and consumed offsets.
  • isolation.level=read_committed prevents consumers from reading aborted transactional writes.

Those are necessary for exactly-once stream processing with Kafka. They are not enough for every side effect.

NATS Core and NATS JetStream have different reliability postures. Core NATS is a fast message bus with at-most-once delivery. JetStream adds persistence, replay, acknowledgments, redelivery, retention policies, and publish acknowledgments.

GoalJetStream postureConsumer postureWhat it protectsWhat to design yourself
Fast at-most-onceCore NATS or unacknowledged push consumerSubscriber must be online and fastVery low latency message distributionRecovery after subscriber outage
At-least-onceJetStream publish acknowledgments and explicit consumer acknowledgmentsAck only after processing succeedsRedelivery when processing fails or ack is lostIdempotent consumer side effects
Publisher duplicate suppressionSet a unique message ID header for a deduplication windowSame as at-least-onceRetries that publish the same message IDDuplicate sends outside the dedupe window
Stronger consumer acknowledgmentUse double-ack style confirmation where the client waits for the server to confirm the ackConsumer treats unconfirmed ack as uncertainReduces false redelivery after ack uncertaintyExternal side effects still need idempotency
Work queue retentionUse work-queue retention when each message should be consumed by one worker groupAvoid overlapping consumers on the same subjectQueue-like distribution with persistenceLong-term replay to many independent teams

JetStream’s model is often easier for request-adjacent messaging and edge deployments. Kafka’s model is often stronger for large analytic logs, long retention, and mature stream-processing ecosystems. The correct choice depends on the failure you are trying to make boring.

At-least-once systems intentionally allow duplicates. That means the consumer must be safe to run twice.

For a database sink, the simplest pattern is to store the event ID with a unique constraint.

CREATE TABLE processed_events (
event_id TEXT PRIMARY KEY,
processed_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

Then the consumer makes the side effect conditional on the insert succeeding.

1. Begin database transaction.
2. Insert event_id into processed_events.
3. If insert conflicts, skip the side effect and commit.
4. Apply the business update.
5. Commit database transaction.
6. Acknowledge or commit the stream offset.

This pattern does not require a magic broker feature. It requires the sink to remember which event IDs already changed state.

Backpressure Is a Signal, Not a Panic Button

Section titled “Backpressure Is a Signal, Not a Panic Button”

Backpressure means one part of the pipeline cannot keep up with another. In streaming, this is normal. The point of a durable log is that consumers can fall behind temporarily without forcing producers to stop immediately.

The question is whether lag is bounded and intentional.

Producer rate: 20,000 events/sec
Broker append: 20,000 events/sec
Consumer read: 18,000 events/sec
Sink write: 12,000 events/sec
Result:
Broker looks healthy.
Consumer lag grows.
The real bottleneck is the sink.

A healthy operator asks where pressure is accumulating.

Pressure locationSymptomFirst metric to inspectCommon fix
Producer clientSends block, time out, or receive throttle responsesProducer request latency, buffer pool wait, error rateBatch better, slow producer, increase broker capacity, fix quota
BrokerDisk, network, or CPU saturationBroker request queue, disk I/O, under-replicated partitionsAdd brokers, rebalance partitions, tune retention, reduce hot keys
ConsumerLag rises while broker remains healthyLag per partition, processing time, rebalance countScale consumers up to partition count, optimize handler, reduce rebalances
Downstream sinkConsumer processing time risesDatabase latency, connection pool wait, write errorsBatch writes, add indexes carefully, isolate slow sinks

Consumer lag is not automatically bad. Lag during a replay is expected. Lag after a planned downstream maintenance window is expected. Lag that grows during normal traffic and never drains is a capacity problem.

A platform team receives an alert:

consumer_group=fraud-score-writer
topic=payments
lag=9,800,000
lag_trend=increasing
broker_cpu=42%
broker_disk_io=normal
consumer_rebalances=0
db_write_latency_p95=1800ms

The first instinct is to add Kafka brokers. That would not fix the incident.

The brokers are not saturated. The consumer group is stable. The database write latency is high.

A better response:

1. Confirm lag is concentrated in all partitions or only one hot partition.
2. Check consumer handler timing: deserialize, score, write, commit.
3. Inspect downstream database p95 and p99 write latency.
4. Temporarily increase consumer batch size if the sink benefits from batching.
5. Add consumers only if partitions are available and the sink can absorb more writes.
6. Throttle noncritical producers if retention risk appears.

Adding consumers without fixing the sink can make the outage worse by increasing database concurrency.

Kafka and NATS JetStream both expose pressure, but they guide operators differently.

Backpressure concernKafka postureNATS JetStream postureOperator interpretation
Consumer lagFirst-class concept through offsets per partition and consumer group lagConsumer state tracks delivered and acknowledged positions; pull consumers make demand explicitKafka makes lag dashboards very natural; JetStream often makes per-consumer pending and ack metrics central
Producer throttlingBroker quotas can throttle producers or consumers; clients may block when buffers fillServer limits, max pending, publish acknowledgments, and discard policies shape producer behaviorKafka is quota-heavy for multi-tenant clusters; JetStream is often subject and stream limit heavy
Broker-side quotasQuotas can be applied to clients, users, or IPs depending on deploymentAccount limits, stream limits, and server resource limits are the common leversBoth need intentional tenancy design before noisy neighbors appear
Slow consumer behaviorLag accumulates in the log until retention removes old recordsPush consumers can hit pending limits; pull consumers naturally request what they can handlePull-based JetStream consumers are often easier for worker-style backpressure
Replay pressureReplays can generate high read load across brokersReplays can be instant or original-rate depending on consumer policyReplays need SLOs; do not let a backfill starve production consumers

Backpressure handling is an operations contract. The producing team, platform team, and consuming team should agree on these values before the launch:

  • Maximum acceptable lag by consumer group.
  • Retention window large enough to survive expected outages.
  • Whether noncritical producers may be throttled first.
  • Which consumers are allowed to replay during peak hours.
  • What happens when a sink is down longer than retention.

Retention is often configured as if it were storage housekeeping. It is really product policy.

If the stream is the source of truth for rebuilding a projection, retention defines how far back you can recover without another system. If the stream feeds compliance analytics, retention defines what the audit can prove. If the stream contains sensitive data, retention also defines how long risk remains on disk.

There are three common retention modes.

Retention modeHow it worksChoose it whenAvoid it when
Time-basedKeep records for a duration such as seven days or ninety days.Consumers need a recovery window measured in time.Storage cost or privacy rules require tighter bounds.
Size-basedKeep records until the log reaches a byte limit.Storage budget is the hard cap and traffic is predictable.Traffic spikes could delete history earlier than consumers expect.
Log-compactedKeep the latest record for each key, plus some recent history depending on broker behavior.Consumers need the latest state per key, such as account settings or feature flags.Every historical transition matters for audit or analytics.

Time-based retention answers, “How long can a consumer be down?” Size-based retention answers, “How much storage can this stream consume?” Compaction answers, “What is the latest value for each key?”

Three teams ask for event storage.

StreamNeedRetention choiceReasoning
orders.eventsRebuild operational projections after a bad deploy.Time-based, long enough for rollback and incident response.Every transition matters, so compaction would lose history.
user.preferencesNew services need the latest preference for each user.Log-compacted by user ID.Consumers care about current state, not every toggle.
edge.metrics.rawHigh-volume device metrics used for near-real-time alerts.Size-based or short time-based retention.Long-term analytics should be downsampled into another store.

The mistake is configuring all three the same way. They carry different business promises.

If retention is seven days and the warehouse consumer is down for eight days, the stream cannot fully rebuild the warehouse. The missing data may still exist in another source, but the stream contract failed.

That is why production stream reviews should include this question:

What is the longest credible outage for each consumer,
and does retention exceed that outage plus detection and repair time?

If the answer is no, increase retention, add a cold archive, or stop claiming the stream can rebuild that projection.

Kafka and NATS JetStream overlap, but they do not feel the same to operate. The difference is not simply “big versus small.” It is the shape of the guarantees, clients, and operational ecosystem.

Kafka is a distributed commit log with a large data engineering ecosystem. It shines when teams need high-throughput durable logs, partitioned replay, long retention, stream processing integrations, schema governance, and mature Kubernetes automation through Strimzi.

NATS is a connective technology built around subjects, low latency, request/reply, and simple service communication. JetStream adds persistence and replay without turning NATS into a Kafka clone. It shines when teams want lightweight messaging, edge-to-cloud deployments, pull-based work distribution, request-adjacent workflows, or lower operational surface area.

CriterionKafka postureNATS JetStream postureDefault choice when this dominates
ThroughputBuilt for sustained high-throughput logs across many partitions and brokers.Very fast for messaging; persistent streams can scale well but are usually chosen for simpler operational shape and lower latency.Kafka for very large analytic event logs.
LatencyLow enough for most pipelines, but batching and replication are usually tuned for throughput and durability.Often chosen for very low-latency service messaging and request/reply with optional persistence.NATS when tail latency matters more than replay depth.
Ordering modelStrong order within each partition; key choice controls business order.Order is natural within a stream or ordered consumer view; queue and pull patterns can distribute work without Kafka-style partitions.Kafka when per-key partitioned logs are the core abstraction.
Operational complexityMore moving parts: brokers, controllers, storage, topic configs, rebalances, quotas, schema ecosystem.Smaller operational surface for many cases; JetStream still requires storage, replicas, limits, and careful stream design.NATS for smaller platform teams or edge clusters.
Kubernetes operator maturityStrimzi is a mature, widely used Kafka operator with Kafka, topic, user, and connector patterns.Official NATS Helm charts are common; JetStream stream and consumer CRDs exist, but the ecosystem is narrower than Strimzi’s Kafka lifecycle coverage.Kafka when K8s-native Kafka lifecycle automation is a hard requirement.
EcosystemKafka Connect, Kafka Streams, Flink, Spark, Schema Registry patterns, lakehouse ingestion.NATS services, request/reply, key-value, object store, edge messaging, simple client model.Match the surrounding platform more than the benchmark.
Failure postureDurable log first; consumers can be far behind if retention permits.Messaging fabric first with optional durable streams and explicit ack behavior.Kafka for replay-heavy data platforms; NATS for service communication with selective durability.

No table replaces a proof of concept. But the table prevents a common mistake: choosing Kafka for every message or choosing NATS for a long-retention analytic backbone because the first demo was simpler.

Kubernetes Manifests: Same Intent, Different Posture

Section titled “Kubernetes Manifests: Same Intent, Different Posture”

In Strimzi, a topic is an explicit Kubernetes resource. This example sets partitions, replicas, and retention.

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
name: orders-events
labels:
strimzi.io/cluster: data-kafka
spec:
partitions: 6
replicas: 3
config:
retention.ms: 604800000
cleanup.policy: delete
min.insync.replicas: 2

In JetStream, a stream captures subjects and applies retention limits. The exact CRD version can change with the installed controller, so operators should verify it against their cluster.

apiVersion: jetstream.nats.io/v1beta2
kind: Stream
metadata:
name: orders-events
spec:
name: ORDERS
subjects:
- orders.>
storage: file
replicas: 3
retention: limits
maxAge: 168h
discard: old

Both examples are “streaming.” They express different operator instincts. Kafka asks you to reason about partitions early. JetStream asks you to reason about subjects, stream capture, consumers, and limits early.

Streaming is attractive because it sounds future-proof. It can also be the wrong tool.

The anti-pattern is not “using Kafka” or “using NATS.” The anti-pattern is making a simple coordination problem harder by adding replay, partitioning, offset management, retention, and eventual consistency without a business need.

Anti-Pattern: Synchronous Request/Response Hidden in a Stream

Section titled “Anti-Pattern: Synchronous Request/Response Hidden in a Stream”

If a user clicks “Place Order” and the API must return a payment authorization immediately, a stream is usually not the request path. Use synchronous HTTP, gRPC, or NATS request/reply for the decision that must happen now. Publish an event after the decision.

Good shape:
Checkout API --sync--> Payment API
Checkout API --append--> OrderCreated stream
Payment API --append--> PaymentAuthorized stream

The stream records the fact. It does not pretend an asynchronous consumer is part of the immediate user response.

Anti-Pattern: Single Durable Worker With No Replay Need

Section titled “Anti-Pattern: Single Durable Worker With No Replay Need”

If there is exactly one logical consumer and messages should disappear after work completes, a durable queue may be a better fit.

Examples:

  • Resize uploaded images.
  • Send a password reset email.
  • Run one background export job.
  • Trigger a one-off cache warm task.

You can implement these with a stream. But if nobody needs fan-out or replay, the stream’s extra concepts may not pay rent.

A stream is not a replacement for the database that enforces current transactional state.

Do not ask a stream to answer:

  • “Is this username available right now?”
  • “Can this account spend this balance?”
  • “Did this unique invoice number already get issued?”
  • “Which row should this transaction lock?”

Those are OLTP questions. Use a database transaction. Then publish the event describing what happened.

Anti-Pattern: Infinite Retention Without Ownership

Section titled “Anti-Pattern: Infinite Retention Without Ownership”

Keeping everything forever sounds safe until storage costs, privacy obligations, and schema drift arrive. Every retained stream needs an owner, deletion policy, schema compatibility policy, and replay expectation.

If no team owns those decisions, the stream becomes an expensive archive nobody trusts.

  • Kafka’s exactly-once story is scoped to cooperating clients. Idempotent producers and transactions are powerful, but consumers that call non-transactional external APIs still need idempotency keys or compensating logic.
  • NATS JetStream can replay at original speed. That makes it useful for staging traffic against a test consumer without flooding it faster than production originally published.
  • Consumer lag can be healthy. A replaying consumer, a paused backfill, or a planned downstream outage can create lag that is expected and bounded.
  • Log compaction is not archival. It preserves the latest value per key for state reconstruction, but it is the wrong mode when every historical transition must be audited.
MistakeWhy it hurtsBetter move
Assuming Kafka ordering is globalDifferent partitions advance independently, so cross-key order is not guaranteed.Choose a key that matches the business invariant that must stay ordered.
Increasing partitions as a casual scaling fixKey-to-partition mappings can change, splitting old and new records for the same key.Treat repartitioning as a migration with a new topic or explicit boundary.
Treating exactly-once as an end-to-end checkboxBroker transactions do not make emails, webhooks, or arbitrary database writes exactly-once.Use idempotency keys and transactional sink design.
Alerting on raw lag without contextA planned replay can look like an outage, while a small lag on a critical stream can be serious.Alert on lag age, trend, consumer purpose, and retention risk.
Using streaming for synchronous decisionsUsers wait on an asynchronous path that is harder to debug and time out correctly.Make the immediate decision synchronously, then publish the fact.
Setting retention by storage budget onlyConsumers may be unable to recover after an outage that exceeds the retained window.Size retention from recovery objectives, then manage cost explicitly.
Choosing random keys for business eventsThroughput improves, but related events can process out of order.Use random keys only when events are genuinely independent.
Running every workload through one giant topicSchemas, retention, ownership, and access control become tangled.Split streams by domain contract and lifecycle.

Your team deployed a Kafka consumer that writes payment summaries to PostgreSQL. Consumer lag is climbing, broker CPU is normal, and database write latency increased after a new index was added. What do you check first?

Answer

Start with the consumer handler and the downstream database, not the broker. The broker is not showing saturation, while the sink latency changed at the same time lag started rising. Check write p95/p99, connection pool waits, transaction duration, batch size, and whether the new index made each write more expensive. Adding brokers or consumers may increase pressure on the database and make the incident worse.

A product team says all events in the company must be processed “in the exact order they happened.” They also expect high throughput and independent scaling by customer. How do you respond?

Answer

Clarify which business entities require ordering. High-throughput streaming systems usually provide order within a partition or stream sequence, not a single global order across every entity. If customer-level correctness matters, key by customer. If order-level correctness matters, key by order. A true global sequence would require a single ordered bottleneck or a separate sequencing service, which trades away throughput and availability.

A Kafka producer uses retries and acks=all, but enable.idempotence was disabled because an old client config conflicted with it. During a broker leadership change, the team sees duplicate records in the topic. What change reduces this class of duplicate?

Answer

Enable idempotent production with compatible settings. Idempotence prevents producer retry duplicates within the producer session when the broker actually received the first attempt but the producer did not see the acknowledgment. The team should also verify acks=all, retries, and allowable in-flight request settings. This does not remove the need for idempotent consumers because application-level resends can still create duplicates.

A NATS JetStream consumer sends a webhook to a partner API and then crashes before acknowledging the message. After restart, the webhook is sent again. Which guarantee failed?

Answer

The broker did what at-least-once delivery allows: it redelivered a message that was not acknowledged. The missing guarantee is idempotency at the external side effect. The consumer should send a stable idempotency key, store processed event IDs, or use a partner API that deduplicates requests. JetStream acknowledgments protect message processing progress, not arbitrary external effects.

The warehouse consumer has been down for nine days. The topic retains seven days of events. The team wants to replay from the last committed offset. What is the risk?

Answer

The earliest needed records may have aged out of retention. The consumer can no longer rebuild purely from the stream. The team needs another source such as an archive, database snapshot, or upstream replay. After recovery, retention should be changed to exceed the longest credible outage plus detection and repair time, or the team should stop promising full replay from that stream.

A team uses tenantId as the partition key. One enterprise tenant suddenly sends most of the traffic, and only one partition is lagging. What happened, and what are the options?

Answer

The tenant became a hot key. All events for that tenant are correctly ordered, but they concentrate on one partition. Options include splitting that tenant’s stream by a narrower key if the business can tolerate weaker tenant-wide order, creating a dedicated stream for large tenants, adding a routing version, or keeping the key and scaling the slow consumer logic. Adding more consumers cannot parallelize one partition beyond its ordering limit.

A platform team is choosing between Kafka and NATS JetStream for edge clusters that need low-latency command messages, occasional durable replay, and a small operations team. Which way do you lean, and what caveat do you give?

Answer

Lean toward NATS with JetStream if the main need is low-latency service messaging with selective persistence and simpler operations. The caveat is to prove the retention, acknowledgment, stream limits, and replication behavior under failure before committing. If the workload turns into a long-retention analytic backbone with many independent replaying teams, Kafka may become a better fit.

A team wants to use log compaction for an OrderStatusChanged stream because it saves storage. Auditors require the full sequence of status transitions for each order. Is compaction appropriate?

Answer

Not as the audit stream. Compaction keeps the latest value per key and can remove earlier transitions. It is useful for rebuilding current state, such as the latest order status, but it is not appropriate when every transition is evidence. The team can keep an uncompacted audit stream and optionally derive a compacted current-state stream from it.

In this exercise, you will design the streaming contract for a small commerce platform before writing any deployment YAML. The goal is to practice the mental model: log ownership, key choice, delivery semantics, retention, and backpressure.

You operate three services:

  • checkout-api creates orders and requests payment authorization.
  • payment-worker records payment outcomes from a provider.
  • warehouse-projector builds a read model for support agents.

The business requirements are:

  • Support agents must see the correct final state for each order.
  • Fraud analysts need to replay the last thirty days of payment events into experiments.
  • Password reset and checkout request paths must remain synchronous.
  • Warehouse projection can tolerate duplicates but not missing confirmed payments.
  • A regional database outage may last up to two days.

Create a short design note with at least two streams. For each stream, write the owner, event types, and consumers.

Use this template:

streams:
- name: orders.events
owner: checkout-platform
eventTypes:
- OrderCreated
- OrderCancelled
- OrderRefunded
primaryConsumers:
- warehouse-projector
- support-search-indexer
- name: payments.events
owner: payments-platform
eventTypes:
- PaymentAuthorized
- PaymentCaptured
- PaymentFailed
primaryConsumers:
- warehouse-projector
- fraud-experiment-runner

For each stream, choose a key and explain the invariant it protects.

Example:

partitioning:
orders.events:
key: orderId
protects: "All lifecycle events for one order are processed in order."
payments.events:
key: orderId
protects: "Payment outcomes for one order are applied in order to the warehouse projection."

If you choose customerId instead, explain what improves and what gets worse.

Write the target semantics for each consumer. Do not just write “exactly-once.” Name the duplicate or loss risk and how the sink handles it.

consumers:
warehouse-projector:
target: at-least-once with idempotent database writes
duplicateDefense: "Unique event_id table inside the same database transaction as the projection update."
offsetRule: "Commit or acknowledge after the database transaction commits."
fraud-experiment-runner:
target: replayable at-least-once
duplicateDefense: "Experiment jobs deduplicate by event_id before feature generation."
offsetRule: "Use separate consumer group for each experiment run."

Choose retention for each stream. Your answer must mention the thirty-day fraud replay and two-day outage requirement.

retention:
orders.events:
mode: time
minimum: 7d
reason: "Operational recovery exceeds the expected regional database outage plus repair time."
payments.events:
mode: time
minimum: 35d
reason: "Fraud analysts need thirty days of replay plus a buffer for detection and rerun time."

Add alerts that separate broker pressure from consumer pressure.

alerts:
- name: warehouse-lag-age
signal: "Oldest unprocessed event age for warehouse-projector"
pageWhen: "Age exceeds 2h during normal traffic"
- name: retention-risk
signal: "Oldest required consumer offset is close to stream retention boundary"
pageWhen: "Remaining replay window is less than 24h"
- name: producer-throttle-rate
signal: "Producer receives sustained broker throttle responses"
pageWhen: "Throttle persists for 10m on critical producers"

Write a short decision record. Use Kafka if your priority is long-retention analytic replay, large fan-out, and mature Strimzi operations. Use NATS JetStream if your priority is low-latency service messaging, smaller operational surface, and selective persistence.

Your record should include:

  • The platform you choose.
  • The reason tied to this scenario.
  • The risk that would make you revisit the decision.
  • The first failure test you would run.
  • The design names at least two streams and their owning teams.
  • Each stream has a partition key tied to a business ordering invariant.
  • Each consumer has a delivery target and an idempotency strategy.
  • Retention covers the stated replay and outage requirements.
  • Backpressure alerts distinguish lag, retention risk, and producer throttling.
  • The Kafka vs NATS decision explains operator posture, not only feature preference.
  • The design explicitly keeps synchronous request/response paths out of the stream.

Continue to Module 1.2: Apache Kafka on Kubernetes (Strimzi) to deploy and operate Kafka after you understand the streaming mental model.