Module 3.15: Azure Event Hub + Event Grid — Operator Path
Цей контент ще не доступний вашою мовою.
Complexity: [COMPLEX]
Time: 90-120 min
Prereqs: 3.1-entra-id, 3.2-vnet, 3.9-key-vault, 3.10-monitor
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Compare Event Hubs, Event Grid, Service Bus, Storage Queues, and Apache Kafka on AKS by matching each service to ordering, replay, fan-out, transaction, and ownership requirements.
- Design a secure Azure event pipeline with managed identity, RBAC, private endpoints, Capture, delivery retry, dead-lettering, and CloudEvents payloads.
- Diagnose Event Hubs throttling, consumer lag, Capture gaps, Event Grid delivery failures, and retry storms by using Azure Monitor metrics and Log Analytics queries.
- Implement an end-to-end Event Hubs Capture plus Event Grid notification flow with Azure CLI, a Python producer, and a webhook test endpoint.
- Evaluate the cost impact of throughput units, processing units, capture windows, retained blobs, Event Grid operations, namespace throughput, and cross-region replication.
The operator path is about choosing the right event service before an incident chooses it for you. Azure has several queues, brokers, and event routers because event-driven systems are not one shape. A telemetry firehose from thousands of devices, a storage-account blob-created notification, a command that must be processed once in order, and a developer-platform audit event all look like “messages” from far away. From close range they are different operational contracts, and the wrong contract creates expensive ambiguity.
Why This Module Matters
Section titled “Why This Module Matters”Exercise scenario: a platform team receives three requests in the same planning meeting. The data team wants to ingest ten thousand telemetry records per second and replay a failed analytics job from yesterday. The application team wants a storage-account notification to call a webhook whenever a report blob lands. The finance team wants invoice commands processed exactly once, in order, with a dead-letter queue and manual replay. All three teams say they need “events”, but only the first request naturally belongs on Event Hubs, only the second is a clean Event Grid fit, and the third is closer to Service Bus.
This distinction matters because managed event services fail in different ways. Event Hubs fails operationally when partitions are undersized, throughput units are capped too low, consumer groups are overloaded, checkpoints stall, or retention is shorter than the replay window. Event Grid fails operationally when a handler is down, filters match too broadly, validation is not handled, retries multiply delivery operations, or dead-letter storage is missing. Service Bus fails operationally when lock duration, sessions, duplicate detection, transactions, or dead-letter handling are misunderstood. The right service makes the incident obvious; the wrong service turns every event into a forensic puzzle.
The Event Hubs overview describes Event Hubs as a scalable event ingestion service, while the Event Grid overview describes Event Grid as a highly scalable event routing service. Those words are easy to skim past, but they are the first split in the operator mental model. Event Hubs is where a stream lives long enough to be read, replayed, partitioned, and analyzed. Event Grid is where an event is routed to interested handlers and retried if delivery fails.
Use this picture as the simplest map:
continuous stream discrete eventhigh volume low to moderate volumeordered per partition fan-out to handlersreader controls pace publisher pushes to routerconsumer checkpoints delivery attempts and dead-letterreplay from retention no stream replay contract
+------------------+ +------------------+ | Event Hubs | | Event Grid | | stream ingestion | | event routing | +--------+---------+ +--------+---------+ | | v v analytics, Kafka clients, Functions, webhooks, queues, Capture, Stream Analytics Event Hubs, Service BusPause and predict: a producer emits one million sensor readings each minute, and a downstream job must replay the last six hours after a bug fix. Would Event Grid or Event Hubs make the operator’s recovery easier, and what property drives the answer?
The rest of the module treats Event Hubs and Event Grid side by side. That is deliberate. Operators do not receive tickets that say “please create the correct service”. They receive workload requirements, cost constraints, security rules, and incident symptoms. Your job is to translate those signals into the right managed event contract.
Why Event-Driven Azure Has More Than One Service
Section titled “Why Event-Driven Azure Has More Than One Service”Event-driven architecture is not only about decoupling. It is about choosing where time, order, pressure, and responsibility live. If the producer writes faster than the consumer reads, someone must absorb that pressure. If a handler is offline, someone must decide how long to retry. If a record is malformed, someone must keep enough evidence to debug it. If a consumer deploys a bad version, someone must decide whether replay is available.
Event Hubs and Event Grid answer those questions differently. Event Hubs gives producers a durable append path and gives consumers a pull-based stream. Consumers read from partitions, maintain checkpoints, and can fall behind without forcing the producer to wait. Event Grid gives publishers a routing layer and gives subscribers delivery attempts. Subscribers receive discrete notifications through handlers such as Azure Functions, Logic Apps, webhooks, Event Hubs, Service Bus, Storage Queues, and Hybrid Connections.
The Azure messaging comparison is worth reading because it prevents a common mistake: treating every message-shaped problem as Event Grid. Service Bus exists because enterprise commands need ordering, sessions, transactions, duplicate detection, lock renewal, scheduled delivery, and a first-class dead-letter queue. Storage Queues exist because some workloads need a simple, low-cost queue with huge storage capacity and fewer broker semantics. Event Grid exists because many systems only need a small event to say “something happened over there”. Event Hubs exists because streams need ingestion, retention, partitioned reads, replay, and analytics integration.
Streaming Versus Eventing
Section titled “Streaming Versus Eventing”Streaming is a time-ordered flow of records. The data often has analytic value as a sequence, not just as individual messages. You care about partition keys because order is only guaranteed inside a partition. You care about retention because the stream is a recovery tool. You care about consumer lag because a reader can be healthy but behind. You care about throughput because one weak namespace can throttle every producer.
Eventing is a notification pattern. The event usually says that a resource changed, a workflow step completed, or an integration point should react. You care about filters because a handler should receive only relevant events. You care about delivery retries because the router pushes to subscribers. You care about validation because webhook endpoints must prove they own the address. You care about dead-lettering because a handler outage should not erase evidence.
+---------------------+------------------------------+------------------------------+| Operator question | Streaming answer | Eventing answer |+---------------------+------------------------------+------------------------------+| Who controls pace? | Consumer reads at its pace | Router delivers to handler || What is ordered? | Records inside a partition | No broad ordering contract || How do I recover? | Re-read retained records | Retry or dead-letter events || Where is pressure? | Namespace throughput and lag | Handler health and retries || Common Azure fit | Event Hubs | Event Grid |+---------------------+------------------------------+------------------------------+Where Service Bus and Storage Queues Fit
Section titled “Where Service Bus and Storage Queues Fit”Service Bus is not a footnote. It is the first candidate when the unit is a business command rather than a telemetry record or notification. If a payment command, invoice command, or provisioning command must be processed once with lock renewal, sessions, transaction boundaries, and a dead-letter queue, Service Bus is the safer default. Event Grid can route a notification to Service Bus, but Event Grid should not be asked to become Service Bus by convention.
Storage Queues are useful when the message contract is simple and cost or storage capacity matters more than broker features. They are common in low-complexity worker patterns, simple retry loops, or systems that already depend heavily on Azure Storage. They do not replace Event Hubs for replayable streams or Event Grid for native Azure event routing.
Worked Example: Four Requirements, Four Choices
Section titled “Worked Example: Four Requirements, Four Choices”A fleet of point-of-sale devices emits one-kilobyte telemetry records continuously. The analytics team wants to replay yesterday’s records after fixing a parsing bug. Choose Event Hubs because retention, partitions, consumer groups, and replay are core to the service.
A storage account receives a nightly export file and must notify a compliance workflow. Choose Event Grid because the storage service already emits a blob-created event, and the workflow only needs a discrete notification.
An order command must preserve customer-level order and must move to a dead-letter queue after repeated processing failures. Choose Service Bus because sessions, lock renewal, and DLQ handling are the operational contract.
A research platform already runs Strimzi on AKS, requires custom Kafka broker configuration, and has a team that owns broker upgrades. Kafka on AKS may be justified because control and ecosystem compatibility matter more than managed-service convenience.
Now you solve it: your platform receives a request for “events” from a team that needs webhook fan-out to three SaaS endpoints, no replay, low message volume, and detailed failure evidence when an endpoint is down. Which service would you start with, and which feature must be configured before production?
Event Hubs Deep Dive
Section titled “Event Hubs Deep Dive”Event Hubs is easiest to operate when you stop thinking of it as a queue. It is a partitioned commit log with managed ingestion, retention, and consumer coordination. Producers append events. Consumers read from partitions. Each consumer group has its own view of progress. Checkpoint storage records where a reader stopped so it can continue after a restart.
The Event Hubs scalability guidance ties the model to throughput units, processing units, and capacity units. Those units matter because Event Hubs is often placed at the front door of a data system. If the front door throttles, downstream services may look idle while producers fail.
Partitions and Partition Keys
Section titled “Partitions and Partition Keys”A partition is an ordered sequence inside
an event hub. Event Hubs guarantees
ordering within a partition, not across
the entire event hub. That means a
partition key is an operational decision,
not a cosmetic field. If all records for
device-123 use the same partition key,
those records stay ordered relative to
each other. If every record uses the same
key, one partition becomes hot while other
partitions sit idle. If no key is used,
Event Hubs distributes records across
partitions, which is often good for
throughput but removes affinity.
event hub: telemetry
partition 0: device-a, device-a, device-apartition 1: device-b, device-c, device-bpartition 2: device-d, device-e, device-dpartition 3: device-f, device-g, device-f
consumer group: analytics-prod checkpoint store says partition 0 offset N checkpoint store says partition 1 offset M checkpoint store says partition 2 offset P checkpoint store says partition 3 offset QChoose partition count before production with care. More partitions can improve parallelism, but they also increase reader coordination and may complicate per-key ordering assumptions. You cannot fix a poor partition-key strategy only by buying more throughput. If all writes land on one hot partition, the namespace can have spare capacity while one partition creates latency and throttling symptoms.
Consumer Groups and Checkpoint Store
Section titled “Consumer Groups and Checkpoint Store”A consumer group is an independent view over the same stream. The analytics team, fraud team, and archive verifier can each have their own consumer group. One team falling behind does not move another team’s checkpoint. That independence is valuable, but it also means every new consumer group can multiply outbound read load.
The checkpoint store is where the client records progress. Most Azure SDK examples use Blob Storage for checkpoints because it is durable, cheap, and easy to coordinate across multiple reader instances. The checkpoint is not the event data. It is the bookmark. If the checkpoint store is unavailable or permissions are wrong, readers may repeatedly reprocess events or fail to coordinate partition ownership.
Tiers: Basic, Standard, Premium, Dedicated
Section titled “Tiers: Basic, Standard, Premium, Dedicated”The Event Hubs quotas and tiers page is the source you check before committing a design. The short version is that Basic is a low-cost ingestion tier with fewer features, Standard is the common shared-capacity production tier, Premium gives isolated resources and larger limits through processing units, and Dedicated gives single-tenant capacity through capacity units. Do not choose the tier from a feature name alone. Choose it from throughput, isolation, event size, Capture needs, private networking, schema registry, and operational blast radius.
| Tier | Capacity unit | Operator fit | Main tradeoff |
|---|---|---|---|
| Basic | Throughput unit | Low-cost ingestion where advanced production features are not needed | Fewer features and tighter limits |
| Standard | Throughput unit | Common production start for shared-capacity event ingestion | Multi-tenant capacity and explicit TU sizing |
| Premium | Processing unit | Stronger isolation, larger limits, private networking, schema registry, and included features | Higher hourly baseline |
| Dedicated | Capacity unit | Very high scale, strict isolation, and predictable broker capacity | Capacity-unit commitment and platform planning |
Throughput Units, Processing Units, and Capacity Units
Section titled “Throughput Units, Processing Units, and Capacity Units”In Standard, a throughput unit is the planning unit. The commonly used mental model is one TU for up to one megabyte per second or one thousand events per second ingress, and up to two megabytes per second egress. The exact current limits and tier details belong in the Microsoft quota page, not in a runbook copied once and forgotten. The operator point is that both event count and bytes matter.
Worked sizing example: a telemetry workload sends ten thousand events per second, with one kilobyte average payloads, and the team wants eighty percent headroom. The event-rate requirement is ten thousand divided by one thousand divided by 0.8, which rounds up to thirteen TUs. The byte-rate requirement is about ten megabytes per second divided by one megabyte per second divided by 0.8, which also rounds up to thirteen TUs. The namespace should start at thirteen TUs or use auto-inflate with a ceiling above that value.
Using the public East US retail meters
observed while writing this module,
Standard Throughput Unit pricing was
$0.03 per hour and Standard ingress
events were $0.028 per million events.
Thirteen TUs for a thirty-day month at
seven hundred thirty hours is about
$284.70. Ten thousand events per second
for that month is about twenty-five
billion nine hundred twenty million
events, so the ingress-events meter is
about $725.76. If Standard Capture is
enabled on the namespace and billed at
$0.10 per hour, thirteen TUs add about
$949.00 before storage, reads, and
regional data transfer. The rough monthly
service subtotal is about $1,959.46, and
the bill still needs blob storage,
analytics readers, and Log Analytics
ingestion.
That example is intentionally plain. It shows why a stream that sounds “small” because each record is one kilobyte can still have a meaningful monthly bill when the rate is continuous. Cost is not a finance-only concern here. If you undersize TUs, you buy incidents. If you oversize without telemetry, you buy idle capacity. If you enable Capture and retain blobs forever, you buy storage growth that never pages anyone until the budget alert fires.
Auto-Inflate and Explicit Ceilings
Section titled “Auto-Inflate and Explicit Ceilings”Auto-inflate lets a Standard namespace increase throughput units when ingress pressure rises. It is useful because traffic rarely respects the spreadsheet. It is also not a license to ignore capacity planning. Auto-inflate needs a maximum ceiling, and that ceiling is a cost and reliability decision. Set it too low and producers still throttle during spikes. Set it too high and one buggy producer can create an expensive month.
Use auto-inflate when traffic is bursty, producers are important, and the team has alerts on both throttling and capacity. Use fixed TUs when traffic is stable, cost ceilings are strict, and the producer behavior is controlled. For critical workloads, alert before the ceiling is reached so the operator has time to decide whether to raise the ceiling, shed load, tune partitions, or move to Premium.
Premium and Dedicated Capacity
Section titled “Premium and Dedicated Capacity”Premium changes the conversation from shared throughput units to processing units. The Event Hubs Premium overview is the entry point for features such as stronger resource isolation, larger event sizes, private links, customer-managed keys, and included capabilities that are add-ons or constrained elsewhere. Premium often pays off when the workload is important enough that noisy-neighbor risk, larger events, private networking, schema registry, or predictable performance matter more than the lower Standard entry price.
Dedicated changes the conversation again. The Event Hubs Dedicated overview describes a single-tenant cluster model with capacity units. Dedicated is not the first stop for a normal application team. It is for organizations that have very high throughput, strict isolation requirements, or a platform team prepared to own capacity planning at a larger scale.
Geo-Disaster Recovery Pairing
Section titled “Geo-Disaster Recovery Pairing”The Event Hubs geo-disaster recovery documentation describes alias-based namespace pairing. The alias is the stable name clients use, while the primary and secondary namespaces sit behind it. Failover moves the alias, which means clients can reconnect without a full application reconfiguration.
Geo-DR is not magic replication of every operational concern. Operators still need to understand what metadata is replicated, what event data is not available after a failover, how producers reconnect, how consumers recover checkpoints, and how downstream systems handle duplicates. If the business requirement is cross-region data durability for the event stream itself, Capture to geo-redundant storage or a separate replication design may be needed. An alias helps with namespace failover; it does not replace a recovery architecture.
Capture to Blob or ADLS Gen2
Section titled “Capture to Blob or ADLS Gen2”The Event Hubs Capture overview explains how Event Hubs can write stream data to Azure Blob Storage or ADLS Gen2 in Avro format. Capture is a powerful operator feature because it decouples stream ingestion from batch analytics. The event hub can keep receiving data while downstream systems read files later. It is also a cost and data-lifecycle feature because those Avro blobs live in storage until lifecycle policy removes or tiers them.
Capture windows are configured by time and size. A short window creates smaller files and faster downstream visibility, but it can produce more objects and more listing overhead. A larger window produces fewer files, but it delays batch consumers. For many operational pipelines, a five-minute window is a practical starting point because it keeps the exercise observable without creating a file per event.
Exactly-once language needs precision. Event Hubs can preserve order within a partition and Capture can write Avro files reliably, but end-to-end exactly-once processing depends on producers, consumers, checkpointing, idempotent sinks, and failure recovery. If a consumer writes to a database and checkpoints before the write is durable, data can be lost. If it writes to a database and crashes before checkpointing, data can be processed twice. Design idempotent consumers and use deterministic event IDs when the business cannot tolerate duplicate side effects.
Schema Registry
Section titled “Schema Registry”The Event Hubs Schema Registry overview gives teams a managed place to store schemas for Avro, JSON, and Protobuf payloads. That matters because streams usually outlive one producer version. If every producer invents its own payload shape, consumers become archaeology projects. Compatibility modes give operators a way to control evolution, such as allowing additive fields while blocking changes that break old consumers.
Schema Registry also helps when Kafka clients use Event Hubs through the Kafka surface. The schema contract should be independent of the client library. Whether a producer uses an Azure SDK or a Kafka client, the platform still needs a clear rule for payload evolution, ownership, and breaking changes.
Kafka Surface
Section titled “Kafka Surface”The Event Hubs Kafka overview explains how Event Hubs exposes an Apache Kafka-compatible endpoint for many Kafka clients. This is useful during migration or when an application already speaks the Kafka protocol. The operational benefit is that teams can use managed Event Hubs without running brokers, ZooKeeper-era components, or broker storage.
Compatibility is not the same as identical behavior. Event Hubs does not become a self-managed Kafka cluster with every broker-side feature, every admin API, every topic configuration knob, or every ecosystem expectation. Operators should test client versions, authentication, partition behavior, offset handling, transactions, idempotent producer requirements, and monitoring assumptions before promising a transparent migration. Kafka compatibility is a bridge, not a reason to ignore managed-service limits.
Pause and predict: if a Kafka application depends on broker-level topic configuration and an admin API that Event Hubs does not support, what should the migration plan test before the cutover date?
Event Grid Deep Dive
Section titled “Event Grid Deep Dive”Event Grid is an event router. It receives an event from a source, evaluates subscriptions and filters, and delivers matching events to handlers. The Event Grid overview is the map for topics, event subscriptions, schemas, and handlers. The important operator idea is that Event Grid owns routing and delivery attempts, not a durable stream that consumers replay at will.
source service or app | v+--------------------+| Event Grid topic || system/custom/etc. |+---------+----------+ | +--> subscription: Functions handler | +--> subscription: webhook handler | +--> subscription: Event Hubs sink | +--> subscription: Service Bus queueTopics, System Topics, Custom Topics, Partner Topics, and Domains
Section titled “Topics, System Topics, Custom Topics, Partner Topics, and Domains”A topic is the endpoint where Event Grid receives events. System topics represent Azure resources that emit events, such as Storage Accounts. Custom topics are application-owned event entry points. Partner topics are used when a partner service publishes events into Azure. Domains group many topics under one management boundary, which is useful for SaaS-style multi-tenant publishing.
Use a system topic when the source is already an Azure resource that emits events. Use a custom topic when your application is the publisher. Use a domain when one publishing surface must serve many tenants or applications with topic-level isolation. Use a partner topic when the source is an integrated partner service. The question is not “which object can I create fastest”; it is “who owns the source, who owns the subscribers, and how many routing boundaries are needed?”
Subscribers and Event Handlers
Section titled “Subscribers and Event Handlers”Event handlers include Azure Functions, Logic Apps, webhooks, Event Hubs, Service Bus, Storage Queues, Hybrid Connections, and other Azure endpoints. That list is the reason Event Grid is so useful as glue. An Event Grid subscription can call a serverless function, send to a queue, or bridge into Event Hubs for analytics. The handler choice should reflect the reliability contract.
Choose Functions for code that reacts immediately and scales with events. Choose Logic Apps for workflow integration where connectors and human-readable steps matter. Choose a webhook for external systems, but treat validation, authentication, idempotency, and retry behavior as production requirements. Choose Event Hubs as a destination when discrete Azure events need to join an analytics stream. Choose Service Bus when the event should become a durable command for enterprise processing. Choose Storage Queues when the consumer is simple and the queue semantics are enough.
Event Grid Native Schema and CloudEvents
Section titled “Event Grid Native Schema and CloudEvents”Event Grid supports its native event schema and the CloudEvents 1.0 schema. The Event Grid event schema documentation describes the native shape and CloudEvents support. The CloudEvents 1.0 specification exists because event producers and consumers need a common envelope for source, type, ID, time, subject, and data content.
Use CloudEvents when events cross team, platform, or vendor boundaries. It gives integration partners a standard envelope and reduces custom adapter code. Use the native Event Grid schema when you are staying inside Azure examples, legacy handlers expect it, or the source and handler already use that shape. Do not mix schemas casually. Schema is an API contract, and changing it can break subscribers even if the event data itself is unchanged.
Example CloudEvents-shaped storage notification, trimmed for readability:
{ "specversion": "1.0", "type": "Microsoft.Storage.BlobCreated", "source": "/subscriptions/.../storageAccounts/stdojocapture", "id": "event-id-from-event-grid", "time": "2026-05-19T12:00:00Z", "subject": "/blobServices/default/containers/capture/blobs/ns/hub/0/file.avro", "data": { "api": "PutBlob", "contentLength": 12345, "url": "https://stdojocapture.blob.core.windows.net/capture/ns/hub/0/file.avro" }}Filtering
Section titled “Filtering”Filtering is where Event Grid becomes an operational tool instead of a noisy broadcast channel. The Event Grid filtering documentation covers event type filters, subject prefix and suffix filters, and advanced filters over event data. A good subscription is narrow enough that the handler can trust the incoming event stream. A broad subscription pushes filtering into code, increases delivery attempts, and makes incident triage harder.
Use event type filters to subscribe only
to events such as
Microsoft.Storage.BlobCreated. Use
subject prefix filters to narrow to a
container or virtual path. Use subject
suffix filters for file extensions such as
.avro when the subject convention is
stable. Use advanced filters when the
decision depends on data fields, but keep
them understandable enough that an
operator can audit the match rule at 2 AM.
Delivery Retry and Dead-Lettering
Section titled “Delivery Retry and Dead-Lettering”The Event Grid delivery and retry documentation is required reading before production. Event Grid retries delivery when a handler fails or times out. Retries are valuable because transient failures should not drop events. Retries are dangerous when the handler is down for a long time, authentication is broken, or the endpoint returns repeated failures. In those cases, the retry policy can amplify traffic against an already unhealthy dependency.
Dead-lettering sends undeliverable events to Storage after retry policy is exhausted or delivery is impossible. Configure it for production subscriptions unless the event is truly disposable. Dead-letter storage is not only for replay. It is evidence. It tells you what Event Grid tried to deliver, which subscription matched, and what payload the handler failed to process.
Event Grid Namespaces, MQTT, and Pull Delivery
Section titled “Event Grid Namespaces, MQTT, and Pull Delivery”Event Grid Namespaces matter when Event Grid becomes more than a push-webhook router. The Event Grid namespace concepts describe namespace topics, clients, topic spaces, permission bindings, and pull delivery. Namespaces also support an MQTT v5 broker surface, which is important for IoT-style clients that speak MQTT and need brokered pub-sub behavior.
Use the namespace model when clients need MQTT v5, pull delivery, or a more explicit client/topic permission model. Do not choose namespaces just because they are newer. For a simple Azure Storage blob-created notification to an Azure Function, a system topic and event subscription are still the clearer shape. For many device clients publishing through MQTT topic spaces, namespaces may be the feature that makes Event Grid the right entry point.
Decision Framework: Event Hubs, Event Grid, Service Bus, or Kafka on AKS
Section titled “Decision Framework: Event Hubs, Event Grid, Service Bus, or Kafka on AKS”Operators need a decision framework because service names are poor requirements. Start with the failure you need to survive. If a consumer is down, do you need replay from a stream, delivery retry from a router, a dead-letter queue with lock semantics, or broker control inside Kubernetes? The answer usually selects the service before any feature checklist does.
start | vNeed ordered replayable high-throughput stream? | yes -> Event Hubs | no vNeed discrete notification fan-out to handlers? | yes -> Event Grid | no vNeed commands, sessions, transactions, FIFO, DLQ? | yes -> Service Bus | no vNeed Kafka broker control, plugins, or self-managed compliance? | yes -> Kafka on AKS with Strimzi | no -> Storage Queue, Function trigger, or simpler integration| Requirement | First candidate | Why |
|---|---|---|
| High-throughput telemetry ingestion | Event Hubs | Partitioned stream, retention, consumer groups, replay, analytics |
| Blob-created event to webhook | Event Grid | Native system topic, filters, push delivery, retry, dead-letter |
| Invoice command with sessions | Service Bus | FIFO-like sessions, locks, transactions, DLQ, duplicate detection |
| Simple background work item | Storage Queue | Low ceremony, storage-backed queue, fewer broker features |
| Existing Kafka app without broker ownership | Event Hubs Kafka surface | Kafka protocol bridge to managed ingestion |
| Kafka with custom broker configuration | Kafka on AKS | Control over brokers, plugins, storage, network, and upgrade timing |
| Azure control-plane event fan-out | Event Grid | System topics and event subscriptions match the pattern |
Event Hubs Wins When
Section titled “Event Hubs Wins When”Use Event Hubs when the workload is high-throughput ingestion, observability telemetry, clickstream data, device readings, security events, or analytics data. The key signals are continuous arrival, partition affinity, replay requirements, multiple independent readers, Kafka client compatibility, or downstream systems such as Stream Analytics, Azure Data Explorer, Databricks, or custom consumers. Event Hubs is also useful as the Event Grid destination when discrete Azure events need to become a stream for audit or analytics.
Event Grid Wins When
Section titled “Event Grid Wins When”Use Event Grid when the workload is a discrete event notification. The key signals are fan-out, webhook integration, Azure resource events, SaaS-style integration, serverless handlers, low to moderate event rate, and push or pull delivery. Event Grid is the right mental model for “notify subscribers that something happened” rather than “store this stream so consumers can replay it later”.
Service Bus Wins When
Section titled “Service Bus Wins When”Use Service Bus when messages are business commands and the handler must coordinate processing state. Sessions, duplicate detection, scheduled messages, transactions, lock renewal, and dead-letter queues are not optional details for many enterprise workflows. This module does not teach Service Bus in depth, but it must protect you from choosing Event Grid when the real requirement is command processing.
Apache Kafka on AKS Wins When
Section titled “Apache Kafka on AKS Wins When”Use Kafka on AKS with an operator such as Strimzi only when the organization needs broker-level control and accepts broker-level responsibility. Valid reasons include regulatory constraints that block a managed service, custom broker plugins, strict network topology, specialized retention and compaction policies, ecosystem tools that require Kafka broker behavior, or a platform team already operating Kafka well. Invalid reasons include “we know Kafka” when no one wants to patch brokers, size disks, tune JVMs, test rolling upgrades, or own weekend incidents.
Identity, Networking, and Security
Section titled “Identity, Networking, and Security”Security for event systems is mostly about removing shared secrets, shrinking network paths, and making data-plane permissions explicit. The same lesson from Entra ID, VNet, Key Vault, and Monitor appears again here: production failures often come from identity and network assumptions that were invisible during a portal demo.
Managed Identity and RBAC
Section titled “Managed Identity and RBAC”For Event Hubs data access, prefer managed
identity with Azure RBAC when the producer
or consumer runs in Azure. Use roles such
as Azure Event Hubs Data Sender, Azure Event Hubs Data Receiver, and Azure Event Hubs Data Owner. Grant the
narrowest useful scope, often the event
hub or namespace rather than the
subscription. For local development,
DefaultAzureCredential can use a
developer login, but production should run
under a managed identity or workload
identity.
Example role assignment for the signed-in user during the lab:
EVENTHUB_ID="$(az eventhubs eventhub show \ --resource-group "$RG" \ --namespace-name "$EH_NAMESPACE" \ --name "$EVENTHUB_NAME" \ --query id \ --output tsv)"
USER_OBJECT_ID="$(az ad signed-in-user show --query id --output tsv)"
az role assignment create \ --assignee "$USER_OBJECT_ID" \ --role "Azure Event Hubs Data Sender" \ --scope "$EVENTHUB_ID"SAS tokens still exist because not every producer can use Entra ID. They are useful for external clients, constrained devices, legacy libraries, and integration points that understand connection strings but not managed identity. They also increase secret-management burden. If you use SAS, create separate policies for send and listen, rotate keys, store secrets in Key Vault, avoid namespace-wide root policies for applications, and prefer short-lived generated SAS tokens when possible.
Private Endpoints, Firewalls, and Service Endpoints
Section titled “Private Endpoints, Firewalls, and Service Endpoints”Network controls answer a different question than RBAC. RBAC says who can call the data plane. Private endpoints, firewalls, and service endpoints decide which network paths can reach the service endpoint. For Event Hubs Standard and Premium, private networking patterns are common when producers and consumers run in VNets. Private endpoints move access through Private Link. IP firewall rules restrict public ingress. VNet service endpoints can be useful in supported patterns, but private endpoints are usually the cleaner zero-public-path design for new production workloads.
Event Grid has its own network realities. A webhook endpoint must be reachable by Event Grid unless you route through supported private or Azure-native handlers. External webhooks need validation, authentication, idempotency, and rate limits. When a private application needs to receive events, consider Azure-native handlers, private endpoints where supported, or queue-based indirection rather than exposing a fragile internal webhook.
TLS and Customer-Managed Keys
Section titled “TLS and Customer-Managed Keys”TLS should be table stakes for every path. The more interesting production decision is key ownership. Customer-managed keys for Event Hubs are available in higher tiers such as Premium and Dedicated, and they pull Key Vault operations into the availability path. That is not bad, but it must be designed deliberately. If a key is disabled, deleted, or blocked by networking, the event service can be affected. Treat CMK as a compliance and operational commitment, not a checkbox.
Security Review Checklist
Section titled “Security Review Checklist”- Producers use managed identity or tightly scoped SAS policies.
- Consumers have receiver permissions only where they read.
- Capture storage has lifecycle policy, RBAC, and network controls.
- Event Grid webhook subscriptions have validation and authentication.
- Dead-letter storage is not public and has a retention owner.
- Diagnostic settings send logs and metrics to the right workspace.
- Private endpoints and DNS are tested from the actual producer subnet.
- Break-glass procedures cover key rotation and namespace failover.
Cost Lens: Capacity, Operations, and Hidden Spikes
Section titled “Cost Lens: Capacity, Operations, and Hidden Spikes”Cost belongs in the design review because event systems scale quietly. A stream can run every second of the month. A broken Event Grid handler can retry for hours. Capture can create storage that outlives the application. Geo-DR and replication can add cross-region data transfer. Log Analytics can ingest large diagnostic volumes if every event is logged by application code.
Event Hubs Standard Cost Model
Section titled “Event Hubs Standard Cost Model”Standard Event Hubs has several cost surfaces. Throughput units are hourly capacity. Ingress events can be metered per million events. Capture can add an hourly meter and storage charges. Kafka endpoint usage can add a meter in some pricing models. Networking, cross-region transfer, storage transactions, and monitoring are outside the simple namespace price.
Worked example recap:
| Input | Value |
|---|---|
| Event rate | 10,000 events/sec |
| Average payload | 1 KB |
| Ingress data | about 10 MB/sec |
| Headroom target | 80 percent utilization |
| Required TUs by event count | 13 |
| Required TUs by bytes | 13 |
| Planning TUs | 13 |
| Approx TU monthly cost | $284.70 |
| Approx ingress event cost | $725.76 |
| Approx Standard Capture cost | $949.00 |
The important part is not the exact dollar total. Prices vary by region, currency, tier, reservation, and date, so the Event Hubs pricing page is the source of truth before purchase. The important part is the method: calculate both event rate and byte rate, apply headroom, add retention and Capture costs, then model the downstream readers.
Event Hubs Premium Cost Model
Section titled “Event Hubs Premium Cost Model”Premium uses processing units instead of Standard throughput units. The hourly baseline is higher, but Premium can pay off when it avoids throttling incidents, multi-tenant contention, larger-event workarounds, separate add-on costs, or private-network and schema requirements that would otherwise create complexity. Premium is usually a production decision for important streams, not a default for every small integration.
The inflection point is not only math. If a Standard namespace needs many TUs, constant auto-inflate, separate isolation per team, private networking, schema registry, and larger event sizes, Premium may be cheaper operationally even if the service line item is higher. That is the difference between cloud cost and engineering cost.
Event Grid Cost Model
Section titled “Event Grid Cost Model”Event Grid charges around operations. Publishing, matching, delivery attempts, namespace throughput, and MQTT operations can all matter depending on the tier and feature set. The Event Grid pricing page should be checked during design because operation counting can surprise teams that think a single event is always a single billable unit.
Example: one million storage blob-created events per month delivered to three subscribers is not only one million events in the system. There is a publish side, matching work, and delivery to each subscriber. If one webhook is unavailable and Event Grid retries repeatedly, delivery attempts increase. If dead-lettering is enabled, Storage costs also appear. Retry storms are therefore both reliability and cost incidents.
Where Costs Spike
Section titled “Where Costs Spike”Undersized Event Hubs TUs spike cost and risk in two ways. If auto-inflate is enabled with a high ceiling, a sudden producer bug can increase hourly capacity. If auto-inflate is disabled or capped too low, producers throttle, retry, and push costs elsewhere. Set a ceiling, alert near the ceiling, and tie the alert to a runbook that distinguishes expected launch traffic from malformed producer loops.
Capture costs spike when files are retained without lifecycle rules. Set storage lifecycle policies for raw Avro blobs. Move old data to cool or archive tiers only when downstream readers can tolerate access latency and retrieval charges. If compliance requires retention, record the owner and retention period in the module or Terraform variable, not in tribal memory.
Geo-DR and replication costs spike when data crosses regions or when duplicate pipelines run hot. Namespace alias failover is not the same as free replicated analytics data. Plan cross-region data transfer, Capture storage replication, and duplicate consumer behavior before the disaster recovery exercise.
Event Grid costs spike when handlers are unavailable. Every failing delivery path creates retries until policy is exhausted. The operator response is not simply “increase retry duration”. Fix authentication, scale the handler, reduce filter breadth, use a queue as a buffer, or dead-letter events so the retry loop does not punish a dependency that is already unhealthy.
Observability and Operator Queries
Section titled “Observability and Operator Queries”Observability for event systems has two sides. The broker or router tells you about capacity, delivery, and throttling. The application tells you whether it processed the event safely. Do not accept a dashboard that only shows producer success. A stream can ingest successfully while every consumer is a day behind. An Event Grid subscription can publish successfully while the webhook is failing validation.
Event Hubs Metrics
Section titled “Event Hubs Metrics”The Event Hubs monitoring reference is the metric source to use when building dashboards. Operators commonly watch incoming bytes, incoming messages or requests, outgoing bytes, outgoing messages, throttled requests, server errors, user errors, connections, and captured bytes. For Capture, watch whether captured bytes move when events are flowing. For consumers, Azure Monitor does not magically know every checkpoint store, so lag often requires comparing last enqueued sequence numbers with stored checkpoints.
Practical Event Hubs dashboard panels should answer the questions an on-call operator asks before opening the portal blades one by one. Each panel needs an owner, an expected range, and a runbook link so the dashboard becomes an operating surface rather than a wall of charts.
| Panel | Signal | Operator question |
|---|---|---|
| Incoming bytes and messages | Producer volume | Is traffic normal or spiking? |
| Throttled requests | Capacity pressure | Are TUs or partitions limiting writes? |
| Outgoing bytes and messages | Consumer reads | Are consumers active? |
| Captured bytes | Capture health | Are Avro files being written? |
| Server errors and user errors | Service and client failures | Is the problem Azure-side or caller-side? |
| Active connections | Client behavior | Did a deploy drop or multiply connections? |
Example KQL for Event Hubs metrics in Azure Monitor can start with the namespace-level pressure signals, then drill into a specific event hub or consumer application once you know whether the broker is throttling:
AzureMetrics| where ResourceProvider == "MICROSOFT.EVENTHUB"| where Resource has "ehns-dojo"| where MetricName in ("IncomingBytes", "OutgoingBytes", "ThrottledRequests")| summarize Total = sum(Total) by MetricName, bin(TimeGenerated, 5m)| order by TimeGenerated ascExample KQL for a throttling view should be used as an alert investigation query because any nonzero throttling means producers are already experiencing backpressure or retries:
AzureMetrics| where ResourceProvider == "MICROSOFT.EVENTHUB"| where MetricName == "ThrottledRequests"| summarize Throttled = sum(Total) by Resource, bin(TimeGenerated, 5m)| where Throttled > 0| order by TimeGenerated descConsumer lag is harder because the checkpoint store is application-owned. One practical approach is to emit application metrics that record the last processed sequence number per partition and compare it with the last enqueued sequence number from Event Hubs runtime properties. If the difference grows, the consumer is behind. If the difference shrinks, it is catching up. If it is flat while incoming volume continues, the consumer is stuck.
Event Grid Metrics
Section titled “Event Grid Metrics”The Event Grid metrics reference for topics includes delivery and failure signals such as successful delivery, failed delivery attempts, dropped events, and publish success or failure. For production subscriptions, alert on failed attempts before dropped events. Dropped events mean the system has already exhausted delivery or could not deliver. Failed attempts mean the operator still has time to fix the handler, authentication, filter, or network path.
Example KQL for Event Grid delivery failures should compare successful delivery, failed attempts, and dropped events so the operator can separate a transient endpoint issue from actual event loss:
AzureMetrics| where ResourceProvider == "MICROSOFT.EVENTGRID"| where MetricName in ("DeliveryAttemptFailCount", "DeliverySuccessCount", "DroppedEventCount")| summarize Total = sum(Total) by Resource, MetricName, bin(TimeGenerated, 5m)| order by TimeGenerated descExample KQL for a handler outage pattern should focus on the failure rate over time because raw failure counts can look alarming during a traffic spike and quiet during a low-volume outage:
AzureMetrics| where ResourceProvider == "MICROSOFT.EVENTGRID"| where MetricName in ("DeliveryAttemptFailCount", "DeliverySuccessCount")| summarize Failed = sumif(Total, MetricName == "DeliveryAttemptFailCount"), Succeeded = sumif(Total, MetricName == "DeliverySuccessCount") by Resource, bin(TimeGenerated, 10m)| extend FailureRate = todouble(Failed) / todouble(Failed + Succeeded)| where Failed > 0| order by TimeGenerated descOperational Runbook: Event Hubs Throttling
Section titled “Operational Runbook: Event Hubs Throttling”When producers report throttling, first verify whether the namespace is at its capacity ceiling. Check incoming bytes, incoming requests, throttled requests, and TU or PU settings. Then check partition distribution. If one partition is hot because of a bad partition key, adding TUs may not fix the real problem. Next check producer retry behavior. Aggressive retries can multiply pressure. Finally decide whether to raise the TU ceiling, enable auto-inflate, tune partition keys, add partitions for new hubs, or move to Premium.
Operational Runbook: Event Grid Delivery Failure
Section titled “Operational Runbook: Event Grid Delivery Failure”When Event Grid delivery fails, start with the subscription. Verify endpoint validation, authentication, filter rules, endpoint DNS, TLS certificate, and handler response codes. Then check retry policy and dead-letter configuration. If the handler is a webhook, confirm it returns success only after durable acceptance. If the handler does real work synchronously and times out, put a queue or Function in front of the slow work. Event Grid should not be forced to wait while a downstream batch job runs.
Patterns & Anti-Patterns
Section titled “Patterns & Anti-Patterns”Patterns and anti-patterns are where service knowledge becomes operational judgment. Most production mistakes are not because a team never heard of Event Hubs or Event Grid. They happen because the team used the right feature in the wrong place, or ignored the cost and failure mode attached to it.
| Pattern | When to use it | Why it works | Scaling consideration |
|---|---|---|---|
| Event Hubs for telemetry streams | Continuous high-volume records with replay needs | Consumers control pace and checkpoints | Plan partitions, TUs, retention, and consumer groups |
| Event Grid to Service Bus | Discrete events that trigger reliable command processing | Event Grid routes, Service Bus owns command semantics | Monitor delivery and queue depth separately |
| Event Grid to Event Hubs | Azure resource events need analytics or audit replay | Event Grid captures discrete source events into a stream | Avoid broad filters that flood the stream |
| Capture to ADLS Gen2 | Stream data needs batch analytics or compliance archive | Avro files decouple ingestion from readers | Add lifecycle policy and schema ownership |
| CloudEvents at boundaries | Events cross teams, vendors, or platforms | Standard envelope reduces adapter code | Version payload schemas separately |
| Private endpoint for critical producers | Producers live in controlled VNets | Reduces public exposure and simplifies network policy | Test DNS and failover path before launch |
| Anti-pattern | What goes wrong | Better alternative |
|---|---|---|
| Event Grid as a command queue | Retries and delivery do not replace sessions, locks, and transactions | Use Service Bus for commands |
| One partition key for all events | One hot partition limits throughput and order assumptions become fragile | Choose a key with useful affinity and distribution |
| One consumer group per pod | Outbound load and checkpoint coordination become chaotic | Share a consumer group per logical application |
| Capture without lifecycle policy | Storage grows forever and no one owns retention | Add lifecycle rules and retention owners |
| Webhook does slow processing inline | Event Grid retries when the handler times out | Acknowledge quickly, enqueue work, process asynchronously |
| SAS root policy in application config | Key rotation and blast radius become painful | Use managed identity or narrow SAS policies |
Did You Know?
Section titled “Did You Know?”- Event Hubs Standard capacity planning must consider both event count and byte rate; a workload can be under the event limit and over the byte limit at the same time.
- Event Hubs Capture writes Avro files, so downstream readers should treat the archive as a typed data product rather than a pile of opaque blobs.
- CloudEvents 1.0 standardizes the event envelope, but it does not standardize your business payload; you still need schema ownership and compatibility rules.
- Event Grid retry behavior can turn one unhealthy webhook into many delivery attempts, so dead-letter storage and handler health alerts are cost controls as well as reliability controls.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Why It Happens | How to Fix It |
|---|---|---|
| Choosing Event Grid for ordered business commands | The word event hides the need for sessions, locks, and DLQ handling | Use Service Bus when command semantics drive the requirement |
| Sizing Event Hubs only by event count | Payload size is ignored during planning | Calculate required units from both events per second and megabytes per second |
| Treating partition count as a late tuning knob | Early demos use defaults and production needs arrive later | Choose partition count and key strategy during design review |
| Creating many consumer groups casually | Every group is an independent reader view and can add load | Create groups only for real independent applications |
| Leaving Capture blobs forever | Archive data feels harmless because it is cheap per GB | Add lifecycle rules, retention owners, and storage cost alerts |
| Filtering Event Grid in handler code only | It is faster to write code than to design subscriptions | Use event type, subject, and advanced filters at the subscription |
| Skipping Event Grid dead-letter storage | The happy path works during development | Configure dead-lettering before production and test handler failure |
Your telemetry producer sends steady one-kilobyte records at a rate that sometimes doubles during business events. Consumers must replay the stream after a failed deployment. Which Azure service is the first candidate, and what two capacity signals do you check?
Event Hubs is the first candidate because the workload is continuous, high-throughput, and replayable. Check event rate and byte rate because Standard throughput planning can be constrained by either messages per second or megabytes per second. You should also watch throttled requests and consumer lag during the business event. Event Grid would be a poor fit because the requirement is a retained stream, not only push notification fan-out.A storage account writes Avro Capture files, and three teams want notification when new blobs land. One team wants a webhook, one wants a Function, and one wants messages in Service Bus. What design keeps the source simple?
Use an Event Grid system topic on the Storage Account with separate event subscriptions for each handler. Filter the subscriptions to the capture container and blob path so unrelated storage events do not reach the handlers. The Service Bus subscriber can turn the notification into durable command processing if that team needs queue semantics. Keep dead-letter storage configured because webhook and Function failures should leave evidence.A payment workflow requires per-customer ordering, duplicate detection, transactions, and a dead-letter queue for manual repair. A developer proposes Event Grid because the messages are called events. How do you respond?
The requirements point to Service Bus, not Event Grid. Event Grid is an event router with delivery attempts; it is not the right primitive for transactional command handling with sessions and broker-managed dead-letter queues. You can still use Event Grid upstream to notify that a resource changed, but the payment command itself belongs on a messaging service built for that contract. The design review should name the failure mode: a duplicate or out-of-order payment command is a business incident.Consumers are behind by several hours even though Event Hubs incoming metrics look normal and producers are not throttled. What do you inspect next?
Inspect outgoing bytes or messages, consumer application health, checkpoint ownership, and the checkpoint store permissions. Event Hubs can ingest successfully while a consumer is stuck or repeatedly reprocessing the same partition. Compare last processed sequence numbers with last enqueued sequence numbers per partition if the application emits that telemetry. Adding throughput units may not help if the reader has a code, identity, or checkpoint problem.An Event Grid webhook subscription shows rising failed delivery attempts after a certificate rotation. What operator actions come before increasing the retry duration?
First validate the endpoint TLS chain, DNS, authentication, and webhook response status. Then check the Event Grid subscription validation and dead-letter configuration so failed events are not lost. Increasing retry duration can multiply pressure against a broken endpoint, so it is not the first fix. If the handler does slow work, put a queue or Function in front of it and acknowledge receipt quickly.A Kafka application team wants to move to Event Hubs through the Kafka endpoint. They use admin APIs and broker-specific topic settings in their deployment scripts. What should the migration plan prove?
The plan must prove that the specific Kafka client version, authentication mode, offset behavior, producer settings, and admin operations work against Event Hubs. Kafka protocol compatibility does not mean every broker feature or configuration knob exists. Scripts that assume self-managed broker control may need to change or remain on Kafka. If those broker controls are mandatory, Kafka on AKS with Strimzi may be the more honest platform choice.A team enables Capture for a busy event hub and celebrates that analytics now has raw files. Two months later storage cost grows faster than expected. What design gap caused the surprise?
Capture created durable Avro blobs, but the team did not define lifecycle policy, retention ownership, or downstream access patterns. Capture is not only an ingestion feature; it is a storage lifecycle commitment. The fix is to set tiering and deletion rules that match compliance and analytics needs, and to alert on captured bytes and storage growth. The cost review should include both Event Hubs meters and storage meters.Hands-On Exercise: Event Hubs Capture to Event Grid Webhook
Section titled “Hands-On Exercise: Event Hubs Capture to Event Grid Webhook”Exercise scenario: you will build a small event pipeline without AKS. The flow is Event Hubs ingestion, Capture to Storage as Avro files, Event Grid notification from the Storage Account, and a webhook test endpoint that receives CloudEvents-shaped blob-created events. Use a non-production subscription or a short-lived resource group. Delete the resource group when you finish.
Success criteria:
- Resource group, Storage Account, Event Hubs namespace, and event hub exist in one region.
- Event hub has four partitions and Capture enabled with a five-minute window.
- Your identity can send events to the event hub through Azure RBAC.
- The Python producer sends at least one hundred events without a connection string.
- Storage contains at least one Avro Capture blob under the expected container path.
- Event Grid system topic exists for the Storage Account.
- Webhook test endpoint receives a CloudEvents-formatted
Microsoft.Storage.BlobCreatedevent for the Capture container. - Backpressure test produces or explains throttling signals and records the operator response.
Step One: Set Variables and Create the Resource Group
Section titled “Step One: Set Variables and Create the Resource Group”Pick a short unique suffix because Storage
Account and Event Hubs namespace names are
globally constrained. The commands use
eastus for consistency with the cost
examples, but you can choose a closer
supported region.
export LOCATION="eastus"export SUFFIX="$(date +%s | tail -c 6)"export RG="rg-dojo-eventing-${SUFFIX}"export STORAGE="stdojoeh${SUFFIX}"export CAPTURE_CONTAINER="capture"export EH_NAMESPACE="ehns-dojo-${SUFFIX}"export EVENTHUB_NAME="telemetry"export SYSTEM_TOPIC="egst-dojo-storage-${SUFFIX}"export EG_SUB="egsub-capture-webhook"
az group create \ --name "$RG" \ --location "$LOCATION"Expected result: the resource group exists, and every later resource lands in the same region unless the command says otherwise.
Step Two: Create Storage for Capture and Dead-Letter Evidence
Section titled “Step Two: Create Storage for Capture and Dead-Letter Evidence”The lab uses one Storage Account for Capture output and Event Grid dead-letter examples. In production, you may separate archive and dead-letter storage so retention, access, and incident review permissions are clearer.
az storage account create \ --resource-group "$RG" \ --name "$STORAGE" \ --location "$LOCATION" \ --sku Standard_LRS \ --kind StorageV2 \ --min-tls-version TLS1_2 \ --allow-blob-public-access false
STORAGE_ID="$(az storage account show \ --resource-group "$RG" \ --name "$STORAGE" \ --query id \ --output tsv)"
USER_OBJECT_ID="$(az ad signed-in-user show --query id --output tsv)"
az role assignment create \ --assignee "$USER_OBJECT_ID" \ --role "Storage Blob Data Contributor" \ --scope "$STORAGE_ID"
az storage container create \ --account-name "$STORAGE" \ --name "$CAPTURE_CONTAINER" \ --auth-mode loginIf the container creation fails immediately after role assignment, wait a minute and retry. Azure RBAC propagation is not instant, and the failure is usually an authorization delay rather than a storage problem.
Step Three: Create Event Hubs Namespace and Event Hub with Capture
Section titled “Step Three: Create Event Hubs Namespace and Event Hub with Capture”Create a Standard namespace with two throughput units. Then create one event hub with four partitions and Capture enabled to the container. The archive naming format keeps namespace, hub, partition, and time in the path so operators can inspect Capture output without opening every file.
az eventhubs namespace create \ --resource-group "$RG" \ --name "$EH_NAMESPACE" \ --location "$LOCATION" \ --sku Standard \ --capacity 2
az eventhubs eventhub create \ --resource-group "$RG" \ --namespace-name "$EH_NAMESPACE" \ --name "$EVENTHUB_NAME" \ --partition-count 4 \ --message-retention 1 \ --enable-capture true \ --capture-interval 300 \ --capture-size-limit 10485760 \ --destination-name EventHubArchive.AzureBlockBlob \ --storage-account "$STORAGE_ID" \ --blob-container "$CAPTURE_CONTAINER" \ --archive-name-format "{Namespace}/{EventHub}/{PartitionId}/{Year}/{Month}/{Day}/{Hour}/{Minute}/{Second}"Expected result: the event hub exists with
four partitions, and the Capture settings
point to your storage container. If your
installed Azure CLI uses slightly
different parameter names for Capture, run
az eventhubs eventhub create --help and
keep the same values.
Step Four: Grant Send Permission and Install Producer Dependencies
Section titled “Step Four: Grant Send Permission and Install Producer Dependencies”The producer uses
DefaultAzureCredential, so it can use
your Azure CLI login locally and managed
identity in Azure. This avoids putting an
Event Hubs connection string in a script.
EVENTHUB_ID="$(az eventhubs eventhub show \ --resource-group "$RG" \ --namespace-name "$EH_NAMESPACE" \ --name "$EVENTHUB_NAME" \ --query id \ --output tsv)"
az role assignment create \ --assignee "$USER_OBJECT_ID" \ --role "Azure Event Hubs Data Sender" \ --scope "$EVENTHUB_ID"
.venv/bin/python -m pip install azure-eventhub azure-identityCreate send_events.py in a temporary lab
directory outside the repo or paste it
into your shell with an editor. The
snippet is intentionally complete and
runnable as-is after the environment
variables are set.
import jsonimport osimport timeimport uuidfrom datetime import datetime, timezone
from azure.eventhub import EventData, EventHubProducerClientfrom azure.identity import DefaultAzureCredential
fully_qualified_namespace = f"{os.environ['EH_NAMESPACE']}.servicebus.windows.net"eventhub_name = os.environ["EVENTHUB_NAME"]event_count = int(os.environ.get("EVENT_COUNT", "100"))batch_size = int(os.environ.get("EVENT_BATCH_SIZE", "20"))payload_bytes = int(os.environ.get("EVENT_PAYLOAD_BYTES", "1024"))sleep_ms = int(os.environ.get("EVENT_SLEEP_MS", "50"))
credential = DefaultAzureCredential()producer = EventHubProducerClient( fully_qualified_namespace=fully_qualified_namespace, eventhub_name=eventhub_name, credential=credential,)
def build_event(index: int) -> EventData: device_id = f"device-{index % 8:02d}" body = { "event_id": str(uuid.uuid4()), "device_id": device_id, "sequence": index, "observed_at": datetime.now(timezone.utc).isoformat(), "payload": "x" * payload_bytes, } event = EventData(json.dumps(body)) event.properties = { "content-type": "application/json", "schema": "dojo.telemetry.v1", } return event
sent = 0started = time.perf_counter()
with producer: while sent < event_count: batch = producer.create_batch(partition_key=f"device-{sent % 8:02d}") for _ in range(min(batch_size, event_count - sent)): batch.add(build_event(sent)) sent += 1 producer.send_batch(batch) if sleep_ms: time.sleep(sleep_ms / 1000)
elapsed = time.perf_counter() - startedprint(f"sent={sent} elapsed_seconds={elapsed:.2f} rate={sent / elapsed:.1f}_events_per_second")Run the producer:
export EH_NAMESPACEexport EVENTHUB_NAME.venv/bin/python send_events.pyExpected result: the script prints
sent=100 and exits without a connection
string. If authentication fails, run az login, verify the RBAC assignment scope,
and wait for propagation.
Step Five: Verify Capture Wrote Avro Blobs
Section titled “Step Five: Verify Capture Wrote Avro Blobs”Capture writes on a time or size window, so wait at least five minutes unless your event volume reaches the size limit first. Then list blobs in the capture container.
az storage blob list \ --account-name "$STORAGE" \ --container-name "$CAPTURE_CONTAINER" \ --auth-mode login \ --query "[].name" \ --output tableExpected result: at least one path appears
with the namespace, event hub name,
partition ID, and time components. If no
blobs appear, check CapturedBytes, the
Capture configuration, Storage
permissions, and whether enough time has
passed for the Capture window.
Step Six: Create the Event Grid System Topic
Section titled “Step Six: Create the Event Grid System Topic”Create a system topic for the Storage Account. This makes the Storage Account the Event Grid source for blob-created notifications.
az eventgrid system-topic create \ --resource-group "$RG" \ --name "$SYSTEM_TOPIC" \ --location "$LOCATION" \ --topic-type Microsoft.Storage.StorageAccounts \ --source "$STORAGE_ID"Expected result: the system topic exists and points to the Storage Account resource ID.
Step Seven: Subscribe a Webhook to Capture BlobCreated Events
Section titled “Step Seven: Subscribe a Webhook to Capture BlobCreated Events”Create a temporary endpoint at a service
such as https://webhook.site. Copy the
unique URL into WEBHOOK_URL. Do not use
a production webhook for this lab.
export WEBHOOK_URL="https://webhook.site/your-temporary-id"
az eventgrid system-topic event-subscription create \ --resource-group "$RG" \ --system-topic-name "$SYSTEM_TOPIC" \ --name "$EG_SUB" \ --endpoint-type webhook \ --endpoint "$WEBHOOK_URL" \ --event-delivery-schema cloudeventschemav1_0 \ --included-event-types Microsoft.Storage.BlobCreated \ --subject-begins-with "/blobServices/default/containers/${CAPTURE_CONTAINER}/blobs/${EH_NAMESPACE}/${EVENTHUB_NAME}/"Event Grid sends a validation event when
you create a webhook subscription. Webhook
test sites usually record the validation
request but may not automatically echo the
validation code. If the portal or CLI
shows validation pending, open the
captured validation event and use the
validationUrl if present. That step
proves why production webhooks need
explicit validation handling.
Run the producer again and wait for another Capture window:
EVENT_COUNT=200 EVENT_BATCH_SIZE=25 EVENT_SLEEP_MS=25 .venv/bin/python send_events.pyExpected result: the webhook site receives
a CloudEvents payload with type equal to
Microsoft.Storage.BlobCreated, and the
subject points into your Capture
container.
Step Eight: Trigger and Interpret Backpressure
Section titled “Step Eight: Trigger and Interpret Backpressure”Lower the namespace from two throughput units to one. Then run a larger producer burst with bigger payloads. The goal is to observe throttling or understand why your local producer cannot generate enough pressure.
az eventhubs namespace update \ --resource-group "$RG" \ --name "$EH_NAMESPACE" \ --capacity 1
EVENT_COUNT=5000 \EVENT_BATCH_SIZE=100 \EVENT_PAYLOAD_BYTES=2048 \EVENT_SLEEP_MS=0 \ .venv/bin/python send_events.pyCheck the metrics:
az monitor metrics list \ --resource "$EVENTHUB_ID" \ --metric IncomingRequests,ThrottledRequests,IncomingBytes \ --interval PT1M \ --aggregation Total \ --output tableExpected result: if your producer exceeds
the namespace capacity,
ThrottledRequests rises. If it does not
rise, record the measured producer rate
and explain that the workstation did not
exceed the one-TU envelope. The operator
response is still the same decision tree:
enable auto-inflate with a ceiling, raise
explicit TUs, reduce producer retry
pressure, tune partition keys, increase
partitions for new hubs, or move to
Premium when isolation and features
justify it.
Cleanup
Section titled “Cleanup”Delete the resource group when you are done. This removes the namespace, storage, system topic, event subscription, and captured blobs.
az group delete \ --name "$RG" \ --yes \ --no-waitSources
Section titled “Sources”- Azure Event Hubs overview - Defines Event Hubs as the managed ingestion and streaming service used for the stream mental model.
- Event Hubs quotas and limits - Grounds tier comparisons, capacity limits, and planning boundaries.
- Event Hubs scalability - Explains throughput units, processing units, capacity units, and auto-inflate.
- Event Hubs Capture overview - Documents Capture to Blob Storage or ADLS Gen2 and Avro archive behavior.
- Event Hubs geo-disaster recovery - Explains alias-based namespace pairing and failover behavior.
- Event Hubs Premium overview - Supports the Premium tier isolation and feature discussion.
- Event Hubs Dedicated overview - Supports capacity-unit and single-tenant cluster guidance.
- Event Hubs Schema Registry - Documents schema groups and Avro, JSON, and Protobuf schema management.
- Event Hubs for Apache Kafka - Documents the Kafka-compatible endpoint and migration considerations.
- Event Hubs monitoring reference - Provides Event Hubs metrics and diagnostic categories for dashboards.
- Azure Event Grid overview - Defines Event Grid as Azure’s event routing service.
- Azure messaging services comparison - Compares Event Grid, Event Hubs, Service Bus, and Storage Queues.
- Event Grid event schema - Documents native Event Grid schema and CloudEvents support.
- Event Grid delivery and retry - Explains retry behavior, dead-lettering, and delivery failure handling.
- Event Grid filtering - Documents subject filters, event type filters, and advanced filters.
- Event Grid namespace concepts - Explains namespaces, clients, topic spaces, permissions, and pull delivery.
- Event Grid topics metrics reference - Provides Event Grid delivery and failure metrics.
- Azure Event Hubs pricing - Pricing source for throughput, ingress, Capture, Premium, and Dedicated cost checks.
- Azure Event Grid pricing - Pricing source for Event Grid operations, namespace throughput, and MQTT cost checks.
- CloudEvents 1.0 specification - Defines the standard event envelope used for cross-platform event contracts.
Next Module
Section titled “Next Module”Continue to AWS Essentials to compare how another hyperscaler approaches identity, networking, compute, storage, and event-driven service boundaries.