Module 7.4: AKS Storage, Observability & Scaling
Complexity: [MEDIUM] | Time to Complete: 2.5h | Prerequisites: Module 7.1: AKS Architecture & Node Management
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Configure KEDA (Kubernetes Event-Driven Autoscaling) on AKS for scaling based on Azure service metrics
- Implement AKS observability with Azure Monitor Container Insights, Managed Prometheus, and Managed Grafana
- Deploy Azure Disk and Azure Files CSI drivers with storage classes optimized for performance and cost on AKS
- Design AKS cost optimization strategies using Spot node pools, cluster autoscaler tuning, and right-sizing
Why This Module Matters
Section titled “Why This Module Matters”In November 2023, an online retailer running on AKS experienced a catastrophic failure during their Black Friday sale. Their order processing service used Azure Premium SSD disks for a write-ahead log. When traffic spiked to 15x normal levels, the disk IOPS ceiling was hit and writes started queuing. The application had no metrics on disk I/O latency---their observability stack only monitored CPU and memory. Without visibility into the real bottleneck, the on-call engineer scaled the deployment from 6 to 30 replicas, which made things dramatically worse: 30 pods now competed for the same disk’s IOPS budget. The queue grew, timeouts cascaded, and the entire order pipeline froze for 90 minutes during peak sales hours. Post-incident analysis estimated $4.2 million in lost revenue. The fix was straightforward: migrate to Ultra Disks with provisioned IOPS, add disk I/O metrics to their Grafana dashboards, and implement KEDA-based scaling that responded to queue depth rather than CPU utilization.
This story illustrates a pattern that repeats across organizations: storage, observability, and scaling are treated as afterthoughts during initial cluster setup, then become the root cause of the most painful production incidents. The three topics are deeply interconnected. Without proper observability, you cannot make informed scaling decisions. Without proper scaling, your storage layer gets overwhelmed. Without proper storage, your observability pipeline loses data during the exact moments you need it most.
In this module, you will learn how to choose between Azure Disks and Azure Files for different workload patterns, configure Container Insights with Managed Prometheus and Grafana for full-stack observability, and implement event-driven autoscaling with the KEDA add-on. By the end, you will have a cluster that monitors itself, scales based on real business signals, and stores data on the right tier for each workload.
Azure Storage for Kubernetes: Disks vs Files
Section titled “Azure Storage for Kubernetes: Disks vs Files”AKS integrates with two primary Azure storage services for persistent volumes: Azure Disks and Azure Files. The choice between them depends on your access patterns, performance requirements, and cross-zone needs.
Azure Disks: Block Storage for Single-Pod Workloads
Section titled “Azure Disks: Block Storage for Single-Pod Workloads”Azure Disks provide block-level storage that attaches to a single node at a time. This maps to ReadWriteOnce (RWO) access mode in Kubernetes---only one pod on one node can mount the disk for read-write access.
Azure Disk Types for AKS: ┌─────────────────────────────────────────────────────────────────┐ │ │ │ Standard HDD Standard SSD Premium SSD Ultra Disk │ │ ──────────── ──────────── ─────────── ────────── │ │ Max IOPS: 2000 Max IOPS: 6000 Max IOPS: 20k Max IOPS: │ │ Max BW: 500MB/s Max BW: 750MB/s Max BW: 900MB 160,000 │ │ Latency: ~10ms Latency: ~4ms Latency: ~1ms Max BW: 4GB │ │ Latency: │ │ Use: backups, Use: dev/test, Use: most sub-ms │ │ cold data light workloads production Use: high- │ │ databases perf DBs, │ │ real-time │ │ analytics │ │ Cost: $ Cost: $$ Cost: $$$ Cost: $$$$ │ └─────────────────────────────────────────────────────────────────┘AKS uses CSI (Container Storage Interface) drivers for storage. The disk.csi.azure.com driver handles Azure Disks. You create a StorageClass that specifies the disk type, then reference it in PersistentVolumeClaims.
# StorageClass for Premium SSD v2 with provisioned IOPSapiVersion: storage.k8s.io/v1kind: StorageClassmetadata: name: premium-ssd-v2provisioner: disk.csi.azure.comparameters: skuName: PremiumV2_LRS DiskIOPSReadWrite: "5000" DiskMBpsReadWrite: "200" cachingMode: NonereclaimPolicy: RetainvolumeBindingMode: WaitForFirstConsumerallowVolumeExpansion: true
---# PVC using the StorageClassapiVersion: v1kind: PersistentVolumeClaimmetadata: name: postgres-data namespace: databasespec: accessModes: - ReadWriteOnce storageClassName: premium-ssd-v2 resources: requests: storage: 256GiThe volumeBindingMode: WaitForFirstConsumer setting is critical for AKS clusters with availability zones. It delays disk creation until a pod actually needs it, ensuring the disk is created in the same zone as the node where the pod is scheduled. Without this, the disk might be created in Zone 1 while the pod gets scheduled to Zone 2, causing a permanent scheduling failure.
Ultra Disks: When Premium SSD Is Not Enough
Section titled “Ultra Disks: When Premium SSD Is Not Enough”Ultra Disks allow you to independently provision IOPS and throughput, decoupled from disk size. A 64 GB Ultra Disk can deliver 50,000 IOPS if you need it. This makes them ideal for databases like PostgreSQL, MySQL, and Cassandra that have high I/O requirements relative to their data size.
# Enable Ultra Disk support on a node poolaz aks nodepool add \ --resource-group rg-aks-prod \ --cluster-name aks-prod-westeurope \ --name dbpool \ --node-count 3 \ --node-vm-size Standard_D8s_v5 \ --zones 1 2 3 \ --enable-ultra-ssd \ --mode User \ --node-taints "workload=database:NoSchedule" \ --labels workload=database# StorageClass for Ultra DiskapiVersion: storage.k8s.io/v1kind: StorageClassmetadata: name: ultra-diskprovisioner: disk.csi.azure.comparameters: skuName: UltraSSD_LRS DiskIOPSReadWrite: "50000" DiskMBpsReadWrite: "1000" cachingMode: NonereclaimPolicy: RetainvolumeBindingMode: WaitForFirstConsumerallowVolumeExpansion: trueAzure Files: Shared Storage for Multi-Pod Access
Section titled “Azure Files: Shared Storage for Multi-Pod Access”Pause and predict: If you have a legacy CMS that writes user uploads to a local filesystem and you want to scale it to 3 replicas across different nodes, which Azure storage solution must you use and why?
Azure Files provides SMB and NFS file shares that multiple pods across multiple nodes can mount simultaneously (ReadWriteMany / RWX). This is essential for workloads that need shared storage: CMS platforms, shared configuration files, machine learning training data, and legacy applications that expect a shared filesystem.
Azure Files Access Patterns: ┌───────────────────────────────────────────────────────────────┐ │ │ │ SMB Protocol (default) NFS Protocol (Premium only) │ │ ───────────────────── ──────────────────────────── │ │ Windows + Linux Linux only │ │ Broad compatibility POSIX-compliant │ │ AD-based authentication No authentication overhead │ │ Lower throughput Higher throughput │ │ │ │ Use: general shared Use: high-performance │ │ storage, Windows shared storage, ML training │ │ workloads data, media processing │ └───────────────────────────────────────────────────────────────┘# StorageClass for Azure Files NFS (Premium tier)apiVersion: storage.k8s.io/v1kind: StorageClassmetadata: name: azure-files-nfs-premiumprovisioner: file.csi.azure.comparameters: protocol: nfs skuName: Premium_LRSmountOptions: - nconnect=4 - noresvportreclaimPolicy: RetainvolumeBindingMode: ImmediateallowVolumeExpansion: true
---# PVC for shared ML training dataapiVersion: v1kind: PersistentVolumeClaimmetadata: name: training-data namespace: ml-pipelinespec: accessModes: - ReadWriteMany storageClassName: azure-files-nfs-premium resources: requests: storage: 1TiShared Disks for High Availability
Section titled “Shared Disks for High Availability”Azure Shared Disks allow a single Premium SSD or Ultra Disk to be attached to multiple nodes simultaneously. This enables cluster-aware applications (like SQL Server Failover Cluster Instances or custom HA storage engines) to share a disk at the block level.
# StorageClass for shared disksapiVersion: storage.k8s.io/v1kind: StorageClassmetadata: name: shared-premium-diskprovisioner: disk.csi.azure.comparameters: skuName: Premium_LRS maxShares: "3" cachingMode: NonereclaimPolicy: RetainvolumeBindingMode: WaitForFirstConsumerWarning: Shared Disks do not provide a filesystem. The application must handle concurrent block-level access using a cluster filesystem (like GFS2) or its own coordination protocol. Do not mount a shared disk with ext4 or xfs from multiple nodes---you will corrupt your data.
The Storage Decision Matrix
Section titled “The Storage Decision Matrix”| Criteria | Azure Disk (Premium) | Azure Disk (Ultra) | Azure Files (SMB) | Azure Files (NFS) |
|---|---|---|---|---|
| Access mode | RWO | RWO | RWX | RWX |
| Max IOPS | 20,000 | 160,000 | 10,000 | 100,000 |
| Cross-zone | No (zone-locked) | No (zone-locked) | Yes (ZRS available) | Yes (ZRS available) |
| Latency | ~1ms | Sub-ms | ~5-10ms | ~2-5ms |
| Windows support | Yes | Yes | Yes | No |
| Best for | Databases, stateful apps | High-IOPS databases | Shared config, CMS | ML data, media |
| Cost | $$$ | $$$$ | $$ | $$$ |
Container Insights and Azure Monitor
Section titled “Container Insights and Azure Monitor”Container Insights is Azure’s native observability solution for AKS. It collects logs, metrics, and performance data from your cluster and presents them in the Azure portal with pre-built dashboards and query capabilities.
Enabling Container Insights
Section titled “Enabling Container Insights”# Create a Log Analytics workspaceaz monitor log-analytics workspace create \ --resource-group rg-aks-prod \ --workspace-name law-aks-prod \ --location westeurope \ --retention-in-days 90
WORKSPACE_ID=$(az monitor log-analytics workspace show \ -g rg-aks-prod -n law-aks-prod --query id -o tsv)
# Enable Container Insightsaz aks enable-addons \ --resource-group rg-aks-prod \ --name aks-prod-westeurope \ --addons monitoring \ --workspace-resource-id "$WORKSPACE_ID"
# Verify the monitoring agent is runningk get pods -n kube-system -l component=ama-logsWhat Container Insights Collects
Section titled “What Container Insights Collects”Container Insights deploys a monitoring agent (Azure Monitor Agent) as a DaemonSet on each node. This agent collects:
- Node metrics: CPU, memory, disk I/O, network throughput per node
- Pod metrics: CPU/memory requests vs actual usage, restart counts, OOM kills
- Container logs: stdout/stderr from all containers (sent to Log Analytics)
- Kubernetes events: Pod scheduling, image pulls, resource quota violations
- Inventory data: Running pods, nodes, deployments, services
# Query container logs in Log Analyticsaz monitor log-analytics query \ --workspace "$WORKSPACE_ID" \ --analytics-query "ContainerLogV2 | where ContainerName == 'payment-service' | where LogMessage contains 'error' | top 20 by TimeGenerated desc" \ --timespan "PT6H"Cost Control for Container Insights
Section titled “Cost Control for Container Insights”Pause and predict: You just deployed Container Insights on a busy cluster and your Log Analytics bill spiked by $500 in one day. What is the most likely culprit, and what configuration component will fix it?
Container Insights can generate significant Log Analytics costs if you send every log line from every container. Use the ConfigMap to control what gets collected:
# Save as container-insights-config.yamlapiVersion: v1kind: ConfigMapmetadata: name: container-azm-ms-agentconfig namespace: kube-systemdata: schema-version: v1 config-version: v1 log-data-collection-settings: | [log_collection_settings] [log_collection_settings.stdout] enabled = true exclude_namespaces = ["kube-system", "gatekeeper-system"] [log_collection_settings.stderr] enabled = true exclude_namespaces = ["kube-system"] [log_collection_settings.env_var] enabled = false prometheus-data-collection-settings: | [prometheus_data_collection_settings.cluster] interval = "60s" monitor_kubernetes_pods = truek apply -f container-insights-config.yamlManaged Prometheus and Grafana: Cloud-Native Monitoring
Section titled “Managed Prometheus and Grafana: Cloud-Native Monitoring”While Container Insights works well for logs and basic metrics, production teams often need Prometheus for application-specific metrics and Grafana for custom dashboards. Azure offers fully managed versions of both, eliminating the operational burden of running your own Prometheus server and Grafana instance.
Setting Up Managed Prometheus
Section titled “Setting Up Managed Prometheus”Azure Monitor managed service for Prometheus stores metrics in an Azure Monitor workspace. AKS ships metrics using a Prometheus-compatible agent.
# Create an Azure Monitor workspace (for Prometheus)az monitor account create \ --resource-group rg-aks-prod \ --name amw-aks-prod \ --location westeurope
MONITOR_WORKSPACE_ID=$(az monitor account show \ -g rg-aks-prod -n amw-aks-prod --query id -o tsv)
# Enable Managed Prometheus on the clusteraz aks update \ --resource-group rg-aks-prod \ --name aks-prod-westeurope \ --enable-azure-monitor-metrics \ --azure-monitor-workspace-resource-id "$MONITOR_WORKSPACE_ID"
# Verify the Prometheus agent is runningk get pods -n kube-system -l rsName=ama-metricsSetting Up Managed Grafana
Section titled “Setting Up Managed Grafana”# Create a Managed Grafana instanceaz grafana create \ --resource-group rg-aks-prod \ --name grafana-aks-prod \ --location westeurope
# Link Grafana to the Azure Monitor workspaceGRAFANA_ID=$(az grafana show -g rg-aks-prod -n grafana-aks-prod --query id -o tsv)
az monitor account update \ --resource-group rg-aks-prod \ --name amw-aks-prod \ --linked-grafana "$GRAFANA_ID"
# Get the Grafana URLaz grafana show -g rg-aks-prod -n grafana-aks-prod --query "properties.endpoint" -o tsvOnce linked, Managed Grafana automatically discovers the Prometheus data source. Azure provides pre-built dashboards for Kubernetes cluster monitoring, node performance, pod resource usage, and more.
Custom Prometheus Metrics from Your Application
Section titled “Custom Prometheus Metrics from Your Application”Your application can expose custom Prometheus metrics, and the managed Prometheus agent will scrape them automatically if you annotate your pods correctly.
apiVersion: apps/v1kind: Deploymentmetadata: name: payment-service namespace: paymentsspec: replicas: 3 selector: matchLabels: app: payment-service template: metadata: labels: app: payment-service annotations: prometheus.io/scrape: "true" prometheus.io/port: "8080" prometheus.io/path: "/metrics" spec: containers: - name: payment image: myregistry.azurecr.io/payment-service:v2.1.0 ports: - containerPort: 8080 name: http resources: requests: cpu: "250m" memory: "256Mi" limits: cpu: "1" memory: "512Mi"Creating Alert Rules
Section titled “Creating Alert Rules”# Create a Prometheus alert rule for high error rateaz monitor metrics alert create \ --resource-group rg-aks-prod \ --name "payment-high-error-rate" \ --scopes "$MONITOR_WORKSPACE_ID" \ --condition "avg http_requests_total{status=~'5..',service='payment-service'} by (service) / avg http_requests_total{service='payment-service'} by (service) > 0.05" \ --description "Payment service error rate exceeds 5%" \ --severity 1 \ --window-size 5m \ --evaluation-frequency 1mFor more flexible alerting, use Prometheus-native alert rules through the Azure Monitor workspace:
# PrometheusRuleGroup for custom alertsapiVersion: alerts.monitor.azure.com/v1kind: PrometheusRuleGroupmetadata: name: payment-alertsspec: rules: - alert: PaymentServiceHighLatency expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service="payment-service"}[5m])) > 2 for: 3m labels: severity: warning annotations: summary: "Payment service p99 latency exceeds 2 seconds" - alert: PaymentServiceDown expr: up{job="payment-service"} == 0 for: 1m labels: severity: critical annotations: summary: "Payment service is down"KEDA: Event-Driven Autoscaling
Section titled “KEDA: Event-Driven Autoscaling”The standard Kubernetes Horizontal Pod Autoscaler (HPA) scales based on CPU and memory utilization. This works for stateless web servers but fails spectacularly for event-driven workloads: message queue consumers, batch processors, and services that need to scale based on business metrics rather than infrastructure metrics.
KEDA (Kubernetes Event-Driven Autoscaler) extends the HPA with over 60 scalers that can trigger scaling from external event sources: Azure Service Bus queue depth, Azure Event Hubs partition lag, PostgreSQL query results, Prometheus metrics, and many more.
Traditional HPA: KEDA: ┌──────────────────┐ ┌──────────────────┐ │ Metrics Server │ │ KEDA Operator │ │ (CPU/memory only)│ │ (60+ scalers) │ │ │ │ │ │ "Pod at 80% CPU" │ │ "Queue has 500 │ │ → scale up │ │ messages" │ │ │ │ → scale up │ │ Cannot scale │ │ │ │ to zero │ │ Can scale to │ └──────────────────┘ │ zero (!) │ └──────────────────┘Enabling the KEDA Add-on
Section titled “Enabling the KEDA Add-on”# Enable KEDA as an AKS add-onaz aks update \ --resource-group rg-aks-prod \ --name aks-prod-westeurope \ --enable-keda
# Verify KEDA pods are runningk get pods -n kube-system -l app.kubernetes.io/name=keda-operatorScaling Based on Azure Service Bus Queue Depth
Section titled “Scaling Based on Azure Service Bus Queue Depth”This is the most common KEDA pattern in Azure: scale your consumer pods based on how many messages are waiting in a queue.
# ScaledObject: scale order-processor based on Service Bus queue depthapiVersion: keda.sh/v1alpha1kind: ScaledObjectmetadata: name: order-processor-scaler namespace: ordersspec: scaleTargetRef: name: order-processor pollingInterval: 15 cooldownPeriod: 120 minReplicaCount: 0 # Scale to zero when queue is empty! maxReplicaCount: 50 triggers: - type: azure-servicebus metadata: queueName: incoming-orders namespace: sb-prod-westeurope messageCount: "10" # 1 pod per 10 messages authenticationRef: name: servicebus-authThe messageCount: "10" means KEDA targets 1 pod for every 10 messages in the queue. If there are 250 messages, KEDA will scale to 25 replicas. When the queue drains to zero, KEDA scales the deployment down to 0 replicas, saving costs entirely.
KEDA Authentication with Workload Identity
Section titled “KEDA Authentication with Workload Identity”KEDA needs credentials to check the queue depth. Using Workload Identity (from Module 7.3), you can avoid storing connection strings:
# TriggerAuthentication using Workload IdentityapiVersion: keda.sh/v1alpha1kind: TriggerAuthenticationmetadata: name: servicebus-auth namespace: ordersspec: podIdentity: provider: azure-workload identityId: "<CLIENT_ID_OF_MANAGED_IDENTITY>"The managed identity needs the “Azure Service Bus Data Receiver” role on the Service Bus namespace to check queue metrics.
Scaling Based on Prometheus Metrics
Section titled “Scaling Based on Prometheus Metrics”KEDA can also scale based on custom Prometheus metrics from your Azure Monitor workspace. This lets you scale on any business metric your application exposes.
# Scale based on a custom Prometheus metricapiVersion: keda.sh/v1alpha1kind: ScaledObjectmetadata: name: api-gateway-scaler namespace: gatewayspec: scaleTargetRef: name: api-gateway pollingInterval: 30 cooldownPeriod: 180 minReplicaCount: 2 maxReplicaCount: 30 triggers: - type: prometheus metadata: serverAddress: "http://prometheus-server.monitoring:9090" metricName: http_requests_per_second query: "sum(rate(http_requests_total{service='api-gateway'}[2m]))" threshold: "100" # 1 pod per 100 requests/secKEDA Scaling Strategies Compared
Section titled “KEDA Scaling Strategies Compared”| Scaler | Trigger Source | Scale to Zero | Typical Use Case |
|---|---|---|---|
| azure-servicebus | Queue message count | Yes | Order processing, async tasks |
| azure-eventhub | Consumer group lag | Yes | Event streaming, IoT data |
| azure-queue | Storage queue length | Yes | Background jobs, batch processing |
| prometheus | Any Prometheus metric | No (min 1) | RPS-based scaling, custom metrics |
| cron | Time schedule | Yes | Predictable traffic patterns |
| azure-monitor | Azure Monitor metrics | Yes | Infrastructure-based triggers |
Combining KEDA with Cluster Autoscaler
Section titled “Combining KEDA with Cluster Autoscaler”KEDA scales pods. The cluster autoscaler scales nodes. They work together beautifully:
- KEDA detects 500 messages in the queue and scales the deployment to 50 replicas
- The scheduler finds that existing nodes can only fit 30 of those pods
- 20 pods go to
Pendingstate - The cluster autoscaler detects pending pods and adds nodes to the VMSS
- New nodes register, and the scheduler places the remaining pods
- Messages get processed. Queue drains.
- KEDA scales pods down to 0
- Cluster autoscaler detects underutilized nodes and removes them after the cool-down period
Queue depth: 500 messages ┌─────────────────────────────────────────────────────────────────┐ │ t=0s KEDA: 0 pods → 50 pods (target) │ │ t=10s Scheduler: 30 pods running, 20 pending │ │ t=20s Cluster Autoscaler: adding 4 nodes to VMSS │ │ t=80s New nodes ready: 50/50 pods running │ │ t=300s Queue drained to 0 messages │ │ t=420s KEDA: 50 pods → 0 pods │ │ t=1020s Cluster Autoscaler: removing 4 underutilized nodes │ └─────────────────────────────────────────────────────────────────┘Cost Optimization: Spot Instances and Right-Sizing
Section titled “Cost Optimization: Spot Instances and Right-Sizing”Compute costs dominate the typical Kubernetes bill. While auto-scaling ensures you only run the nodes you need, cost optimization ensures you pay the lowest possible price for those nodes and pack them as efficiently as possible.
Spot Node Pools
Section titled “Spot Node Pools”Azure Spot Virtual Machines offer unutilized Azure capacity at a deep discount—up to 90% off the pay-as-you-go rate. The trade-off is that Azure can evict these VMs at any time with only a 30-second warning if the capacity is needed for full-price customers.
Spot VMs are perfect for fault-tolerant, interruptible workloads:
- Batch processing and background jobs
- Stateless web servers (if you run enough replicas across both Spot and regular nodes)
- CI/CD build agents
- Machine learning training jobs
# Add a Spot node pool to an existing clusteraz aks nodepool add \ --resource-group rg-aks-prod \ --cluster-name aks-prod-westeurope \ --name spotpool \ --priority Spot \ --eviction-policy Delete \ --spot-max-price -1 \ --enable-cluster-autoscaler \ --min-count 1 \ --max-count 10 \ --node-vm-size Standard_D4s_v5When you create a Spot node pool, AKS automatically adds the taint kubernetes.azure.com/scalesetpriority=spot:NoSchedule. This prevents normal pods from being scheduled on Spot nodes unless they explicitly tolerate the taint.
# Pod configured to run on Spot nodesapiVersion: apps/v1kind: Deploymentmetadata: name: batch-workerspec: template: spec: tolerations: - key: "kubernetes.azure.com/scalesetpriority" operator: "Equal" value: "spot" effect: "NoSchedule" affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 preference: matchExpressions: - key: kubernetes.azure.com/scalesetpriority operator: In values: - spotStop and think: If your entire web frontend is running on a Spot node pool and Azure experiences a sudden surge in demand for that VM size in your region, what happens to your application? How should you architect a production deployment to utilize Spot savings without risking downtime?
To use Spot instances safely in production, employ a mixed strategy: run your baseline minimum replicas on regular (On-Demand) nodes, and use KEDA or HPA to scale out onto Spot nodes during traffic spikes.
Workload Right-Sizing
Section titled “Workload Right-Sizing”Running workloads with CPU and memory requests that are vastly larger than their actual usage leads to “slack” capacity. The cluster autoscaler provisions new nodes because the requested resources exceed capacity, even if the nodes are physically sitting at 10% CPU utilization.
Right-sizing involves aligning your container requests with reality.
- Analyze Historical Usage: Use Azure Monitor Container Insights or Grafana dashboards to compare
kube_pod_container_resource_requestsagainst actualcontainer_cpu_usage_seconds_total. - Vertical Pod Autoscaler (VPA): Run the VPA in
Recommendationmode. It analyzes pod metrics over time and suggests optimal CPU and memory requests without actively restarting your pods. - Set Requests = Limits for Memory: To prevent unexpected Out-Of-Memory (OOM) kills during traffic spikes, a common best practice is to set memory requests equal to memory limits.
- Allow CPU Throttling (Carefully): Unlike memory, CPU is a compressible resource. Setting CPU limits higher than requests allows a pod to burst during startup or brief spikes, though aggressive throttling can cause latency.
# A well-sized container specificationresources: requests: memory: "256Mi" cpu: "100m" # 1/10th of a core for baseline limits: memory: "256Mi" # Equal to request to prevent OOM cpu: "500m" # Allowed to burst up to half a coreDid You Know?
Section titled “Did You Know?”-
Azure Disk IOPS scale with disk size on Premium SSD, but Ultra Disk decouples them. A 256 GB Premium SSD v1 gets 1,100 IOPS. To get 5,000 IOPS you need a 1 TB disk, even if you only store 50 GB of data. Ultra Disk lets you provision 50,000 IOPS on a 64 GB disk. This decoupling can save thousands of dollars per month for I/O-intensive databases that do not need large storage volumes.
-
KEDA can scale to zero replicas, which the standard HPA cannot do. The HPA requires a minimum of 1 replica. KEDA’s ability to scale to zero is transformative for cost optimization on batch processing workloads. A cluster with 200 different queue consumers that are each idle 95% of the time can run zero pods for most of those consumers, only spinning them up when messages arrive. Combined with the cluster autoscaler, this means you can run a multi-tenant batch processing platform where idle tenants cost nothing.
-
Azure Managed Prometheus stores metrics for 18 months at no additional retention cost. Self-hosted Prometheus typically requires careful capacity planning for long-term storage (using Thanos or Cortex). Azure Monitor workspace handles this natively, making it possible to query 18 months of historical metrics for capacity planning and trend analysis without managing any storage infrastructure.
-
The
nconnectmount option for Azure Files NFS multiplies throughput by opening multiple TCP connections. A single NFS connection typically tops out at 300-400 MB/s due to TCP window limitations. Settingnconnect=4in your StorageClass mount options opens 4 parallel TCP connections per mount, effectively quadrupling throughput. This is essential for ML training workloads that read large datasets from shared storage.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Why It Happens | How to Fix It |
|---|---|---|
| Using Premium SSD when IOPS requirement exceeds the disk-size-to-IOPS ratio | Not understanding that Premium SSD IOPS are tied to disk size | Calculate required IOPS first. If you need high IOPS on small storage, use Ultra Disk or Premium SSD v2 |
Mounting Azure Disks without WaitForFirstConsumer binding mode | Copying StorageClass examples that use Immediate binding | Always use volumeBindingMode: WaitForFirstConsumer on zone-aware clusters to prevent zone mismatches |
| Sending all container logs to Log Analytics without filtering | Default Container Insights config collects everything | Use the ConfigMap to exclude noisy namespaces (kube-system, monitoring) and disable env_var collection |
| Setting KEDA minReplicaCount to 0 for latency-sensitive services | Attracted by cost savings of scale-to-zero | Only scale to zero for batch/queue consumers. Latency-sensitive services need minReplicaCount >= 1 to avoid cold start delays |
| Not configuring PodDisruptionBudgets for KEDA-scaled workloads | PDBs seem unnecessary for “elastic” workloads | KEDA scales pods, but node upgrades drain them. Without PDBs, all replicas can be evicted simultaneously during cluster upgrades |
| Mounting Azure Files SMB when NFS would perform better | SMB is the default and works on both Windows and Linux | For Linux-only workloads needing high throughput, always use NFS with the nconnect mount option |
| Creating Grafana dashboards without alert rules | ”We will check the dashboards when something is wrong” | If nobody is watching the dashboard when the incident starts, it has zero value. Always pair dashboards with alert rules |
| Ignoring disk I/O metrics in observability setup | CPU and memory are the default metrics; disk I/O requires explicit configuration | Add disk IOPS, throughput, and latency to your monitoring ConfigMap and Grafana dashboards |
1. Scenario: You deployed a StatefulSet using a Premium SSD StorageClass with `Immediate` binding mode across a 3-zone AKS cluster. The first pod comes up fine, but the second pod is permanently stuck in `Pending` state. What architectural constraint caused this, and how does `WaitForFirstConsumer` solve it?
Azure Disks are zone-locked resources, meaning a disk created in Availability Zone 1 can only be attached to a virtual machine physically located in Zone 1. When you use Immediate binding mode, the Kubernetes control plane creates the disk immediately upon seeing the PersistentVolumeClaim, without knowing which node the scheduler will eventually choose for the pod. If the disk happens to be created in Zone 1, but the pod is scheduled onto a node in Zone 2, the pod cannot mount the volume and remains stuck in Pending. Using WaitForFirstConsumer solves this by delaying the disk creation API call until the exact moment the scheduler places the pod on a specific node, ensuring the disk is provisioned in the correct matching zone.
2. Scenario: Your DBA team needs to migrate a high-transaction PostgreSQL database to AKS. The database is only 50 GB in size, but requires a guaranteed 15,000 IOPS to handle peak loads. Why would provisioning a 50 GB Premium SSD fail to meet this requirement, and what storage tier is mathematically required instead?
Standard Premium SSDs tie their IOPS and throughput performance directly to the provisioned capacity of the disk. A 64 GB Premium SSD (P6) provides only 240 IOPS, meaning you would have to provision and pay for a 1 TB disk just to achieve the 5,000 IOPS tier, and even larger to hit 15,000. Ultra Disks and Premium SSD v2 solve this by decoupling capacity from performance, allowing you to independently dial in exact IOPS and throughput metrics. By using Ultra Disk, you can provision a 50 GB disk but explicitly set the DiskIOPSReadWrite parameter to 15,000, paying only for the performance you need without wasting money on empty terabytes of storage.
3. Scenario: A machine learning pipeline needs to train a model using 5 TB of image data shared across 20 GPU pods simultaneously. The data scientists initially used Azure Files SMB but are complaining that the data loading phase takes hours due to network bottlenecking. Which Azure Files protocol should they switch to, and what specific mount option will drastically reduce their load times?
The data scientists should switch their StorageClass to use Azure Files with the NFS protocol, which avoids the authentication overhead and Windows-centric design of SMB. NFS on Azure Files Premium provides significantly higher throughput for Linux-based workloads like machine learning containers. Furthermore, they must add the nconnect=4 (or up to 16) setting in their StorageClass mount options. By default, an NFS mount uses a single TCP connection that tops out at around 300-400 MB/s due to TCP window limits; nconnect opens multiple parallel TCP connections to the storage account, multiplying the throughput and drastically reducing data load times.
4. Scenario: An e-commerce backend uses standard HPA (CPU/Memory) to scale its order processing workers. During a flash sale, 10,000 orders hit the Azure Service Bus queue in seconds. The workers process them so quickly that their CPU never exceeds 40%, so the HPA never scales them up, resulting in a 2-hour processing backlog. How would KEDA fundamentally change how this scaling decision is made?
The standard HPA is entirely blind to external business metrics like queue depth, relying solely on lagging infrastructure metrics like CPU utilization which may not correlate with the actual backlog. KEDA replaces this paradigm by connecting directly to the Azure Service Bus API and reading the exact number of pending messages waiting to be processed. Instead of waiting for CPU to spike, KEDA can be configured to instantly provision one worker pod for every 50 messages in the queue. This event-driven approach ensures the deployment scales out preemptively the moment the queue begins to fill, processing the 10,000 orders in minutes rather than hours, and then safely scaling back down to zero when the queue is empty.
5. Scenario: You configure KEDA to scale a consumer deployment to 100 replicas based on queue depth, but your AKS cluster currently only has 3 nodes which can fit 30 pods total. Walk through the exact sequence of events that occurs between KEDA and the Cluster Autoscaler when 1,000 messages suddenly arrive in the queue.
When the messages arrive, the KEDA operator detects the queue depth and immediately updates the deployment’s target replica count to 100. The Kubernetes scheduler successfully places 30 pods on the existing 3 nodes, but the remaining 70 pods transition into a Pending state due to insufficient CPU or memory resources on the cluster. The Cluster Autoscaler constantly watches for Pending pods; upon detecting them, it calculates how many new nodes are required and makes an API call to Azure to expand the Virtual Machine Scale Set. Once the new VMs boot up and join the AKS cluster as Ready nodes, the scheduler automatically places the remaining 70 pods onto them, allowing all 100 consumers to process the queue in parallel.
6. Scenario: A junior engineer enables Container Insights on a production cluster with default settings to troubleshoot a specific microservice. A week later, the Azure Log Analytics bill arrives at $2,000. Why did this happen by default, and what specific configuration changes in the `container-azm-ms-agentconfig` ConfigMap are required to stop the bleeding while still monitoring the application?
By default, the Azure Monitor Agent deployed by Container Insights captures every single line of standard output (stdout) and standard error (stderr) from every container in the cluster, including incredibly noisy system components. This massive ingestion volume is billed per gigabyte by Log Analytics, leading to the rapid cost spike. To fix this, the engineer must deploy a custom ConfigMap named container-azm-ms-agentconfig in the kube-system namespace. In this configuration, they need to explicitly add kube-system and other high-volume namespaces to the exclude_namespaces array for stdout and stderr, and disable environment variable collection (env_var.enabled = false), ensuring only relevant application logs are ingested and billed.
7. Scenario: To save money, a team creates a single 1 TB Premium SSD with `maxShares: 3` and mounts it to three different web server pods using the default `ext4` filesystem so they can share static assets. Within an hour, the filesystem is completely corrupted and the data is lost. What architectural rule of Shared Disks did they violate, and what is required to share block storage safely?
The team misunderstood the difference between block storage and file storage; Azure Shared Disks provide concurrent block-level access to the underlying storage device, not a managed filesystem. Standard Linux filesystems like ext4 or xfs cache data in memory and are completely unaware that other operating systems might be modifying the same underlying disk blocks simultaneously, inevitably leading to catastrophic data corruption. To share a disk safely, the pods must either utilize a specialized cluster-aware filesystem (like GFS2) that coordinates locks across nodes, or the application itself must be explicitly designed to manage concurrent block-level arbitration, such as SQL Server Failover Cluster Instances. For simple shared static assets, the team should have used Azure Files (NFS or SMB) instead.
Hands-On Exercise: KEDA + Azure Service Bus Queue Scaling + Monitor Alerts
Section titled “Hands-On Exercise: KEDA + Azure Service Bus Queue Scaling + Monitor Alerts”In this exercise, you will set up event-driven autoscaling where a consumer deployment scales from zero to many replicas based on Azure Service Bus queue depth, with monitoring alerts that fire when the queue exceeds a threshold. You will also create a zone-aware StorageClass to properly deploy stateful workloads.
Prerequisites
Section titled “Prerequisites”- AKS cluster with KEDA add-on enabled
- Azure CLI authenticated
- Workload Identity configured (from Module 7.3)
Task 1: Create a Zone-Aware StorageClass and PVC
Section titled “Task 1: Create a Zone-Aware StorageClass and PVC”Before setting up scaling, provision a Premium SSD v2 StorageClass that correctly handles availability zones, and create a PersistentVolumeClaim.
Solution
# Create a zone-aware StorageClassk apply -f - <<EOFapiVersion: storage.k8s.io/v1kind: StorageClassmetadata: name: premium-ssd-v2-zone-awareprovisioner: disk.csi.azure.comparameters: skuName: PremiumV2_LRS DiskIOPSReadWrite: "3000" DiskMBpsReadWrite: "125" cachingMode: NonereclaimPolicy: DeletevolumeBindingMode: WaitForFirstConsumerallowVolumeExpansion: trueEOF
# Create a PersistentVolumeClaimk apply -f - <<EOFapiVersion: v1kind: PersistentVolumeClaimmetadata: name: order-db-pvc namespace: defaultspec: accessModes: - ReadWriteOnce storageClassName: premium-ssd-v2-zone-aware resources: requests: storage: 100GiEOF
# Verify the PVC stays in Pending state (because WaitForFirstConsumer delays provisioning until a Pod uses it)k get pvc order-db-pvcTask 2: Create the Azure Service Bus Namespace and Queue
Section titled “Task 2: Create the Azure Service Bus Namespace and Queue”Solution
# Create the Service Bus namespaceaz servicebus namespace create \ --resource-group rg-aks-prod \ --name sb-aks-lab-$(openssl rand -hex 4) \ --location westeurope \ --sku Standard
SB_NAMESPACE=$(az servicebus namespace list -g rg-aks-prod \ --query "[0].name" -o tsv)
# Create the queueaz servicebus queue create \ --resource-group rg-aks-prod \ --namespace-name "$SB_NAMESPACE" \ --name incoming-orders \ --max-size 1024 \ --default-message-time-to-live "PT1H"
# Get the connection string for the producer scriptSB_CONNECTION=$(az servicebus namespace authorization-rule keys list \ --resource-group rg-aks-prod \ --namespace-name "$SB_NAMESPACE" \ --name RootManageSharedAccessKey \ --query primaryConnectionString -o tsv)
echo "Service Bus Namespace: $SB_NAMESPACE"Task 3: Set Up Workload Identity for KEDA and the Consumer
Section titled “Task 3: Set Up Workload Identity for KEDA and the Consumer”Create a managed identity that KEDA and the consumer pods will use to read from the queue.
Solution
# Get the OIDC issuerOIDC_ISSUER=$(az aks show -g rg-aks-prod -n aks-prod-westeurope \ --query "oidcIssuerProfile.issuerUrl" -o tsv)
# Create the managed identityaz identity create \ --resource-group rg-aks-prod \ --name id-order-processor \ --location westeurope
SB_CLIENT_ID=$(az identity show -g rg-aks-prod -n id-order-processor \ --query clientId -o tsv)SB_PRINCIPAL_ID=$(az identity show -g rg-aks-prod -n id-order-processor \ --query principalId -o tsv)
# Grant Service Bus Data Receiver roleSB_ID=$(az servicebus namespace show -g rg-aks-prod -n "$SB_NAMESPACE" --query id -o tsv)
az role assignment create \ --assignee-object-id "$SB_PRINCIPAL_ID" \ --assignee-principal-type ServicePrincipal \ --role "Azure Service Bus Data Receiver" \ --scope "$SB_ID"
# Create federated credentialaz identity federated-credential create \ --name fed-order-processor \ --identity-name id-order-processor \ --resource-group rg-aks-prod \ --issuer "$OIDC_ISSUER" \ --subject "system:serviceaccount:orders:order-processor-sa" \ --audiences "api://AzureADTokenExchange"
# Create the namespace and service accountk create namespace orders
k apply -f - <<EOFapiVersion: v1kind: ServiceAccountmetadata: name: order-processor-sa namespace: orders annotations: azure.workload.identity/client-id: "$SB_CLIENT_ID" labels: azure.workload.identity/use: "true"EOFTask 4: Deploy the Consumer Application and KEDA ScaledObject
Section titled “Task 4: Deploy the Consumer Application and KEDA ScaledObject”Deploy the consumer and configure KEDA to scale it based on queue depth.
Solution
# Deploy the order processor (a simple consumer simulator)k apply -f - <<'EOF'apiVersion: apps/v1kind: Deploymentmetadata: name: order-processor namespace: ordersspec: replicas: 0 selector: matchLabels: app: order-processor template: metadata: labels: app: order-processor spec: serviceAccountName: order-processor-sa containers: - name: processor image: busybox:1.36 command: - /bin/sh - -c - | echo "Order processor started. Processing messages..." while true; do echo "$(date): Processing order batch..." sleep 5 done resources: requests: cpu: "100m" memory: "128Mi" limits: cpu: "250m" memory: "256Mi"EOF
# Create the KEDA TriggerAuthenticationTENANT_ID=$(az account show --query tenantId -o tsv)
k apply -f - <<EOFapiVersion: keda.sh/v1alpha1kind: TriggerAuthenticationmetadata: name: servicebus-workload-auth namespace: ordersspec: podIdentity: provider: azure-workload identityId: "$SB_CLIENT_ID"EOF
# Create the ScaledObjectk apply -f - <<EOFapiVersion: keda.sh/v1alpha1kind: ScaledObjectmetadata: name: order-processor-scaler namespace: ordersspec: scaleTargetRef: name: order-processor pollingInterval: 10 cooldownPeriod: 60 minReplicaCount: 0 maxReplicaCount: 20 triggers: - type: azure-servicebus metadata: queueName: incoming-orders namespace: $SB_NAMESPACE messageCount: "5" authenticationRef: name: servicebus-workload-authEOF
# Verify KEDA is watching the queuek get scaledobject -n ordersk get hpa -n ordersTask 5: Send Messages and Observe Scaling
Section titled “Task 5: Send Messages and Observe Scaling”Flood the queue with messages and watch KEDA scale the consumer.
Solution
# Verify current state: 0 replicask get deployment order-processor -n orders
# Send 100 messages to the queuefor i in $(seq 1 100); do az servicebus queue message send \ --resource-group rg-aks-prod \ --namespace-name "$SB_NAMESPACE" \ --queue-name incoming-orders \ --body "{\"orderId\": \"ORD-$i\", \"amount\": $((RANDOM % 1000 + 1))}"done
echo "Sent 100 messages. Watching KEDA scale..."
# Watch the scaling happen (KEDA polls every 10 seconds)# Run this in a loop or use watchk get deployment order-processor -n orders -w
# After a few moments, you should see replicas increasing:# order-processor 0/20 0 0 0s# order-processor 20/20 20 0 15s# (KEDA targets 1 pod per 5 messages: 100/5 = 20 pods)
# Check the HPA that KEDA createdk describe hpa -n orders
# Check queue depth decreasing (in a real app, consumers would drain the queue)az servicebus queue show \ --resource-group rg-aks-prod \ --namespace-name "$SB_NAMESPACE" \ --name incoming-orders \ --query "countDetails.activeMessageCount" -o tsvTask 6: Set Up Azure Monitor Alert for Queue Backlog
Section titled “Task 6: Set Up Azure Monitor Alert for Queue Backlog”Create an alert that fires when the queue depth exceeds a threshold, indicating consumers cannot keep up.
Solution
# Create an action group for notificationsaz monitor action-group create \ --resource-group rg-aks-prod \ --name ag-aks-oncall \ --short-name aks-oncall \ --email-receiver name="Platform Team" address="platform-oncall@contoso.com"
ACTION_GROUP_ID=$(az monitor action-group show \ -g rg-aks-prod -n ag-aks-oncall --query id -o tsv)
# Create metric alert on Service Bus queue depthaz monitor metrics alert create \ --resource-group rg-aks-prod \ --name "high-order-queue-depth" \ --scopes "$SB_ID" \ --condition "avg ActiveMessages > 200" \ --window-size 5m \ --evaluation-frequency 1m \ --severity 2 \ --description "Order queue has more than 200 active messages for 5 minutes. Consumers may not be keeping up." \ --action "$ACTION_GROUP_ID"
# Verify the alert ruleaz monitor metrics alert show \ -g rg-aks-prod -n "high-order-queue-depth" -o table
# Create a second alert for KEDA scaling failures# (when KEDA hits maxReplicaCount but queue is still growing)az monitor metrics alert create \ --resource-group rg-aks-prod \ --name "order-queue-critical" \ --scopes "$SB_ID" \ --condition "avg ActiveMessages > 1000" \ --window-size 5m \ --evaluation-frequency 1m \ --severity 1 \ --description "CRITICAL: Order queue exceeds 1000 messages. KEDA may have hit maxReplicaCount. Investigate immediately." \ --action "$ACTION_GROUP_ID"Task 7: Verify Scale-to-Zero
Section titled “Task 7: Verify Scale-to-Zero”Drain the queue and confirm KEDA scales the deployment back to zero.
Solution
# In a real scenario, consumers process messages. For the lab, purge the queue:az servicebus queue message purge \ --resource-group rg-aks-prod \ --namespace-name "$SB_NAMESPACE" \ --queue-name incoming-orders
# Watch the deployment scale down (takes cooldownPeriod seconds: 60s in our config)echo "Waiting for KEDA cooldown (60 seconds)..."k get deployment order-processor -n orders -w
# After ~60-90 seconds:# order-processor 20/0 20 20 2m# order-processor 0/0 0 0 3m
# Verify final statek get pods -n orders# Expected: No resources found in orders namespace
# Verify the ScaledObject statusk describe scaledobject order-processor-scaler -n orders | grep -A5 "Status:"
echo "Scale-to-zero verified. Clean up when ready:"echo "az group delete --name rg-aks-prod --yes --no-wait"Success Criteria
Section titled “Success Criteria”- Premium SSD v2 zone-aware StorageClass and PVC created
- Azure Service Bus namespace and queue created
- Workload Identity configured for the consumer (managed identity + federated credential + service account)
- Consumer deployment starts at 0 replicas
- KEDA ScaledObject and TriggerAuthentication deployed
- Sending 100 messages causes KEDA to scale to 20 replicas (100 messages / 5 per pod)
- HPA created by KEDA is visible with
kubectl get hpa - Azure Monitor alert configured for queue depth > 200 (warning) and > 1000 (critical)
- After queue is drained, deployment scales back to 0 replicas within the cooldown period
- No credentials stored in Kubernetes Secrets (Workload Identity used throughout)
Next Module
Section titled “Next Module”This is the final module in the AKS Deep Dive series. You now have the knowledge to architect, secure, network, observe, and scale production AKS clusters. For further learning, explore the Platform Engineering Track to deepen your understanding of SRE practices, GitOps workflows, and DevSecOps pipelines that build on this AKS foundation.