Module 3.10: Azure Monitor & Log Analytics
Complexity: [MEDIUM] | Time to Complete: 2.5h | Prerequisites: Module 3.3 (VMs), Module 3.1 (Entra ID)
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Configure Azure Monitor with Log Analytics workspaces, diagnostic settings, and custom metric collection
- Implement alert rules with action groups, dynamic thresholds, and multi-resource metric alerts
- Deploy Application Insights for distributed tracing, dependency mapping, and performance anomaly detection
- Design centralized monitoring architectures using Azure Monitor across multiple subscriptions and resource types
Why This Module Matters
Section titled “Why This Module Matters”In April 2023, a logistics company running a fleet management application on Azure noticed that their customer complaints had tripled over two weeks. Drivers were reporting slow load times and intermittent errors. The engineering team had no idea what was happening because they had not configured any monitoring beyond the default Azure portal metrics. When they finally investigated, they discovered that their SQL database had been running at 98% DTU utilization for 11 days. A new reporting query, deployed two weeks earlier as part of a routine release, was running every 5 minutes and consuming massive database resources. Two weeks of degraded customer experience, a 15% spike in customer churn, and an estimated $180,000 in lost contracts---all because nobody was watching the metrics and nobody had set up an alert for “database utilization exceeds 80%.”
Observability is not a luxury. It is the difference between proactively fixing problems and reactively apologizing to customers. Azure Monitor is the platform’s unified observability service, collecting metrics and logs from every Azure resource, virtual machine, container, and application. Log Analytics provides the query engine to make sense of that data. Alerts turn signals into actions. Application Insights traces requests through distributed systems.
In this module, you will learn how Azure Monitor collects and organizes data, how to explore metrics in Metrics Explorer, how to write KQL (Kusto Query Language) queries against Log Analytics, how to create alerts with action groups, and how Application Insights provides end-to-end request tracing. By the end, you will deploy monitoring agents on a VM, ingest custom logs, write KQL queries, and create an alert that sends email when a threshold is breached.
Azure Monitor Architecture
Section titled “Azure Monitor Architecture”Azure Monitor is not a single service---it is a platform composed of several interconnected components:
┌──────────────────────────────────────────────────────────────────┐ │ Data Sources │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │ │ │ Azure │ │ VMs │ │ Containers│ │ Applications │ │ │ │ Resources│ │ (Agent) │ │ (AKS,ACA) │ │ (App Insights) │ │ │ └────┬─────┘ └────┬─────┘ └────┬──────┘ └────────┬─────────┘ │ │ │ │ │ │ │ └───────┼─────────────┼────────────┼─────────────────┼─────────────┘ │ │ │ │ ▼ ▼ ▼ ▼ ┌──────────────────────────────────────────────────────────────────┐ │ Azure Monitor Platform │ │ │ │ ┌────────────────────────┐ ┌──────────────────────────────┐ │ │ │ Metrics Store │ │ Log Analytics Workspace │ │ │ │ (Time-series DB) │ │ (Kusto / KQL engine) │ │ │ │ │ │ │ │ │ │ - Platform metrics │ │ - Activity logs │ │ │ │ - Custom metrics │ │ - Resource logs │ │ │ │ - 93 day retention │ │ - VM agent logs │ │ │ │ - Near real-time │ │ - Application traces │ │ │ │ │ │ - Custom logs │ │ │ └─────────┬──────────────┘ └──────────┬───────────────────┘ │ │ │ │ │ │ ▼ ▼ │ │ ┌────────────────────────────────────────────────────────┐ │ │ │ Consumers │ │ │ │ Metrics Explorer │ KQL Queries │ Dashboards │ Alerts │ │ │ │ Workbooks │ Power BI │ Grafana │ APIs │ │ │ └────────────────────────────────────────────────────────┘ │ └──────────────────────────────────────────────────────────────────┘Stop and think: If a critical application crashes due to an out-of-memory exception, which monitoring capability (metrics or logs) would alert you that the memory was exhausted, and which would help you find the specific line of code that caused it?
Metrics vs Logs
Section titled “Metrics vs Logs”| Aspect | Metrics | Logs |
|---|---|---|
| Data type | Numeric time-series (counters, gauges) | Semi-structured text (JSON, syslog) |
| Latency | Near real-time (~1 minute) | 2-5 minutes (ingestion delay) |
| Retention | 93 days (free) | 31 days free, up to 12 years paid |
| Query language | Simple (filter, aggregate) | KQL (powerful, SQL-like) |
| Cost | Free (platform metrics) | $2.76/GB ingested |
| Best for | Dashboards, alerting on thresholds | Root cause analysis, auditing, forensics |
| Examples | CPU %, memory %, request count, latency | Error messages, audit events, request traces |
Metrics Explorer: Real-Time Dashboards
Section titled “Metrics Explorer: Real-Time Dashboards”Every Azure resource automatically emits platform metrics to the Azure Monitor metrics store. No configuration needed---metrics like CPU utilization, memory percentage, network bytes, and disk IOPS are collected the moment a resource is created.
# List available metrics for a VMaz monitor metrics list-definitions \ --resource "/subscriptions/<sub>/resourceGroups/myRG/providers/Microsoft.Compute/virtualMachines/myVM" \ --query '[].{Name:name.value, Unit:unit, Aggregation:primaryAggregationType}' -o table
# Query a specific metric (CPU percentage, last 1 hour, 5-minute intervals)az monitor metrics list \ --resource "/subscriptions/<sub>/resourceGroups/myRG/providers/Microsoft.Compute/virtualMachines/myVM" \ --metric "Percentage CPU" \ --interval PT5M \ --start-time "$(date -u -v-1H '+%Y-%m-%dT%H:%M:%SZ' 2>/dev/null || date -u -d '-1 hour' '+%Y-%m-%dT%H:%M:%SZ')" \ --aggregation Average Maximum \ -o table
# Query storage account metricsaz monitor metrics list \ --resource "/subscriptions/<sub>/resourceGroups/myRG/providers/Microsoft.Storage/storageAccounts/mystorage" \ --metric "Transactions" \ --interval PT1H \ --aggregation Total \ -o tableCommon metrics you should always monitor:
| Resource | Critical Metrics | Alert Threshold (suggestion) |
|---|---|---|
| VM | Percentage CPU, Available Memory Bytes, Disk IOPS | CPU > 85%, Memory < 10%, Disk queue > 5 |
| SQL Database | DTU percentage, Connection failed | DTU > 80%, Failed connections > 10/5min |
| Storage Account | Availability, E2E Latency | Availability < 99.9%, Latency > 100ms |
| App Service | Response time, HTTP 5xx, CPU % | Response > 2s, 5xx > 5/min, CPU > 80% |
| Container App | Replica count, Request count, Response time | Response > 1s, Restarts > 3/hour |
| Key Vault | Service API Latency, Availability | Latency > 1s, Availability < 99.9% |
Log Analytics Workspaces
Section titled “Log Analytics Workspaces”A Log Analytics workspace is the central repository for log data. All logs---from Azure resources, VMs, containers, and applications---are sent to a workspace where you query them using KQL.
# Create a Log Analytics workspaceaz monitor log-analytics workspace create \ --resource-group myRG \ --workspace-name kubedojo-logs \ --location eastus2 \ --retention-time 90
# Get workspace detailsaz monitor log-analytics workspace show \ --resource-group myRG \ --workspace-name kubedojo-logs \ --query '{Name:name, ID:customerId, RetentionDays:retentionInDays}' -o table
# Enable diagnostic logs for a resource (send to Log Analytics)WORKSPACE_ID=$(az monitor log-analytics workspace show \ -g myRG -n kubedojo-logs --query id -o tsv)
az monitor diagnostic-settings create \ --name "send-to-log-analytics" \ --resource "/subscriptions/<sub>/resourceGroups/myRG/providers/Microsoft.KeyVault/vaults/myVault" \ --workspace "$WORKSPACE_ID" \ --logs '[{"category":"AuditEvent","enabled":true}]' \ --metrics '[{"category":"AllMetrics","enabled":true}]'Common Log Tables
Section titled “Common Log Tables”| Table | Source | Contains |
|---|---|---|
AzureActivity | All resources | Control plane operations (create, delete, update) |
AzureDiagnostics | Resources with diagnostics | Resource-specific logs (SQL, Key Vault, etc.) |
Heartbeat | Azure Monitor Agent | VM health status (every minute) |
Syslog | Linux VMs | System logs (auth, kernel, daemon) |
Event | Windows VMs | Windows Event Log entries |
Perf | VMs with agent | Performance counters (CPU, memory, disk) |
ContainerLogV2 | AKS | Container stdout/stderr |
AppRequests | Application Insights | HTTP request traces |
AppExceptions | Application Insights | Exception traces |
KQL: Kusto Query Language
Section titled “KQL: Kusto Query Language”KQL is the query language for Azure Monitor logs. It is similar to SQL but uses a pipe-based syntax where each operator transforms the result of the previous one.
KQL Fundamentals
Section titled “KQL Fundamentals”// Basic query structure: Table | operator1 | operator2 | ...
// Find all failed login attempts in the last 24 hoursSigninLogs| where TimeGenerated > ago(24h)| where ResultType != 0 // 0 = success| project TimeGenerated, UserPrincipalName, IPAddress, ResultDescription| order by TimeGenerated desc| take 50
// Count errors by type in the last hourAzureDiagnostics| where TimeGenerated > ago(1h)| where Category == "AuditEvent"| where ResultType == "Failure"| summarize FailureCount = count() by OperationName| order by FailureCount desc
// Calculate P50, P95, P99 response timesAppRequests| where TimeGenerated > ago(1h)| summarize P50 = percentile(DurationMs, 50), P95 = percentile(DurationMs, 95), P99 = percentile(DurationMs, 99), Total = count() by bin(TimeGenerated, 5m)| order by TimeGenerated asc
// Find VMs with high CPU (from performance counters)Perf| where TimeGenerated > ago(15m)| where CounterName == "% Processor Time"| where InstanceName == "_Total"| summarize AvgCPU = avg(CounterValue), MaxCPU = max(CounterValue) by Computer| where AvgCPU > 80| order by AvgCPU descEssential KQL Operators
Section titled “Essential KQL Operators”| Operator | Purpose | Example |
|---|---|---|
where | Filter rows | where Status == 500 |
project | Select/rename columns | project Time=TimeGenerated, User |
summarize | Aggregate (count, avg, sum, etc.) | summarize count() by Status |
order by | Sort results | order by TimeGenerated desc |
take | Limit result count | take 100 |
extend | Add computed columns | extend DurationSec = DurationMs / 1000 |
bin | Group time into buckets | bin(TimeGenerated, 5m) |
join | Join two tables | join kind=inner (Table2) on Key |
render | Visualize as chart | render timechart |
ago | Relative time | ago(24h), ago(7d) |
between | Range filter | where Value between (10 .. 100) |
Practical KQL Queries for Day-to-Day Operations
Section titled “Practical KQL Queries for Day-to-Day Operations”// Who deleted resources in the last 7 days?AzureActivity| where TimeGenerated > ago(7d)| where OperationNameValue endswith "delete"| where ActivityStatusValue == "Success"| project TimeGenerated, Caller, ResourceGroup, Resource = tostring(split(ResourceId, "/")[-1]), Operation = OperationNameValue| order by TimeGenerated desc
// Disk space usage trends (last 24 hours)Perf| where TimeGenerated > ago(24h)| where ObjectName == "LogicalDisk" and CounterName == "% Free Space"| where InstanceName !in ("_Total", "HarddiskVolume1")| summarize AvgFreeSpace = avg(CounterValue) by Computer, InstanceName, bin(TimeGenerated, 1h)| order by AvgFreeSpace asc
// Top 10 slowest API endpointsAppRequests| where TimeGenerated > ago(1h)| where Success == true| summarize AvgDuration = avg(DurationMs), P99 = percentile(DurationMs, 99), Count = count() by Name| where Count > 10 // Filter out infrequent endpoints| order by P99 desc| take 10
// Anomaly detection: sudden spike in errorsAppExceptions| where TimeGenerated > ago(24h)| summarize ErrorCount = count() by bin(TimeGenerated, 15m)| render timechartPause and predict: If you configure an alert to trigger when CPU exceeds 90%, but you do not assign an Action Group to it, what will happen when the CPU hits 100%?
Alerts and Action Groups
Section titled “Alerts and Action Groups”Alerts proactively notify you when conditions are met. An alert has three components: a condition (what to watch), an action group (what to do), and severity (how critical).
Alert Types
Section titled “Alert Types”| Type | Monitors | Evaluation Frequency | Use Case |
|---|---|---|---|
| Metric alert | Metric thresholds | 1 min | CPU > 85%, Response time > 2s |
| Log alert | KQL query results | 5-15 min | Error count > threshold, security events |
| Activity log alert | Control plane events | Real-time | Resource deleted, role assigned |
| Smart detection | AI-detected anomalies | Continuous | Unusual failure rate, performance degradation |
# Create an action group (email + webhook)az monitor action-group create \ --resource-group myRG \ --name "platform-team" \ --short-name "platform" \ --email-receiver "oncall" "oncall@company.com" \ --webhook-receiver "pagerduty" "https://events.pagerduty.com/integration/xxx/enqueue"
# Create a metric alert: VM CPU > 85% for 5 minutesVM_ID="/subscriptions/<sub>/resourceGroups/myRG/providers/Microsoft.Compute/virtualMachines/myVM"ACTION_GROUP_ID=$(az monitor action-group show -g myRG -n platform-team --query id -o tsv)
az monitor metrics alert create \ --resource-group myRG \ --name "high-cpu-alert" \ --description "VM CPU exceeds 85% for 5 minutes" \ --severity 2 \ --scopes "$VM_ID" \ --condition "avg Percentage CPU > 85" \ --window-size 5m \ --evaluation-frequency 1m \ --action "$ACTION_GROUP_ID"
# Create a log alert: more than 10 errors in 15 minutesWORKSPACE_ID=$(az monitor log-analytics workspace show -g myRG -n kubedojo-logs --query id -o tsv)
az monitor scheduled-query create \ --resource-group myRG \ --name "high-error-rate" \ --description "More than 10 errors in 15 minutes" \ --severity 1 \ --scopes "$WORKSPACE_ID" \ --condition "count 'AzureDiagnostics | where Category == \"AuditEvent\" | where ResultType == \"Failure\"' > 10" \ --condition-query "AzureDiagnostics | where Category == 'AuditEvent' | where ResultType == 'Failure'" \ --window-size 15m \ --evaluation-frequency 5m \ --action-groups "$ACTION_GROUP_ID"
# Create an activity log alert: resource group deletedaz monitor activity-log alert create \ --resource-group myRG \ --name "rg-deleted-alert" \ --description "Resource group was deleted" \ --condition "category=Administrative and operationName=Microsoft.Resources/subscriptions/resourceGroups/delete" \ --action-group "$ACTION_GROUP_ID"Application Insights: End-to-End Tracing
Section titled “Application Insights: End-to-End Tracing”Application Insights is a component of Azure Monitor focused on application performance monitoring (APM). It traces requests through distributed systems, captures exceptions, and profiles performance.
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ Client │ ──► │ API GW │ ──► │ Order │ ──► │ Payment │ │ Browser │ │ Service │ │ Service │ │ Service │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │ │ │ │ Request ID: abc-123 │ │ │ ──────────────────────────────────────────── │ │ All services share the same │ │ correlation ID (Operation ID) │ │ │ └── Application Insights traces the entire ───────┘ request across all services# Create Application Insights resourceaz monitor app-insights component create \ --app kubedojo-app-insights \ --resource-group myRG \ --location eastus2 \ --kind web \ --application-type web \ --workspace "$WORKSPACE_ID"
# Get the instrumentation key and connection stringaz monitor app-insights component show \ --app kubedojo-app-insights -g myRG \ --query '{InstrumentationKey:instrumentationKey, ConnectionString:connectionString}' -o jsonMost Azure SDKs and popular frameworks auto-instrument when you set the APPLICATIONINSIGHTS_CONNECTION_STRING environment variable:
# Configure App Insights on a Container Appaz containerapp update \ --resource-group myRG \ --name my-app \ --set-env-vars "APPLICATIONINSIGHTS_CONNECTION_STRING=$APPINSIGHTS_CONN_STRING"Useful App Insights KQL Queries
Section titled “Useful App Insights KQL Queries”// Request performance by endpoint (last hour)AppRequests| where TimeGenerated > ago(1h)| summarize AvgDuration = avg(DurationMs), P95 = percentile(DurationMs, 95), FailRate = countif(Success == false) * 100.0 / count(), Total = count() by Name| order by Total desc
// Dependency call failures (DB, HTTP, etc.)AppDependencies| where TimeGenerated > ago(1h)| where Success == false| summarize FailCount = count() by Target, DependencyType, ResultCode| order by FailCount desc
// End-to-end transaction searchAppRequests| where OperationId == "abc123"| union AppDependencies | where OperationId == "abc123"| union AppExceptions | where OperationId == "abc123"| project TimeGenerated, ItemType = Type, Name, DurationMs, Success, ResultCode| order by TimeGenerated ascAzure Monitor Agent (AMA)
Section titled “Azure Monitor Agent (AMA)”The Azure Monitor Agent replaces the legacy Log Analytics Agent (MMA/OMS). It collects performance counters, syslog, Windows events, and custom logs from VMs and sends them to Log Analytics.
# Install Azure Monitor Agent on a Linux VMaz vm extension set \ --resource-group myRG \ --vm-name myVM \ --name AzureMonitorLinuxAgent \ --publisher Microsoft.Azure.Monitor \ --enable-auto-upgrade true
# Create a Data Collection Rule (DCR) - defines what to collectaz monitor data-collection rule create \ --resource-group myRG \ --name "vm-performance-rule" \ --location eastus2 \ --data-flows '[{ "streams": ["Microsoft-Perf", "Microsoft-Syslog"], "destinations": ["logAnalyticsWorkspace"] }]' \ --destinations '{ "logAnalytics": [{ "workspaceResourceId": "'$WORKSPACE_ID'", "name": "logAnalyticsWorkspace" }] }' \ --data-sources '{ "performanceCounters": [{ "name": "perfCounters", "streams": ["Microsoft-Perf"], "samplingFrequencyInSeconds": 60, "counterSpecifiers": [ "\\Processor(_Total)\\% Processor Time", "\\Memory\\Available Bytes", "\\LogicalDisk(_Total)\\% Free Space", "\\LogicalDisk(_Total)\\Disk Reads/sec", "\\LogicalDisk(_Total)\\Disk Writes/sec" ] }], "syslog": [{ "name": "syslogDataSource", "streams": ["Microsoft-Syslog"], "facilityNames": ["auth", "authpriv", "cron", "daemon", "kern"], "logLevels": ["Warning", "Error", "Critical", "Alert", "Emergency"] }] }'
# Associate the DCR with the VMDCR_ID=$(az monitor data-collection rule show -g myRG -n vm-performance-rule --query id -o tsv)az monitor data-collection rule association create \ --name "vm-dcr-association" \ --resource "/subscriptions/<sub>/resourceGroups/myRG/providers/Microsoft.Compute/virtualMachines/myVM" \ --rule-id "$DCR_ID"Did You Know?
Section titled “Did You Know?”-
Log Analytics ingestion costs $2.76 per GB, but the first 5 GB per billing account per month are free. A single VM running the Azure Monitor Agent with default performance counters and syslog collection generates approximately 1-3 GB of log data per month. However, enabling verbose application logging or NSG flow logs can push a single VM to 10+ GB per month. Always estimate your ingestion volume before enabling logging on production fleets.
-
KQL was designed by the same team that built Azure Data Explorer (Kusto), and it processes over 15 exabytes of data per day across Microsoft’s internal telemetry systems. The same language is used in Microsoft Sentinel (SIEM), Azure Monitor, Azure Data Explorer, and Microsoft 365 Defender. Learning KQL is one of the highest-ROI skills for any Azure engineer.
-
Azure Monitor can detect anomalies automatically using machine learning. The
series_decompose_anomalies()KQL function analyzes time-series data and flags data points that deviate significantly from the expected pattern. You do not need to set static thresholds---the model learns what “normal” looks like and alerts on deviations. This catches issues like “response time gradually increased 40% over 3 days” that static thresholds miss. -
Application Insights sampling reduces data volume without losing visibility. By default, adaptive sampling kicks in when your application generates more than 5 events per second per server. It intelligently drops redundant telemetry (like 1000 identical successful requests) while preserving anomalies (errors, slow requests). A team generating 100 GB of App Insights data per month reduced it to 8 GB with adaptive sampling, with zero loss in diagnostic capability.
Common Mistakes
Section titled “Common Mistakes”| Mistake | Why It Happens | How to Fix It |
|---|---|---|
| Not setting up monitoring until after an incident | ”We will add monitoring later” is the most expensive sentence in engineering | Set up baseline monitoring (CPU, memory, disk, response time) and alerts on day one, before the application goes to production. |
| Creating one Log Analytics workspace per resource | Misunderstanding of workspace purpose | Use one workspace per environment (dev, staging, prod) or per team. Multiple resources send logs to the same workspace. Separate workspaces fragment your data and make cross-resource queries impossible. |
| Setting alert thresholds too aggressively (e.g., CPU > 50%) | Teams want to catch problems early | Aggressive thresholds cause alert fatigue. The team ignores alerts, and real problems are missed. Set thresholds at actionable levels (CPU > 85%, not 50%). |
| Not configuring action groups on alerts | Alerts are created but nobody receives them | Every alert must have an action group. At minimum, email the on-call team. Better: integrate with PagerDuty/Opsgenie for proper incident management. |
| Using the legacy Log Analytics Agent (MMA) instead of Azure Monitor Agent | MMA appears in older documentation and tutorials | MMA is deprecated. Always use the Azure Monitor Agent (AMA) with Data Collection Rules. AMA supports identity-based auth, multiple workspaces, and data transformation. |
| Writing KQL queries that scan entire tables without time filters | The query “works” but takes 30 seconds | Always start KQL queries with a where TimeGenerated > ago(...) filter. Log Analytics partitions data by time, so time filters dramatically reduce scan volume and cost. |
| Not enabling diagnostic settings on critical resources | Diagnostic settings are not enabled by default | Use Azure Policy to enforce diagnostic settings across all resources. At minimum, enable them on Key Vault, SQL, storage accounts, and networking resources. |
| Logging sensitive data (passwords, tokens, PII) to Log Analytics | Applications log full request/response bodies for debugging | Implement log scrubbing. Never log request bodies, authorization headers, or user PII. Application Insights allows custom telemetry processors to redact sensitive fields before ingestion. |
1. Your team is building a new microservice and wants to be notified immediately if the HTTP 500 error rate exceeds 5% for more than a minute. They also want to be able to search the exact error messages and stack traces from those failing requests to find the root cause. How should you use Azure Monitor metrics and logs to satisfy both requirements?
You should use metrics for the alerting requirement and logs for the root cause analysis. Metrics are numeric time-series data points that are evaluated in near real-time (1-minute granularity), making them perfect for triggering fast, threshold-based alerts like a 5% error rate. Logs, on the other hand, contain rich semi-structured text data such as full stack traces and error messages, which are necessary for debugging. However, logs have a slight ingestion delay (2-5 minutes) and are more expensive to store, making them less ideal for instantaneous alerting but essential for deep investigation.
2. A critical production storage account was unexpectedly deleted sometime during the past weekend, causing an application outage. Your manager has asked you to figure out exactly who deleted the resource and when the deletion occurred. Write a KQL query that retrieves this information from the control plane logs for the last 7 days.
AzureActivity| where TimeGenerated > ago(7d)| where OperationNameValue endswith "delete"| where ActivityStatusValue == "Success"| project TimeGenerated, Caller, ResourceGroup, ResourceType = tostring(split(ResourceId, "/")[-2]), ResourceName = tostring(split(ResourceId, "/")[-1]), OperationName = OperationNameValue| order by TimeGenerated descThis query targets the AzureActivity table, which records all control plane operations (like creations, updates, and deletions) across your Azure subscription. By filtering for operations ending with “delete” and a “Success” status, you isolate the exact events where resources were removed. The project operator then extracts only the most relevant fields—specifically the Caller (which identifies the user or service principal) and the precise TimeGenerated—allowing you to quickly answer who performed the action and when. It sorts the output by time in descending order to surface the most recent deletions first. By querying this centralized control plane log rather than resource-specific logs, you can guarantee visibility even after the resource itself is gone.
3. A junior engineer on your team suggests that to keep things organized, you should create a separate Log Analytics workspace for every single Azure App Service and SQL Database in your production environment (over 50 workspaces total). Why is this a problematic architectural decision?
Creating a separate workspace for every resource severely fragments your log data and breaks your ability to perform end-to-end tracing. When a user request fails, you often need to correlate logs from the frontend App Service with logs from the backend SQL Database. If those logs are in different workspaces, writing a single KQL query to join them becomes significantly harder, slower, and more expensive. The recommended best practice is to centralize logs into a single workspace per environment (e.g., one for production, one for staging) and use Role-Based Access Control (RBAC) to restrict who can query specific tables or resources.
4. You have a fleet of 100 Linux VMs running the Azure Monitor Agent. The security team wants to collect all `auth` and `kern` syslog events, while the platform team only wants to collect CPU and Memory performance counters. How can you configure the agents to satisfy both teams without installing two different monitoring agents?
You can satisfy both teams by using a Data Collection Rule (DCR) in Azure Monitor. A DCR is a centralized configuration resource that defines exactly what data should be collected (like performance counters or syslog facilities), how it should be transformed, and to which Log Analytics workspace it should be routed. Instead of configuring each VM’s agent individually, you create one or more DCRs defining the security and platform requirements, and then associate those rules with your 100 VMs. The Azure Monitor Agent will automatically download the rules and begin collecting only the specified data streams, keeping configuration consistent and easy to manage centrally.
5. Since deploying a new application, your on-call engineers are receiving over 50 alert emails per day. Most of these alerts trigger when a VM's CPU exceeds 50% for a single minute, but the CPU drops back to 20% almost immediately. The engineers have started ignoring all monitoring emails. How should you redesign the alerting strategy to fix this?
You need to redesign the strategy to eliminate alert fatigue by adjusting both the thresholds and the evaluation windows. First, a 50% CPU spike for one minute is normal behavior for many applications; you should raise the threshold to an actionable level (e.g., 85%) and increase the window size so the alert only fires if the CPU remains elevated for 5 or 10 consecutive minutes. Second, you should route alerts based on severity: transient or non-critical issues (Sev 3/4) should be logged to a dashboard or chat channel, while only critical, user-impacting issues (Sev 0/1) should trigger emails or wake up an engineer via PagerDuty. Finally, you might consider using dynamic thresholds that rely on machine learning to alert only when the CPU deviates significantly from its historical baseline.
6. Your architecture consists of a frontend web app, an API gateway, and three backend microservices. A user reports that when they click "Checkout", the page spins for 10 seconds and then fails. You need to figure out which specific backend microservice is causing the delay. How does Application Insights track this single checkout request across all five components?
Application Insights tracks the request across multiple services by implementing distributed tracing based on the W3C Trace Context standard. When the user clicks “Checkout”, the first component (the frontend web app) generates a unique Operation ID for that specific transaction. This Operation ID is automatically injected into the HTTP headers (such as traceparent) of all subsequent outbound calls to the API gateway and backend microservices. Because every service logs its telemetry (requests, dependencies, exceptions) using that exact same Operation ID, you can query Application Insights for that ID to visualize the entire end-to-end transaction, instantly identifying which specific microservice took 10 seconds to respond.
Hands-On Exercise: Monitor Agent on VM with Custom Logs, KQL Query, and Email Alert
Section titled “Hands-On Exercise: Monitor Agent on VM with Custom Logs, KQL Query, and Email Alert”In this exercise, you will deploy a VM with the Azure Monitor Agent, collect performance data and custom logs, write KQL queries, and create an email alert.
Prerequisites: Azure CLI installed and authenticated.
Task 1: Create Infrastructure
Section titled “Task 1: Create Infrastructure”RG="kubedojo-monitor-lab"LOCATION="eastus2"
az group create --name "$RG" --location "$LOCATION"
# Create Log Analytics workspaceaz monitor log-analytics workspace create \ --resource-group "$RG" \ --workspace-name monitor-lab-logs \ --location "$LOCATION" \ --retention-time 30
WORKSPACE_ID=$(az monitor log-analytics workspace show \ -g "$RG" -n monitor-lab-logs --query id -o tsv)WORKSPACE_CUSTOMER_ID=$(az monitor log-analytics workspace show \ -g "$RG" -n monitor-lab-logs --query customerId -o tsv)Verify Task 1
az monitor log-analytics workspace show -g "$RG" -n monitor-lab-logs \ --query '{Name:name, ID:customerId, Retention:retentionInDays}' -o tableTask 2: Deploy a VM with Azure Monitor Agent
Section titled “Task 2: Deploy a VM with Azure Monitor Agent”# Create a VMaz vm create \ --resource-group "$RG" \ --name monitor-lab-vm \ --image Ubuntu2204 \ --size Standard_B2s \ --admin-username azureuser \ --generate-ssh-keys \ --assign-identity '[system]'
# Install Azure Monitor Agentaz vm extension set \ --resource-group "$RG" \ --vm-name monitor-lab-vm \ --name AzureMonitorLinuxAgent \ --publisher Microsoft.Azure.Monitor \ --enable-auto-upgrade true
# Grant the VM's identity access to the DCR (Monitoring Metrics Publisher)VM_IDENTITY=$(az vm identity show -g "$RG" -n monitor-lab-vm --query principalId -o tsv)Verify Task 2
az vm extension list -g "$RG" --vm-name monitor-lab-vm \ --query '[].{Name:name, Publisher:publisher, State:provisioningState}' -o tableYou should see AzureMonitorLinuxAgent with Succeeded state.
Task 3: Create a Data Collection Rule
Section titled “Task 3: Create a Data Collection Rule”az monitor data-collection rule create \ --resource-group "$RG" \ --name "vm-perf-syslog-rule" \ --location "$LOCATION" \ --data-flows '[{ "streams": ["Microsoft-Perf"], "destinations": ["logAnalytics"] }, { "streams": ["Microsoft-Syslog"], "destinations": ["logAnalytics"] }]' \ --destinations "{ \"logAnalytics\": [{ \"workspaceResourceId\": \"$WORKSPACE_ID\", \"name\": \"logAnalytics\" }] }" \ --data-sources '{ "performanceCounters": [{ "name": "perfCounters", "streams": ["Microsoft-Perf"], "samplingFrequencyInSeconds": 30, "counterSpecifiers": [ "\\Processor(_Total)\\% Processor Time", "\\Memory\\Available Bytes", "\\Memory\\% Used Memory", "\\LogicalDisk(_Total)\\% Free Space" ] }], "syslog": [{ "name": "syslog", "streams": ["Microsoft-Syslog"], "facilityNames": ["auth", "authpriv", "daemon", "kern"], "logLevels": ["Warning", "Error", "Critical", "Alert", "Emergency"] }] }'
# Associate DCR with the VMDCR_ID=$(az monitor data-collection rule show -g "$RG" -n vm-perf-syslog-rule --query id -o tsv)VM_ID=$(az vm show -g "$RG" -n monitor-lab-vm --query id -o tsv)
az monitor data-collection rule association create \ --name "vm-dcr-assoc" \ --resource "$VM_ID" \ --rule-id "$DCR_ID"Verify Task 3
az monitor data-collection rule association list --resource "$VM_ID" \ --query '[].{Name:name, RuleId:dataCollectionRuleId}' -o tableTask 4: Generate Load and Wait for Data
Section titled “Task 4: Generate Load and Wait for Data”# Generate some CPU load on the VMaz vm run-command invoke -g "$RG" -n monitor-lab-vm \ --command-id RunShellScript \ --scripts "stress-ng --cpu 2 --timeout 120 &" 2>/dev/null || \az vm run-command invoke -g "$RG" -n monitor-lab-vm \ --command-id RunShellScript \ --scripts "for i in \$(seq 1 4); do yes > /dev/null & done; sleep 120; kill %1 %2 %3 %4 2>/dev/null"
echo "Generating CPU load... Waiting 5 minutes for data to appear in Log Analytics."echo "(Data ingestion typically takes 3-8 minutes)"sleep 300Verify Task 4
# Query Log Analytics for performance dataaz monitor log-analytics query \ --workspace "$WORKSPACE_CUSTOMER_ID" \ --analytics-query "Perf | where TimeGenerated > ago(10m) | where Computer contains 'monitor-lab' | summarize avg(CounterValue) by CounterName | order by CounterName asc" \ --output tableYou should see CPU and memory counter data.
Task 5: Create an Alert Rule with Action Group
Section titled “Task 5: Create an Alert Rule with Action Group”# Create an action group (email notification)az monitor action-group create \ --resource-group "$RG" \ --name "lab-alerts" \ --short-name "labalerts" \ --email-receiver "lab-oncall" "your-email@example.com"
# Create a metric alert for high CPUACTION_GROUP_ID=$(az monitor action-group show -g "$RG" -n lab-alerts --query id -o tsv)
az monitor metrics alert create \ --resource-group "$RG" \ --name "vm-high-cpu" \ --description "Alert when CPU exceeds 80% for 5 minutes" \ --severity 2 \ --scopes "$VM_ID" \ --condition "avg Percentage CPU > 80" \ --window-size 5m \ --evaluation-frequency 1m \ --action "$ACTION_GROUP_ID"Verify Task 5
az monitor metrics alert show -g "$RG" -n vm-high-cpu \ --query '{Name:name, Severity:severity, Enabled:enabled, Condition:criteria.allOf[0].{Metric:metricName, Operator:operator, Threshold:threshold}}' -o jsonTask 6: Run KQL Queries
Section titled “Task 6: Run KQL Queries”# Query 1: Average CPU over the last 10 minutesaz monitor log-analytics query \ --workspace "$WORKSPACE_CUSTOMER_ID" \ --analytics-query " Perf | where TimeGenerated > ago(10m) | where CounterName == '% Processor Time' | where InstanceName == '_Total' | summarize AvgCPU = avg(CounterValue), MaxCPU = max(CounterValue) by bin(TimeGenerated, 1m) | order by TimeGenerated desc " --output table
# Query 2: Memory usageaz monitor log-analytics query \ --workspace "$WORKSPACE_CUSTOMER_ID" \ --analytics-query " Perf | where TimeGenerated > ago(10m) | where CounterName == '% Used Memory' | summarize AvgMemory = avg(CounterValue) by bin(TimeGenerated, 1m) | order by TimeGenerated desc " --output table
# Query 3: Agent heartbeat (proves the agent is reporting)az monitor log-analytics query \ --workspace "$WORKSPACE_CUSTOMER_ID" \ --analytics-query " Heartbeat | where TimeGenerated > ago(10m) | summarize LastHeartbeat = max(TimeGenerated) by Computer, OSType, Version " --output tableVerify Task 6
All three queries should return data. If they are empty, wait another 5 minutes for data ingestion and retry. The Heartbeat query is the most reliable indicator that the agent is working.
Cleanup
Section titled “Cleanup”az group delete --name "$RG" --yes --no-waitSuccess Criteria
Section titled “Success Criteria”- Log Analytics workspace created
- VM deployed with Azure Monitor Agent and system-assigned managed identity
- Data Collection Rule created and associated with the VM
- Performance counter data visible in Log Analytics (Perf table)
- KQL queries returning CPU, memory, and heartbeat data
- Metric alert created with email action group
Next Module
Section titled “Next Module”Module 3.11: CI/CD with Azure DevOps & GitHub Actions --- Learn how to build automated deployment pipelines that securely deploy to Azure using OIDC authentication, Azure Container Registry, and Container Apps.