Skip to content

Module 3.10: Azure Monitor & Log Analytics

Complexity: [MEDIUM] | Time to Complete: 2.5h | Prerequisites: Module 3.3 (VMs), Module 3.1 (Entra ID)

After completing this module, you will be able to:

  • Configure Azure Monitor with Log Analytics workspaces, diagnostic settings, and custom metric collection
  • Implement alert rules with action groups, dynamic thresholds, and multi-resource metric alerts
  • Deploy Application Insights for distributed tracing, dependency mapping, and performance anomaly detection
  • Design centralized monitoring architectures using Azure Monitor across multiple subscriptions and resource types

In April 2023, a logistics company running a fleet management application on Azure noticed that their customer complaints had tripled over two weeks. Drivers were reporting slow load times and intermittent errors. The engineering team had no idea what was happening because they had not configured any monitoring beyond the default Azure portal metrics. When they finally investigated, they discovered that their SQL database had been running at 98% DTU utilization for 11 days. A new reporting query, deployed two weeks earlier as part of a routine release, was running every 5 minutes and consuming massive database resources. Two weeks of degraded customer experience, a 15% spike in customer churn, and an estimated $180,000 in lost contracts---all because nobody was watching the metrics and nobody had set up an alert for “database utilization exceeds 80%.”

Observability is not a luxury. It is the difference between proactively fixing problems and reactively apologizing to customers. Azure Monitor is the platform’s unified observability service, collecting metrics and logs from every Azure resource, virtual machine, container, and application. Log Analytics provides the query engine to make sense of that data. Alerts turn signals into actions. Application Insights traces requests through distributed systems.

In this module, you will learn how Azure Monitor collects and organizes data, how to explore metrics in Metrics Explorer, how to write KQL (Kusto Query Language) queries against Log Analytics, how to create alerts with action groups, and how Application Insights provides end-to-end request tracing. By the end, you will deploy monitoring agents on a VM, ingest custom logs, write KQL queries, and create an alert that sends email when a threshold is breached.


Azure Monitor is not a single service---it is a platform composed of several interconnected components:

┌──────────────────────────────────────────────────────────────────┐
│ Data Sources │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│ │ Azure │ │ VMs │ │ Containers│ │ Applications │ │
│ │ Resources│ │ (Agent) │ │ (AKS,ACA) │ │ (App Insights) │ │
│ └────┬─────┘ └────┬─────┘ └────┬──────┘ └────────┬─────────┘ │
│ │ │ │ │ │
└───────┼─────────────┼────────────┼─────────────────┼─────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌──────────────────────────────────────────────────────────────────┐
│ Azure Monitor Platform │
│ │
│ ┌────────────────────────┐ ┌──────────────────────────────┐ │
│ │ Metrics Store │ │ Log Analytics Workspace │ │
│ │ (Time-series DB) │ │ (Kusto / KQL engine) │ │
│ │ │ │ │ │
│ │ - Platform metrics │ │ - Activity logs │ │
│ │ - Custom metrics │ │ - Resource logs │ │
│ │ - 93 day retention │ │ - VM agent logs │ │
│ │ - Near real-time │ │ - Application traces │ │
│ │ │ │ - Custom logs │ │
│ └─────────┬──────────────┘ └──────────┬───────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Consumers │ │
│ │ Metrics Explorer │ KQL Queries │ Dashboards │ Alerts │ │
│ │ Workbooks │ Power BI │ Grafana │ APIs │ │
│ └────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘

Stop and think: If a critical application crashes due to an out-of-memory exception, which monitoring capability (metrics or logs) would alert you that the memory was exhausted, and which would help you find the specific line of code that caused it?

AspectMetricsLogs
Data typeNumeric time-series (counters, gauges)Semi-structured text (JSON, syslog)
LatencyNear real-time (~1 minute)2-5 minutes (ingestion delay)
Retention93 days (free)31 days free, up to 12 years paid
Query languageSimple (filter, aggregate)KQL (powerful, SQL-like)
CostFree (platform metrics)$2.76/GB ingested
Best forDashboards, alerting on thresholdsRoot cause analysis, auditing, forensics
ExamplesCPU %, memory %, request count, latencyError messages, audit events, request traces

Every Azure resource automatically emits platform metrics to the Azure Monitor metrics store. No configuration needed---metrics like CPU utilization, memory percentage, network bytes, and disk IOPS are collected the moment a resource is created.

Terminal window
# List available metrics for a VM
az monitor metrics list-definitions \
--resource "/subscriptions/<sub>/resourceGroups/myRG/providers/Microsoft.Compute/virtualMachines/myVM" \
--query '[].{Name:name.value, Unit:unit, Aggregation:primaryAggregationType}' -o table
# Query a specific metric (CPU percentage, last 1 hour, 5-minute intervals)
az monitor metrics list \
--resource "/subscriptions/<sub>/resourceGroups/myRG/providers/Microsoft.Compute/virtualMachines/myVM" \
--metric "Percentage CPU" \
--interval PT5M \
--start-time "$(date -u -v-1H '+%Y-%m-%dT%H:%M:%SZ' 2>/dev/null || date -u -d '-1 hour' '+%Y-%m-%dT%H:%M:%SZ')" \
--aggregation Average Maximum \
-o table
# Query storage account metrics
az monitor metrics list \
--resource "/subscriptions/<sub>/resourceGroups/myRG/providers/Microsoft.Storage/storageAccounts/mystorage" \
--metric "Transactions" \
--interval PT1H \
--aggregation Total \
-o table

Common metrics you should always monitor:

ResourceCritical MetricsAlert Threshold (suggestion)
VMPercentage CPU, Available Memory Bytes, Disk IOPSCPU > 85%, Memory < 10%, Disk queue > 5
SQL DatabaseDTU percentage, Connection failedDTU > 80%, Failed connections > 10/5min
Storage AccountAvailability, E2E LatencyAvailability < 99.9%, Latency > 100ms
App ServiceResponse time, HTTP 5xx, CPU %Response > 2s, 5xx > 5/min, CPU > 80%
Container AppReplica count, Request count, Response timeResponse > 1s, Restarts > 3/hour
Key VaultService API Latency, AvailabilityLatency > 1s, Availability < 99.9%

A Log Analytics workspace is the central repository for log data. All logs---from Azure resources, VMs, containers, and applications---are sent to a workspace where you query them using KQL.

Terminal window
# Create a Log Analytics workspace
az monitor log-analytics workspace create \
--resource-group myRG \
--workspace-name kubedojo-logs \
--location eastus2 \
--retention-time 90
# Get workspace details
az monitor log-analytics workspace show \
--resource-group myRG \
--workspace-name kubedojo-logs \
--query '{Name:name, ID:customerId, RetentionDays:retentionInDays}' -o table
# Enable diagnostic logs for a resource (send to Log Analytics)
WORKSPACE_ID=$(az monitor log-analytics workspace show \
-g myRG -n kubedojo-logs --query id -o tsv)
az monitor diagnostic-settings create \
--name "send-to-log-analytics" \
--resource "/subscriptions/<sub>/resourceGroups/myRG/providers/Microsoft.KeyVault/vaults/myVault" \
--workspace "$WORKSPACE_ID" \
--logs '[{"category":"AuditEvent","enabled":true}]' \
--metrics '[{"category":"AllMetrics","enabled":true}]'
TableSourceContains
AzureActivityAll resourcesControl plane operations (create, delete, update)
AzureDiagnosticsResources with diagnosticsResource-specific logs (SQL, Key Vault, etc.)
HeartbeatAzure Monitor AgentVM health status (every minute)
SyslogLinux VMsSystem logs (auth, kernel, daemon)
EventWindows VMsWindows Event Log entries
PerfVMs with agentPerformance counters (CPU, memory, disk)
ContainerLogV2AKSContainer stdout/stderr
AppRequestsApplication InsightsHTTP request traces
AppExceptionsApplication InsightsException traces

KQL is the query language for Azure Monitor logs. It is similar to SQL but uses a pipe-based syntax where each operator transforms the result of the previous one.

// Basic query structure: Table | operator1 | operator2 | ...
// Find all failed login attempts in the last 24 hours
SigninLogs
| where TimeGenerated > ago(24h)
| where ResultType != 0 // 0 = success
| project TimeGenerated, UserPrincipalName, IPAddress, ResultDescription
| order by TimeGenerated desc
| take 50
// Count errors by type in the last hour
AzureDiagnostics
| where TimeGenerated > ago(1h)
| where Category == "AuditEvent"
| where ResultType == "Failure"
| summarize FailureCount = count() by OperationName
| order by FailureCount desc
// Calculate P50, P95, P99 response times
AppRequests
| where TimeGenerated > ago(1h)
| summarize
P50 = percentile(DurationMs, 50),
P95 = percentile(DurationMs, 95),
P99 = percentile(DurationMs, 99),
Total = count()
by bin(TimeGenerated, 5m)
| order by TimeGenerated asc
// Find VMs with high CPU (from performance counters)
Perf
| where TimeGenerated > ago(15m)
| where CounterName == "% Processor Time"
| where InstanceName == "_Total"
| summarize AvgCPU = avg(CounterValue), MaxCPU = max(CounterValue) by Computer
| where AvgCPU > 80
| order by AvgCPU desc
OperatorPurposeExample
whereFilter rowswhere Status == 500
projectSelect/rename columnsproject Time=TimeGenerated, User
summarizeAggregate (count, avg, sum, etc.)summarize count() by Status
order bySort resultsorder by TimeGenerated desc
takeLimit result counttake 100
extendAdd computed columnsextend DurationSec = DurationMs / 1000
binGroup time into bucketsbin(TimeGenerated, 5m)
joinJoin two tablesjoin kind=inner (Table2) on Key
renderVisualize as chartrender timechart
agoRelative timeago(24h), ago(7d)
betweenRange filterwhere Value between (10 .. 100)

Practical KQL Queries for Day-to-Day Operations

Section titled “Practical KQL Queries for Day-to-Day Operations”
// Who deleted resources in the last 7 days?
AzureActivity
| where TimeGenerated > ago(7d)
| where OperationNameValue endswith "delete"
| where ActivityStatusValue == "Success"
| project TimeGenerated, Caller, ResourceGroup,
Resource = tostring(split(ResourceId, "/")[-1]),
Operation = OperationNameValue
| order by TimeGenerated desc
// Disk space usage trends (last 24 hours)
Perf
| where TimeGenerated > ago(24h)
| where ObjectName == "LogicalDisk" and CounterName == "% Free Space"
| where InstanceName !in ("_Total", "HarddiskVolume1")
| summarize AvgFreeSpace = avg(CounterValue) by Computer, InstanceName, bin(TimeGenerated, 1h)
| order by AvgFreeSpace asc
// Top 10 slowest API endpoints
AppRequests
| where TimeGenerated > ago(1h)
| where Success == true
| summarize AvgDuration = avg(DurationMs), P99 = percentile(DurationMs, 99), Count = count() by Name
| where Count > 10 // Filter out infrequent endpoints
| order by P99 desc
| take 10
// Anomaly detection: sudden spike in errors
AppExceptions
| where TimeGenerated > ago(24h)
| summarize ErrorCount = count() by bin(TimeGenerated, 15m)
| render timechart

Pause and predict: If you configure an alert to trigger when CPU exceeds 90%, but you do not assign an Action Group to it, what will happen when the CPU hits 100%?

Alerts proactively notify you when conditions are met. An alert has three components: a condition (what to watch), an action group (what to do), and severity (how critical).

TypeMonitorsEvaluation FrequencyUse Case
Metric alertMetric thresholds1 minCPU > 85%, Response time > 2s
Log alertKQL query results5-15 minError count > threshold, security events
Activity log alertControl plane eventsReal-timeResource deleted, role assigned
Smart detectionAI-detected anomaliesContinuousUnusual failure rate, performance degradation
Terminal window
# Create an action group (email + webhook)
az monitor action-group create \
--resource-group myRG \
--name "platform-team" \
--short-name "platform" \
--email-receiver "oncall" "oncall@company.com" \
--webhook-receiver "pagerduty" "https://events.pagerduty.com/integration/xxx/enqueue"
# Create a metric alert: VM CPU > 85% for 5 minutes
VM_ID="/subscriptions/<sub>/resourceGroups/myRG/providers/Microsoft.Compute/virtualMachines/myVM"
ACTION_GROUP_ID=$(az monitor action-group show -g myRG -n platform-team --query id -o tsv)
az monitor metrics alert create \
--resource-group myRG \
--name "high-cpu-alert" \
--description "VM CPU exceeds 85% for 5 minutes" \
--severity 2 \
--scopes "$VM_ID" \
--condition "avg Percentage CPU > 85" \
--window-size 5m \
--evaluation-frequency 1m \
--action "$ACTION_GROUP_ID"
# Create a log alert: more than 10 errors in 15 minutes
WORKSPACE_ID=$(az monitor log-analytics workspace show -g myRG -n kubedojo-logs --query id -o tsv)
az monitor scheduled-query create \
--resource-group myRG \
--name "high-error-rate" \
--description "More than 10 errors in 15 minutes" \
--severity 1 \
--scopes "$WORKSPACE_ID" \
--condition "count 'AzureDiagnostics | where Category == \"AuditEvent\" | where ResultType == \"Failure\"' > 10" \
--condition-query "AzureDiagnostics | where Category == 'AuditEvent' | where ResultType == 'Failure'" \
--window-size 15m \
--evaluation-frequency 5m \
--action-groups "$ACTION_GROUP_ID"
# Create an activity log alert: resource group deleted
az monitor activity-log alert create \
--resource-group myRG \
--name "rg-deleted-alert" \
--description "Resource group was deleted" \
--condition "category=Administrative and operationName=Microsoft.Resources/subscriptions/resourceGroups/delete" \
--action-group "$ACTION_GROUP_ID"

Application Insights is a component of Azure Monitor focused on application performance monitoring (APM). It traces requests through distributed systems, captures exceptions, and profiles performance.

┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Client │ ──► │ API GW │ ──► │ Order │ ──► │ Payment │
│ Browser │ │ Service │ │ Service │ │ Service │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
│ │ │ │
│ Request ID: abc-123 │ │
│ ──────────────────────────────────────────── │
│ All services share the same │
│ correlation ID (Operation ID) │
│ │
└── Application Insights traces the entire ───────┘
request across all services
Terminal window
# Create Application Insights resource
az monitor app-insights component create \
--app kubedojo-app-insights \
--resource-group myRG \
--location eastus2 \
--kind web \
--application-type web \
--workspace "$WORKSPACE_ID"
# Get the instrumentation key and connection string
az monitor app-insights component show \
--app kubedojo-app-insights -g myRG \
--query '{InstrumentationKey:instrumentationKey, ConnectionString:connectionString}' -o json

Most Azure SDKs and popular frameworks auto-instrument when you set the APPLICATIONINSIGHTS_CONNECTION_STRING environment variable:

Terminal window
# Configure App Insights on a Container App
az containerapp update \
--resource-group myRG \
--name my-app \
--set-env-vars "APPLICATIONINSIGHTS_CONNECTION_STRING=$APPINSIGHTS_CONN_STRING"
// Request performance by endpoint (last hour)
AppRequests
| where TimeGenerated > ago(1h)
| summarize
AvgDuration = avg(DurationMs),
P95 = percentile(DurationMs, 95),
FailRate = countif(Success == false) * 100.0 / count(),
Total = count()
by Name
| order by Total desc
// Dependency call failures (DB, HTTP, etc.)
AppDependencies
| where TimeGenerated > ago(1h)
| where Success == false
| summarize FailCount = count() by Target, DependencyType, ResultCode
| order by FailCount desc
// End-to-end transaction search
AppRequests
| where OperationId == "abc123"
| union AppDependencies | where OperationId == "abc123"
| union AppExceptions | where OperationId == "abc123"
| project TimeGenerated, ItemType = Type, Name, DurationMs, Success, ResultCode
| order by TimeGenerated asc

The Azure Monitor Agent replaces the legacy Log Analytics Agent (MMA/OMS). It collects performance counters, syslog, Windows events, and custom logs from VMs and sends them to Log Analytics.

Terminal window
# Install Azure Monitor Agent on a Linux VM
az vm extension set \
--resource-group myRG \
--vm-name myVM \
--name AzureMonitorLinuxAgent \
--publisher Microsoft.Azure.Monitor \
--enable-auto-upgrade true
# Create a Data Collection Rule (DCR) - defines what to collect
az monitor data-collection rule create \
--resource-group myRG \
--name "vm-performance-rule" \
--location eastus2 \
--data-flows '[{
"streams": ["Microsoft-Perf", "Microsoft-Syslog"],
"destinations": ["logAnalyticsWorkspace"]
}]' \
--destinations '{
"logAnalytics": [{
"workspaceResourceId": "'$WORKSPACE_ID'",
"name": "logAnalyticsWorkspace"
}]
}' \
--data-sources '{
"performanceCounters": [{
"name": "perfCounters",
"streams": ["Microsoft-Perf"],
"samplingFrequencyInSeconds": 60,
"counterSpecifiers": [
"\\Processor(_Total)\\% Processor Time",
"\\Memory\\Available Bytes",
"\\LogicalDisk(_Total)\\% Free Space",
"\\LogicalDisk(_Total)\\Disk Reads/sec",
"\\LogicalDisk(_Total)\\Disk Writes/sec"
]
}],
"syslog": [{
"name": "syslogDataSource",
"streams": ["Microsoft-Syslog"],
"facilityNames": ["auth", "authpriv", "cron", "daemon", "kern"],
"logLevels": ["Warning", "Error", "Critical", "Alert", "Emergency"]
}]
}'
# Associate the DCR with the VM
DCR_ID=$(az monitor data-collection rule show -g myRG -n vm-performance-rule --query id -o tsv)
az monitor data-collection rule association create \
--name "vm-dcr-association" \
--resource "/subscriptions/<sub>/resourceGroups/myRG/providers/Microsoft.Compute/virtualMachines/myVM" \
--rule-id "$DCR_ID"

  1. Log Analytics ingestion costs $2.76 per GB, but the first 5 GB per billing account per month are free. A single VM running the Azure Monitor Agent with default performance counters and syslog collection generates approximately 1-3 GB of log data per month. However, enabling verbose application logging or NSG flow logs can push a single VM to 10+ GB per month. Always estimate your ingestion volume before enabling logging on production fleets.

  2. KQL was designed by the same team that built Azure Data Explorer (Kusto), and it processes over 15 exabytes of data per day across Microsoft’s internal telemetry systems. The same language is used in Microsoft Sentinel (SIEM), Azure Monitor, Azure Data Explorer, and Microsoft 365 Defender. Learning KQL is one of the highest-ROI skills for any Azure engineer.

  3. Azure Monitor can detect anomalies automatically using machine learning. The series_decompose_anomalies() KQL function analyzes time-series data and flags data points that deviate significantly from the expected pattern. You do not need to set static thresholds---the model learns what “normal” looks like and alerts on deviations. This catches issues like “response time gradually increased 40% over 3 days” that static thresholds miss.

  4. Application Insights sampling reduces data volume without losing visibility. By default, adaptive sampling kicks in when your application generates more than 5 events per second per server. It intelligently drops redundant telemetry (like 1000 identical successful requests) while preserving anomalies (errors, slow requests). A team generating 100 GB of App Insights data per month reduced it to 8 GB with adaptive sampling, with zero loss in diagnostic capability.


MistakeWhy It HappensHow to Fix It
Not setting up monitoring until after an incident”We will add monitoring later” is the most expensive sentence in engineeringSet up baseline monitoring (CPU, memory, disk, response time) and alerts on day one, before the application goes to production.
Creating one Log Analytics workspace per resourceMisunderstanding of workspace purposeUse one workspace per environment (dev, staging, prod) or per team. Multiple resources send logs to the same workspace. Separate workspaces fragment your data and make cross-resource queries impossible.
Setting alert thresholds too aggressively (e.g., CPU > 50%)Teams want to catch problems earlyAggressive thresholds cause alert fatigue. The team ignores alerts, and real problems are missed. Set thresholds at actionable levels (CPU > 85%, not 50%).
Not configuring action groups on alertsAlerts are created but nobody receives themEvery alert must have an action group. At minimum, email the on-call team. Better: integrate with PagerDuty/Opsgenie for proper incident management.
Using the legacy Log Analytics Agent (MMA) instead of Azure Monitor AgentMMA appears in older documentation and tutorialsMMA is deprecated. Always use the Azure Monitor Agent (AMA) with Data Collection Rules. AMA supports identity-based auth, multiple workspaces, and data transformation.
Writing KQL queries that scan entire tables without time filtersThe query “works” but takes 30 secondsAlways start KQL queries with a where TimeGenerated > ago(...) filter. Log Analytics partitions data by time, so time filters dramatically reduce scan volume and cost.
Not enabling diagnostic settings on critical resourcesDiagnostic settings are not enabled by defaultUse Azure Policy to enforce diagnostic settings across all resources. At minimum, enable them on Key Vault, SQL, storage accounts, and networking resources.
Logging sensitive data (passwords, tokens, PII) to Log AnalyticsApplications log full request/response bodies for debuggingImplement log scrubbing. Never log request bodies, authorization headers, or user PII. Application Insights allows custom telemetry processors to redact sensitive fields before ingestion.

1. Your team is building a new microservice and wants to be notified immediately if the HTTP 500 error rate exceeds 5% for more than a minute. They also want to be able to search the exact error messages and stack traces from those failing requests to find the root cause. How should you use Azure Monitor metrics and logs to satisfy both requirements?

You should use metrics for the alerting requirement and logs for the root cause analysis. Metrics are numeric time-series data points that are evaluated in near real-time (1-minute granularity), making them perfect for triggering fast, threshold-based alerts like a 5% error rate. Logs, on the other hand, contain rich semi-structured text data such as full stack traces and error messages, which are necessary for debugging. However, logs have a slight ingestion delay (2-5 minutes) and are more expensive to store, making them less ideal for instantaneous alerting but essential for deep investigation.

2. A critical production storage account was unexpectedly deleted sometime during the past weekend, causing an application outage. Your manager has asked you to figure out exactly who deleted the resource and when the deletion occurred. Write a KQL query that retrieves this information from the control plane logs for the last 7 days.
AzureActivity
| where TimeGenerated > ago(7d)
| where OperationNameValue endswith "delete"
| where ActivityStatusValue == "Success"
| project
TimeGenerated,
Caller,
ResourceGroup,
ResourceType = tostring(split(ResourceId, "/")[-2]),
ResourceName = tostring(split(ResourceId, "/")[-1]),
OperationName = OperationNameValue
| order by TimeGenerated desc

This query targets the AzureActivity table, which records all control plane operations (like creations, updates, and deletions) across your Azure subscription. By filtering for operations ending with “delete” and a “Success” status, you isolate the exact events where resources were removed. The project operator then extracts only the most relevant fields—specifically the Caller (which identifies the user or service principal) and the precise TimeGenerated—allowing you to quickly answer who performed the action and when. It sorts the output by time in descending order to surface the most recent deletions first. By querying this centralized control plane log rather than resource-specific logs, you can guarantee visibility even after the resource itself is gone.

3. A junior engineer on your team suggests that to keep things organized, you should create a separate Log Analytics workspace for every single Azure App Service and SQL Database in your production environment (over 50 workspaces total). Why is this a problematic architectural decision?

Creating a separate workspace for every resource severely fragments your log data and breaks your ability to perform end-to-end tracing. When a user request fails, you often need to correlate logs from the frontend App Service with logs from the backend SQL Database. If those logs are in different workspaces, writing a single KQL query to join them becomes significantly harder, slower, and more expensive. The recommended best practice is to centralize logs into a single workspace per environment (e.g., one for production, one for staging) and use Role-Based Access Control (RBAC) to restrict who can query specific tables or resources.

4. You have a fleet of 100 Linux VMs running the Azure Monitor Agent. The security team wants to collect all `auth` and `kern` syslog events, while the platform team only wants to collect CPU and Memory performance counters. How can you configure the agents to satisfy both teams without installing two different monitoring agents?

You can satisfy both teams by using a Data Collection Rule (DCR) in Azure Monitor. A DCR is a centralized configuration resource that defines exactly what data should be collected (like performance counters or syslog facilities), how it should be transformed, and to which Log Analytics workspace it should be routed. Instead of configuring each VM’s agent individually, you create one or more DCRs defining the security and platform requirements, and then associate those rules with your 100 VMs. The Azure Monitor Agent will automatically download the rules and begin collecting only the specified data streams, keeping configuration consistent and easy to manage centrally.

5. Since deploying a new application, your on-call engineers are receiving over 50 alert emails per day. Most of these alerts trigger when a VM's CPU exceeds 50% for a single minute, but the CPU drops back to 20% almost immediately. The engineers have started ignoring all monitoring emails. How should you redesign the alerting strategy to fix this?

You need to redesign the strategy to eliminate alert fatigue by adjusting both the thresholds and the evaluation windows. First, a 50% CPU spike for one minute is normal behavior for many applications; you should raise the threshold to an actionable level (e.g., 85%) and increase the window size so the alert only fires if the CPU remains elevated for 5 or 10 consecutive minutes. Second, you should route alerts based on severity: transient or non-critical issues (Sev 3/4) should be logged to a dashboard or chat channel, while only critical, user-impacting issues (Sev 0/1) should trigger emails or wake up an engineer via PagerDuty. Finally, you might consider using dynamic thresholds that rely on machine learning to alert only when the CPU deviates significantly from its historical baseline.

6. Your architecture consists of a frontend web app, an API gateway, and three backend microservices. A user reports that when they click "Checkout", the page spins for 10 seconds and then fails. You need to figure out which specific backend microservice is causing the delay. How does Application Insights track this single checkout request across all five components?

Application Insights tracks the request across multiple services by implementing distributed tracing based on the W3C Trace Context standard. When the user clicks “Checkout”, the first component (the frontend web app) generates a unique Operation ID for that specific transaction. This Operation ID is automatically injected into the HTTP headers (such as traceparent) of all subsequent outbound calls to the API gateway and backend microservices. Because every service logs its telemetry (requests, dependencies, exceptions) using that exact same Operation ID, you can query Application Insights for that ID to visualize the entire end-to-end transaction, instantly identifying which specific microservice took 10 seconds to respond.


Hands-On Exercise: Monitor Agent on VM with Custom Logs, KQL Query, and Email Alert

Section titled “Hands-On Exercise: Monitor Agent on VM with Custom Logs, KQL Query, and Email Alert”

In this exercise, you will deploy a VM with the Azure Monitor Agent, collect performance data and custom logs, write KQL queries, and create an email alert.

Prerequisites: Azure CLI installed and authenticated.

Terminal window
RG="kubedojo-monitor-lab"
LOCATION="eastus2"
az group create --name "$RG" --location "$LOCATION"
# Create Log Analytics workspace
az monitor log-analytics workspace create \
--resource-group "$RG" \
--workspace-name monitor-lab-logs \
--location "$LOCATION" \
--retention-time 30
WORKSPACE_ID=$(az monitor log-analytics workspace show \
-g "$RG" -n monitor-lab-logs --query id -o tsv)
WORKSPACE_CUSTOMER_ID=$(az monitor log-analytics workspace show \
-g "$RG" -n monitor-lab-logs --query customerId -o tsv)
Verify Task 1
Terminal window
az monitor log-analytics workspace show -g "$RG" -n monitor-lab-logs \
--query '{Name:name, ID:customerId, Retention:retentionInDays}' -o table

Task 2: Deploy a VM with Azure Monitor Agent

Section titled “Task 2: Deploy a VM with Azure Monitor Agent”
Terminal window
# Create a VM
az vm create \
--resource-group "$RG" \
--name monitor-lab-vm \
--image Ubuntu2204 \
--size Standard_B2s \
--admin-username azureuser \
--generate-ssh-keys \
--assign-identity '[system]'
# Install Azure Monitor Agent
az vm extension set \
--resource-group "$RG" \
--vm-name monitor-lab-vm \
--name AzureMonitorLinuxAgent \
--publisher Microsoft.Azure.Monitor \
--enable-auto-upgrade true
# Grant the VM's identity access to the DCR (Monitoring Metrics Publisher)
VM_IDENTITY=$(az vm identity show -g "$RG" -n monitor-lab-vm --query principalId -o tsv)
Verify Task 2
Terminal window
az vm extension list -g "$RG" --vm-name monitor-lab-vm \
--query '[].{Name:name, Publisher:publisher, State:provisioningState}' -o table

You should see AzureMonitorLinuxAgent with Succeeded state.

Terminal window
az monitor data-collection rule create \
--resource-group "$RG" \
--name "vm-perf-syslog-rule" \
--location "$LOCATION" \
--data-flows '[{
"streams": ["Microsoft-Perf"],
"destinations": ["logAnalytics"]
}, {
"streams": ["Microsoft-Syslog"],
"destinations": ["logAnalytics"]
}]' \
--destinations "{
\"logAnalytics\": [{
\"workspaceResourceId\": \"$WORKSPACE_ID\",
\"name\": \"logAnalytics\"
}]
}" \
--data-sources '{
"performanceCounters": [{
"name": "perfCounters",
"streams": ["Microsoft-Perf"],
"samplingFrequencyInSeconds": 30,
"counterSpecifiers": [
"\\Processor(_Total)\\% Processor Time",
"\\Memory\\Available Bytes",
"\\Memory\\% Used Memory",
"\\LogicalDisk(_Total)\\% Free Space"
]
}],
"syslog": [{
"name": "syslog",
"streams": ["Microsoft-Syslog"],
"facilityNames": ["auth", "authpriv", "daemon", "kern"],
"logLevels": ["Warning", "Error", "Critical", "Alert", "Emergency"]
}]
}'
# Associate DCR with the VM
DCR_ID=$(az monitor data-collection rule show -g "$RG" -n vm-perf-syslog-rule --query id -o tsv)
VM_ID=$(az vm show -g "$RG" -n monitor-lab-vm --query id -o tsv)
az monitor data-collection rule association create \
--name "vm-dcr-assoc" \
--resource "$VM_ID" \
--rule-id "$DCR_ID"
Verify Task 3
Terminal window
az monitor data-collection rule association list --resource "$VM_ID" \
--query '[].{Name:name, RuleId:dataCollectionRuleId}' -o table
Terminal window
# Generate some CPU load on the VM
az vm run-command invoke -g "$RG" -n monitor-lab-vm \
--command-id RunShellScript \
--scripts "stress-ng --cpu 2 --timeout 120 &" 2>/dev/null || \
az vm run-command invoke -g "$RG" -n monitor-lab-vm \
--command-id RunShellScript \
--scripts "for i in \$(seq 1 4); do yes > /dev/null & done; sleep 120; kill %1 %2 %3 %4 2>/dev/null"
echo "Generating CPU load... Waiting 5 minutes for data to appear in Log Analytics."
echo "(Data ingestion typically takes 3-8 minutes)"
sleep 300
Verify Task 4
Terminal window
# Query Log Analytics for performance data
az monitor log-analytics query \
--workspace "$WORKSPACE_CUSTOMER_ID" \
--analytics-query "Perf | where TimeGenerated > ago(10m) | where Computer contains 'monitor-lab' | summarize avg(CounterValue) by CounterName | order by CounterName asc" \
--output table

You should see CPU and memory counter data.

Task 5: Create an Alert Rule with Action Group

Section titled “Task 5: Create an Alert Rule with Action Group”
Terminal window
# Create an action group (email notification)
az monitor action-group create \
--resource-group "$RG" \
--name "lab-alerts" \
--short-name "labalerts" \
--email-receiver "lab-oncall" "your-email@example.com"
# Create a metric alert for high CPU
ACTION_GROUP_ID=$(az monitor action-group show -g "$RG" -n lab-alerts --query id -o tsv)
az monitor metrics alert create \
--resource-group "$RG" \
--name "vm-high-cpu" \
--description "Alert when CPU exceeds 80% for 5 minutes" \
--severity 2 \
--scopes "$VM_ID" \
--condition "avg Percentage CPU > 80" \
--window-size 5m \
--evaluation-frequency 1m \
--action "$ACTION_GROUP_ID"
Verify Task 5
Terminal window
az monitor metrics alert show -g "$RG" -n vm-high-cpu \
--query '{Name:name, Severity:severity, Enabled:enabled, Condition:criteria.allOf[0].{Metric:metricName, Operator:operator, Threshold:threshold}}' -o json
Terminal window
# Query 1: Average CPU over the last 10 minutes
az monitor log-analytics query \
--workspace "$WORKSPACE_CUSTOMER_ID" \
--analytics-query "
Perf
| where TimeGenerated > ago(10m)
| where CounterName == '% Processor Time'
| where InstanceName == '_Total'
| summarize AvgCPU = avg(CounterValue), MaxCPU = max(CounterValue) by bin(TimeGenerated, 1m)
| order by TimeGenerated desc
" --output table
# Query 2: Memory usage
az monitor log-analytics query \
--workspace "$WORKSPACE_CUSTOMER_ID" \
--analytics-query "
Perf
| where TimeGenerated > ago(10m)
| where CounterName == '% Used Memory'
| summarize AvgMemory = avg(CounterValue) by bin(TimeGenerated, 1m)
| order by TimeGenerated desc
" --output table
# Query 3: Agent heartbeat (proves the agent is reporting)
az monitor log-analytics query \
--workspace "$WORKSPACE_CUSTOMER_ID" \
--analytics-query "
Heartbeat
| where TimeGenerated > ago(10m)
| summarize LastHeartbeat = max(TimeGenerated) by Computer, OSType, Version
" --output table
Verify Task 6

All three queries should return data. If they are empty, wait another 5 minutes for data ingestion and retry. The Heartbeat query is the most reliable indicator that the agent is working.

Terminal window
az group delete --name "$RG" --yes --no-wait
  • Log Analytics workspace created
  • VM deployed with Azure Monitor Agent and system-assigned managed identity
  • Data Collection Rule created and associated with the VM
  • Performance counter data visible in Log Analytics (Perf table)
  • KQL queries returning CPU, memory, and heartbeat data
  • Metric alert created with email action group

Module 3.11: CI/CD with Azure DevOps & GitHub Actions --- Learn how to build automated deployment pipelines that securely deploy to Azure using OIDC authentication, Azure Container Registry, and Container Apps.