Module 1.10: CloudWatch & Observability
Complexity: [MEDIUM] | Time to Complete: 2 hours | Track: AWS DevOps Essentials
Prerequisites
Section titled “Prerequisites”Before starting this module, ensure you have:
- Completed Module 1.3: EC2 & Compute Fundamentals (launching instances, security groups, IAM instance profiles)
- An AWS account with admin access (or scoped permissions for CloudWatch, EC2, IAM)
- AWS CLI v2 installed and configured
- At least one running EC2 instance to instrument (or willingness to launch one)
- Basic understanding of metrics, logs, and alerting concepts
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”After completing this module, you will be able to:
- Configure the CloudWatch Agent to collect custom metrics including memory utilization, disk I/O, and application-level telemetry
- Deploy CloudWatch Alarms with composite alarm logic and SNS notification routing for multi-signal alerting
- Implement CloudWatch Logs Insights queries to diagnose application errors across distributed services
- Design CloudWatch dashboards with metric math expressions to visualize service health and cost trends
Why This Module Matters
Section titled “Why This Module Matters”In July 2019, a major financial services company experienced a 14-hour outage that cost them an estimated $12 million in lost transactions. The root cause was a memory leak in a Java microservice running on EC2. The leak took roughly 6 hours to exhaust available memory, at which point the application began throwing OutOfMemoryError exceptions. The operations team did not notice for another 3 hours because they only monitored CPU utilization — the default CloudWatch metric for EC2. Memory usage, application-level errors, and garbage collection pauses were invisible to them. By the time a customer complaint triggered investigation, cascading failures had spread to three downstream services.
Had they installed the CloudWatch Agent to collect memory and disk metrics, configured a custom metric for JVM heap usage, and set an alarm at 80% memory utilization, they would have received an alert 6 hours before the outage. A simple auto-scaling policy tied to memory pressure could have launched fresh instances automatically. Total cost of prevention: about $3/month in CloudWatch custom metrics.
In this module, you will learn the full CloudWatch observability stack — from the free standard metrics that every AWS resource emits, to custom metrics you define, to log aggregation with CloudWatch Logs, to alerting with CloudWatch Alarms, to tracing with X-Ray. You will understand what AWS gives you for free, what costs money, and where the sharp edges are that catch teams off guard.
Did You Know?
Section titled “Did You Know?”-
CloudWatch ingests over 1 trillion metrics per day across all AWS customers. It is one of the oldest AWS services, launching alongside EC2 in 2009, and has grown from a simple CPU-monitoring tool into a full observability platform.
-
EC2 standard metrics have a 5-minute resolution by default and are completely free. Enabling “detailed monitoring” bumps this to 1-minute resolution but costs ~$2.10 per instance per month (7 metrics at $0.30 each). Most production workloads need 1-minute resolution — 5-minute intervals can hide spikes that cause real user impact.
-
CloudWatch Logs Insights can query terabytes of logs in seconds using a purpose-built query language. It was released in November 2018 and has largely eliminated the need for teams to ship logs to Elasticsearch just for ad-hoc querying. You pay $0.005 per GB of data scanned.
-
The CloudWatch Agent replaced three older tools: the CloudWatch Monitoring Scripts (Perl-based
mon-put-instance-data.pl), the SSM CloudWatch Plugin (on Windows), and the CloudWatch Logs Agent (awslogs). If you encounter tutorials referencingawslogsagent ormon-put-instance-data.pl, they are outdated — the unified CloudWatch Agent handles everything.
Standard Metrics: What AWS Gives You for Free
Section titled “Standard Metrics: What AWS Gives You for Free”Every AWS service automatically publishes metrics to CloudWatch at no cost. These are called standard metrics (sometimes “basic monitoring” or “vended metrics”). Understanding what is free versus paid prevents surprise bills.
EC2 Standard Metrics
Section titled “EC2 Standard Metrics”+------------------------------------------------------------------+| EC2 Standard Metrics || (Free, 5-minute intervals) |+------------------------------------------------------------------+| || CPU Network || - CPUUtilization (%) - NetworkIn (bytes) || - CPUCreditUsage (T-series) - NetworkOut (bytes) || - CPUCreditBalance - NetworkPacketsIn || - NetworkPacketsOut || Disk (instance store only) || - DiskReadOps Status Checks || - DiskWriteOps - StatusCheckFailed || - DiskReadBytes - StatusCheckFailed_Instance || - DiskWriteBytes - StatusCheckFailed_System || |+------------------------------------------------------------------+| || NOT included (requires CloudWatch Agent): || - Memory utilization || - Disk space utilization (EBS volumes) || - Swap usage || - Process-level metrics || |+------------------------------------------------------------------+The biggest gap in EC2 standard metrics is memory. AWS cannot see inside your instance’s operating system (the hypervisor only sees CPU, network, and instance-store disk I/O), so memory and EBS disk space metrics require an agent running inside the instance.
Stop and think: If an EC2 instance exhausts its memory and crashes, which of the standard free metrics might give you a clue that something went wrong, given that
MemoryUtilizationis not tracked?
Viewing Standard Metrics
Section titled “Viewing Standard Metrics”# List all available metrics for an instanceaws cloudwatch list-metrics \ --namespace "AWS/EC2" \ --dimensions "Name=InstanceId,Value=i-0abc123def456789"
# Get CPU utilization for the last hour (5-minute periods)aws cloudwatch get-metric-statistics \ --namespace "AWS/EC2" \ --metric-name CPUUtilization \ --dimensions Name=InstanceId,Value=i-0abc123def456789 \ --start-time "$(date -u -v-1H '+%Y-%m-%dT%H:%M:%S')" \ --end-time "$(date -u '+%Y-%m-%dT%H:%M:%S')" \ --period 300 \ --statistics Average Maximum
# On Linux, use date -d instead of -v:# --start-time "$(date -u -d '1 hour ago' '+%Y-%m-%dT%H:%M:%S')"Other Services’ Free Metrics
Section titled “Other Services’ Free Metrics”| Service | Key Free Metrics | Default Resolution |
|---|---|---|
| RDS | CPUUtilization, FreeStorageSpace, ReadIOPS, WriteIOPS, DatabaseConnections | 1 minute |
| ALB | RequestCount, TargetResponseTime, HTTPCode_Target_4XX_Count, HealthyHostCount | 1 minute |
| ECS | CPUUtilization, MemoryUtilization (per service) | 1 minute |
| Lambda | Invocations, Duration, Errors, Throttles, ConcurrentExecutions | 1 minute |
| SQS | NumberOfMessagesSent, ApproximateNumberOfMessagesVisible, ApproximateAgeOfOldestMessage | 5 minutes |
| DynamoDB | ConsumedReadCapacityUnits, ConsumedWriteCapacityUnits, ThrottledRequests | 1 minute |
Notice that ECS gives you memory utilization for free (it can see container-level memory from the task metadata), while EC2 does not.
Custom Metrics: Measuring What Matters
Section titled “Custom Metrics: Measuring What Matters”Standard metrics tell you about infrastructure. Custom metrics tell you about your application. Business-critical values — requests per second, payment processing latency, queue depth, cache hit ratio — need custom metrics.
Publishing Custom Metrics
Section titled “Publishing Custom Metrics”# Publish a single metric data pointaws cloudwatch put-metric-data \ --namespace "MyApp/Production" \ --metric-name "OrdersProcessed" \ --value 142 \ --unit Count \ --dimensions Environment=production,Service=order-processor
# Publish with a timestamp (useful for backfilling)aws cloudwatch put-metric-data \ --namespace "MyApp/Production" \ --metric-name "PaymentLatencyMs" \ --value 238 \ --unit Milliseconds \ --timestamp "2026-03-24T10:30:00Z"
# Publish multiple metrics in one call (more efficient)aws cloudwatch put-metric-data \ --namespace "MyApp/Production" \ --metric-data '[ {"MetricName": "ActiveUsers", "Value": 1834, "Unit": "Count"}, {"MetricName": "ErrorRate", "Value": 0.023, "Unit": "Percent"}, {"MetricName": "CacheHitRatio", "Value": 94.6, "Unit": "Percent"} ]'Pricing Reality Check
Section titled “Pricing Reality Check”Custom metrics cost $0.30 per metric per month for the first 10,000 metrics, dropping to $0.10 at scale. A “metric” is defined by its unique combination of namespace, metric name, and dimensions.
This means these are three separate billable metrics:
MyApp/Production + OrdersProcessed + Environment=production,Service=ordersMyApp/Production + OrdersProcessed + Environment=staging,Service=ordersMyApp/Production + OrdersProcessed + Environment=production,Service=paymentsTeams that over-use dimensions (adding instance ID, request ID, or user ID as dimensions) can accidentally create millions of metrics and face bills in the thousands. A good rule: dimensions should have low cardinality (tens or hundreds of values, not thousands).
Embedded Metric Format (EMF)
Section titled “Embedded Metric Format (EMF)”If your application writes structured JSON logs, CloudWatch can automatically extract metrics from them. This is called the Embedded Metric Format, and it is the most cost-effective way to publish custom metrics from Lambda functions and ECS tasks:
import jsonimport sys
def emit_metric(metric_name, value, unit="Count", dimensions=None): """Emit a CloudWatch metric via Embedded Metric Format.""" emf = { "_aws": { "Timestamp": 1711267200000, # epoch ms "CloudWatchMetrics": [ { "Namespace": "MyApp/Production", "Dimensions": [list(dimensions.keys())] if dimensions else [[]], "Metrics": [ {"Name": metric_name, "Unit": unit} ] } ] }, metric_name: value } if dimensions: emf.update(dimensions) # Print to stdout -- CloudWatch Logs automatically extracts the metric print(json.dumps(emf)) sys.stdout.flush()
# Usageemit_metric("CheckoutLatency", 234, "Milliseconds", {"Environment": "production", "Region": "us-east-1"})The beauty of EMF: you get both a log entry AND a CloudWatch metric from a single print statement. No separate put-metric-data API call needed.
Pause and predict: If you use
put-metric-datasynchronously in a Lambda function that processes 10,000 requests per second, what two major bottlenecks or operational issues will you likely encounter?
CloudWatch Alarms: Getting Notified Before Users Do
Section titled “CloudWatch Alarms: Getting Notified Before Users Do”Metrics without alarms are just dashboards that nobody watches at 3 AM. Alarms bridge the gap between data collection and incident response.
Alarm Anatomy
Section titled “Alarm Anatomy”Every CloudWatch Alarm has three states:
+----------+ threshold breached +---------+| OK | ---------------------------> | ALARM || | <--------------------------- | |+----------+ threshold recovered +---------+ | | | insufficient data | v v+---------------------------------------------------+| INSUFFICIENT_DATA || (not enough data points to evaluate) |+---------------------------------------------------+Creating Alarms
Section titled “Creating Alarms”# CPU alarm: trigger if average CPU > 80% for 3 consecutive 5-minute periodsaws cloudwatch put-metric-alarm \ --alarm-name "high-cpu-i-0abc123" \ --alarm-description "CPU utilization exceeds 80% for 15 minutes" \ --metric-name CPUUtilization \ --namespace AWS/EC2 \ --statistic Average \ --period 300 \ --threshold 80 \ --comparison-operator GreaterThanThreshold \ --evaluation-periods 3 \ --dimensions Name=InstanceId,Value=i-0abc123def456789 \ --alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts \ --ok-actions arn:aws:sns:us-east-1:123456789012:ops-alerts \ --treat-missing-data missing
# Status check alarm (recover the instance automatically)aws cloudwatch put-metric-alarm \ --alarm-name "status-check-i-0abc123" \ --alarm-description "Recover instance on status check failure" \ --metric-name StatusCheckFailed_System \ --namespace AWS/EC2 \ --statistic Maximum \ --period 60 \ --threshold 1 \ --comparison-operator GreaterThanOrEqualToThreshold \ --evaluation-periods 2 \ --dimensions Name=InstanceId,Value=i-0abc123def456789 \ --alarm-actions arn:aws:automate:us-east-1:ec2:recover
# Custom metric alarm: order processing errorsaws cloudwatch put-metric-alarm \ --alarm-name "order-errors-high" \ --metric-name "OrderErrors" \ --namespace "MyApp/Production" \ --statistic Sum \ --period 300 \ --threshold 10 \ --comparison-operator GreaterThanThreshold \ --evaluation-periods 1 \ --dimensions Name=Environment,Value=production \ --alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alertsThe treat-missing-data Gotcha
Section titled “The treat-missing-data Gotcha”This setting determines what happens when CloudWatch has no data points for an evaluation period. The options:
| Setting | Behavior | Best For |
|---|---|---|
missing | Maintains current state | Most alarms (conservative) |
notBreaching | Treats missing data as OK | Sporadic metrics (batch jobs) |
breaching | Treats missing data as ALARM | Critical systems where silence is bad |
ignore | Skips the period entirely | Alarms with naturally gappy data |
The default is missing, which is usually correct. But for critical health checks, consider breaching — if your application stops reporting metrics, that itself is a problem worth alerting on.
Stop and think: You have an alarm monitoring a batch job that runs once an hour. If
treat-missing-datais set tomissing, what state will the alarm be in for the 59 minutes the job isn’t running, and how might that affect your incident response?
Composite Alarms
Section titled “Composite Alarms”When a single metric alarm is too noisy, combine multiple alarms with boolean logic:
# Only alert if BOTH CPU is high AND memory is highaws cloudwatch put-composite-alarm \ --alarm-name "instance-stressed" \ --alarm-rule 'ALARM("high-cpu-i-0abc123") AND ALARM("high-memory-i-0abc123")' \ --alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alertsThis reduces alert fatigue significantly. A CPU spike alone is often transient. A CPU spike combined with high memory and elevated error rate is a real problem.
CloudWatch Logs: Centralized Log Management
Section titled “CloudWatch Logs: Centralized Log Management”Every application produces logs. CloudWatch Logs gives you a central place to store, search, and analyze them.
Core Concepts
Section titled “Core Concepts”CloudWatch Logs Architecture:
Log Group: /myapp/production/api | +-- Log Stream: i-0abc123/application.log | - Log Event: "2026-03-24T10:30:01Z INFO Request processed in 234ms" | - Log Event: "2026-03-24T10:30:02Z ERROR Database connection timeout" | +-- Log Stream: i-0def456/application.log | - Log Event: "2026-03-24T10:30:01Z INFO Request processed in 189ms" | +-- Log Stream: i-0ghi789/application.log - Log Event: "2026-03-24T10:30:03Z WARN Cache miss rate above 20%"- Log Group: A container for log streams, typically one per application/environment combination. Retention, access policies, and encryption are set at the group level.
- Log Stream: A sequence of log events from a single source (one per instance, container, or Lambda invocation).
- Log Event: A single log message with a timestamp.
Setting Retention (Cost Control)
Section titled “Setting Retention (Cost Control)”By default, CloudWatch Logs retains data forever. This is the single biggest cost surprise for CloudWatch newcomers.
# Set retention to 30 days (common for production)aws logs put-retention-policy \ --log-group-name "/myapp/production/api" \ --retention-in-days 30
# Common retention periods:# 1, 3, 5, 7, 14, 30, 60, 90, 120, 150, 180, 365, 400, 545, 731, 1096, 1827, 2192, 2557, 2922, 3288, 3653
# Check current retention for all log groupsaws logs describe-log-groups \ --query 'logGroups[*].[logGroupName,retentionInDays,storedBytes]' \ --output tableA good strategy: 7-14 days for development, 30-90 days for production, and archive to S3 for long-term compliance needs.
CloudWatch Logs Insights
Section titled “CloudWatch Logs Insights”This is where CloudWatch Logs becomes genuinely powerful. Logs Insights lets you write SQL-like queries across log groups:
# Find the 20 slowest requests in the last houraws logs start-query \ --log-group-name "/myapp/production/api" \ --start-time $(date -u -v-1H '+%s') \ --end-time $(date -u '+%s') \ --query-string ' fields @timestamp, @src/content/docs/cloud/managed-services/module-9.2-message-brokers.md | filter @src/content/docs/cloud/managed-services/module-9.2-message-brokers.md like /processed in/ | parse @src/content/docs/cloud/managed-services/module-9.2-message-brokers.md "processed in *ms" as latency | sort latency desc | limit 20 '
# Count errors by type in the last 24 hoursaws logs start-query \ --log-group-name "/myapp/production/api" \ --start-time $(date -u -v-24H '+%s') \ --end-time $(date -u '+%s') \ --query-string ' fields @timestamp, @src/content/docs/cloud/managed-services/module-9.2-message-brokers.md | filter @src/content/docs/cloud/managed-services/module-9.2-message-brokers.md like /ERROR/ | parse @src/content/docs/cloud/managed-services/module-9.2-message-brokers.md "ERROR * - *" as errorType, errorMessage | stats count(*) as errorCount by errorType | sort errorCount desc '
# Get the query results (use the queryId from start-query response)aws logs get-query-results --query-id "a1b2c3d4-5678-90ab-cdef-example"Key Logs Insights query patterns:
| Pattern | Example | Use Case |
|---|---|---|
filter | filter @src/content/docs/cloud/managed-services/module-9.2-message-brokers.md like /ERROR/ | Narrow to relevant logs |
parse | parse @src/content/docs/cloud/managed-services/module-9.2-message-brokers.md "status=*" as code | Extract fields from unstructured logs |
stats | stats count(*) by code | Aggregate and group |
sort | sort @timestamp desc | Order results |
limit | limit 50 | Cap result size |
fields | fields @timestamp, @message | Select columns |
Pause and predict: You run a Logs Insights query searching for an error over a 30-day window on a high-traffic API. It costs $15 to run. If you add a
limit 10clause to the exact same query and run it again, will the cost decrease? Why or why not?
Metric Filters: Turning Logs Into Metrics
Section titled “Metric Filters: Turning Logs Into Metrics”You can create CloudWatch Metrics from log patterns without changing your application code:
# Create a metric filter that counts ERROR linesaws logs put-metric-filter \ --log-group-name "/myapp/production/api" \ --filter-name "ErrorCount" \ --filter-pattern "ERROR" \ --metric-transformations \ metricName=ApplicationErrors,metricNamespace=MyApp/Production,metricValue=1,defaultValue=0
# More specific: count 5xx responses in JSON logsaws logs put-metric-filter \ --log-group-name "/myapp/production/api" \ --filter-name "5xxResponses" \ --filter-pattern '{ $.statusCode >= 500 }' \ --metric-transformations \ metricName=Server5xxErrors,metricNamespace=MyApp/Production,metricValue=1,defaultValue=0Now you can alarm on ApplicationErrors or Server5xxErrors just like any other CloudWatch metric.
The CloudWatch Agent: Unlocking OS-Level Metrics and Custom Logs
Section titled “The CloudWatch Agent: Unlocking OS-Level Metrics and Custom Logs”The CloudWatch Agent is a lightweight daemon that runs inside your EC2 instances (and on-premises servers). It collects operating system metrics that the hypervisor cannot see and ships log files to CloudWatch Logs.
Installation
Section titled “Installation”# Amazon Linux 2 / Amazon Linux 2023sudo yum install -y amazon-cloudwatch-agent
# Ubuntu/Debianwget https://amazoncloudwatch-agent.s3.amazonaws.com/ubuntu/amd64/latest/amazon-cloudwatch-agent.debsudo dpkg -i amazon-cloudwatch-agent.deb
# Verify installationamazon-cloudwatch-agent-ctl -a statusConfiguration
Section titled “Configuration”The agent is configured with a JSON file. You can generate one interactively with a wizard or write it directly:
{ "agent": { "metrics_collection_interval": 60, "run_as_user": "cwagent" }, "metrics": { "namespace": "CWAgent", "append_dimensions": { "InstanceId": "${aws:InstanceId}", "AutoScalingGroupName": "${aws:AutoScalingGroupName}" }, "aggregation_dimensions": [ ["InstanceId"], ["AutoScalingGroupName"] ], "metrics_collected": { "mem": { "measurement": [ "mem_used_percent", "mem_available_percent", "mem_total" ], "metrics_collection_interval": 60 }, "disk": { "measurement": [ "disk_used_percent", "disk_free" ], "resources": ["/", "/data"], "metrics_collection_interval": 60 }, "swap": { "measurement": ["swap_used_percent"] }, "cpu": { "measurement": [ "cpu_usage_idle", "cpu_usage_user", "cpu_usage_system", "cpu_usage_iowait" ], "totalcpu": true, "metrics_collection_interval": 60 } } }, "logs": { "logs_collected": { "files": { "collect_list": [ { "file_path": "/var/log/myapp/application.log", "log_group_name": "/myapp/production/api", "log_stream_name": "{instance_id}/application.log", "retention_in_days": 30, "timestamp_format": "%Y-%m-%dT%H:%M:%S" }, { "file_path": "/var/log/syslog", "log_group_name": "/myapp/production/system", "log_stream_name": "{instance_id}/syslog", "retention_in_days": 14 } ] } } }}Storing Config in SSM and Starting the Agent
Section titled “Storing Config in SSM and Starting the Agent”# Store the config in SSM Parameter Storeaws ssm put-parameter \ --name "AmazonCloudWatch-linux-config" \ --type String \ --value file:///opt/aws/amazon-cloudwatch-agent/etc/config.json
# Fetch config from SSM and start the agentsudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \ -a fetch-config \ -m ec2 \ -s \ -c ssm:AmazonCloudWatch-linux-config
# Check agent statusamazon-cloudwatch-agent-ctl -a statusRequired IAM Policy
Section titled “Required IAM Policy”The EC2 instance role needs these permissions:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "cloudwatch:PutMetricData", "logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents", "logs:DescribeLogStreams", "ssm:GetParameter" ], "Resource": "*" }, { "Effect": "Allow", "Action": "ec2:DescribeTags", "Resource": "*" } ]}AWS provides a managed policy CloudWatchAgentServerPolicy that covers these permissions. Use it instead of maintaining a custom policy:
aws iam attach-role-policy \ --role-name my-ec2-role \ --policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicyEventBridge: Event-Driven Automation
Section titled “EventBridge: Event-Driven Automation”EventBridge (formerly CloudWatch Events) is the event bus that connects AWS services together. When something happens in your account — an EC2 instance changes state, a deployment completes, an alarm triggers — EventBridge can route that event to a target for automated response.
Common Event Patterns
Section titled “Common Event Patterns”# React when an EC2 instance stops unexpectedlyaws events put-rule \ --name "ec2-instance-stopped" \ --event-pattern '{ "source": ["aws.ec2"], "detail-type": ["EC2 Instance State-change Notification"], "detail": { "state": ["stopped", "terminated"] } }' \ --state ENABLED
# Send to SNS topicaws events put-targets \ --rule "ec2-instance-stopped" \ --targets '[{"Id":"notify-ops","Arn":"arn:aws:sns:us-east-1:123456789012:ops-alerts"}]'
# Schedule-based rule (cron): run a Lambda every day at 6 AM UTCaws events put-rule \ --name "daily-health-check" \ --schedule-expression "cron(0 6 * * ? *)" \ --state ENABLEDEventBridge vs CloudWatch Alarms
Section titled “EventBridge vs CloudWatch Alarms”Think of alarms as threshold-based monitoring (“alert me when X exceeds Y”) and EventBridge as event-based automation (“when X happens, do Y”). They are complementary:
- Alarm: CPU > 80% for 15 minutes —> send SNS notification
- EventBridge: ECS task failed —> trigger Lambda to investigate and post to Slack
- Combined: Alarm triggers —> EventBridge rule catches alarm state change —> Lambda creates a PagerDuty incident
X-Ray: Distributed Tracing (Brief Overview)
Section titled “X-Ray: Distributed Tracing (Brief Overview)”When a request flows through multiple services (API Gateway to Lambda to DynamoDB to SQS to another Lambda), logs alone cannot tell you which service caused the slowdown. AWS X-Ray provides distributed tracing.
User Request | v[API Gateway] --trace--> [Lambda A] --trace--> [DynamoDB] 2ms 45ms 12ms | +--trace--> [SQS] --trace--> [Lambda B] 3ms 28ms
Total request time: 90msBottleneck: Lambda A (45ms)X-Ray integration requires adding the X-Ray SDK to your application code and enabling tracing on the service. It is most useful for Lambda and ECS-based microservice architectures. For a deep dive, the X-Ray service deserves its own module — here, know that it exists and what it solves.
Cost Considerations
Section titled “Cost Considerations”CloudWatch costs sneak up on teams that do not plan. Here is a pricing summary for US East (as of 2026):
| Component | Free Tier | Paid Rate |
|---|---|---|
| Standard metrics | All included | Free |
| Detailed monitoring (1-min) | 10 metrics | $0.30/metric/month |
| Custom metrics | First 10 metrics | $0.30/metric/month (first 10K) |
| Alarms | 10 standard alarms | $0.10/alarm/month |
| Logs ingestion | 5 GB/month | $0.50/GB |
| Logs storage | 5 GB/month | $0.03/GB/month |
| Logs Insights queries | None free | $0.005/GB scanned |
| Dashboards | 3 dashboards (50 metrics) | $3.00/dashboard/month |
The three biggest cost drivers are usually:
- Log ingestion — verbose application logging at scale adds up fast
- Custom metric cardinality — too many dimension combinations
- Log retention — the default is “forever,” which compounds monthly
Common Mistakes
Section titled “Common Mistakes”| Mistake | Why It Happens | How to Fix It |
|---|---|---|
| Not setting log group retention | Default is “never expire” and it accumulates silently | Set retention on every log group at creation time; audit with describe-log-groups regularly |
| Monitoring only CPU on EC2 | It is the only visible metric without agent setup | Install CloudWatch Agent on day one; memory and disk are essential signals |
| High-cardinality custom metric dimensions | Adding request ID, user ID, or IP as dimensions | Dimensions should have low cardinality (environment, service, region); put high-cardinality data in logs |
| Setting alarm evaluation period too short | Wanting to catch issues fast | A single 1-minute breach is often noise; use 3+ evaluation periods to reduce false alarms |
Using treat-missing-data = breaching on metrics that naturally gap | Sporadic batch jobs or infrequent Lambda invocations | Use notBreaching or ignore for intermittent data sources |
| Not using Logs Insights, querying raw streams instead | Habit from grep/tail workflows | Logs Insights is faster, supports aggregation, and works across streams; invest 30 minutes learning the query syntax |
| Forgetting IAM permissions for CloudWatch Agent | Agent installed but fails silently | Attach CloudWatchAgentServerPolicy managed policy to the instance role; check agent logs at /opt/aws/amazon-cloudwatch-agent/logs/ |
| Creating dashboards instead of alarms | Dashboards feel productive | Dashboards require someone watching; alarms notify you proactively; build alarms first, dashboards second |
1. You are migrating a Java application from ECS to EC2. On ECS, you had a CloudWatch dashboard showing memory utilization without installing any agents. On EC2, the dashboard is blank. Why is this happening, and how do you fix it?
EC2 standard metrics are collected by the hypervisor, which sits outside the instance’s operating system and only sees hardware-level data like CPU cycles and network I/O. It cannot see inside the guest OS to measure memory allocation or process-level metrics. ECS, however, collects container metrics through the ECS agent running inside the instance, which has direct access to container resource usage via the container runtime API. To fix the blank dashboard on EC2, you must install and configure the CloudWatch Agent inside the OS to explicitly collect and publish memory metrics.
2. You configure a CloudWatch Alarm on CPUUtilization with a period of 300 seconds, an evaluation period of 3, and a threshold of 80%. A bug causes CPU to spike to 100% for 10 minutes, drop to 50% for 5 minutes, and spike back to 100% for 5 minutes. Does the alarm trigger? Why or why not?
The alarm does not trigger under these specific conditions. For an alarm to trigger with the default settings, the metric must breach the threshold for all consecutive evaluation periods—in this case, three consecutive 5-minute periods (15 minutes total). Since the CPU dropped below the 80% threshold during the third 5-minute period, the consecutive breach chain was broken. To catch intermittent spikes like this, you would need to use the “M out of N” evaluation model, such as requiring 2 out of 3 periods to breach the threshold.
3. Your team is writing a high-throughput Lambda function that processes thousands of payment events per second. A developer suggests using the boto3 SDK to call `put_metric_data` for every payment to track custom business metrics. Why is this a poor architectural choice, and what should be used instead?
Calling the put-metric-data API directly within a high-throughput Lambda function introduces significant latency and cost, as every invocation must wait for a synchronous HTTP network call to CloudWatch to complete. At thousands of requests per second, this could lead to API throttling limits and inflate your Lambda duration billing. Instead, you should use the Embedded Metric Format (EMF) to write the metric data as structured JSON to stdout. CloudWatch Logs will asynchronously parse the EMF logs and publish the metrics behind the scenes, eliminating the API latency and cost from your function’s execution path.
4. You inherit an AWS environment where the monthly CloudWatch bill has inexplicably jumped from $50 to $800. The application architecture has not changed, but traffic has doubled. What are the first three areas you should investigate to identify the root cause?
First, you should investigate log ingestion volume, as doubled traffic often means doubled logs, and verbose logging quickly consumes terabytes of expensive ingestion data. Second, you must check the log group retention policies; if the default “Never expire” is set, storage costs will compound infinitely over time as old logs are never deleted. Third, review the custom metrics for high-cardinality dimensions, such as a developer accidentally adding a unique Request ID or User ID as a dimension. This mistake generates millions of unique billable metrics, which is one of the most common causes of massive CloudWatch billing spikes.
5. You need to automatically reboot an EC2 instance when it fails a system status check, and you also need to trigger a complex Step Functions workflow that opens a Jira ticket and pages the on-call engineer. Should you use a CloudWatch Alarm action, an EventBridge rule, or both? Why?
You should use a combination of both a CloudWatch Alarm action and an EventBridge rule for this scenario. CloudWatch Alarm actions are threshold-based and have built-in, native support for simple EC2 recovery actions (like rebooting or recovering an instance) when a status check fails. However, Alarm actions cannot directly trigger complex workflows like Step Functions. To achieve the second requirement, you would create an EventBridge rule configured to listen for the specific CloudWatch Alarm state change event, which can then flexibly route that event to the Step Functions state machine to handle the ticketing and paging.
6. During a major production incident, your team ran the same complex Logs Insights query across 500 GB of log data dozens of times, resulting in hundreds of dollars in query fees. How can you architect the system to reduce the cost of tracking this specific error pattern in the future?
To prevent repeated query fees for known error patterns, you should create a CloudWatch Metric Filter on the log group that matches the specific error syntax. The metric filter continuously evaluates incoming logs in real-time for free and increments a custom CloudWatch metric whenever the pattern is found. You can then build dashboards and alarms based on this custom metric, which costs a flat, predictable monthly rate rather than incurring per-query scan charges. For ad-hoc querying during an incident, you can also reduce costs by narrowing the Logs Insights time range to just the last few minutes, drastically reducing the gigabytes of data scanned.
7. Your infrastructure team is deploying 50 EC2 instances via Auto Scaling. A junior engineer suggests baking the CloudWatch Agent JSON configuration file directly into the Golden AMI. Why might this lead to operational headaches, and what service should you use instead?
Baking the configuration file directly into the AMI creates a tight coupling that requires you to rebuild and redeploy the entire Golden AMI across all 50 instances just to change a single metric interval or add a new log path. This turns a trivial configuration change into a time-consuming infrastructure deployment. Instead, you should store the JSON configuration in Systems Manager (SSM) Parameter Store. This centralized approach allows instances to fetch the latest configuration dynamically at startup, and you can push updates to running instances using SSM Run Command without ever needing to touch the base AMI.
Hands-On Exercise: CloudWatch Agent on EC2 with Custom Logs and CPU Alarm
Section titled “Hands-On Exercise: CloudWatch Agent on EC2 with Custom Logs and CPU Alarm”Objective
Section titled “Objective”Install the CloudWatch Agent on an EC2 instance, configure it to collect memory metrics and ship application logs, then create an alarm that fires when CPU exceeds a threshold.
You need:
- An EC2 instance (Amazon Linux 2023 recommended) with an IAM role attached
- SSH access to the instance
- The IAM role must have
CloudWatchAgentServerPolicyattached
If you do not have an instance ready:
# Create an IAM role for the instance (if you don't have one)aws iam create-role \ --role-name cw-lab-ec2-role \ --assume-role-policy-document '{ "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Principal": {"Service": "ec2.amazonaws.com"}, "Action": "sts:AssumeRole" }] }'
aws iam attach-role-policy \ --role-name cw-lab-ec2-role \ --policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy
aws iam attach-role-policy \ --role-name cw-lab-ec2-role \ --policy-arn arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
aws iam create-instance-profile --instance-profile-name cw-lab-profileaws iam add-role-to-instance-profile \ --instance-profile-name cw-lab-profile \ --role-name cw-lab-ec2-role
# Launch an instance (use your key pair and security group)aws ec2 run-instances \ --image-id resolve:ssm:/aws/service/ami-amazon-linux-latest/al2023-ami-kernel-default-x86_64 \ --instance-type t3.micro \ --iam-instance-profile Name=cw-lab-profile \ --key-name YOUR_KEY_PAIR \ --security-group-ids sg-YOUR_SG \ --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=cw-lab}]'Task 1: Install the CloudWatch Agent
Section titled “Task 1: Install the CloudWatch Agent”SSH into the instance and install the agent.
Solution
# SSH into the instancessh -i your-key.pem ec2-user @INSTANCE_PUBLIC_IP
# Install the CloudWatch Agentsudo yum install -y amazon-cloudwatch-agent
# Verify installationamazon-cloudwatch-agent-ctl -a status# Should show: "status": "stopped"Task 2: Create a Sample Application Log
Section titled “Task 2: Create a Sample Application Log”Generate a log file that simulates application output.
Solution
# Create the log directorysudo mkdir -p /var/log/myappsudo chown ec2-user:ec2-user /var/log/myapp
# Generate some fake log entriescat > /tmp/generate-logs.sh <<'SCRIPT'#!/bin/bashwhile true; do TIMESTAMP=$(date -u '+%Y-%m-%dT%H:%M:%SZ') LATENCY=$((RANDOM % 500 + 10)) STATUS_CODES=(200 200 200 200 200 201 301 400 404 500) STATUS=${STATUS_CODES[$RANDOM % ${#STATUS_CODES[@]}]} echo "${TIMESTAMP} INFO request_id=$(uuidgen | cut -c1-8) status=${STATUS} latency=${LATENCY}ms path=/api/orders" sleep 2done >> /var/log/myapp/application.logSCRIPT
chmod +x /tmp/generate-logs.shnohup /tmp/generate-logs.sh &Task 3: Configure and Start the CloudWatch Agent
Section titled “Task 3: Configure and Start the CloudWatch Agent”Write the agent configuration to collect memory metrics and ship the application log.
Solution
# Write the agent configsudo tee /opt/aws/amazon-cloudwatch-agent/etc/custom-config.json <<'EOF'{ "agent": { "metrics_collection_interval": 60, "run_as_user": "root" }, "metrics": { "namespace": "CWAgentLab", "append_dimensions": { "InstanceId": "${aws:InstanceId}" }, "metrics_collected": { "mem": { "measurement": ["mem_used_percent"], "metrics_collection_interval": 60 }, "disk": { "measurement": ["disk_used_percent"], "resources": ["/"], "metrics_collection_interval": 60 }, "cpu": { "measurement": ["cpu_usage_idle", "cpu_usage_user"], "totalcpu": true, "metrics_collection_interval": 60 } } }, "logs": { "logs_collected": { "files": { "collect_list": [ { "file_path": "/var/log/myapp/application.log", "log_group_name": "/cw-lab/application", "log_stream_name": "{instance_id}", "retention_in_days": 7 } ] } } }}EOF
# Start the agent with the configsudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \ -a fetch-config \ -m ec2 \ -s \ -c file:/opt/aws/amazon-cloudwatch-agent/etc/custom-config.json
# Verify it is runningamazon-cloudwatch-agent-ctl -a status# Should show: "status": "running"Task 4: Verify Metrics and Logs Appear in CloudWatch
Section titled “Task 4: Verify Metrics and Logs Appear in CloudWatch”Wait 2-3 minutes for data to flow, then verify from your local machine.
Solution
# Check that custom metrics are appearing (from your local machine)aws cloudwatch list-metrics \ --namespace "CWAgentLab" \ --query 'Metrics[*].[MetricName,Dimensions]' \ --output table
# Get the latest memory metricINSTANCE_ID="i-YOUR_INSTANCE_ID"aws cloudwatch get-metric-statistics \ --namespace "CWAgentLab" \ --metric-name "mem_used_percent" \ --dimensions Name=InstanceId,Value=$INSTANCE_ID \ --start-time "$(date -u -v-10M '+%Y-%m-%dT%H:%M:%S')" \ --end-time "$(date -u '+%Y-%m-%dT%H:%M:%S')" \ --period 60 \ --statistics Average
# Check logs are flowingaws logs describe-log-streams \ --log-group-name "/cw-lab/application" \ --query 'logStreams[*].[logStreamName,lastEventTimestamp]' \ --output table
# Read recent log entriesaws logs get-log-events \ --log-group-name "/cw-lab/application" \ --log-stream-name "$INSTANCE_ID" \ --limit 10 \ --query 'events[*].message' \ --output textTask 5: Create a CPU Alarm
Section titled “Task 5: Create a CPU Alarm”Create an alarm that triggers when CPU exceeds 70% for 2 consecutive 1-minute periods. Then stress the CPU to trigger it.
Solution
INSTANCE_ID="i-YOUR_INSTANCE_ID"
# Create an SNS topic for notifications (or use an existing one)TOPIC_ARN=$(aws sns create-topic --name cw-lab-alerts --query 'TopicArn' --output text)
# Subscribe your emailaws sns subscribe \ --topic-arn $TOPIC_ARN \ --protocol email \ --notification-endpoint your-email @example.com# Confirm the subscription via the email you receive
# Create the CPU alarmaws cloudwatch put-metric-alarm \ --alarm-name "cw-lab-high-cpu" \ --alarm-description "CPU exceeds 70% for 2 minutes" \ --metric-name CPUUtilization \ --namespace AWS/EC2 \ --statistic Average \ --period 60 \ --threshold 70 \ --comparison-operator GreaterThanThreshold \ --evaluation-periods 2 \ --dimensions Name=InstanceId,Value=$INSTANCE_ID \ --alarm-actions $TOPIC_ARN \ --ok-actions $TOPIC_ARN \ --treat-missing-data missing
# SSH into the instance and stress the CPUssh -i your-key.pem ec2-user @INSTANCE_PUBLIC_IP# Run: stress-ng --cpu 2 --timeout 300# (Install if needed: sudo yum install -y stress-ng)
# After 2-3 minutes, check alarm state from your local machineaws cloudwatch describe-alarms \ --alarm-names "cw-lab-high-cpu" \ --query 'MetricAlarms[0].[AlarmName,StateValue,StateReason]' \ --output textTask 6: Clean Up
Section titled “Task 6: Clean Up”Solution
# Delete the alarmaws cloudwatch delete-alarms --alarm-names "cw-lab-high-cpu"
# Delete SNS topic and subscriptionaws sns delete-topic --topic-arn $TOPIC_ARN
# Delete log groupaws logs delete-log-group --log-group-name "/cw-lab/application"
# Terminate the EC2 instanceaws ec2 terminate-instances --instance-ids $INSTANCE_ID
# Clean up IAM (after instance is terminated)aws iam remove-role-from-instance-profile \ --instance-profile-name cw-lab-profile \ --role-name cw-lab-ec2-roleaws iam delete-instance-profile --instance-profile-name cw-lab-profileaws iam detach-role-policy \ --role-name cw-lab-ec2-role \ --policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicyaws iam detach-role-policy \ --role-name cw-lab-ec2-role \ --policy-arn arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCoreaws iam delete-role --role-name cw-lab-ec2-roleSuccess Criteria
Section titled “Success Criteria”- CloudWatch Agent installed and running on EC2 instance
- Memory metrics (
mem_used_percent) appearing in CloudWatch underCWAgentLabnamespace - Application logs visible in CloudWatch Logs under
/cw-lab/application - CPU alarm created and in
OKstate initially - CPU stress test triggers alarm to
ALARMstate - Alarm notification received (email or visible state change)
- All resources cleaned up
Next Module
Section titled “Next Module”Continue to Module 1.11: CI/CD on AWS — where you will build automated deployment pipelines using AWS CodeBuild, CodeDeploy, and CodePipeline. Now that you can monitor your infrastructure, it is time to automate how code gets there.