Amazon CloudWatch Metrics - Complete Guide for AWS SysOps Administrator Associate Exam
Why Amazon CloudWatch Metrics is Important
Amazon CloudWatch Metrics is a fundamental AWS monitoring service that every SysOps Administrator must master. It provides the foundation for monitoring AWS resources, troubleshooting issues, and maintaining operational excellence. Understanding CloudWatch Metrics is essential because it enables you to track performance, set alarms, and make data-driven decisions about your infrastructure.
What is Amazon CloudWatch Metrics?
CloudWatch Metrics is a time-series data collection service that gathers and stores metrics from AWS services and your applications. Metrics are the fundamental concept in CloudWatch and represent a time-ordered set of data points published to CloudWatch.
Key Components:
• Namespace: A container for CloudWatch metrics (e.g., AWS/EC2, AWS/RDS)
• Metric: A variable to monitor (e.g., CPUUtilization, NetworkIn)
• Dimension: A name/value pair that uniquely identifies a metric (e.g., InstanceId=i-1234567890abcdef0)
• Statistic: Metric data aggregations over time (Average, Sum, Minimum, Maximum, SampleCount)
• Period: Length of time associated with a specific statistic (minimum 1 second for detailed monitoring)
• Unit: The unit of measure (Bytes, Seconds, Count, Percent)
How CloudWatch Metrics Works
Data Collection:
• AWS services automatically send metrics to CloudWatch
• EC2 instances send basic metrics every 5 minutes (free) or detailed metrics every 1 minute (paid)
• Custom metrics can be published using the PutMetricData API
• CloudWatch Agent collects system-level metrics and logs from EC2 instances and on-premises servers
Metric Resolution:
• Standard Resolution: 1-minute granularity (default for most AWS services)
• High Resolution: 1-second granularity (available for custom metrics)
Retention Periods:
• Data points with period less than 60 seconds: Available for 3 hours
• Data points with 60-second period: Available for 15 days
• Data points with 300-second (5-minute) period: Available for 63 days
• Data points with 3600-second (1-hour) period: Available for 455 days (15 months)
Important Default Metrics by Service
EC2 Metrics (sent by hypervisor):
• CPUUtilization, NetworkIn, NetworkOut, NetworkPacketsIn, NetworkPacketsOut
• DiskReadOps, DiskWriteOps, DiskReadBytes, DiskWriteBytes
• StatusCheckFailed, StatusCheckFailed_Instance, StatusCheckFailed_System
Metrics NOT available by default (require CloudWatch Agent):
• Memory utilization
• Disk space utilization
• Number of processes running
EBS Metrics:
• VolumeReadOps, VolumeWriteOps, VolumeReadBytes, VolumeWriteBytes
• VolumeTotalReadTime, VolumeTotalWriteTime, VolumeIdleTime
• VolumeQueueLength, BurstBalance (for gp2 and st1/sc1 volumes)
RDS Metrics:
• DatabaseConnections, FreeableMemory, FreeStorageSpace
• CPUUtilization, ReadIOPS, WriteIOPS, ReadLatency, WriteLatency
Custom Metrics
You can publish your own metrics using:
• AWS CLI: aws cloudwatch put-metric-data
• AWS SDKs
• CloudWatch Agent
Custom metrics support:
• Standard resolution (1-minute minimum)
• High resolution (1-second minimum) using StorageResolution parameter
• Up to 10 dimensions per metric
CloudWatch Alarms
Alarms watch metrics and perform actions based on thresholds:
Alarm States:
• OK: Metric is within the defined threshold
• ALARM: Metric has breached the threshold
• INSUFFICIENT_DATA: Not enough data to determine state
Alarm Actions:
• Send notifications via SNS
• Execute Auto Scaling policies
• Perform EC2 actions (stop, terminate, reboot, recover)
• Create OpsItems or incidents in Systems Manager
Evaluation Periods and Datapoints to Alarm:
• You can configure how many consecutive periods must breach before alarming
• Example: 3 out of 5 evaluation periods breaching triggers alarm
Metric Math
Allows you to query multiple CloudWatch metrics and use math expressions to create new time series:
• Useful for calculating rates, aggregations, and derived metrics
• Example: Calculate error rate as (Errors/Requests)*100
Exam Tips: Answering Questions on Amazon CloudWatch Metrics
Key Facts to Remember:
1. Memory and disk space metrics are NOT collected by default for EC2 - You need the CloudWatch Agent to collect these metrics. This is a very common exam question.
2. Basic vs Detailed Monitoring:
• Basic: Free, 5-minute intervals
• Detailed: Paid, 1-minute intervals
• Enable detailed monitoring when you need faster reaction to issues
3. EC2 Status Checks:
• System status checks: Problems with underlying AWS infrastructure
• Instance status checks: Problems requiring your involvement (OS issues)
• Recovery actions can only be performed for system status check failures
4. Metric Retention: Remember the retention periods - high-resolution data is kept for only 3 hours before being aggregated
5. Custom Metrics:
• Minimum resolution is 1 second (high resolution)
• Standard resolution minimum is 1 minute
• Maximum of 10 dimensions per metric
6. Alarm Best Practices:
• Use multiple datapoints to avoid false alarms from brief spikes
• Composite alarms combine multiple alarms to reduce alarm noise
• Missing data treatment: Choose appropriate handling (treat as breaching, not breaching, missing, or ignore)
7. Cross-Account and Cross-Region:
• CloudWatch cross-account observability allows sharing metrics across accounts
• Use CloudWatch dashboard widgets to display metrics from multiple regions
Common Exam Scenarios:
• Scenario: Need to monitor memory usage on EC2
• Answer: Install and configure CloudWatch Agent
• Scenario: Need sub-minute monitoring data
• Answer: Use high-resolution custom metrics (1-second granularity)
• Scenario: EC2 instance fails system status check
• Answer: Configure CloudWatch alarm with EC2 recovery action
• Scenario: Need to aggregate metrics across multiple instances
• Answer: Use metric math or publish custom metrics with appropriate dimensions
• Scenario: Reduce alarm noise from multiple related alarms
• Answer: Use composite alarms
Watch Out For:
• Questions that assume memory metrics are available by default - they are not
• Confusion between CloudWatch Logs and CloudWatch Metrics - they serve different purposes
• Understanding when to use detailed monitoring vs custom metrics
• Knowing which actions are available for different alarm types