AWS CloudWatch Metrics and Alarms: Practical Observability
Build a meaningful CloudWatch setup with custom metrics, composite alarms, and dashboards that catch real incidents without paging on noise.
What you'll learn
- ✓How CloudWatch metrics are structured
- ✓Dimensions and the cost they create
- ✓Statistic vs metric math
- ✓Designing useful alarms
- ✓Composite alarms and SLO patterns
Prerequisites
- •Familiar with terminals and YAML
What and Why
CloudWatch is AWS’s observability surface for metrics, logs, and traces. Most teams use the metrics part poorly: they alarm on CPU and call it done. A good CloudWatch setup catches real customer impact early, suppresses noise during deploys, and gives you a dashboard you actually look at during incidents.
Getting this right matters because alarms shape on-call quality of life. Page on the wrong signal and engineers stop trusting alerts; page on the right signal and incidents end before customers notice.
Mental Model
A CloudWatch metric is identified by a namespace, a name, and a set of dimensions. Each unique dimension combination is a separate metric, and each metric costs money. Datapoints are aggregated into 1-minute or 5-minute buckets and queried with a statistic like Sum, Average, or p99.
App -> EMF log line -> CloudWatch Logs -> Metric (with dimensions)
|
v
Statistic over period
|
v
Alarm threshold + evaluation
|
v
SNS -> PagerDuty The cheapest way to emit custom metrics is the Embedded Metric Format (EMF): you log a JSON line and CloudWatch extracts metrics from it during ingestion. No extra API calls.
Hands-on Example
A Node service emits a custom OrderProcessed metric per environment:
{
"_aws": {
"Timestamp": 1719500000000,
"CloudWatchMetrics": [{
"Namespace": "ShopApp",
"Dimensions": [["Env"]],
"Metrics": [{ "Name": "OrderProcessed", "Unit": "Count" }]
}]
},
"Env": "prod",
"OrderProcessed": 1
}
Now define an alarm that pages if order throughput drops sharply compared to recent history. Using metric math:
Alarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: orders-throughput-drop
Metrics:
- Id: m1
MetricStat:
Metric:
Namespace: ShopApp
MetricName: OrderProcessed
Dimensions: [{ Name: Env, Value: prod }]
Period: 60
Stat: Sum
ReturnData: false
- Id: anom
Expression: ANOMALY_DETECTION_BAND(m1, 2)
- Id: result
Expression: IF(m1 < anom, 1, 0)
ReturnData: true
EvaluationPeriods: 5
DatapointsToAlarm: 3
Threshold: 1
ComparisonOperator: GreaterThanOrEqualToThreshold
AlarmActions: [!Ref PagerDutyTopic]
That alarm fires when 3 out of 5 minutes are below the anomaly band - far less noisy than a static threshold.
For a service SLO, combine multiple alarms in a composite alarm that only pages when both error rate and latency are bad:
aws cloudwatch put-composite-alarm \
--alarm-name svc-slo-breach \
--alarm-rule "ALARM(errors-high) AND ALARM(latency-p99-high)"
Common Pitfalls
- High-cardinality dimensions. Putting
userIdorrequestIdas a dimension creates millions of metrics and a huge bill. Keep dimensions to environment, region, route group. - Average everywhere.
Averagehides spikes. Usep99for latency,Sumfor counts, and reserveAveragefor utilization metrics. - Single datapoint alarms. Set
EvaluationPeriodsgreater than 1 withDatapointsToAlarmto filter blips, otherwise you page on every transient hiccup. - No
treat missing datapolicy. A silent producer should usually alarm. SetTreatMissingData: breachingfor heartbeat metrics. - Forgetting log retention. CloudWatch Logs default to never expire. Set retention per log group or pay forever.
Production Tips
- Build dashboards from the four golden signals: latency, traffic, errors, saturation. Layer business metrics (orders per minute, signups) above them.
- Use alarm tagging and an SNS topic per severity. Route critical to a pager, warnings to Slack.
- Suppress deploy noise with Alarm Actions Suppressor or composite alarm rules referencing a deploy-in-progress alarm.
- Export key metrics to CloudWatch Metric Streams into Kinesis if you want to pipe them into Datadog, Grafana, or a long-term store.
- Run a quarterly alarm cleanup: any alarm that fired more than 20 times last quarter without an incident is a candidate to retire or retune.
Wrap-up
CloudWatch can be the backbone of your observability stack if you treat metrics as a first-class concern. Emit custom metrics with EMF, alarm on customer-visible signals with anomaly detection or composite rules, and keep dimensions sane to control cost. The result is a quiet inbox during normal operation and a fast, trustworthy signal when things break.
Related articles
- DevOps DevOps Observability Stack Overview
A tour of the modern observability stack: metrics, logs, traces, and events. Learn how the pillars fit together and how to choose tooling without drowning in dashboards.
- AWS AWS API Gateway vs ALB: Choosing the Right Entry Point
Compare API Gateway and Application Load Balancer for fronting AWS workloads, including features, pricing, latency, and when to use each in production.
- AWS AWS CloudFront CDN Tutorial: Caching at the Edge
Learn how AWS CloudFront accelerates content delivery, what cache behaviors look like, and how to wire it up to an S3 origin with sensible defaults.
- AWS AWS CodeBuild and CodeDeploy Tutorial: Build and Ship on AWS
Learn how to wire AWS CodeBuild and CodeDeploy together to build artifacts, run tests, and deploy to EC2, ECS, or Lambda with blue/green and canary strategies.