AWS CloudWatch Metrics and Alarms: Practical Observability

Intermediate 10 min read

What you'll learn

✓How CloudWatch metrics are structured
✓Dimensions and the cost they create
✓Statistic vs metric math
✓Designing useful alarms
✓Composite alarms and SLO patterns

Prerequisites

•Familiar with terminals and YAML

What and Why

CloudWatch is AWS’s observability surface for metrics, logs, and traces. Most teams use the metrics part poorly: they alarm on CPU and call it done. A good CloudWatch setup catches real customer impact early, suppresses noise during deploys, and gives you a dashboard you actually look at during incidents.

Getting this right matters because alarms shape on-call quality of life. Page on the wrong signal and engineers stop trusting alerts; page on the right signal and incidents end before customers notice.

Mental Model

A CloudWatch metric is identified by a namespace, a name, and a set of dimensions. Each unique dimension combination is a separate metric, and each metric costs money. Datapoints are aggregated into 1-minute or 5-minute buckets and queried with a statistic like Sum, Average, or p99.

App -> EMF log line -> CloudWatch Logs -> Metric (with dimensions)
                                           |
                                           v
                                     Statistic over period
                                           |
                                           v
                                Alarm threshold + evaluation
                                           |
                                           v
                                     SNS -> PagerDuty

From event to alarm

The cheapest way to emit custom metrics is the Embedded Metric Format (EMF): you log a JSON line and CloudWatch extracts metrics from it during ingestion. No extra API calls.

Hands-on Example

A Node service emits a custom OrderProcessed metric per environment:

{
  "_aws": {
    "Timestamp": 1719500000000,
    "CloudWatchMetrics": [{
      "Namespace": "ShopApp",
      "Dimensions": [["Env"]],
      "Metrics": [{ "Name": "OrderProcessed", "Unit": "Count" }]
    }]
  },
  "Env": "prod",
  "OrderProcessed": 1
}

Now define an alarm that pages if order throughput drops sharply compared to recent history. Using metric math:

Alarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmName: orders-throughput-drop
    Metrics:
      - Id: m1
        MetricStat:
          Metric:
            Namespace: ShopApp
            MetricName: OrderProcessed
            Dimensions: [{ Name: Env, Value: prod }]
          Period: 60
          Stat: Sum
        ReturnData: false
      - Id: anom
        Expression: ANOMALY_DETECTION_BAND(m1, 2)
      - Id: result
        Expression: IF(m1 < anom, 1, 0)
        ReturnData: true
    EvaluationPeriods: 5
    DatapointsToAlarm: 3
    Threshold: 1
    ComparisonOperator: GreaterThanOrEqualToThreshold
    AlarmActions: [!Ref PagerDutyTopic]

That alarm fires when 3 out of 5 minutes are below the anomaly band - far less noisy than a static threshold.

For a service SLO, combine multiple alarms in a composite alarm that only pages when both error rate and latency are bad:

aws cloudwatch put-composite-alarm \
  --alarm-name svc-slo-breach \
  --alarm-rule "ALARM(errors-high) AND ALARM(latency-p99-high)"

Common Pitfalls

High-cardinality dimensions. Putting userId or requestId as a dimension creates millions of metrics and a huge bill. Keep dimensions to environment, region, route group.
Average everywhere. Average hides spikes. Use p99 for latency, Sum for counts, and reserve Average for utilization metrics.
Single datapoint alarms. Set EvaluationPeriods greater than 1 with DatapointsToAlarm to filter blips, otherwise you page on every transient hiccup.
No treat missing data policy. A silent producer should usually alarm. Set TreatMissingData: breaching for heartbeat metrics.
Forgetting log retention. CloudWatch Logs default to never expire. Set retention per log group or pay forever.

Production Tips

Build dashboards from the four golden signals: latency, traffic, errors, saturation. Layer business metrics (orders per minute, signups) above them.
Use alarm tagging and an SNS topic per severity. Route critical to a pager, warnings to Slack.
Suppress deploy noise with Alarm Actions Suppressor or composite alarm rules referencing a deploy-in-progress alarm.
Export key metrics to CloudWatch Metric Streams into Kinesis if you want to pipe them into Datadog, Grafana, or a long-term store.
Run a quarterly alarm cleanup: any alarm that fired more than 20 times last quarter without an incident is a candidate to retire or retune.

Wrap-up

CloudWatch can be the backbone of your observability stack if you treat metrics as a first-class concern. Emit custom metrics with EMF, alarm on customer-visible signals with anomaly detection or composite rules, and keep dimensions sane to control cost. The result is a quiet inbox during normal operation and a fast, trustworthy signal when things break.