Skip to content
C Codeloom
AWS

AWS CloudWatch Metrics and Alarms: Practical Observability

Build a meaningful CloudWatch setup with custom metrics, composite alarms, and dashboards that catch real incidents without paging on noise.

·4 min read · By Codeloom
Intermediate 10 min read

What you'll learn

  • How CloudWatch metrics are structured
  • Dimensions and the cost they create
  • Statistic vs metric math
  • Designing useful alarms
  • Composite alarms and SLO patterns

Prerequisites

  • Familiar with terminals and YAML

What and Why

CloudWatch is AWS’s observability surface for metrics, logs, and traces. Most teams use the metrics part poorly: they alarm on CPU and call it done. A good CloudWatch setup catches real customer impact early, suppresses noise during deploys, and gives you a dashboard you actually look at during incidents.

Getting this right matters because alarms shape on-call quality of life. Page on the wrong signal and engineers stop trusting alerts; page on the right signal and incidents end before customers notice.

Mental Model

A CloudWatch metric is identified by a namespace, a name, and a set of dimensions. Each unique dimension combination is a separate metric, and each metric costs money. Datapoints are aggregated into 1-minute or 5-minute buckets and queried with a statistic like Sum, Average, or p99.

App -> EMF log line -> CloudWatch Logs -> Metric (with dimensions)
                                           |
                                           v
                                     Statistic over period
                                           |
                                           v
                                Alarm threshold + evaluation
                                           |
                                           v
                                     SNS -> PagerDuty
From event to alarm

The cheapest way to emit custom metrics is the Embedded Metric Format (EMF): you log a JSON line and CloudWatch extracts metrics from it during ingestion. No extra API calls.

Hands-on Example

A Node service emits a custom OrderProcessed metric per environment:

{
  "_aws": {
    "Timestamp": 1719500000000,
    "CloudWatchMetrics": [{
      "Namespace": "ShopApp",
      "Dimensions": [["Env"]],
      "Metrics": [{ "Name": "OrderProcessed", "Unit": "Count" }]
    }]
  },
  "Env": "prod",
  "OrderProcessed": 1
}

Now define an alarm that pages if order throughput drops sharply compared to recent history. Using metric math:

Alarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    AlarmName: orders-throughput-drop
    Metrics:
      - Id: m1
        MetricStat:
          Metric:
            Namespace: ShopApp
            MetricName: OrderProcessed
            Dimensions: [{ Name: Env, Value: prod }]
          Period: 60
          Stat: Sum
        ReturnData: false
      - Id: anom
        Expression: ANOMALY_DETECTION_BAND(m1, 2)
      - Id: result
        Expression: IF(m1 < anom, 1, 0)
        ReturnData: true
    EvaluationPeriods: 5
    DatapointsToAlarm: 3
    Threshold: 1
    ComparisonOperator: GreaterThanOrEqualToThreshold
    AlarmActions: [!Ref PagerDutyTopic]

That alarm fires when 3 out of 5 minutes are below the anomaly band - far less noisy than a static threshold.

For a service SLO, combine multiple alarms in a composite alarm that only pages when both error rate and latency are bad:

aws cloudwatch put-composite-alarm \
  --alarm-name svc-slo-breach \
  --alarm-rule "ALARM(errors-high) AND ALARM(latency-p99-high)"

Common Pitfalls

  • High-cardinality dimensions. Putting userId or requestId as a dimension creates millions of metrics and a huge bill. Keep dimensions to environment, region, route group.
  • Average everywhere. Average hides spikes. Use p99 for latency, Sum for counts, and reserve Average for utilization metrics.
  • Single datapoint alarms. Set EvaluationPeriods greater than 1 with DatapointsToAlarm to filter blips, otherwise you page on every transient hiccup.
  • No treat missing data policy. A silent producer should usually alarm. Set TreatMissingData: breaching for heartbeat metrics.
  • Forgetting log retention. CloudWatch Logs default to never expire. Set retention per log group or pay forever.

Production Tips

  • Build dashboards from the four golden signals: latency, traffic, errors, saturation. Layer business metrics (orders per minute, signups) above them.
  • Use alarm tagging and an SNS topic per severity. Route critical to a pager, warnings to Slack.
  • Suppress deploy noise with Alarm Actions Suppressor or composite alarm rules referencing a deploy-in-progress alarm.
  • Export key metrics to CloudWatch Metric Streams into Kinesis if you want to pipe them into Datadog, Grafana, or a long-term store.
  • Run a quarterly alarm cleanup: any alarm that fired more than 20 times last quarter without an incident is a candidate to retire or retune.

Wrap-up

CloudWatch can be the backbone of your observability stack if you treat metrics as a first-class concern. Emit custom metrics with EMF, alarm on customer-visible signals with anomaly detection or composite rules, and keep dimensions sane to control cost. The result is a quiet inbox during normal operation and a fast, trustworthy signal when things break.