DevOps Monitoring with Prometheus and Grafana

Intermediate 10 min read

What you'll learn

✓How Prometheus scrapes metrics
✓PromQL fundamentals
✓Dashboarding with Grafana
✓Alertmanager basics
✓Avoiding cardinality explosions

Prerequisites

•Familiar with shell and YAML

What and Why

Modern services are too dynamic to babysit by hand. Containers come and go, instances scale up and down, and a single user request can fan out across a dozen microservices. Monitoring exists to answer two simple questions: is the system healthy right now, and how is it trending over time. Prometheus and Grafana have become the de facto open-source pairing because they solve those questions cheaply, reliably, and with portable tooling.

Prometheus pulls numeric time series from your services and stores them. Grafana paints those series onto dashboards that humans can read at a glance. Add Alertmanager and you get pages when things go wrong.

Mental Model

your-app  --/metrics-->  Prometheus  --query-->  Grafana
                            |
                            +--alerts-->  Alertmanager  -->  Slack / PagerDuty

Pull-based metrics flow

Prometheus is pull-based. Your application exposes an HTTP endpoint, usually /metrics, that returns counters, gauges, and histograms in a plain text format. Prometheus scrapes that endpoint on an interval, stores the samples, and lets you query them with PromQL. Grafana queries Prometheus and renders charts. Alertmanager handles routing, grouping, and silencing of fired alerts.

The unit of data is the time series, identified by a metric name and a set of labels like http_requests_total{method="GET", status="200"}. Labels are powerful but dangerous; high-cardinality labels can blow up memory.

Hands-on Example

Run the stack locally with Docker Compose:

version: "3.8"
services:
  prometheus:
    image: prom/prometheus:v2.53.0
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
    ports:
      - "9090:9090"
  grafana:
    image: grafana/grafana:11.1.0
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    depends_on:
      - prometheus

A minimal prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]
  - job_name: "node"
    static_configs:
      - targets: ["host.docker.internal:9100"]

Useful PromQL queries:

# request rate per second over the last 5 minutes
rate(http_requests_total[5m])

# 95th percentile latency
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

# error ratio
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))

In Grafana add Prometheus as a data source pointing at http://prometheus:9090, then build panels using those queries. Most teams start by importing community dashboards (Node Exporter Full is a classic) and customize from there.

Common Pitfalls

The first mistake is leaking unbounded label values: putting user_id or full URL paths into a label. Each new value creates a new time series, and Prometheus will eventually OOM. Normalize routes (/users/:id) and avoid per-request labels.

The second mistake is scraping too aggressively. A 1-second scrape interval feels precise but multiplies storage and CPU. 15 seconds is the default for a reason.

The third is alerting on raw counters. Counters reset on restart; use rate() or increase() over a window. And do not alert on every metric. Alert on symptoms users feel: high latency, error rate, queue depth, saturation.

Production Tips

For anything larger than a single team, plan for long-term storage. Prometheus retains roughly 15 days by default. Thanos or Mimir give you cheap object-storage backing and global queries across clusters.

Use recording rules to precompute expensive queries. A dashboard that re-evaluates a 30-day histogram every refresh is painful; bake it into a :p95 series once a minute.

Treat dashboards and alerts as code. Store them in Git, render them via Grafonnet or Terraform, and code-review changes. The team that ships dashboards through PRs catches bad thresholds before they page someone at 3 a.m.

Wrap-up

Prometheus and Grafana are not magic, but together they give you a fast feedback loop for service health. Start with one service exposing /metrics, scrape it, build a dashboard with request rate, error rate, and latency, then add a single alert on error ratio. Expand from there. Observability is built by accretion, not big-bang projects.