Prometheus and Grafana Basics for Metrics

Intermediate 10 min read

What you'll learn

✓How Prometheus scrapes and stores metrics
✓The four core metric types
✓How to write basic PromQL queries
✓How Grafana panels and dashboards work
✓How to define alerts that page only when it matters

Prerequisites

•Comfortable with containers — see What is Docker
•Familiar with Kubernetes concepts is a plus — see What is Kubernetes

Metrics tell you what is happening. Logs tell you why. Traces tell you where. The most common starting point for observability is the metrics stack, and the most common metrics stack is Prometheus for collection and Grafana for visualization. This post walks through both, the query language between them, and the alert patterns that keep on-call rotations sane.

The Prometheus model

Prometheus is a time-series database. It stores a stream of numeric samples, each tagged with a metric name and a set of key/value labels. The unit of data is a sample like http_requests_total{method="GET",route="/api"} 1234 @ t=1718650000.

Two architectural choices define it:

Pull-based scraping. Prometheus calls your service’s /metrics endpoint on an interval (default 15 seconds) and stores what it gets.
Multi-dimensional labels. The same metric name can have many label combinations, each forming its own time series.

You expose metrics, Prometheus scrapes them, queries answer questions across labels.

Running Prometheus

The simplest setup is a single container.

# docker-compose.yml
services:
  prometheus:
    image: prom/prometheus:v2.54.0
    ports: ["9090:9090"]
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

The config file declares what to scrape.

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'api'
    static_configs:
      - targets: ['api:3000']
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

Bring it up and visit http://localhost:9090. The Status > Targets page should show each target as UP.

The four metric types

Most application metrics fall into four shapes:

Counter — monotonically increasing. Request counts, errors, bytes sent. Reset only on process restart.
Gauge — value that can go up or down. Queue length, memory in use, temperature.
Histogram — distribution. Request latency in buckets, plus a sum and count.
Summary — similar to histogram but computes quantiles client-side. Use histograms when in doubt.

The choice matters because PromQL functions assume the right type. A counter goes through rate() before you graph it. A gauge does not.

Instrumenting a service

In a Node app, the prom-client library exposes metrics on /metrics.

import express from 'express';
import client from 'prom-client';

const app = express();
client.collectDefaultMetrics();

const httpReqs = new client.Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status'],
});

app.use((req, res, next) => {
  res.on('finish', () => {
    httpReqs.inc({
      method: req.method,
      route: req.route?.path ?? 'unknown',
      status: res.statusCode,
    });
  });
  next();
});

app.get('/metrics', async (_req, res) => {
  res.set('Content-Type', client.register.contentType);
  res.send(await client.register.metrics());
});

Keep label cardinality low. Adding user_id as a label explodes the time-series count and slows everything down. Stick to bounded sets like methods, routes, and HTTP status classes.

PromQL essentials

PromQL is the query language. A few patterns cover most needs.

Instant value of a counter (raw, rarely useful):

http_requests_total

Per-second request rate over the last 5 minutes:

rate(http_requests_total[5m])

Error rate by route:

sum by (route) (rate(http_requests_total{status=~"5.."}[5m]))

Latency p95 from a histogram:

histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

CPU usage from node_exporter:

1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))

Two functions appear everywhere: rate() for per-second average over a window, and sum by () for aggregating across labels. Learn those first.

Grafana for dashboards

Grafana reads from Prometheus (and many other sources) and renders panels. Run it next to Prometheus.

grafana:
  image: grafana/grafana:11.1.0
  ports: ["3000:3000"]
  environment:
    GF_SECURITY_ADMIN_PASSWORD: admin

Add Prometheus as a data source at http://prometheus:9090. Then build a dashboard:

One row for traffic — request rate by route.
One row for errors — 5xx rate, error ratio.
One row for latency — p50, p95, p99 from histograms.
One row for saturation — CPU, memory, queue depth.

These four sections are the USE method (Utilization, Saturation, Errors) and the RED method (Rate, Errors, Duration) combined. They cover most “is the system healthy” questions.

Alerting that pages humans

Alerts live in Prometheus rules or in Grafana. The principle is the same: alert on symptoms, not causes.

# alert.rules.yml
groups:
  - name: api
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
          sum(rate(http_requests_total[5m]))
          > 0.05
        for: 10m
        labels:
          severity: page
        annotations:
          summary: "API error rate above 5 percent"

Two things make this a good alert: it measures user-facing impact (error ratio), and it requires the condition to hold for 10 minutes (for: 10m) so a transient blip does not page anyone.

Resist the urge to alert on every gauge that crosses a threshold. CPU at 90 percent is not a problem if latency is fine. Latency at 5 seconds is a problem even if CPU is at 20 percent.

Service discovery in Kubernetes

In Kubernetes, you do not list targets by hand. The Prometheus Operator and kube-prometheus-stack Helm chart wire up service discovery, scrape every pod with the right annotations, and ship a starter set of dashboards.

metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "3000"

The cluster becomes self-describing for metrics. New deployments show up automatically.

Retention and long-term storage

Prometheus is local-disk by default with retention measured in weeks. For long-term storage, ship to Thanos, Cortex, or Grafana Mimir. The query layer remains PromQL, but storage is now durable and queryable across clusters.

Pitfalls

High-cardinality labels. They are the number one cause of slow Prometheus instances.
Querying long ranges with small intervals. A 30-day graph at 10-second resolution will hurt.
Alerting on raw counters. Always wrap in rate() first.
Treating dashboards as documentation. They go stale. Review them quarterly.

Wrap up

Prometheus and Grafana cover the metrics half of observability with two pieces: a pull-based time-series database with a powerful query language, and a renderer that turns queries into pictures. Instrument your services with counters, histograms, and a small set of low-cardinality labels. Build dashboards organized by RED and USE. Write a handful of high-quality alerts on user-facing symptoms. That is enough to know when your system is healthy and when it is not — which is the whole job.