DevOps Observability Stack Overview

Beginner 8 min read

What you'll learn

✓The three pillars and what each is good at
✓How telemetry flows from app to backend
✓Where OpenTelemetry fits in
✓Common pitfalls with cost and cardinality
✓How to start small and grow the stack

Prerequisites

•Basic familiarity with running a service in production

What and Why

Observability is the ability to ask new questions about a running system without shipping new code. Monitoring tells you the things you decided to watch. Observability lets you investigate the things you did not. The distinction matters because production behavior is full of surprises that no dashboard anticipated.

The reason you need a stack rather than a single tool is that different questions need different data shapes. “Is latency up?” is a metrics question. “What happened to this one request?” is a trace question. “Why did it fail?” is usually a logs question. A good stack lets you pivot between them.

Mental Model

The classic three pillars are metrics, logs, and traces. A useful fourth is events, sometimes folded into logs.

Metrics are numbers over time. They are cheap, aggregable, and answer “how much” or “how many” at a glance.

Logs are structured records of discrete things that happened. They are detailed but expensive at volume.

Traces are causally linked sets of spans showing how a single request moved through services. They are the only data type that captures cross-service flow.

Events are point-in-time records like deploys, config changes, and incidents that you overlay on the other three to explain shifts.

Every modern stack has four layers: instrumentation in the app, a collector or agent that ships data, a backend that stores and indexes it, and a UI that lets you query it.

Hands-on Example

A minimal Python service emitting all three signals using OpenTelemetry:

from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor
import logging, json

trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="otel-collector:4317"))
)
tracer = trace.get_tracer("checkout")
meter  = metrics.get_meter("checkout")
orders = meter.create_counter("orders_total")

logging.basicConfig(level=logging.INFO, format="%(message)s")

def checkout(user_id, cart):
    with tracer.start_as_current_span("checkout") as span:
        span.set_attribute("user.id", user_id)
        orders.add(1, {"status": "ok"})
        logging.info(json.dumps({"event": "checkout", "user": user_id}))

The OTel Collector receives, transforms, and fans out to Prometheus, Loki, and Tempo.

App ----+
      |  OTLP
      v
 OTel Collector ----+----+----+
                    |    |    |
                    v    v    v
                Metrics Logs Traces
                (Prom)  (Loki)(Tempo)
                    \    |    /
                     \   |   /
                      v  v  v
                     Grafana
                  (queries + dashboards)

Telemetry flow through the stack

Common Pitfalls

High cardinality is the most expensive mistake. A metric with a user_id label becomes one time series per user, which destroys your Prometheus bill and your query speed. Keep labels low-cardinality. Push user-level detail into traces or logs.

Logging everything is the second pitfall. Verbose logs in a high-traffic service can dominate your bill and bury the signal. Log at INFO for state transitions, at ERROR for failures, and use sampling for noisy paths.

Trace gaps are the third. If one service in the chain is not instrumented, the trace shows a hole and you cannot follow the request. Instrument every hop, even small internal services.

Practical Tips

Start with metrics. They are the cheapest and give you the broadest situational awareness. Add tracing once you have multiple services that call each other. Add structured logging from day one, even if your backend is just stdout and grep.

Standardize on OpenTelemetry. It decouples your code from your backend, so swapping vendors does not require re-instrumenting.

Tag everything with service, version, and environment. These three labels make every dashboard reusable across services.

Treat dashboards as code. Store them in Git, review them in pull requests, and avoid the click-to-create graveyard that every Grafana eventually becomes.

Wrap-up

An observability stack is the union of metrics, logs, traces, and events, each shipped by instrumentation, routed by a collector, stored in a backend, and surfaced in a UI. Get the basics right, watch cardinality, and standardize on OpenTelemetry. The goal is not more dashboards but faster answers when production surprises you.