Observability

Observability Is Not Monitoring: The Distinction That Finally Made Sense

Metrics dashboard visualization

For a long time I treated "observability" as a marketing word that the vendors who sold monitoring tools had adopted to sound more sophisticated. Prometheus was monitoring. Grafana was monitoring. Adding Jaeger for traces didn't feel like a categorical shift — it felt like more monitoring.

The distinction clicked during an incident in early 2025 that I couldn't debug with any of our dashboards.

The incident

One of our payment processing services started returning elevated error rates — around 3%, up from a baseline of 0.02%. The errors were intermittent. Our dashboards showed the error rate, showed that p99 latency was elevated, showed that database connection count was normal. Nothing told me which requests were failing or why.

I spent 90 minutes looking at dashboards before I started reading raw logs. The logs told me in three minutes: a specific merchant ID was generating malformed requests that our validation layer was mishandling, throwing an exception that was being swallowed and counted as an error rather than a 400. The fix was a one-line change. The 90 minutes was wasted because our monitoring told me that something was wrong but gave me no path to why.

The actual distinction

Monitoring answers: is the system healthy? Observability answers: why is the system behaving this way?

Monitoring is optimized for known failure modes. You define what "unhealthy" looks like — error rate above X, latency above Y — and alert on it. It works well for the failure modes you anticipated when you wrote the alerts.

Observability is about being able to ask arbitrary questions about system state from the outside. The canonical definition comes from control theory: a system is observable if you can determine its internal state from its outputs. In practice this means: given any user complaint or anomalous behavior, can I find the cause without deploying new instrumentation?

The incident above was not a monitoring failure. The alert fired correctly. It was an observability failure — I couldn't determine what was causing the error rate from the outputs I had available.

The three pillars framing is incomplete

Logs, metrics, traces. You've seen this framing. It's useful as a starting taxonomy but it's not sufficient as a definition of observability, because you can have all three and still be unable to answer questions about your system.

What matters is whether the signals are correlated and high-cardinality. Correlated: a trace ID that links a log line to a trace span to a metric data point, so you can follow a single request across your entire system. High-cardinality: the ability to filter by arbitrary dimensions — by user ID, merchant ID, feature flag, deployment version — not just by service name and environment.

After the incident we added structured logging with request context propagation (user ID, merchant ID, request ID) and started correlating logs with our Jaeger traces via the trace ID. The next time a merchant-specific issue appeared, the investigation took eight minutes instead of ninety.

What this means practically

You don't need to throw away Prometheus and Grafana. Metrics are efficient for alerting and trend analysis. But supplement them with:

We run Loki for log aggregation. Its label-based query model is more constrained than something like Honeycomb for high-cardinality queries, but it's what our team can operate, and it's better than structured logs with no aggregation layer at all.

The honest version

Full observability — the kind where you can answer any arbitrary question about system state — is expensive and operationally complex. Most teams don't need the full thing. What most teams need is to fix the specific gap between "alert fired" and "I know what to look at next." That gap is usually smaller than it seems, and closing it doesn't require rebuilding your entire telemetry stack.


All posts · Incident post →