SRE

A Year of On-Call: What Actually Pages vs. What Should

Server monitoring dashboard at night

I spent most of 2025 as the primary on-call engineer for our platform team. We run a one-week rotation across six people, so I was primary roughly every six weeks. Over the year I responded to 94 pages. I kept notes on all of them.

Here's what I learned about the gap between the alerting system we had and the one we should have built.

The distribution was not what I expected

Before the year started I would have guessed most pages would come from our most complex systems — the Kafka pipelines, the payment processing service, the data warehouse jobs. The reality: 41 of 94 pages were from infrastructure-level issues that had nothing to do with application code. Certificate expiry warnings (we had four certificates that had been on a rotation schedule someone deleted). Disk space on log volumes. A recurring issue with our container registry's garbage collection running at the wrong time.

The second-largest category was flapping alerts — things that fired, resolved, and fired again within the same hour. 23 of 94 pages were flaps I eventually acknowledged and went back to sleep. Most were latency alerts with thresholds set too tight for normal traffic variance.

Only 30 pages indicated something that was actually broken and required intervention.

The things that should have paged but didn't

More instructive than what fired was what didn't. Three of our five worst incidents in 2025 were discovered not by an alert but by a user report, a failed deployment check, or me noticing something in a dashboard while investigating something else. The alerts existed for the symptoms; they were misconfigured, or had been silenced during a maintenance window and never unsilenced.

The pattern: we built alerts when we were thinking about the system. We disabled or misconfigured them when we were under pressure. Incidents happened in the gaps.

Alert quality over alert quantity

After the year I did an audit. We had 134 active alert rules. I categorized each one by the last time it had fired and whether the resulting page had required action:

We deleted the 38 that never fired (if they haven't fired in a year they're either wrong or monitoring something that doesn't fail), tuned or deleted the 19 noise sources, and fixed the 12 with wrong thresholds. Going from 134 to 65 active rules felt risky. Six months later it hasn't caused a problem, and the on-call experience is measurably better.

The 3am rule

Every alert rule we write now goes through a single question before it's merged: if this fires at 3am on a Saturday, would waking someone up be the right response?

If the answer is "it depends" or "probably not," the alert should either have a higher threshold, a longer evaluation window, or route to a ticket rather than a page. Most of our noise alerts failed this test obviously — we just hadn't applied the test when we wrote them.

What actually matters to alert on

From the 30 pages that required real intervention: error rate above baseline for more than five minutes, p99 latency exceeding SLA threshold sustained for three minutes, consumer lag growing at more than 2x normal rate for ten minutes, and any certificate within 14 days of expiry. Everything else we monitor in dashboards and review during business hours.

The specificity matters. "High latency" is not an alert condition. "p99 request latency exceeds 800ms for more than 3 consecutive minutes" is.

The emotional component

Noisy on-call is a morale problem. Engineers who get woken up for nothing learn to treat all pages as probably-nothing. When something real happens, the response is slower because the cognitive pattern is "this is probably fine." We had this problem. After the alert cleanup, the team's attitude toward pages changed noticeably — not because I told anyone about the changes, but because the experience was different.


All posts · PgBouncer →