The Incident That Taught Me to Read Runbooks Before Incidents
In August we had a 47-minute partial outage on our transaction processing service. Payments were not lost — we have enough queuing in place for that — but the processing latency increased to the point where some downstream systems started timing out and users saw errors. The root cause was a known failure mode. The runbook existed. I had written part of it. I had never read it end to end.
What happened
Our Kafka consumer group for transaction processing entered a rebalance loop. One consumer was repeatedly joining and leaving the group due to a slow processing path that was exceeding max.poll.interval.ms. Each time it left, it triggered a full group rebalance, which paused processing for all consumers. The cycle repeated every 3–4 minutes.
The alert fired correctly. I was paged at 02:14. I looked at the consumer lag dashboard, confirmed lag was growing, looked at the consumer group status and saw repeated rebalances, and then spent 23 minutes trying things that didn't work — restarting the affected consumer, adjusting the partition assignment manually, checking broker logs for errors.
At 02:37 a colleague joined the incident. She opened the runbook for consumer group rebalance issues, read to step 4, and said: "have you checked max.poll.interval.ms against the actual processing time?" I had not. The runbook had that as step 4. We found the slow processing path within four minutes. The fix — temporarily increasing the interval and routing the slow messages to a separate lower-priority consumer — took another six minutes.
The runbook problem
We have 34 runbooks. They live in our internal wiki. I know they exist. Before this incident, I had read maybe a third of them, and of those I'd only read the ones I'd written myself, and even those I'd written incrementally over time without reading the whole document after the fact.
The assumption was: runbooks are for people who don't know the system. If you know the system, you don't need the runbook. This assumption is wrong in at least two ways.
First, runbooks encode the accumulated debugging history of everyone who has dealt with a problem before, not just your own. The max.poll.interval.ms check was added to our runbook by a colleague after an incident I hadn't been involved in, eight months earlier. I didn't know about it.
Second, at 02:14 on a Tuesday morning, the version of me that "knows the system" is not operating at full capacity. A checklist is useful precisely when you're tired and under pressure, not when you're rested and thinking clearly. Runbooks are for incidents. Incidents happen when you're not at your best.
What we changed
We scheduled a two-hour session where the on-call rotation reads through all runbooks together. Not to memorize them — to know what's in them and to update anything that's wrong or missing. We do this quarterly now. The first session found eight runbooks with outdated steps (services that had changed their configuration), two that referenced tools we no longer use, and three that were missing steps that the people in the room knew from experience but hadn't written down.
We also added a note to each alert configuration pointing to the relevant runbook. When PagerDuty wakes you up, the notification now includes a direct link. This sounds obvious. We hadn't done it.
The thing I keep thinking about
The outage lasted 47 minutes. The runbook-based fix took 10 minutes once we started following it. The 37 minutes between the page and opening the runbook were not wasted exactly — I was gathering information and trying things — but they weren't as efficient as they would have been if I'd started with the runbook.
The instinct during an incident is to act based on what you know. The better instinct, especially late at night, is to check what's been written down about this problem before you start acting on what you know. These aren't mutually exclusive, but the order matters.