2025: Year in Review
I started keeping a work journal in February 2025 — a short daily note about what I worked on, decisions made, things that surprised me. Reading back through it for this post, the year was harder and more productive than I remember it feeling at the time.
What I shipped
The alert cleanup. Reduced our active alert rule count from 134 to 65 by auditing every rule against a year of incident data. On-call quality improved measurably. I wrote about the process here.
Static membership for our Kafka consumers. Rolled out group.instance.id across all consumer groups. Eliminated the rebalance storms we'd been seeing during rolling deployments. Took three weeks to implement carefully (you have to coordinate the config change with the deployment strategy) and has saved roughly an hour of aggregate processing time per deploy cycle.
PgBouncer in transaction mode. After six months in session mode where it wasn't helping, migrated to transaction mode. Required fixing three services that relied on session-level state. Connection utilization dropped by 60%. Described in more detail here.
Runbook review process. Quarterly sessions where the on-call rotation reads and updates all runbooks together. This came out of the August incident. The first session was the most valuable — eight outdated runbooks updated, three with new steps added from institutional knowledge that hadn't been written down.
What I broke
The August incident. 47 minutes of elevated transaction processing latency, caused by a consumer rebalance loop I could have resolved in 10 minutes if I'd read the runbook first. Written up in full here. The lesson was delivered clearly and I don't expect to repeat the specific mistake.
An Istio upgrade in November that caused 12 minutes of service-to-service communication failures for three services. The cause was a changed default in AuthorizationPolicy behavior. I'd read the changelog but hadn't connected it to our specific configuration. We now run a pre-upgrade checklist that includes testing in staging with production-equivalent policy configurations.
What I read
Designing Data-Intensive Applications by Martin Kleppmann. I'd read it once before but this time with more production experience behind me, the chapters on replication and distributed transactions landed differently. The section on the problems with distributed transactions and why most systems avoid them is the clearest explanation of the tradeoffs I've read.
The Kafka documentation, more thoroughly than before. Specifically the sections on consumer group protocol and exactly-once semantics. Documentation that rewards careful reading.
What I'm carrying into 2026
The Kafka cluster migration. We've approved moving from Confluent Cloud to self-managed. The cost case is clear; the operational case is what I'll be building in the first half of the year.
More deliberate writing. I have notes from 2025 that would have made decent posts but didn't get written up. The gap between having thoughts about something and having a publishable post about it is real. I haven't solved it, but I've at least named it.
Reading Database Internals by Alex Petrov. I've started it twice and gotten distracted. Third attempt with a proper reading schedule rather than grabbing it when nothing else is urgent.