Postgres

PgBouncer in Production: Lessons from a Connection Pool That Fell Over

Database query visualization

We added PgBouncer because Postgres was running out of connections. That's the usual reason. We had 18 services, each with a connection pool of 10–20 connections, and our Postgres instance was sitting at max_connections=200 with headroom evaporating as we deployed new replicas during a traffic spike.

PgBouncer solved the immediate problem and introduced three new ones. Here's what they were and what we did about them.

Pool mode determines almost everything

PgBouncer has three pool modes. The choice between them is not stylistic — it has concrete implications for what your application can and can't do.

Session mode: a server connection is held for the lifetime of a client connection. Functionally equivalent to not having a pool at all, in terms of connection count. Useful if you rely on session-level state (SET LOCAL, advisory locks, prepared statements). We started here and it didn't help.

Transaction mode: a server connection is held only for the duration of a transaction. After COMMIT or ROLLBACK, it returns to the pool. This is where the real multiplexing happens. It's also where most PgBouncer bugs come from, because anything that relies on session state breaks silently.

Statement mode: a server connection is returned after each statement. Incompatible with multi-statement transactions. Almost nobody uses this.

We run transaction mode. The migration from session mode required auditing every service for session-level state usage. We found three: one service using SET search_path, one using advisory locks held across requests, and one that had accumulated a pile of named prepared statements. Each required a fix before we could switch.

The thing that transaction mode breaks

Named prepared statements. Postgres prepared statements are session-scoped. In transaction mode, the session isn't stable across client connections, so a prepared statement created in one transaction may not exist in the next. Most ORMs and drivers use prepared statements by default.

The fix: set prepared_statements = off in PgBouncer (added in PgBouncer 1.21) or disable prepared statements in your driver. For the Go pgx driver:

config.DefaultQueryExecMode = pgx.QueryExecModeSimpleProtocol

For SQLAlchemy:

create_engine(url, execution_options={"prepared": False})

We missed this for one service and spent two hours chasing intermittent prepared statement "s1" does not exist errors that appeared only under load, because that's when connections were being returned to and acquired from the pool frequently enough to expose the problem.

Sizing the pool

The formula that's worked for us: pool_size = (core_count * 2) + effective_spindle_count for the Postgres server. Our primary has 8 cores and SSDs (spindle count = 1), so we target a pool size of around 17 server connections per database. This comes from the PgBouncer documentation and aligns with the older Postgres wiki guidance on connection limits.

max_client_conn is separate — it's how many client-side connections PgBouncer will accept. Set this higher than you think you need. Client connections waiting in PgBouncer's queue are cheap; the limit is how many can queue at all. We set ours to 1000. The failure mode when max_client_conn is too low is that clients get connection refused errors at the PgBouncer layer before Postgres is even involved, which is confusing to debug.

What actually fell over

Seven months after the initial deployment, PgBouncer became the bottleneck during a traffic spike. Specifically: one service had a query that held a transaction open for 8–12 seconds under normal conditions (it was doing external API calls inside a transaction — a separate problem). During the spike, those long-running transactions occupied server connections for their full duration, preventing the pool from recycling connections to other services. Client connections piled up in the queue. When the queue hit max_client_conn, new connections were rejected.

The fix had two parts: immediate — increase pool_size temporarily and add server_idle_timeout to recover stale connections faster. Permanent — move the external API call outside the transaction in the offending service. The second fix took a sprint; the first bought us time.

The lesson: PgBouncer's metrics (SHOW POOLS, SHOW STATS) are not well-surfaced by default. Add them to your monitoring before you need them. By the time we needed them we were reading them manually from the PgBouncer admin console during an incident, which is not the ideal time to learn what the columns mean.

-- Connect to PgBouncer admin db and check pool state
psql -h 127.0.0.1 -p 6432 -U pgbouncer pgbouncer
SHOW POOLS;
-- cl_active: clients with assigned server connection
-- cl_waiting: clients waiting for a connection
-- sv_idle: idle server connections ready for use
-- sv_used: server connections idle but not cleaned up yet

On-call year · All posts · Service mesh →