Kafka

What I Wish Someone Had Told Me About Kafka Consumer Groups

Abstract data flow visualization

Three years ago I inherited a Kafka setup from a team that had left the company. The documentation was sparse. The consumer group names were things like my-service-consumer-v2-final-FINAL. The lag monitoring was a Grafana dashboard that showed a number nobody was sure how to interpret. I learned what I needed to learn by breaking things.

Here's what I'd tell myself at the start.

Rebalancing is not a background event

When a consumer joins or leaves a group, or when a topic's partition count changes, the group coordinator triggers a rebalance. During a rebalance, all consumers in the group stop processing. Every partition is unassigned and reassigned. This takes time — typically 5–30 seconds depending on your session.timeout.ms, the number of consumers, and how fast the group coordinator responds.

The trap: you add a consumer to scale up processing, and instead you get a 20-second gap in throughput while everyone rebalances. If your topic has a high message rate, that gap builds lag. If your downstream depends on low latency, that gap is an incident.

The solution for most cases is static membership (group.instance.id). Static members don't trigger rebalances on restart — the group coordinator waits for the session to expire before considering them gone. For services that restart frequently (rolling deploys, pod restarts in Kubernetes), this eliminates a significant source of rebalance storms.

// Java consumer config
props.put("group.instance.id", "payment-processor-" + podName);
props.put("session.timeout.ms", "45000");
// give static members time to come back before reassigning

Partition count determines your maximum parallelism

A consumer group can have at most one active consumer per partition. If your topic has 12 partitions and you run 16 consumers, 4 of them will be idle. If you have 4 consumers and 12 partitions, each consumer processes 3 partitions.

This means your partition count decision is made at topic creation and is difficult to change later. Increasing partition count is possible but triggers a rebalance and changes the mapping between message keys and partitions — if you rely on key-based ordering, that breaks until all messages with the old mapping are processed.

Our rule now: create topics with at least 2x the partitions you think you'll need in the first year. We've never regretted headroom; we've regretted the opposite several times.

Consumer lag is not one number

The lag you see in most dashboards is the sum of lag across all partitions in the group. That number hides the distribution. A group with 12 partitions might show 10,000 messages of lag — but if 9,999 of those are on one partition and the rest are near zero, your problem is a single slow consumer or a partition assignment issue, not a capacity problem.

Monitor lag per partition. When lag grows on one partition and not others, it's almost always one of: a poison-pill message causing retries, a hot key causing one consumer to do more work than peers, or an unhealthy consumer that's been assigned too many partitions.

# Check per-partition lag with kafka-consumer-groups.sh
kafka-consumer-groups.sh \
  --bootstrap-server localhost:9092 \
  --describe \
  --group my-service-consumer

Offset commits are your durability guarantee

Auto-commit (enable.auto.commit=true) commits offsets on a timer (auto.commit.interval.ms, default 5 seconds). This means if your consumer dies between processing a message and the next commit, that message will be reprocessed after restart. If your processing is idempotent, this is fine. If it's not, it's a silent data integrity bug.

I now use manual commits in every consumer I write or review. The additional code is minimal; the control over exactly-once semantics is worth it.

while (true) {
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
    for (ConsumerRecord<String, String> record : records) {
        process(record);
    }
    consumer.commitSync(); // only after all records processed
}

The consumer group you can't delete

A consumer group exists in Kafka until you explicitly delete it. Old groups from deprecated services, test runs, one-off scripts — they accumulate. At one point we had 47 consumer groups registered against our main cluster, of which maybe 15 were active. The rest were ghosts from services that no longer existed.

The consequence isn't just aesthetic. Some monitoring tools alert on any group with growing lag, including dead ones whose "lag" is just an old committed offset against a topic that has moved on. We spent an afternoon chasing a lag alert before realizing the consumer group in question belonged to a service we'd decommissioned six months earlier.

# Delete a dead consumer group (must have no active members)
kafka-consumer-groups.sh \
  --bootstrap-server localhost:9092 \
  --delete \
  --group deprecated-service-consumer

What I check first when something's wrong

In order: per-partition lag distribution, consumer group state (is it in Stable or stuck in PreparingRebalance?), consumer logs for CommitFailedException or WakeupException, then broker metrics. Most consumer problems announce themselves clearly in the logs if you know what to look for.


All posts