When Kafka Consumers Become the Bottleneck: Lessons from a Message Backlog Nightmare
#Infrastructure

When Kafka Consumers Become the Bottleneck: Lessons from a Message Backlog Nightmare

Backend Reporter
3 min read

A deep dive into why treating Kafka like a job queue leads to message backlogs, rebalances, and duplicate processing - and how to fix it.

For a while I genuinely could not figure out what was wrong. Nothing was throwing errors. The service was running. But messages were piling up, some were being processed twice, and the lag just kept climbing. I kept waiting for it to sort itself out. It did not. Eventually I had to sit down and actually trace through what was happening.

The Problem

Our consumer was doing too much. Each message triggered external API calls, some heavy business logic, blocking operations. Individually, a message might take a couple of minutes to get through. That feels manageable until you remember that Kafka's default max.poll.records is 500. Pull a batch of even a handful of slow messages, and the cumulative processing time blows past Kafka's default max.poll.interval.ms of 5 minutes without much effort.

When that happens, Kafka assumes the consumer has died. It triggers a rebalance, reassigns the partitions, and those same messages get picked up and processed all over again. That was our loop. Consumer pulls a batch, gets bogged down processing it, Kafka loses patience, rebalance happens, repeat.

Featured image

The Quick Fixes

The first thing was just to stop the bleeding. We bumped max.poll.interval.ms up to 8 minutes to give the consumer a bit more breathing room. Rebalances stopped almost immediately. That was a relief, but it was a band-aid not a fix.

Next we set max.poll.records = 1. One message at a time. With each message taking a couple of minutes, pulling any larger batch was just asking for trouble. Throughput dropped considerably, but at least the system was stable and we could reason about it.

We also dropped auto-commit and switched to manual offset commits. Honestly we should have done this from the start. Auto-commit quietly marks messages as done on a timer whether processing actually succeeded or not. Manual commits meant we knew exactly what had been handled and what had not.

The Real Solution

Kafka consumers are not meant to do heavy, long-running work. After things stabilised, we redesigned the flow so that the consumer became lightweight. It would validate the message and quickly hand off the heavy work to background workers.

Kafka went back to doing what it is good at: moving data fast. And our system stopped fighting it. We also added retries and a dead letter queue so one broken message could not drag everything else down with it.

What Stuck With Me

I think I was treating Kafka like a job queue because that is what felt familiar. But it is not that. It is a streaming system that expects you to keep up with it. The moment you do slow, heavy work inside the consumer, you are borrowing time you do not have.

Once we aligned with how Kafka actually works, everything got simpler. The lag cleared. The rebalances stopped. The system finally felt like it was running the way it was supposed to.

Sometimes the fix is technical. But sometimes you just have to admit the design was wrong.

pic

Key Takeaways

  • Kafka is a streaming system, not a job queue. It expects consumers to process messages quickly and keep up with the data flow.

  • Long-running work inside consumers causes rebalances. When processing time exceeds max.poll.interval.ms, Kafka assumes the consumer is dead and triggers rebalancing.

  • Batch size matters. With slow processing, even default batch sizes (500) can cause timeouts. Consider reducing max.poll.records.

  • Manual offset commits provide better control. Auto-commit can mark messages as processed before they're actually done, leading to data loss.

  • Separate concerns. Use Kafka consumers for fast validation and handoff, not for heavy business logic or external API calls.

  • Add resilience patterns. Implement retries and dead letter queues to handle problematic messages without blocking the entire system.

The lesson here isn't just about Kafka configuration - it's about understanding the fundamental nature of the tools we use. When we try to force a streaming system to behave like a job queue, we create problems that no amount of parameter tweaking will solve. The real fix comes from aligning our architecture with the tool's intended purpose.

Build seamlessly, securely, and flexibly with MongoDB Atlas. Try free.

Comments

Loading comments...