At QCon AI, Adi Polak laid out the architecture for moving AI systems from stateless prompts to memory-aware agents. The interesting part for cloud architects isn't the LLM. It's how Confluent decomposed Kafka and Flink into tiered memory, exactly-once state, and tool orchestration through MCP, then mapped each piece to a problem agents actually hit in production.
Most discussions about AI agents start and end at the model. Adi Polak, who spent 15 years in distributed systems and now works at Confluent, spent her QCon AI talk arguing the opposite. The model is the easy part. The hard part is everything around it: where memory lives, how state survives a crash, and how a streaming pipeline decides what context to send when the token budget is finite.
Her framing is useful because it treats agent infrastructure as an engineering problem with known building blocks, not a greenfield mystery. If you already operate event-driven systems, much of what an agent needs is infrastructure you have already deployed and learned to scale.

The shift the talk is actually about
Prompt engineering, Polak noted, covers four familiar moves: role assignment, few-shot examples, chain of thought, and constraint settings. All of them ride inside a single prompt, and all of them share one limitation. The model has a finite context window. Push past it and your carefully written guardrails fall out of scope. Anyone who has tried to enforce a hard constraint through prompt text alone, then watched a jailbreak walk right through it, knows the failure mode.
Context engineering is the larger discipline that wraps prompting. It covers memory management split into short-term and long-term tiers, explicit state management across multi-step agent loops, retrieval (RAG) for pulling external data, and tool access through interfaces like the Model Context Protocol. The architectural consequence is the part worth sitting with: agents move applications from stateless to stateful.
That single change reorders your whole scaling story. A stateless service is trivial to scale horizontally because there is nothing to coordinate. The moment you introduce a memory tier and a feedback loop that grades each response and writes results back, you inherit every distributed-state problem you spent years learning to avoid. State explodes. Context accumulates. Each LLM call has to reconstruct and resend the relevant history, which drives both cost and latency up at exactly the moment more users arrive.
The three failure modes
Polak named the symptoms cleanly. First, latency and cost spikes as usage grows and every request drags a larger context payload. Second, the lost-in-the-middle problem, where a model handed too much text simply ignores what sits in the center of the window. Third, context collision and hallucination, where the model has so many competing inputs it cannot tell what matters.
The naive answer, just buy a bigger context window from Gemini or Claude and dump everything in, fails on all three counts. Bigger windows cost more per call, degrade on retrieval quality, and still terminate somewhere. The engineering work is deciding what to send, not sending everything.
Her solution set maps each problem to a technique. State management gets made explicit and processed as a flow. Short-term context gets compressed into hierarchies of summaries. Long-term memory gets hybrid retrieval that combines semantic search with other ranking signals, plus re-ranking on the vector database weights, contextual chunk retrieval governed by a judge loop, and dynamic compression to stay inside the token ceiling.
The architecture: Kafka and Flink, decomposed
This is where the talk gets concrete and where the cloud-architecture decisions live. Confluent's approach was not to build a new agent framework from scratch. It was to take two mature streaming systems, Apache Kafka and Apache Flink, break them into parts, and reassemble those parts into the memory and orchestration layers an agent needs.
The reasoning behind Flink is specific. It offers millisecond latency on real-time streams, built-in state management through checkpoints, and configurable delivery guarantees. Polak called out the three processing levels you can choose: exactly-once, at-most-once, and at-least-once. For agents, exactly-once matters because a dropped or duplicated message is a dropped or duplicated piece of context, and that corrupts the memory the agent reasons over. The team built an open-source Flink Agents API on top of this, exposing a ReAct-style agent that thinks, acts, observes, and repeats, defined through a model descriptor, a prompt, a set of tools, and an output schema.
Kafka contributes the memory substrate and the integration surface. Kafka Connect pulls real-time data from systems like BigQuery, Salesforce, and SAP to enrich agent context. And because Kafka is an immutable, effectively infinite log, every agent interaction routed through it becomes replayable, observable, and auditable. Connect the same topics to OpenSearch, Elasticsearch, or Datadog and you get observability and governance close to free, which are exactly the concerns teams skip until an incident forces the question.

Tiered memory mapped to physical storage
The most transferable idea in the talk is how Confluent maps memory tiers onto storage tiers, a pattern that holds whether or not you run Kafka.
Confluent rebuilt Kafka's internals into an engine called Kora, using a cell-based architecture, the same structural approach AWS uses for S3. A cell is a self-contained, multi-tenant unit, and the system collects usage statistics per availability zone and per cell to scale up or down before a customer hits a limit. Polak claimed the control-plane networking layer (recognized with a VLDB award, with a paper she offered to link) kept customers online through a recent AWS outage.
The relevant decomposition: classic Kafka co-locates compute and storage on one broker. Kora splits them. Compute runs where compute runs. Storage moves to object storage. And the local SSD that many cloud VMs ship with anyway, Azure VMs include a machine disk you cannot decline, becomes a fast caching tier you have already paid for.
That physical split lines up with memory semantics:
- Short-term memory lives on SSD-backed Kafka topics, configured for single-digit-millisecond retrieval. This holds the active session: the scratchpad, recent summaries, whatever the agent needs right now.
- Long-term memory moves to object storage (S3, Azure Blob, GCS) once data crosses a configured threshold, optionally landed in Iceberg or Delta for historical retrieval. This is the durable layer: policies, learned preferences, organizational facts the agent should carry across sessions.
- Flink state for in-flight computation sits in RocksDB or a checkpointing mechanism, tuned for speed.
The lifecycle question came up in Q&A, and the honest answer was that it depends on the application. A session might end when an agent makes a decision, fires an alert, or logs that nothing happened. You define the boundary; the storage tier follows from it.
A concrete use case
Polak walked through an anomaly-detection system built with a large E*TRADE platform for spotting unusual trading volume. Market data streams into a raw-volume Kafka topic. A Flink cluster consumes it, applies KeyBy, tumbling windows, and aggregation, then runs Flink's built-in anomaly detection.
The agent's job is not to replace that algorithm but to tune it. A small language model trained on trading data watches for variability and recommends threshold adjustments, because anomaly detection lives or dies on threshold configuration. The rollout was deliberately cautious: start with the agent only raising a suggestion or an alert, then graduate to autonomous action through A/B testing once trust is established. That staged path from advisory to actor is a sound pattern for putting any agent near a system where mistakes are expensive.
Trade-offs an architect should weigh
The streaming-as-agent-infrastructure pitch is genuinely attractive, but it carries conditions worth stating plainly.
The open-source story is partial. In Q&A, Polak confirmed the AI tool-calling functions are partly available in open-source Flink, but the MCP integration is not yet released, held back by the operational complexity of shipping it without the managed control plane. So the cleanest version of this architecture currently assumes Confluent Cloud. Self-managing Flink and Kafka to reproduce it is possible but it is real operational weight.
The context-length math does not disappear either. Confluent supplies the streaming and memory layers, not the inference. You still call Claude, Gemini, or another provider, and you still have to reason about two limits at once: how many tokens you can send, and how fast the provider returns a result under that load. Polak's advice was to build the latency-versus-token chart, make conscious decisions, and weight context dynamically, keeping must-have context and dropping the nice-to-have.
And there is a build-versus-adopt judgment underneath all of it. Her closing point was the pragmatic one: if you already run a distributed streaming system, you may not need a new agent platform so much as a mapping of what you have onto what agents require, plus a vector database and a few summarization and compression routines. The infrastructure is often already in the building.
That reframing is the takeaway for anyone evaluating options. The future Polak describes is stateful agents backed by real-time streams, and the cost of getting there is lower if you stop treating agent infrastructure as exotic and start treating it as the event-driven plumbing you already know how to run.
For the technical resources mentioned, the Flink Agents project and the Confluent documentation on tiered storage are the right starting points for mapping these ideas onto your own stack.

Comments
Please log in or register to join the discussion