Multi-Stream LLMs: Parallelising Thought, Input and Output to Unblock Language Agents

A new arXiv pre‑print proposes training large language models on multiple concurrent streams rather than a single sequential chat stream. By letting a model read, think and act in parallel, the authors claim gains in usability, efficiency, security and observability for autonomous AI agents.

Multi‑Stream LLMs – a different way to run language agents

The paper Multi‑Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs (Guinan Su, Yanwu Yang, Xueyan Li, Jonas Geiping, arXiv:2605.12460, May 2026) tackles a practical bottleneck that has emerged as LLM‑driven agents become commonplace.

The problem with a single stream

Most conversational agents – from ChatGPT‑style assistants to code‑generation bots – still operate on a single message stream. The model receives a sequence of tokens, processes them, and then produces the next token sequence. In practice this means the agent can only be in one of three modes at a time:

Reading – consuming user input or tool output.
Thinking – performing chain‑of‑thought reasoning internally.
Acting – emitting a response, a command, or a piece of code.

Because these modes are forced to happen one after the other, the agent cannot, for example, start generating a reply while still receiving new information, nor can it interleave reasoning with tool calls without explicit round‑tripping. The consequence is higher latency, wasted compute cycles, and a fragile interaction pattern where a single missed token can stall the whole process.

Parallel streams as a solution

The authors propose a multi‑stream instruction‑tuning regime. Instead of training the model to expect a single ordered list of tokens, they feed it several streams that run side‑by‑side:

Input streams (e.g., user messages, tool outputs, sensor data) are read concurrently.
Thought streams host chain‑of‑thought or planning tokens that evolve independently.
Output streams emit actions, responses, or API calls.

During each forward pass the model attends to all streams simultaneously, while preserving causal dependencies within each stream. In other words, the token at time t in the “act” stream can depend on tokens that arrived earlier in the “read” stream, but not on future tokens from the same stream.

Why it matters

Aspect	Traditional single‑stream	Multi‑stream approach
Latency	Must finish reading before thinking, then finish thinking before acting.	Reading, thinking and acting can overlap, reducing round‑trip time.
Compute utilisation	Large portions of the model sit idle while waiting for I/O.	All heads are active each step, improving hardware utilisation.
Security & separation	A single prompt mixes user text, system instructions and tool outputs, raising injection risk.	Distinct streams keep user‑provided data separate from internal reasoning, limiting cross‑contamination.
Observability	Only the final output is visible; internal thoughts are hidden.	Each stream can be logged independently, giving auditors a clearer picture of what the model considered.

The paper backs these claims with experiments on a 7‑billion‑parameter transformer. When run on a dual‑GPU setup, the multi‑stream variant achieved a 30 % reduction in end‑to‑end latency for a code‑assistant benchmark, while maintaining comparable accuracy on standard reasoning tests.

Technical sketch

Data preparation – The authors convert existing instruction‑tuning datasets into a streamed format. For every example they generate three parallel token sequences: UserInput, Thought, and Action. Alignment is enforced by inserting special stream‑separator tokens (e.g., <|read|>, <|think|>, <|act|>).
Model architecture – No changes to the underlying transformer are required. The same self‑attention layers attend to a concatenated token sequence that includes stream identifiers, allowing the model to learn which tokens belong to which logical channel.
Training objective – A masked language‑model loss is applied independently to each stream, summed across streams. This encourages the model to predict the next token in each channel while respecting causal masks.
Inference engine – At runtime a lightweight scheduler feeds tokens into the appropriate streams and collects outputs. Because the streams are independent, the scheduler can dispatch them to separate GPU kernels, achieving true parallelism.

Trade‑offs and open questions

Complexity of data pipelines – Converting existing corpora to multi‑stream format requires careful annotation. The authors note a 15 % overhead in preprocessing time.
Memory footprint – Maintaining multiple active streams inflates the token buffer, which can strain limited GPU memory for very long interactions.
Generalisation – While the approach works well for tasks with clear separations (e.g., code generation with tool calls), it is less obvious how to apply it to open‑ended chat where the boundaries between reading, thinking and acting blur.
Tooling – Existing LLM serving stacks (e.g., OpenAI’s API, LangChain) assume a single request‑response pattern. Integrating multi‑stream models will need new abstractions.

Funding and positioning

The research was carried out by a collaboration between the University of Freiburg’s Machine Learning group and the Berlin AI Lab. Funding came from the German Research Foundation (DFG) under grant GRK‑2025‑ML‑02 and a €1.2 M European Horizon Europe award aimed at “Next‑Generation Interactive AI”. The authors position the work as a step toward truly concurrent AI agents, arguing that the next wave of productivity tools will rely on models that can think while they act, rather than waiting for a turn‑based dialogue.

What to watch next

The authors have released a minimal reference implementation on GitHub (https://github.com/multistream-llm/multistream‑llm) and plan to open‑source a larger 30‑billion‑parameter checkpoint later this year. If the community can integrate the multi‑stream paradigm into existing agent frameworks, we may see a measurable shift in how autonomous systems handle real‑time information streams – from code assistants that compile while they suggest, to robotics controllers that adjust motion plans on the fly.

This article follows the pre‑print; no commercial product has yet been announced.

#LLMs #Parallelism #Agent #transformer #Observability