Cloudflare has unveiled new infrastructure designed to efficiently run large language models across its global network, featuring disaggregated processing, custom inference engine Infire, and model compression technology Unweight to optimize performance and reduce costs.
Cloudflare has recently announced significant advancements in infrastructure designed to run large AI language models across its global network. As these models continue to grow in size and complexity, with some now exceeding one trillion parameters, the challenges of deploying them efficiently become increasingly critical. Cloudflare's approach addresses these challenges through architectural innovations that separate processing stages and optimize GPU utilization.
Disaggregated Processing Architecture
At the core of Cloudflare's solution is a disaggregated approach that separates the model's input processing and output generation onto different optimized systems. This architectural decision stems from the fundamental differences between the two main stages of processing an LLM request:
- Prefill stage: Processes input tokens and populates the KV cache (typically compute-bound)
- Decode stage: Generates output tokens (typically memory-bound)
By handling these stages on different machines, Cloudflare can optimize each for its specific workload characteristics. Michelle Chen, principal product manager at Cloudflare, explains: "One hardware configuration that we use to improve performance and efficiency is disaggregated prefill. There are two stages to processing an LLM request: prefill, which processes the input tokens and populates the KV cache, and decode, which generates output tokens. Prefill is usually compute bound, while decode is memory bound."
The Infire Inference Engine
To manage these complex workloads across multiple GPUs, Cloudflare developed a custom AI inference engine called Infire. Announced during Cloudflare Birthday Week 2025, Infire addresses the challenges of running large language models that must be split across multiple GPUs. For instance, Kimi K2.5 is so large (over 1 trillion parameters and about 560GB in size) that it requires at least eight H100 GPUs just to load the model into memory, before accounting for additional memory used during processing.
Infire optimizes both pipeline parallelism and tensor parallelism:
- Pipeline parallelism: Properly load balances all stages of the pipeline to prevent GPUs at one stage from starving while others are executing
- Tensor parallelism: Optimizes for reducing cross-GPU communication, making it as fast as possible
"For most models, utilizing both pipeline parallelism and tensor parallelism in tandem provides the best balance of throughput and latency," according to Cloudflare's engineering team.
Model Optimization with Unweight
Cloudflare also introduced Unweight, a system that compresses large language model weights by approximately 15-22% without losing accuracy. This compression reduces the amount of data GPUs need to load and move during inference, allowing models to run faster and more efficiently. The impact is particularly significant for large models, where even modest reductions in memory requirements can translate into substantial cost savings and performance improvements.
These optimizations have enabled Cloudflare to run Llama 4 Scout on just two H200 GPUs with large capacity for context tokens, while still leaving sufficient memory for the KV cache. Similarly, Kimi K2.5 can run on eight H100 GPUs with adequate memory for processing.
Practical Use Cases and Applications
The infrastructure supports a range of applications that require high-performance LLM inference:
- Real-time conversational AI: The optimized architecture reduces latency for interactive applications
- Content generation: Efficient processing enables faster creation of text, code, and other content
- Customer service bots: The global distribution ensures consistent performance regardless of user location
- Research and development: The cost optimizations make it more feasible to experiment with larger models
Cloudflare has demonstrated this by running open-source models on its AI inference platform, starting with Moonshot AI's Kimi K2.5 model on Workers AI. The team highlighted how they're using a variety of hardware configurations to best serve different models based on their specific characteristics and requirements.
Trade-offs and Challenges
While Cloudflare's approach offers significant advantages, it's not without trade-offs:
- Increased complexity: The disaggregated architecture requires sophisticated orchestration
- Network dependencies: The separation of stages introduces additional network latency that must be minimized
- Optimization effort: Each model type may require specific tuning to achieve optimal performance
The company has addressed these challenges through careful engineering of the Infire engine and extensive testing across different model types and sizes.
Industry Context
Cloudflare is not alone in addressing the challenges of running LLMs in production. According to Cockroach Labs' recent State of AI Infrastructure report, as companies move AI systems into everyday use, many are finding their current infrastructure is not ready to handle the scale and reliability these workloads require.
"Legacy infrastructure, built around episodic human interaction, simply wasn't designed for this kind of pressure," the report states. "To handle the pace and unpredictability of AI, companies need more than performance upgrades. They need a fundamental shift in how systems are architected."
Cloudflare's approach exemplifies this fundamental shift, moving beyond simple scaling to architectural innovation that addresses the specific characteristics of AI workloads.
Future Implications
As LLMs continue to grow in size and complexity, infrastructure innovations like Cloudflare's will become increasingly important. The disaggregated approach, combined with specialized inference engines and model compression techniques, offers a path forward that balances performance, cost, and scalability.
For organizations looking to deploy large language models, Cloudflare's infrastructure provides a compelling alternative to traditional approaches, particularly for those already leveraging Cloudflare's global network for other services. The integration of these capabilities with Cloudflare's existing edge computing platform creates a comprehensive solution for AI inference that can scale globally while maintaining high performance.
The evolution of this technology will likely continue, with further optimizations in model distribution, inference scheduling, and hardware utilization. As these improvements emerge, they will make increasingly sophisticated AI applications more accessible and practical for a wider range of organizations.
For more information about Cloudflare's AI infrastructure, you can explore their official blog and Workers AI documentation.

Comments
Please log in or register to join the discussion