#Hardware

OpenAI to serve ChatGPT on Cerebras' AI dinner plates • The Register

Regulation Reporter
4 min read

OpenAI has signed a $10+ billion deal with Cerebras to deploy 750 megawatts of wafer-scale AI accelerators through 2028, leveraging Cerebras' massive SRAM bandwidth for real-time inference and reasoning capabilities that could transform ChatGPT's responsiveness.

OpenAI has announced a landmark partnership with Cerebras Systems that will see the ChatGPT provider deploy 750 megawatts of the chip startup's dinner-plate-sized AI accelerators through 2028. The deal, valued at over $10 billion according to sources familiar with the matter, represents a significant strategic shift for OpenAI as it seeks to differentiate its inference services through raw performance gains.

{{IMAGE:1}}

The partnership centers on Cerebras' unique wafer-scale compute architecture, which differs fundamentally from conventional GPU-based systems. Each Cerebras WSE-3 accelerator measures 46,225 mm²—roughly the size of a dinner plate—and packs 44 GB of SRAM directly on the chip. This massive on-chip memory capacity enables memory bandwidth approaching 21 petabytes per second, nearly 1,000 times the bandwidth of a single Nvidia Rubin GPU's 22 TB/s HBM configuration.

The Performance Advantage for Real-Time AI

For inference workloads, this bandwidth translates directly into speed. When running OpenAI's gpt-oss 120B model, Cerebras' chips can purportedly achieve 3,098 tokens per second per user, compared to 885 tokens per second on competitor Together AI's Nvidia GPU-based system. This 3.5x performance improvement has particular significance for reasoning models and AI agents, where faster inference enables models to "think" longer while maintaining interactive response times.

OpenAI explained the practical implications in a recent blog post: "When you ask a hard question, generate code, create an image, or run an AI agent, there is a loop happening behind the scenes: you send a request, the model thinks, and it sends something back. When AI responds in real time, users do more with it, stay longer, and run higher-value workloads."

The partnership follows OpenAI's introduction of a model router with GPT-5, which intelligently directs queries to appropriate model sizes based on complexity. This architecture helps mitigate Cerebras' memory constraints—while the chips offer exceptional bandwidth, their 44 GB SRAM capacity is comparable to a six-year-old Nvidia A100 PCIe card. At 16-bit precision, each billion parameters requires 2 GB of SRAM, meaning even modest models like Llama 3 70B need at least four CS-3 accelerators.

Strategic Implications and Technical Trade-offs

The deal structure is notable: Cerebras will take on the risk of building and leasing datacenters to serve OpenAI, a departure from traditional cloud procurement models. This arrangement gives OpenAI access to specialized compute without capital expenditure, while Cerebras secures a massive anchor customer for its wafer-scale technology.

The partnership also reflects the evolving competitive landscape in AI hardware. While Nvidia has dominated AI training, inference remains a contested space where architectural differences matter more. Cerebras' shift from training to inference over the past two years aligns with this market opportunity.

However, the architecture has limitations. SRAM's space inefficiency means Cerebras chips pack less total memory than their size might suggest. Each CS-3 accelerator draws up to 23 kW of power, and large models require parallelization across multiple chips. The company's next-generation chip is expected to address this by dedicating more area to SRAM and adding support for modern block floating point data types like MXFP4, which could dramatically increase the models that can run on a single chip.

Disaggregated Inference: A Potential Future Direction

Industry observers speculate about potential disaggregated inference architectures, where compute-heavy prompt processing runs on AMD or Nvidia GPUs while token generation—bandwidth-constrained but compute-light—runs on Cerebras' SRAM-packed accelerators. When asked about this possibility, a Cerebras spokesperson indicated the current agreement is for a cloud service where the company builds out datacenters with its equipment for OpenAI.

This doesn't preclude future disaggregated approaches, but would require Cerebras to deploy GPU systems alongside its wafer-scale accelerators in its datacenters. The technical feasibility is clear—many inference workloads already separate prefill and decode phases—but depends on Cerebras' willingness to support heterogeneous compute environments.

The partnership represents a calculated bet by OpenAI on specialized hardware for inference. While Nvidia's GPUs remain the workhorse for training, the company's $20 billion acquisition of Groq last year signaled recognition that inference requires different optimization strategies. Cerebras' dinner-plate architecture offers a fundamentally different approach: massive on-chip SRAM bandwidth versus the distributed memory hierarchies of conventional GPU systems.

For developers and enterprises using ChatGPT, the practical impact will be faster responses for complex queries, potentially enabling new use cases in real-time coding assistance, multi-step reasoning, and interactive AI agents. The deal's $10+ billion valuation and 2028 timeline suggest OpenAI views inference performance as a key competitive differentiator as the AI market matures beyond basic chat interfaces.

The partnership also validates Cerebras' wafer-scale approach after years of skepticism. With OpenAI as an anchor customer, the company has a clear path to scale its datacenter operations and potentially accelerate its long-delayed IPO plans. As Cerebras CEO Andrew Feldman has noted, the company maintains common ground with Nvidia despite being a competitor—both recognize that specialized hardware will be essential for AI's next phase.

The real test will come when ChatGPT users experience the performance difference. If Cerebras' chips deliver on their promised 3.5x speedup for reasoning workloads, the partnership could reshape expectations for real-time AI interaction. If not, OpenAI will have spent billions on specialized hardware that may need to be supplemented with more conventional compute resources.

Either way, the deal signals that the AI industry is moving beyond the one-size-fits-all GPU era into a period of hardware specialization where different architectures target specific workload characteristics. For inference, where latency and throughput directly impact user experience, that specialization is already proving decisive.

Comments

Loading comments...