Nvidia integrates Groq's LPUs into Vera Rubin systems to achieve 1000+ tokens/second inference speeds for trillion-parameter models.
Nvidia is leveraging its $20 billion acquisition of Groq to dramatically accelerate AI inference speeds, integrating the language processing unit (LPU) technology into new Vera Rubin rack systems that promise to deliver tokens at unprecedented rates.

The GPU giant revealed the integration during CEO Jensen Huang's GTC keynote, where the company detailed how combining Groq's LPUs with its own Rubin GPUs will enable serving massive trillion-parameter large language models at hundreds or even thousands of tokens per second per user.
The Technical Architecture
The integration represents a fundamental shift in how Nvidia approaches AI inference. Rather than relying solely on GPUs, the company is adopting a heterogeneous computing approach where different processors handle specialized tasks.
Nvidia's newly announced Rubin GPUs provide up to 50 petaFLOPs of compute power with 22 TB/s of HBM4 memory bandwidth. However, Groq's latest LPU chips offer nearly 7x faster memory performance at 150 TB/s apiece, making them ideal for the decode phase of LLM inference.
Each Groq 3 LPU delivers 1.2 petaFLOPS of FP8 performance and contains 500 MB of onboard memory. While this represents only about 1/500th of the capacity of Nvidia's Rubin GPU, the chips excel at one specific task: generating tokens at extreme speeds.
LPX Rack System Design
Nvidia plans to pack 256 Groq LPUs into a new LPX rack system, connected via a custom Spectrum-X interconnect to a neighboring Vera-Rubin NVL72 rack system. This configuration allows GPUs to handle the compute-intensive prompt processing while LPUs rapidly generate responses.
The memory constraints are significant - even with 256 chips per rack providing only 128 GB of ultra-fast memory, which falls short of what's needed for trillion-parameter models. For context, a 1 trillion-parameter model at 4-bit precision requires at least 512 GB of memory, meaning approximately a thousand LPUs would be needed to hold such a model entirely in memory.
Nvidia addresses this limitation by allowing multiple LPX racks to be ganged together for larger models, though this introduces additional complexity and cost.
Market Implications and Pricing
The performance gains enable Nvidia to target the premium inference market, where providers can charge as much as $45 per million tokens generated. This represents a significant premium compared to OpenAI's current rate of approximately $15 per million output tokens for API access to its top GPT-5.4 model.
This pricing strategy targets applications requiring near-instantaneous responses, such as real-time code generation, interactive AI agents, and high-frequency trading analysis.
Strategic Course Correction
The Groq integration marks a strategic pivot for Nvidia. The company had previously announced a dedicated prefill processor called Rubin CPX at Computex last year, which would have used GDDR7-equipped processors for prefill processing and HBM-equipped GPUs for decode. However, that project appears to have been abandoned in favor of Groq's LPU-based approach.
"Integrating LPU and LPX into our written platform to optimize the decode, that's where we're focused right now," said Ian Buck, VP of Hyperscale and HPC at Nvidia.
Industry-Wide Trend
Nvidia isn't alone in pursuing this hybrid approach. Amazon Web Services announced a collaboration with Cerebras to develop a combined inference platform using AWS's Trainium 3 accelerators for prompt processing and Cerebras' WSE-3 ASICs for low-latency token generation.
Cerebras' wafer-scale engine packs 44 GB of SRAM onto a single chip, offering an alternative architecture to Groq's multi-chip approach.
Software and Ecosystem Considerations
Currently, Nvidia's Groq-based LPX systems don't support CUDA natively. The company is leveraging the LPU as an accelerator to CUDA running on the Vera NVL72 platform, suggesting that software optimization remains a work in progress.
This limitation means initial deployments will likely focus on model builders and service providers with the resources to optimize for the new architecture, rather than general-purpose cloud customers.
Timeline and Availability
The Groq-based LPX systems are expected to ship alongside Vera Rubin rack systems later this year. However, both access and software support may be limited initially as Nvidia works through the integration challenges.
The technology targets a specific niche: organizations needing to serve trillion-plus parameter models with high token rates, representing a small but growing segment of the AI infrastructure market.
This acquisition and integration strategy positions Nvidia to compete more effectively in the premium inference market while potentially limiting the growth of specialized inference chip companies that have dominated this space until now.

Comments
Please log in or register to join the discussion