Nvidia's Ian Buck reveals why CPX was shelved for LPU decode this year, explains the Vera CPU's unique agentic positioning, and details how LPX racks paired with Rubin GPUs achieve 1,000 tokens/second at economic scale.
Following Nvidia's GTC 2026 keynote, where CEO Jensen Huang unveiled the Vera Rubin architecture and Groq 3 LPU acquisition, Nvidia VP of Hyperscale and HPC Ian Buck addressed the press about the company's strategic shifts. Buck explained why CPX was shelved, how LPU decode works with GPUs, Vera CPU's positioning, and the Intel NVLink Fusion partnership.
CPX Delay and LPU Decode Architecture
Buck confirmed that CPX has been pulled from the immediate roadmap to focus on optimizing decode with LPU this year. "It's still a good idea," he said, "but in order to dedicate our focus on optimizing the decode with LPU this year, we'll be thinking about CPX more in the next generation."
The LPU decode architecture represents a significant shift in how AI inference works. Using a Groq 3 LPU LPX rack with 256 LPU chips combined with a Vera Rubin NVL72, Nvidia splits the decode process between LPU and GPU. "We've combined the Groq software team with our Dynamo team," Buck explained. "We now do not only disaggregation of separate GPUs that you pre-fill and decode, but also the decode itself is actually split between the LPU and GPU."
This split approach enables extremely fast token generation by leveraging the LPU's fast SRAM for computations that benefit from it, while sending intermediate activation states to GPUs for attention math, softmax, routing, and KV calculations. "Only the LPUs need to have copies of the weights," Buck noted, "while all the per-query state, all the KV cache state, which can get quite large, can operate and stay in the HBMs."
Dynamo's Success and Integration
Dynamo, launched a year ago as the "operating system of the AI factory," has seen remarkable adoption. "We get about 100 GitHub submissions a day now, and about a third of them are coming from external sources," Buck revealed. This success underscores the importance of software optimization in Nvidia's strategy.
Vera CPU: The World's Best Agentic CPU
Nvidia's Vera CPU represents a unique approach to processor design. Buck brought a Vera module to the Q&A, showing a dual-socket server with two Vera CPUs and LPDDR5 memory. "It is the world's best agentic CPU," he stated, emphasizing its 88 cores designed for maximum performance under load.
The Vera CPU's design philosophy centers on single-threaded performance, memory bandwidth, and energy efficiency under load. "You may not need 88 cores," Buck acknowledged, "but it's actually a unique workload because in agentic AI, it's in the critical path for both training and running these models."
Agentic AI's Critical Path
Buck provided a compelling example of why Vera CPU matters in agentic AI workflows. During model training, when an AI model writes code to solve problems like computing Fibonacci series or solving crossword puzzles, the CPU must execute that code to score the results. "We're not going to run that Python on the GPU," he explained. "It's a CPU job. The GPU tells the CPU to go run it."
This creates a critical path where the CPU must open a sandbox, boot a Linux instance, start the Python interpreter, execute, compile, and run code quickly. "What the world is asking for, what it needs, is a really fast CPU that can generate a lot of training data while you're training in order to make the model faster, and never let GPU go idle."
Market Positioning and Partner Strategy
When asked about Vera's market positioning, Buck was clear that it's not designed as a dollar-per-vCPU chip or a gaming processor. "The amount of technology and, frankly, just the cost to build something that solves that critical workload makes it not for that market."
Nvidia will only build one Vera SKU, with partners building other x86 SKUs. "The world is not going to be served by one SKU of CPU," Buck said. Partners can build systems using the reference board or purchase chips directly, though they're highly motivated to build what Nvidia recommends for agentic use cases.
Intel NVLink Fusion Partnership
On the Intel partnership announced last year, Buck confirmed it's progressing. NVLink Fusion is an IP block plus chiplet that allows CPUs like x86 to communicate across NVLink to Nvidia GPUs or other accelerators. "We've announced multiple partnerships including Intel, and that is definitely progressing."
The integration involves IP hooking into the fabric of the processor, with manufacturing and integration handled separately by partners. Buck noted that the Vera module itself uses multiple chiplets, demonstrating the complexity of modern processor design.
LPX Paired with Vera Rubin: The Economic Sweet Spot
By combining LPX with Vera Rubin, Nvidia achieves 1,000 tokens per second at economic scale. "Instead of dozens of racks of LPX, we can deliver that level of performance with just two racks of LPX and one rack of Vera Rubin," Buck explained. This one-to-two or one-to-four rack ratio makes large-scale deployment economically viable.
The key insight is that LPX handles what it's good at—memory bandwidth that's seven times faster than HBM—for mixture-of-experts layers, while GPUs handle attention math and other computations. This combination activates the market for 100-megawatt, 500-megawatt, or gigawatt data centers.
Scale-Up Architecture and NVLink
Buck detailed Nvidia's scale-up architecture, explaining how NVLink connections between GPUs provide 10x to 20x more bandwidth than PCIe connections. "The main use of scale-up is parallelism, taking the most compute-intensive part of the computation and doing tensor parallelism across all the GPUs."
With mixture-of-experts models having hundreds of experts per layer, fast communication between GPUs becomes critical. Nvidia's use of copper cables—over 5,000 of them—enables high bandwidth without the cost and power consumption of optical connections.
Future Roadmap and Optical Connections
Looking ahead, Buck confirmed that Rubin will get optical capabilities, while Ambulink represents co-packaged optics (CPO). The roadmap includes LP40 with NVLink interconnect and FP4 capabilities, plus Tensor Core math from GPUs.
Nvidia is also developing NVSwitch with front ports to scale up to 576 GPUs per rack, with prototypes already working with Grace Blackwell. This addresses the chicken-and-egg relationship between hardware capabilities and model requirements.
Software Optimization: The Untold Story
Buck emphasized that software optimization is crucial to Nvidia's performance advantages. The team achieved a 4x uplift in just four months by optimizing for DeepSeek-like models, running 250 simulations and 1.2 million GPU hours.
"Software is an untold story here," Buck said. "People like to say who's got the faster chip, and I'm like, who's got the better software ecosystem? That's why it's so hard to benchmark these things, because the whole stack end-to-end matters."
The combinatorial optimization space is massive, requiring 400 engineers and extensive testing to ensure performance and accuracy across different configurations and workloads.
Conclusion
Nvidia's strategic pivot from CPX to LPU decode, combined with the unique positioning of Vera CPU for agentic workloads, represents a comprehensive approach to AI infrastructure. By focusing on the critical path where CPUs and GPUs must work together seamlessly, Nvidia is addressing the specific needs of next-generation AI models while maintaining economic viability at scale.
The success of this strategy will depend on execution across hardware, software, and ecosystem partnerships, but the technical foundation appears solid. As agentic AI continues to evolve, Nvidia's integrated approach to CPU-GPU collaboration may prove essential for achieving the performance and efficiency required for large-scale deployment.

Comments
Please log in or register to join the discussion