A new AWS paper promises 33% better throughput with 69% fewer routers, and the marketing language has people asking whether the leaf-and-spine fabric is finished. Ivan Pepelnjak's response is a careful reminder that physics doesn't bend to press releases, and that what AWS built is a thoughtful revival of an old idea rather than a break from it.
Every few years the networking world produces a paper that claims to have outgrown its own foundations. The latest arrives from AWS engineers, accompanied by a LinkedIn post and a blog write-up that reads like a victory lap: lean, resilient aggregation fabrics delivering 33% better throughput with 69% fewer routers, 27% lower costs, 40% less power, and a corresponding cut in CO2 emissions. The headline question writes itself. Is this the end of the leaf-and-spine network that has organized hyperscale data centers for more than a decade?
Ivan Pepelnjak's answer, delivered on the ipSpace.net blog, is direct: of course not. What makes his rebuttal worth reading is not the conclusion but the reasoning, because it separates a genuinely interesting engineering result from the promotional packaging wrapped around it.

The thesis: an old idea wearing new vocabulary
The central argument is that AWS rediscovered a design that a startup named Plexxi tried to commercialize years ago. Instead of building a spine layer that every leaf switch connects to, you wire leaf switches directly to one another. Plexxi first attempted this with CWDM optics, imagining dynamic leaf-to-leaf bandwidth, and later moved to a prewired middlebox. AWS calls its version a ShuffleBox. The shape of the problem is identical, and so is the shape of the solution.
The difficulty with connecting leaves directly is obvious once you picture it. At any given moment, some pairs of leaf switches exchange no traffic, yet they hold a direct link that sits idle while other links saturate. You waste capacity. Plexxi solved this with unequal-cost multipathing, routing some traffic along longer indirect paths rather than only the direct ones. The AWS material gives this the more evocative name Routing through Randomness. The technique is real and the relabeling is harmless, but it is relabeling.
The vocabulary substitution runs deeper than naming. A prewired box cannot be random in any meaningful sense, and that claim is where a careful reader's skepticism should activate. The likely mechanism is lane splitting: a 400GE uplink port is composed of four 100GE lanes, and the ShuffleBox hardwires a lane-to-lane matrix between switches. The pattern looks scrambled to a casual observer, the way the XKCD random number generator returns four because a fair die was rolled once. The arXiv paper is honest about this and calls the structure a Quasi-Random Graph, the product of an optimization process searching for the best partial mesh between N switches each having D uplinks. That precision survives in the academic text and evaporates in the blog posts, which is a predictable pattern when a hyperscaler publishes research that doubles as a recruiting signal.
The key argument: throughput claims meet conservation laws
The most substantial part of the analysis is a thought experiment about whether a partial mesh can genuinely beat a leaf-and-spine fabric on throughput. The answer, in an honest apples-to-apples comparison, is no, and the reasoning is clean enough to follow without simulation.
Build a leaf-and-spine fabric with N:1 oversubscription at the leaf, where N is commonly three, meaning the total bandwidth of edge ports is three times the total uplink bandwidth. Critically, the spine or superspine has no oversubscription. The only congested resource in the entire fabric is the set of leaf switch uplinks. Traffic between any two endpoints crosses exactly two leaf uplinks plus a non-oversubscribed core.
Now look at the Plexxi or AWS partial mesh. When two leaves have no direct link, or when their direct link is busy, traffic gets relayed through other leaves, which means it can cross more than two leaf uplinks on its journey. Since those uplinks are the congestion points, consuming extra uplink hops can only reduce total achievable throughput, not increase it. In an environment of many small flows, where load balancing works well, it is simply not possible for a partial mesh to exceed a leaf-and-spine fabric that has no core oversubscription. The traffic profile does not change this conclusion as long as the uplinks remain the bottleneck. This is conservation, not opinion.
So how did the paper produce graphs showing better throughput? The honest reading is that the comparison is not quite apples-to-apples, and the paper is vague enough that the exact setup is hard to reconstruct. The numbers appear to come from simulation, the source code is not published, and the fat-tree baselines are described without specifying their parameters. Several mundane explanations fit the results. The baseline spine layer may itself be oversubscribed. The baseline load balancing may be suboptimal, leaving some uplinks congested while others idle, a problem that has known fixes such as the one Cisco ACI reportedly uses. The AWS solution may use load balancing across virtual paths, giving it more alternate routes. It may use a routing algorithm sensitive to link load, or packet spraying that distributes a single session's packets across many paths, while the baseline does not. Any of these would explain the gap without invoking new physics.
Implications: power and scale, not throughput
None of this means the work is worthless, and Pepelnjak is careful to say so. The honest case for the AWS design is not throughput, it is power and capital. A ShuffleBox is most likely a passive element, a box of wires rather than a box of silicon with fans and power supplies. Removing an active spine layer removes the energy that layer consumed and the routers you had to buy and cool. The 40% power reduction is plausible precisely because it does not depend on beating physics, only on deleting hardware.
There is also an old piece of practical advice hiding in here. For small fabrics, connecting four leaf switches into a full mesh has always been more sensible than standing up a dedicated spine. The AWS work is that instinct pushed to enormous scale, with optimization theory finding good partial meshes for thousands of nodes where a human cannot reason about the topology by hand.
Counter-perspectives and the limits of the result
The analysis closes with appropriate humility. By Clarke's First Law, a distinguished engineer declaring something impossible is often wrong, and the author explicitly invites correction if he has missed an obvious mechanism. That openness matters, because the throughput argument rests on the assumption that uplinks stay the only congestion point, and a sufficiently clever encapsulation or traffic-engineering scheme changes which resources are scarce. If you want utilization beyond what unequal-cost multipathing delivers, you need real traffic engineering, which means virtual circuits and an extra encapsulation layer, whether that layer is built from MAC addresses, MPLS, SRv6, or, as the post jokes, carrier pigeons.
The more grounded caution concerns who should care. Enterprises almost certainly should not. Plexxi tried this commercially and went nowhere, and there may be a structural reason for that. When Fortune 50 companies can build two data centers with fewer than a dozen switches, optimizing the fabric topology is not where anyone's time is well spent. The design earns its keep only at the scale where you are deploying tens of thousands of switches and a 40% power difference becomes a line item visible from the executive floor.
That is the quiet lesson underneath the marketing. The AWS result is real engineering aimed at a problem almost no one else has, dressed in language that implies it overturns a principle it actually respects. Reading the arXiv paper instead of the blog post reveals a Quasi-Random Graph, not a break with the laws that govern congested links. The leaf-and-spine fabric is not going anywhere. It has simply been joined, at the very top of the scale curve, by a passive-wiring cousin that trades a little throughput honesty for a lot of saved electricity, which for a company running fabrics that large is a trade well worth making.

Comments
Please log in or register to join the discussion