Amazon’s new Resilient Network Graph (RNG) architecture replaces traditional fat‑tree topologies with a quasi‑random mesh, slashing the number of switches needed by roughly two‑thirds, boosting raw throughput by up to one‑third, and cutting network power draw by 40%. The rollout, now default for most AWS workloads, promises billions in cap‑ex savings and a measurable reduction in CO₂ emissions.
Image credit: Getty Images
Announcement
Amazon Web Services announced that its Resilient Network Graph (RNG) architecture has become the default networking fabric for the majority of newly built AWS data centers. The company says RNG delivers up to 33 % higher throughput while using 69 % fewer switches and routers than the conventional fat‑tree designs that have powered cloud networks for the past two decades. Early deployments in Dublin, Frankfurt and Madrid have already shown 40 % lower network power consumption and up to 45 % reduction in infrastructure cost.
Technical specifications
1. Topology shift
- Traditional fat‑tree – a hierarchical arrangement of edge, aggregation and core layers. Traffic is forced through a limited set of uplinks, creating hot spots even when spare capacity exists elsewhere.
- RNG – a quasi‑random graph where each leaf switch connects to a large set of peer switches (average degree ≈ 12 in the current implementation). The resulting mesh offers hundreds of distinct paths between any two servers, dramatically increasing path diversity.
2. Routing protocol – Spraypoint
- Built on a custom extension of ECMP (Equal‑Cost Multi‑Path) that evaluates all viable paths rather than the shortest‑hop set.
- Uses per‑flow hashing combined with real‑time congestion feedback to spread packets across the mesh, achieving a load‑balancing efficiency of 92 % versus ~70 % for classic ECMP.
- The protocol is implemented in the AWS‑owned Spraypoint software stack, running on the same ASICs that power the Switch‑X line‑cards.
3. Optical interconnect – ShuffleBox
- A passive optical back‑plane that aggregates up to 1,024 fibers into a single modular panel. The design eliminates the need for individual fiber splicing at each switch, reducing cabling labor by an estimated 80 %.
- ShuffleBox operates at 400 Gbps per lane using PAM‑4 modulation, giving a raw fabric bandwidth of ≈ 400 Tbps per rack.
4. Device count and power
| Metric | Fat‑tree (baseline) | RNG (deployed) | Δ |
|---|---|---|---|
| Switches per rack | 12 | 4 | ‑66 % |
| Power per rack (network) | 7.5 kW | 4.5 kW | ‑40 % |
| Capital cost per rack | $120 k | $66 k | ‑45 % |
5. Performance impact
- Throughput: Benchmarks using 100 Gbps TCP streams show a 33 % increase in sustained bandwidth per server pair.
- Latency: Median round‑trip latency drops from 1.2 µs to 0.9 µs under load, thanks to shorter hop counts and reduced queuing.
- Failure resilience: Simulated node failures (up to 15 % of switches) result in less than 2 % throughput loss, confirming the mesh’s inherent redundancy.
Market implications
Cost structure shift
The 69 % reduction in switch count translates directly into lower bill‑of‑materials for every new AWS region. Assuming an average switch price of $10 k, a 1,000‑rack data center saves $6.6 M in hardware alone. When combined with the 40 % power savings (≈ $2 M per year in electricity for a typical hyperscale pod), the total TCO reduction can exceed $10 M over a five‑year horizon.
Competitive pressure on networking vendors
- Broadcom, Mellanox (NVIDIA) and Cisco have long supplied the ASICs and chassis for fat‑tree fabrics. RNG’s reliance on a custom protocol and passive optical modules means those vendors must either license Spraypoint or develop competing mesh‑ready stacks.
- Early signals suggest a 10‑15 % price dip in conventional top‑of‑rack switches as hyperscalers re‑evaluate spend.
Environmental impact
AWS estimates that the network power cut will avoid ≈ 2 MtCO₂e per year across its global footprint, aligning with the company’s pledge to reach net‑zero carbon for its infrastructure by 2040.
Influence on other cloud providers
Google Cloud and Microsoft Azure have publicly explored random‑graph concepts in academic papers, but have not announced production rollouts. The RNG deployment provides a concrete reference point; we can expect pilot projects from both firms within the next 12‑18 months, especially as AI workloads demand ever‑higher inter‑node bandwidth.
Outlook
RNG demonstrates that graph‑theoretic networking is no longer a research curiosity but a production‑ready architecture capable of delivering measurable cost and performance gains at scale. The key takeaways for enterprises and cloud‑focused engineers are:
- Network topology matters as much as CPU/GPU speed when scaling AI and data‑intensive workloads.
- Software‑defined routing (Spraypoint) can unlock the potential of dense optical meshes without sacrificing predictability.
- Passive optical infrastructure (ShuffleBox) reduces both CAPEX and OPEX, a compelling proposition for any hyperscale operator.
As AWS continues to retrofit existing regions and build new RNG‑enabled pods, the industry will watch closely to see whether the random‑graph model becomes the new baseline for hyperscale data center design.
*For deeper technical details, see the AWS blog post on Resilient Network Graphs and the open‑source Spraypoint protocol specification.*

Comments
Please log in or register to join the discussion