Coherent Fabrics Unpacked: Five Rules for CXL, NVLink, and InfinityFabric

Memory‑hungry workloads are pushing the limits of traditional DRAM. A recent UC SDC study dissects how coherent fabrics—CXL, NVLink‑C2C, and AMD’s InfinityFabric—can extend capacity and bandwidth, and it distills five practical rules for developers and architects to harness these technologies without falling prey to latency traps.

The Memory‑Hungry Reality

Large‑language models, in‑memory databases, and data‑analytics platforms now routinely demand more memory bandwidth and capacity than a single CPU socket can deliver. Coherent fabrics promise to bridge that gap by adding memory that remains cache‑coherent with the CPU, allowing applications to treat remote memory as if it were local.

A team of eight Ph.D. students at UC SDC spent 1½ years measuring 13 server systems across three CPU vendors, three fabric types, and five device vendors. Their findings, published on arXiv and accompanied by an open‑source benchmarking suite, reveal that the promise of “just add more memory” is nuanced.

Key Takeaway

Coherent fabrics are not a drop‑in DRAM replacement; they are a new tier of memory that offers massive capacity at a latency and bandwidth cost that must be understood.

Rule #1 – Pin Your Workloads on Early Intel CPUs

On Intel Sapphire Rapids (SPR) and Emerald Rapids (EMR), remote CXL accesses are throttled to a fraction of the local socket’s last‑level cache (LLC). A workload that touches CXL memory from a different socket can see its LLC reduced to 1/8th on SPR and 1/4th on EMR. AMD Zen 4 and Intel Granite Rapids (GNR) do not exhibit this asymmetry.

Practical advice: use numactl or similar tools to bind CXL‑heavy threads to the local socket, especially on older Intel silicon.

Rule #2 – Expect Read‑Write Asymmetry

On AMD platforms, load (read) bandwidth scales with thread count, but store (write) bandwidth remains flat, even with many cores. This can cripple workloads that write heavily to CXL memory.

Practical advice: profile your application’s memory access pattern; consider restructuring write‑intensive phases or using local DRAM for hot write paths.

Rule #3 – CXL Can Reduce Overall Latency

Surprisingly, adding CXL memory to a system that already has saturated DRAM channels can lower average memory latency. The extra bandwidth prevents queuing delays on the DDR channels, pulling the overall latency down even though CXL accesses themselves are slower.

Practical advice: in a system where DRAM bandwidth is the bottleneck, a modest CXL expander can be a win, both in capacity and in smoothing latency spikes.

Rule #4 – Pick the Right CPU Microarchitecture

Bandwidth and latency characteristics differ markedly between CPUs. AMD CPUs tend to saturate CXL devices more readily, while Intel’s early generations (SPR/EMR) lag. GNR aligns with AMD in both bandwidth and latency.

Practical advice: match the fabric to the CPU generation. On Intel SPR/EMR, expect lower remote bandwidth; on GNR or AMD Zen 4, expect parity with local memory.

Rule #5 – Capacity Expansion Fuels AI‑Based Scientific Workloads

AlphaFold 3, for instance, can fail with out‑of‑memory errors on DIMM‑only systems when processing large RNA inputs. Adding CXL expanders enabled the full workload to run without code changes, with latency penalties outweighed by the capacity gain.

Practical advice: for AI or scientific workloads that are CPU‑bound but memory‑hungry, CXL is a low‑friction way to scale capacity.

Practical Steps for Architects

Measure: run the open‑source benchmark suite on your target hardware to understand local vs. remote bandwidth and latency.
Pin: use NUMA awareness tools to bind CXL‑heavy threads to the local socket.
Profile: check for read/write asymmetry; adjust memory access patterns if needed.
Scale: add CXL expanders only when DRAM bandwidth saturates; monitor queuing delays.
Validate: test with real workloads (e.g., AlphaFold, LLM inference) to confirm that latency penalties remain acceptable.

The Road Ahead

Coherent fabrics are a cornerstone of tomorrow’s heterogeneous systems. As vendors refine the protocol and silicon, we can expect lower latencies and higher bandwidths. For now, the five rules distilled from UC SDC’s extensive measurements provide a roadmap for developers and system architects to harness CXL, NVLink‑C2C, and InfinityFabric without falling into common performance pitfalls.

Source: Zixuan Wang et al., “The Hitchhiker’s Guide to Coherent Fabrics: 5 Programming Rules for CXL, NVLink, and InfinityFabric,” SIGARCH, 1 Dec 2025. https://www.sigarch.org/the-hitchhikers-guide-to-coherent-fabrics-5-programming-rules-for-cxl-nvlink-and-infinityfabric/

#CXL #InfinityFabric #NVLink