How 768 GB of Intel Optane Persistent Memory Enabled a Single‑GPU Trillion‑Parameter LLM Run
#Hardware

How 768 GB of Intel Optane Persistent Memory Enabled a Single‑GPU Trillion‑Parameter LLM Run

Chips Reporter
5 min read

A Reddit user assembled a Xeon workstation with six 128 GB Optane DCPMM modules, using memory‑mode to host a 1‑trillion‑parameter Kimi K2.5 model. The hybrid CPU‑GPU inference pipeline delivered roughly 4 tokens / s, proving that cheap, second‑hand Optane can bridge the DRAM‑SSD gap for large‑scale LLM inference despite its imminent discontinuation.

768 GB of Intel Optane Persistent Memory Powers a Trillion‑Parameter LLM on a Single GPU

Intel Image credit: Lenovo

A post on the Local LLaMA subreddit sparked a wave of discussion after user APFrisco demonstrated that a workstation built around Intel’s discontinued Optane Persistent Memory (PMem) could run a 1‑trillion‑parameter language model—Kimi K2.5—at roughly 4 tokens per second using only a single RTX 3060 GPU.


Announcement

  • Goal: Run a trillion‑parameter LLM locally without a multi‑GPU server farm.
  • Key hardware: Six Intel Optane DCPMM sticks (128 GB each, total 768 GB) installed in memory mode.
  • Result: ~4 t/s inference speed on a Xeon Gold 6246 + RTX 3060, with a total system cost well under the price of an equivalent‑capacity DRAM build.

The experiment highlights a niche use‑case for Optane: providing a large, byte‑addressable pool that is faster than NVMe SSDs but far cheaper than DRAM.


Technical Specification

Component Model Quantity Capacity / Specs
CPU Intel Xeon Gold 6246 1 12 cores / 24 threads, 2.9 GHz base, 3.9 GHz boost
Motherboard Tyan S5630GMRE‑CGN 1 Supports 6‑channel DDR4, 2 × PCIe 4.0 x16
GPU ASUS Dual GeForce RTX 3060 OC 12 GB 1 CUDA 12, 12 GB GDDR6
DRAM Samsung DDR4‑2666 ECC 32 GB 6 192 GB total, used as cache for Optane
Persistent Memory Intel Optane DCPMM PC4‑2666 NMA1XBD128GQS 6 768 GB total, latency ~300 ns, bandwidth ~45 GB/s
Storage WD SN850X 2 TB NVMe 1 7 GB/s sequential read
PSU ASRock Steel Legend SL‑850G 850 W 80 PLUS Gold 1 Fully modular
Case Silverstone Grandia Series HTPC 1 Compact, good airflow

Optane in Memory Mode

In memory mode, the Optane modules appear as regular system memory to the OS. The DRAM acts as a volatile cache, holding the most frequently accessed pages. This arrangement yields:

  • Effective capacity: 768 GB usable for the model’s weights and activation maps.
  • Latency: 300 ns, roughly 2‑3× slower than DRAM (≈100 ns) but 10‑20× faster than the fastest NVMe SSDs (3‑5 µs).
  • Bandwidth: ~45 GB/s, sufficient to keep the CPU feeding the GPU when the model is sliced across the two.

Software Stack

  • Framework: llama.cpp compiled with AVX2/AVX512 support.
  • Inference mode: Hybrid CPU‑GPU. The main transformer layers run on the Xeon, while the routing (Mixture‑of‑Experts) heads are forced onto the GPU via the --override-tensor flag.
  • Model: Kimi K2.5 (1 T parameters, MoE architecture). The MoE design reduces the per‑token compute cost because only a subset of experts is activated per token, making the hybrid approach viable on modest hardware.

Performance Breakdown

Stage Device Approx. Time per Token
Embedding lookup (weights in Optane) CPU (cached in DRAM) 0.12 s
Main transformer pass CPU 0.55 s
MoE routing + expert computation GPU (12 GB) 0.23 s
Total ~0.90 s ≈ 4 t/s

The bottleneck remains the latency of pulling large weight matrices from Optane, but the cache mitigates repeated accesses. The GPU handles the most parallelizable portion, keeping overall throughput respectable for a single‑GPU setup.


Market Implications

Cost Efficiency

A comparable DRAM configuration (six 128 GB DDR4 sticks) would cost $3,200‑$3,600 at current market rates, whereas the used Optane modules were sourced for ≈$800 total. The price‑per‑GB advantage is roughly 4‑5×.

Supply‑Chain Context

Intel announced the end‑of‑life for Optane DC Persistent Memory in 2024. Existing inventory is being liquidated through secondary markets, creating a brief window where hobbyists and small‑scale labs can acquire large byte‑addressable pools at a fraction of the price. This experiment demonstrates a practical, if temporary, use‑case for that inventory.

Competing Technologies

  • NVMe‑based “storage class memory” (e.g., Samsung Z‑SSD) offers lower latency than traditional SSDs but still lags behind Optane’s sub‑microsecond response.
  • CXL‑based memory expanders are expected to appear in 2025‑2026, promising up to 2 TB of affordable, byte‑addressable memory per socket. Until those products ship, Optane remains the only commercially available solution that sits between DRAM and SSD in the memory hierarchy.

Implications for LLM Deployment

  1. Hybrid inference pipelines can now be built on workstation‑class hardware, reducing the barrier to entry for research groups lacking cloud credits.
  2. Memory‑mode Optane provides a practical path for models that exceed DRAM capacity but do not yet require a full‑scale GPU cluster.
  3. The experiment reinforces the importance of Mixture‑of‑Experts designs: by activating only a fraction of the total parameters per token, the compute load becomes tractable on limited GPUs.

Outlook

While the Optane story is winding down, the broader lesson is clear: the industry needs a scalable, cost‑effective memory tier between DRAM and SSD. CXL memory fabrics, persistent memory modules from other vendors, and emerging HBM‑based pool solutions are all being positioned to fill that gap.

For practitioners looking to replicate APFrisco’s build today, the recipe is:

  1. Acquire used Optane DCPMM modules (e.g., via eBay or surplus dealers).
  2. Pair them with a server‑grade Xeon platform that supports memory‑mode.
  3. Use a modest GPU (RTX 3060‑3070 class) and a framework like llama.cpp that can split work between CPU and GPU.
  4. Choose an MoE‑style model to keep per‑token compute within the limits of the GPU’s VRAM.

If the community can continue to share configuration tweaks and performance data, the “large‑model‑on‑a‑desk” niche may evolve from a curiosity into a viable research platform—at least until the next generation of CXL‑enabled memory arrives.


Mark Tyson covers semiconductor trends, PC hardware, and the intersection of AI workloads with system architecture. Follow his reporting for deeper analysis of emerging memory technologies.

Comments

Loading comments...