A Reddit user assembled a Xeon workstation with six 128 GB Optane DCPMM modules, using memory‑mode to host a 1‑trillion‑parameter Kimi K2.5 model. The hybrid CPU‑GPU inference pipeline delivered roughly 4 tokens / s, proving that cheap, second‑hand Optane can bridge the DRAM‑SSD gap for large‑scale LLM inference despite its imminent discontinuation.
768 GB of Intel Optane Persistent Memory Powers a Trillion‑Parameter LLM on a Single GPU
Image credit: Lenovo
A post on the Local LLaMA subreddit sparked a wave of discussion after user APFrisco demonstrated that a workstation built around Intel’s discontinued Optane Persistent Memory (PMem) could run a 1‑trillion‑parameter language model—Kimi K2.5—at roughly 4 tokens per second using only a single RTX 3060 GPU.
Announcement
- Goal: Run a trillion‑parameter LLM locally without a multi‑GPU server farm.
- Key hardware: Six Intel Optane DCPMM sticks (128 GB each, total 768 GB) installed in memory mode.
- Result: ~4 t/s inference speed on a Xeon Gold 6246 + RTX 3060, with a total system cost well under the price of an equivalent‑capacity DRAM build.
The experiment highlights a niche use‑case for Optane: providing a large, byte‑addressable pool that is faster than NVMe SSDs but far cheaper than DRAM.
Technical Specification
| Component | Model | Quantity | Capacity / Specs |
|---|---|---|---|
| CPU | Intel Xeon Gold 6246 | 1 | 12 cores / 24 threads, 2.9 GHz base, 3.9 GHz boost |
| Motherboard | Tyan S5630GMRE‑CGN | 1 | Supports 6‑channel DDR4, 2 × PCIe 4.0 x16 |
| GPU | ASUS Dual GeForce RTX 3060 OC 12 GB | 1 | CUDA 12, 12 GB GDDR6 |
| DRAM | Samsung DDR4‑2666 ECC 32 GB | 6 | 192 GB total, used as cache for Optane |
| Persistent Memory | Intel Optane DCPMM PC4‑2666 NMA1XBD128GQS | 6 | 768 GB total, latency ~300 ns, bandwidth ~45 GB/s |
| Storage | WD SN850X 2 TB NVMe | 1 | 7 GB/s sequential read |
| PSU | ASRock Steel Legend SL‑850G 850 W 80 PLUS Gold | 1 | Fully modular |
| Case | Silverstone Grandia Series HTPC | 1 | Compact, good airflow |
Optane in Memory Mode
In memory mode, the Optane modules appear as regular system memory to the OS. The DRAM acts as a volatile cache, holding the most frequently accessed pages. This arrangement yields:
- Effective capacity: 768 GB usable for the model’s weights and activation maps.
- Latency: 300 ns, roughly 2‑3× slower than DRAM (≈100 ns) but 10‑20× faster than the fastest NVMe SSDs (3‑5 µs).
- Bandwidth: ~45 GB/s, sufficient to keep the CPU feeding the GPU when the model is sliced across the two.
Software Stack
- Framework:
llama.cppcompiled with AVX2/AVX512 support. - Inference mode: Hybrid CPU‑GPU. The main transformer layers run on the Xeon, while the routing (Mixture‑of‑Experts) heads are forced onto the GPU via the
--override-tensorflag. - Model: Kimi K2.5 (1 T parameters, MoE architecture). The MoE design reduces the per‑token compute cost because only a subset of experts is activated per token, making the hybrid approach viable on modest hardware.
Performance Breakdown
| Stage | Device | Approx. Time per Token |
|---|---|---|
| Embedding lookup (weights in Optane) | CPU (cached in DRAM) | 0.12 s |
| Main transformer pass | CPU | 0.55 s |
| MoE routing + expert computation | GPU (12 GB) | 0.23 s |
| Total | — | ~0.90 s ≈ 4 t/s |
The bottleneck remains the latency of pulling large weight matrices from Optane, but the cache mitigates repeated accesses. The GPU handles the most parallelizable portion, keeping overall throughput respectable for a single‑GPU setup.
Market Implications
Cost Efficiency
A comparable DRAM configuration (six 128 GB DDR4 sticks) would cost $3,200‑$3,600 at current market rates, whereas the used Optane modules were sourced for ≈$800 total. The price‑per‑GB advantage is roughly 4‑5×.
Supply‑Chain Context
Intel announced the end‑of‑life for Optane DC Persistent Memory in 2024. Existing inventory is being liquidated through secondary markets, creating a brief window where hobbyists and small‑scale labs can acquire large byte‑addressable pools at a fraction of the price. This experiment demonstrates a practical, if temporary, use‑case for that inventory.
Competing Technologies
- NVMe‑based “storage class memory” (e.g., Samsung Z‑SSD) offers lower latency than traditional SSDs but still lags behind Optane’s sub‑microsecond response.
- CXL‑based memory expanders are expected to appear in 2025‑2026, promising up to 2 TB of affordable, byte‑addressable memory per socket. Until those products ship, Optane remains the only commercially available solution that sits between DRAM and SSD in the memory hierarchy.
Implications for LLM Deployment
- Hybrid inference pipelines can now be built on workstation‑class hardware, reducing the barrier to entry for research groups lacking cloud credits.
- Memory‑mode Optane provides a practical path for models that exceed DRAM capacity but do not yet require a full‑scale GPU cluster.
- The experiment reinforces the importance of Mixture‑of‑Experts designs: by activating only a fraction of the total parameters per token, the compute load becomes tractable on limited GPUs.
Outlook
While the Optane story is winding down, the broader lesson is clear: the industry needs a scalable, cost‑effective memory tier between DRAM and SSD. CXL memory fabrics, persistent memory modules from other vendors, and emerging HBM‑based pool solutions are all being positioned to fill that gap.
For practitioners looking to replicate APFrisco’s build today, the recipe is:
- Acquire used Optane DCPMM modules (e.g., via eBay or surplus dealers).
- Pair them with a server‑grade Xeon platform that supports memory‑mode.
- Use a modest GPU (RTX 3060‑3070 class) and a framework like
llama.cppthat can split work between CPU and GPU. - Choose an MoE‑style model to keep per‑token compute within the limits of the GPU’s VRAM.
If the community can continue to share configuration tweaks and performance data, the “large‑model‑on‑a‑desk” niche may evolve from a curiosity into a viable research platform—at least until the next generation of CXL‑enabled memory arrives.
Mark Tyson covers semiconductor trends, PC hardware, and the intersection of AI workloads with system architecture. Follow his reporting for deeper analysis of emerging memory technologies.

Comments
Please log in or register to join the discussion