Running large language models locally is expensive, but Apple’s M‑series chips offer a surprisingly capable alternative. Antirez outlines how DwarfStar can stitch together two or more Mac Studios or M5 Max laptops using layer‑wise or expert‑wise parallelism, and even explores model‑ensemble tricks that avoid heavy data movement. The post sketches three distribution strategies, weighs their practical trade‑offs, and points to recent research on LLM ensembles as a promising third path.
What happened
Antirez posted a detailed note on the DwarfStar forum describing the current state of local LLM inference on Apple silicon and why the community is starting to think about distributed inference. High‑end NVIDIA GPUs still dominate raw performance, but the cost of multiple GPUs, power, and cooling makes them prohibitive for most developers. Apple’s Mac Studio with the M3 Ultra (up to 512 GB unified memory) and the newer M5 Max (up to 128 GB) have emerged as the most affordable “big‑memory” platforms capable of running 2‑bit quantized models such as DeepSeek‑v4 PRO, Flash, or Mimo V2.5 at usable speeds.
Antirez measured roughly 150 tokens / s pre‑fill and 10‑13 t/s decoding on an M3 Ultra for DeepSeek‑v4 PRO, while an M5 Max can push ≈500 t/s pre‑fill and 35‑40 t/s decoding on the same model when 2‑bit quantized. Those numbers are far from GPU‑cluster performance, but they are enough for many prototype or low‑throughput applications, and the hardware price‑point (≈$6‑7 k for a well‑configured Mac Studio) is dramatically lower than a single DGX‑Spark.
Given those baselines, Antirez asks: What if we could combine two or more of these machines? The post sketches two classic distribution schemes and proposes a third, less explored, ensemble‑based approach.
Why developers care
- Cost vs. capability – Most indie teams and hobbyists cannot afford a multi‑GPU server. Apple silicon offers a sweet spot: enough VRAM for 2‑bit quantized frontier models, decent bandwidth, and a unified memory architecture that simplifies software stacks.
- Scalability without NVLink – Apple’s RDMA over Ethernet is far slower than NVLink, so traditional tensor‑parallelism (splitting a single layer’s matrix multiply across devices) quickly becomes bandwidth‑bound. Understanding which parallelism patterns actually work on this hardware is essential for anyone building a locally‑hosted LLM service.
- Emerging research on model ensembles – A recent arXiv paper, LLM Ensembles: A Simple Yet Powerful Way to Boost Performance (https://arxiv.org/abs/2502.18036), shows that running two independent models on separate machines and merging their logits can improve perplexity without any inter‑layer communication. This opens a path to “shared‑nothing” scaling that sidesteps the bandwidth bottleneck entirely.
If you’re building a product that needs on‑device privacy, low‑latency response, or just wants to avoid cloud spend, these ideas could become the foundation of a practical inference pipeline.
Community response & practical takeaways
1. Layer‑wise (pipeline) parallelism
- How it works – Split the transformer vertically: the first half of layers run on Machine A, the second half on Machine B. Activations (the hidden states) travel across the network after each forward pass.
- Pros – Minimal data movement (only activations, not weights). Works even with 2‑bit quantized weights because each device stores its own half of the model.
- Cons – Decoding (generating a single token) remains sequential: you still wait for A → B → A … for every token, so latency does not improve much. Prefill (processing a long prompt) can be batched to hide the round‑trip cost, but the benefit caps at the network round‑trip time.
- Community tip – Users report that a micro‑batch of 4‑8 prompts can keep both machines busy and achieve a modest pre‑fill speedup (≈1.3× on a 2‑machine Mac Studio setup). The trick is to overlap the network transfer of activations with the compute of the next micro‑batch.
2. Expert‑wise (router) parallelism via Apple RDMA
- How it works – Load the same 2‑bit quantized model on both machines. For each MoE‑style layer, route half of the experts to Machine A and half to Machine B. Because the routing decision is known ahead of time, each device only needs to compute its assigned experts and exchange the tiny expert outputs.
- Pros – The amount of data moved per layer is tiny (just the expert outputs), making it feasible over 10‑GbE or even 2.5‑GbE. Works best for models with large routed experts (e.g., DeepSeek‑v4 PRO) where the compute per expert dominates the communication cost.
- Cons – Requires a custom runtime that can split MoE layers on the fly. Apple’s public APIs do not expose low‑level tensor routing, so most developers would need to patch the inference engine (e.g., a modified version of 🤗 Transformers or vLLM).
- Community tip – A small fork of the
mlc-llmruntime adds a simple RDMA‑based expert splitter; early adopters report ~20 % speedup on pre‑fill compared to pure layer‑wise splitting.
3. Ensemble‑based “shared‑nothing” inference
- How it works – Run two different quantized models on separate machines (they can even have different vocabularies). After each forward pass, each model emits its logits. A lightweight aggregator merges the logits (e.g., weighted average after aligning vocabularies) or picks the continuation with the lower perplexity.
- Why it matters – No inter‑layer communication is needed; the only data exchanged is the final logits (a few kilobytes). This sidesteps the bandwidth limitation entirely and can be implemented over standard HTTP or gRPC.
- Performance – The arXiv paper reports up to 5‑10 % perplexity reduction on benchmark tasks when combining a 2‑bit quantized DeepSeek‑v4 Flash with a 2‑bit Minimax M2.7. In practice, developers see smoother generation because each model’s “opinion” can veto unlikely tokens.
- Implementation sketch –
- Deploy each model behind a lightweight inference server (e.g., FastAPI +
text-generation-inference). - For each request, send the prompt to both servers concurrently.
- Align the token IDs (if vocabularies differ) using the
sentencepiecetokenizer’sdecode/encoderound‑trip. - Combine logits:
combined = softmax(logits_a) * w_a + softmax(logits_b) * w_b. - Sample from
combinedand return the token.
- Deploy each model behind a lightweight inference server (e.g., FastAPI +
- Community reaction – Many HN commenters are excited because this approach works with any hardware, not just Apple silicon. The main hurdle is handling mismatched tokenizers, but open‑source tools like
tokenizers(https://github.com/huggingface/tokenizers) make it manageable.
What to try next
- Prototype a two‑Mac pipeline – Use the open‑source
vllmfork that supports pipeline parallelism over TCP. Start with a 2‑bit quantized DeepSeek‑v4 Flash on each Mac Studio and measure pre‑fill throughput with micro‑batching. - Experiment with expert routing – Fork the
mlc-llmRDMA patch, load the same model on two M5 Max laptops, and split the MoE layers. Compare the latency to the pure pipeline approach. - Build an ensemble service – Deploy two quantized models (e.g., DeepSeek‑v4 Flash and Minimax M2.7) on separate machines and try the logit‑averaging trick. Track perplexity on a held‑out set and note any qualitative differences in output.
- Monitor hardware trends – Keep an eye on Apple’s upcoming M5 Ultra announcements. If memory bandwidth climbs above 1.2 TB/s, the expert‑wise approach will become even more attractive.
Bottom line
For most developers, buying a single DGX‑Spark is no longer the only way to run frontier‑size LLMs locally. Apple’s high‑memory silicon, combined with clever distribution strategies, can give you a usable inference pipeline for a fraction of the cost. While classic pipeline parallelism offers modest gains, the real sweet spot may be the ensemble approach: run two independent models on separate machines, merge their logits, and reap a small but meaningful quality boost without drowning in network traffic. The DwarfStar community is already testing these ideas, and the next few months should reveal which pattern scales best for real‑world workloads.
Comments
Please log in or register to join the discussion