Developer antirez has released ds4, a focused local inference engine for the DeepSeek V4 Flash model that skips generic framework overhead in favor of Metal-optimized execution, disk-based KV cache persistence, and asymmetric 2-bit quantization for high-end Apple Silicon Macs.

antirez’s ds4: A Narrow, Metal-Only Inference Engine for DeepSeek V4 Flash

The local LLM inference space is crowded with tools that aim to support every model, quantization format, and hardware target under the sun. Projects like llama.cpp and Ollama have built massive followings by prioritizing breadth: if a model can be quantized to GGUF,they’ll find a way to run it. A new project from developer antirez flips that logic entirely. The ds4 repository is a deliberately narrow, Metal-only inference engine built exclusively for the DeepSeek V4 Flash model, skipping generic framework overhead in favor of optimizations tailored to one specific model’s unique characteristics.

This project represents a counter-movement to the generic inference tool trend. Most developers building local LLM tools assume that supporting more models is inherently better, but ds4’s creator argues that focusing on a single model allows for validation, optimization, and integration that generic tools can’t match. The project’s documentation notes that new models release constantly, and generic tools often implement minimal support for each new release before moving to the next, leaving users with runnable but unfinished experiences. ds4 instead aims to make one local model feel "finished end to end" for high-end Mac users.

Why DeepSeek V4 Flash?

The project’s maintainers highlight several properties that make this model worth a dedicated engine. First, it uses a Mixture of Experts (MoE) architecture with fewer active parameters than dense models of similar total parameter counts, leading to faster inference even at large context sizes. Its thinking mode, which generates a reasoning section before final output, produces sections 1/5 the length of other models in many cases, and the length scales with problem complexity. That makes it usable with thinking enabled in scenarios where other models’ verbose reasoning makes them impractical.

DeepSeek V4 Flash also supports a 1 million token context window, far larger than most accessible local models. The project notes that this larger context allows the 284B parameter model to outperform smaller 27B or 35B dense models on edge-of-knowledge queries, with better English and Italian output quality. Crucially, its KV cache compresses far more efficiently than other models, and the engine supports asymmetric 2-bit quantization: only the routed MoE experts are quantized to IQ2_XXS (up/gate) and Q2_K (down), while shared experts, projections, and routing components are left unquantized to preserve output quality. This allows the model to run on MacBooks with 128GB of RAM using the 2-bit GGUF files provided by the project, available on Hugging Face.

Core Design Choices

Disk-First KV Caching

The project argues that fast SSDs in modern MacBooks should change the assumption that KV caches must live in RAM. ds4 treats the KV cache as a "first class disk citizen", with support for persisting cache to disk to survive session restarts and switch between unrelated sessions. The in-memory cache handles the active session, while the disk cache stores checkpoints for resuming past sessions without reprocessing long prompts.

The cache uses SHA1 hashes of exact token IDs as keys, with files storing the token prefix, rendered text for observability, and DS4-specific session payloads including KV state, logits, and compressor state. The documentation details the full binary format of cache files, which is intentionally not portable to other tools to keep the implementation simple. Cache files include a 48-byte header with metadata like creation time, token count, and quantization level, followed by the decoded text of the cached prefix, then the serialized session state. Checkpoints are saved at four points: after initial prefill of long prompts, at regular intervals during generation, when an unrelated session replaces the live cache, and on server shutdown.

Metal-Only Execution

ds4 is Metal-only, with no CUDA support planned in the short term. A CPU path exists only for correctness checks, but a bug in current macOS versions causes kernel crashes when using the CPU backend, so it is not recommended for use. The server component, ds4-server, is also Metal-only, and runs a single live KV cache in memory. Inference is serialized through one Metal worker, so concurrent requests wait in a queue rather than running in parallel.

The server implements OpenAI-compatible and Anthropic-compatible APIs, supporting endpoints for chat completions, completions, and messages, with SSE streaming. It maps OpenAI tool schemas to DeepSeek’s DSML format and back, making it compatible with coding agents that use OpenAI-style APIs. It does not batch multiple independent requests, as the single live graph and KV checkpoint are shared across all clients.

Performance and Quantization

The project provides benchmark numbers for the q2 GGUF on M3 Max and M3 Ultra Macs, using greedy decoding, 32768 context, and no thinking mode:

Machine	Prompt	Prefill	Generation
MacBook Pro M3 Max, 128 GB	short	58.52 t/s	26.68 t/s
MacBook Pro M3 Max, 128 GB	11709 tokens	250.11 t/s	21.47 t/s
Mac Studio M3 Ultra, 512 GB	short	84.43 t/s	36.86 t/s
Mac Studio M3 Ultra, 512 GB	11709 tokens	468.03 t/s	27.39 t/s

The 1M token max context uses ~26GB of memory for the compressed indexer alone, so the project recommends 100-300k token contexts for 128GB RAM machines running the 81GB 2-bit quant. The 4-bit quant requires 256GB of RAM or more.

Agent Integration

The project includes configuration examples for popular coding agents. For opencode, add a provider and agent entry to ~/.config/opencode/opencode.json:

{ "$schema": "https://opencode.ai/config.json", "provider": { "ds4": { "name": "ds4.c (local)", "npm": "@ai-sdk/openai-compatible", "options": { "baseURL": "http://127.0.0.1:8000/v1", "apiKey": "dsv4-local" }, "models": { "deepseek-v4-flash": { "name": "DeepSeek V4 Flash (ds4.c local)", "limit": { "context": 100000, "output": 384000 } } } } }, "agent": { "ds4": { "description": "DeepSeek V4 Flash served by local ds4-server", "model": "ds4/deepseek-v4-flash", "temperature": 0 } } }

For Pi, add a provider to ~/.pi/agent/models.json:

{ "providers": { "ds4": { "name": "ds4.c local", "baseUrl": "http://127.0.0.1:8000/v1", "api": "openai-completions", "apiKey": "dsv4-local", "compat": { "supportsStore": false, "supportsDeveloperRole": false, "supportsReasoningEffort": true, "supportsUsageInStreaming": true, "maxTokensField": "max_tokens", "supportsStrictMode": false, "thinkingFormat": "deepseek", "requiresReasoningContentOnAssistantMessages": true }, "models": [ { "id": "deepseek-v4-flash", "name": "DeepSeek V4 Flash (ds4.c local)", "reasoning": true, "thinkingLevelMap": { "off": null, "minimal": "low", "low": "low", "medium": "medium", "high": "high", "xhigh": "xhigh" }, "input": ["text"], "contextWindow": 100000, "maxTokens": 384000, "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 } } ] } } }

For Claude Code, a wrapper script sets the Anthropic base URL to the local ds4-server, disables non-essential traffic, and sets the model to DeepSeek V4 Flash. The disk KV cache is particularly useful for Claude Code, which often sends ~25k token initial prompts, as the cache avoids reprocessing that prompt on every request or restart.

Thinking Modes and Validation

DeepSeek V4 Flash supports three modes: non-thinking (direct answers), thinking (reasoning section scales with complexity), and Think Max (extended reasoning, only available for large enough contexts). The server defaults to thinking mode, with API parameters mapping to DeepSeek’s native controls. OpenAI’s reasoning_effort parameter maps to thinking levels, with xhigh still using normal thinking rather than Think Max.

ds4 uses test vectors captured from the official DeepSeek V4 Flash API, comparing logprobs from local runs to official API outputs to catch regressions. The project credits llama.cpp and GGML extensively, noting that while ds4 does not link against GGML, it builds on the ecosystem, quantization formats, and engineering knowledge developed for llama.cpp. Some CPU quantization logic and Metal kernels are adapted from llama.cpp under the MIT license, with copyright notices retained in the LICENSE file.

Limitations and Counter-Perspectives

The project is upfront about its constraints. It is alpha quality code, not production ready. Its narrow scope means it only works with the specific GGUF files provided by the project, not arbitrary GGUF or DeepSeek models. There is no support for Windows, Linux, or CUDA, so it is only usable on Apple Silicon Macs with at least 128GB of RAM for the 2-bit quant, or 256GB for the 4-bit quant.

The server does not batch concurrent requests, so it is not suitable for multi-user setups. The MTP speculative decoding support is experimental, providing only slight speedups currently. The project also notes that it was developed with heavy assistance from GPT 5.5, with humans leading ideas, testing, and debugging, so developers who prefer fully human-written code may want to avoid it.

The CPU backend is broken on current macOS versions due to a virtual memory bug that causes kernel crashes, requiring a computer restart to recover, so it is only useful for debugging. The disk KV cache is DS4-specific, so cached data can’t be used with other inference tools. While the asymmetric 2-bit quantization works well for coding agent tasks and tool calling, it may not match the quality of higher-bit quants for other use cases.

Broader Context

This project highlights a split in the local LLM community between generic tools that prioritize coverage and specialized tools that prioritize depth. For users with high-end Macs who want a reliable, validated DeepSeek V4 Flash setup for coding agents, ds4 offers a tailored experience that generic tools can’t match. For users who need to run multiple models, or use non-Apple hardware, it offers no value.

The focus on disk-backed KV caches also points to a shift in how local inference can use modern hardware: fast SSDs are increasingly able to supplement RAM for large context workloads, reducing the memory pressure that usually limits local long-context inference. As MoE models like DeepSeek V4 Flash become more common, specialized engines that optimize for their unique characteristics may become a more common sight in the local inference ecosystem.

#LLMs #Apple Silicon #Inference Engine #quantization #MoE

antirez’s ds4: A Narrow, Metal-Only Inference Engine for DeepSeek V4 Flash

antirez’s ds4: A Narrow, Metal-Only Inference Engine for DeepSeek V4 Flash

Why DeepSeek V4 Flash?

Core Design Choices

Disk-First KV Caching

Metal-Only Execution

Performance and Quantization

Agent Integration

Thinking Modes and Validation

Limitations and Counter-Perspectives

Broader Context

Comments