Google’s Gemma 4 now ships with a lightweight multi‑token prediction drafter that lets the model verify several tokens in parallel, delivering roughly 3× faster generation on consumer hardware without compromising quality.
Gemma 4 Multi‑Token Prediction Cuts Inference Latency by Up to Three‑Fold

Google announced that the latest release of its Gemma 4 family can be paired with a multi‑token prediction (MTP) drafter. The drafter runs alongside the main model, proposes a batch of future tokens, and lets the heavyweight target model verify them in a single pass. In practice this reduces the number of round‑trips between VRAM and the compute cores, delivering up to ~3× faster token generation while preserving the frontier‑class reasoning of the original model.
How the MTP drafter works
During ordinary inference each token requires the processor to fetch billions of parameters from VRAM, perform the attention‑feed‑forward pass, and write the result back. On consumer GPUs and even on high‑end CPUs this memory‑bandwidth bottleneck dominates latency. The MTP approach introduces a lightweight auxiliary model that runs on the same hardware but with a fraction of the parameter count.
- Draft phase – The drafter predicts n candidate tokens (typically 4‑8) using a shallow network that reuses the target model’s key‑value cache. Because the drafter is small, it can generate these candidates much faster than the full model could produce a single token.
- Verification phase – The full Gemma 4 model receives the batch of candidates, re‑uses the cached attention context, and scores each token in parallel. Tokens that meet a confidence threshold are emitted; the rest are recomputed individually.
- Cache sharing – A key engineering detail is that the drafter and the target model share the same kV cache. This eliminates the need to duplicate the large attention buffers, keeping the extra memory overhead modest (roughly 10‑15 % of the target model’s footprint).
The result is a pipeline that keeps the compute units busy while the memory subsystem is idle, turning a previously serial process into a partially parallel one.
Real‑world use cases
| Scenario | Benefit |
|---|---|
| Desktop inference – Running Gemma 4‑31B on a consumer RTX 4090 or an AMD Radeon 7900 | Up to 3× lower latency for chat‑style interactions, making local assistants feel snappier. |
| Edge devices – Mobile phones, tablets, or embedded GPUs running the E2B/E4B variants | Faster response without sacrificing the model’s reasoning, enabling richer on‑device assistants that respect privacy. |
| Low‑traffic API endpoints – SaaS providers that serve a few concurrent users per instance | Better utilization of expensive GPU instances, reducing cost per request. |
| Hybrid RAG pipelines – Retrieval‑augmented generation where the LLM must produce long passages | Drafting multiple tokens reduces the number of verification steps, shortening end‑to‑end latency for document synthesis. |
Developers can pull the MTP‑enabled checkpoints from platforms such as Hugging Face, Ollama, or Kaggle. The packages include both the target Gemma 4 model and the matching drafter, pre‑configured to share the cache.
Trade‑offs and considerations
Memory footprint
While the shared cache mitigates duplication, the drafter still occupies additional VRAM. On a 24 GB GPU a dense 31B Gemma 4 model already consumes ~20 GB; adding the drafter pushes usage to the limit, leaving little room for other workloads. For truly constrained devices (e.g., 8 GB mobile GPUs) the E2B/E4B variants are the practical choice.
Quality vs. speed
The verification step guarantees that any token emitted by the drafter meets the target model’s confidence threshold. In benchmarks Google reports no measurable degradation in perplexity or downstream task performance. However, community testing on Reddit and Hacker News notes occasional edge‑case failures when the drafter’s confidence is overly optimistic. Tuning the acceptance threshold can balance speed gains against occasional re‑generation.
Deployment complexity
Running two models simultaneously introduces a small orchestration burden. Frameworks such as TensorRT‑LLM and vLLM already expose an MTP mode, but custom pipelines may need to manage cache sharing explicitly.
Suitability for high‑throughput services
Analysts on Hacker News argue that MTP shines when compute is abundant but request volume is low—for example, personal assistants or edge inference. Large API providers that serve thousands of concurrent requests may see diminishing returns because the verification step still becomes a bottleneck under heavy load.
Architectural takeaways for cloud‑native teams
- Speculative decoding is becoming a standard performance knob – As LLMs grow, the memory‑bandwidth gap will only widen. Expect more vendors to ship paired drafter models.
- Cache‑aware designs matter – Sharing the kV cache is a clever way to keep overhead low; similar techniques could be applied to other accelerator‑specific optimizations.
- Hybrid inference stacks – Combining a heavyweight model for correctness with a lightweight predictor for throughput offers a flexible trade‑off surface that can be tuned per deployment.
- Observability is key – Monitoring the acceptance rate of drafted tokens helps detect drift in the drafter’s confidence and informs dynamic threshold adjustments.
Looking ahead
Google’s MTP implementation for Gemma 4 demonstrates that speculative decoding can be packaged as a first‑class feature, not just a research experiment. As the community builds tooling around cache sharing and drafter orchestration, we can anticipate broader adoption across both on‑device and cloud‑hosted LLM services. For architects, the lesson is clear: treat inference latency as a multi‑dimensional problem and consider lightweight auxiliary models as a lever to extract more performance from existing hardware.
Author: Sergio De Simone – senior software engineer, currently leading iOS and macOS development at BigML, Inc.


Comments
Please log in or register to join the discussion