Cross-GPU KV Cache Marketplace: Turning Attention into an Infrastructure Primitive

Article illustration 1

Large language model inference today wastes an astonishing amount of silicon on déjà-vu.

Every time your chat stack, RAG pipeline, or multi-tenant LLM service replays the same system prompt—or the same few kilobytes of context—it recomputes the exact same key/value (KV) tensors on every GPU, in every process, for every request. Those attention states live briefly in device memory, then vanish. The next request, even if identical up to the last token, starts from scratch.

The Cross-GPU KV Cache Marketplace project asks a simple, infrastructure-native question:

What if KV caches were shared, addressable assets?

And then it answers it with something both practical and provocative: a runtime that turns transformer attention states into a cooperative, cross-GPU cache—"memcached for attention"—starting with a concrete integration into vLLM.

This is not just an optimization trick. It’s an early blueprint for how serious LLM platforms will treat memory, consistency, and reuse at scale.


The Missed Opportunity in Today’s KV Caching

Autoregressive transformers already rely on KV caching to avoid recomputing attention states for previous tokens during decoding. But current serving stacks make one limiting assumption: the cache belongs to the process or GPU handling the request.

That leads to three systemic inefficiencies:

  • Every GPU recomputes identical prefixes (prompts, system messages, RAG boilerplate) independently.
  • GPU memory is siloed, so KV states can’t be reused across workers even on the same node.
  • Schedulers and routers remain oblivious to prefix reuse potential.

In modern workloads, this is pathological:

  • Chatbots with a fixed system prompt.
  • RAG flows with shared scaffolding and retrieval templates.
  • Multi-tenant deployments with standardized policies, disclaimers, and guardrail prompts.

The same 1–4 KB of text can appear across thousands of concurrent requests and many GPUs. The prefill phase for a long prefix is expensive; recomputing it everywhere is pure waste.

The KV Cache Marketplace flips the model: treat KV tensors as first-class artifacts that can be published, discovered, and reused across GPUs.


Design: A Marketplace for Attention States

At its core, the Marketplace is a distributed inference runtime pattern with three key ideas:

  1. KV states as shareable artifacts

    • After prefill, each process exports completed prefix KV caches.
    • Entries are keyed by:
      • A hash of the input token sequence (prefix)
      • Model version / configuration
    • These become reusable assets rather than ephemeral, private buffers.
  2. High-speed GPU-to-GPU movement

    • Other processes that encounter the same prefix can import KV caches directly:
      • Via CUDA peer-to-peer copies or NVLink
      • Without bouncing through host memory
    • Transfers are designed to be low-latency enough to beat recomputation.
  3. A registry that behaves like a cache, not a science fair demo

    • Manages:
      • Registration of exported tensors
      • Lookup by prefix key
      • Compatibility enforcement (model params, tokenizer, positional encoding, dtype, layout)
      • Lifetime tracking and correctness checks
    • Falls back to standard execution when no valid cache is found.

Conceptually, you get a cooperative cache fabric spanning GPUs: if one process has paid the cost of prefill, others can simply attach.

For inference providers, that’s existentially relevant: this is how you stop paying full price for every single prompt.


MVP: Starting Local, Proving It Works

The project’s minimum viable prototype is deliberately scoped—and that’s a strength. Instead of overpromising a cross-datacenter KV mesh, it focuses on one high-leverage target: node-local reuse inside vLLM.

What’s shipping now

The initial release provides:

  • A companion fork of vLLM

    • Repo: neelsomani/vllm
    • Branch: vllm-kvm-dev
    • Adds two integration hooks:
      • before_prefill: Attempt KV import when a matching prefix exists.
      • after_prefill: Export KV state for reuse.
    • Includes a thin shim adapter to load the kv-marketplace plugin.
  • CUDA peer-to-peer / IPC transport

    • Enables direct GPU-to-GPU KV transfers on the same machine.
    • Uses CUDA P2P / NVLink when available; can fall back via PCIe.
  • Prefix registry with exact-match semantics

    • Exact token-sequence hash matching for now.
    • Verifies configuration compatibility to avoid subtle correctness bugs.
  • Safe fallbacks

    • If no cache is available or configs mismatch, execution reverts to normal vLLM behavior.
    • Tests confirm next-token parity within floating-point tolerance when caches are reused.
  • Developer-facing demos and tests

    • Unit tests for the registry and prefix index.
    • Integration tests for CUDA P2P (auto-skip if unsuitable hardware).
    • Example scripts:
      • two_gpu_demo.py validates raw peer-to-peer copies.
      • vllm_dual_gpu_demo.py benchmarks with/without kv-marketplace integration.

This is enough for practitioners to:

  • Run real models (e.g., GPT-2, Mistral-7B) on dual-GPU setups.
  • Turn on KV reuse with a flag (--kv-marketplace, --kv-min-prefix).
  • Observe prefix cache hits, latency changes, and throughput shifts.

And crucially, it keeps the complexity where it belongs: behind an integration layer instead of in every application.


What’s Explicitly Not There (Yet)

The authors are refreshingly clear about what the MVP does not attempt:

  • No cross-host KV transfer or global distributed registry.
  • No custom scheduler or router that routes requests to where prefixes live.
  • No eviction policies, scoring, or advanced lifetime management.
  • No tensor-parallel / pipeline-parallel sharded KV import.
  • No KV compression/quantization in the transport path.
  • No speculative decoding integration.
  • No Longest Common Prefix (LCP) support—only exact-prefix matches.

For now, this is about validating the core premise: that sharing KV caches across GPUs on the same node is both feasible and beneficial without breaking correctness.

That constraint is not a limitation; it’s a staging ground.


Why This Matters for Practitioners

If you’re running production LLM workloads, this project lands at the intersection of three hard truths:

  1. GPU time is your real burn rate.
  2. Prefill for long prompts is expensive and increasingly dominates request cost.
  3. Your workloads are more repetitive than you think.

Concrete benefits

  • Chat & system prompts:

    • Shared system messages and instructions become one-time work per node, not per request-per-GPU.
  • RAG-heavy stacks:

    • Boilerplate scaffolding, chain-of-thought templates, and policy prompts can be cached at the KV level.
  • Multi-tenant inference:

    • Common safety policies or branding prompts across tenants can be amortized.

Even the early demo numbers (from the vLLM integration scripts) show the expected pattern:

  • Phase 1 ("cold" runs) may see overhead or minimal gain while caches warm up.
  • Phase 2 ("warm" runs with repeated prefixes) shows material throughput and latency improvements.

The exact impact will be workload-dependent, but the direction is clear: once caches are hot and reuse is common, prefill becomes cheaper than your mental model expects.


Implementation Notes for the Curious

For teams who want to experiment today, the project is designed to be reproducible on real hardware:

  • Environment:

    • Debian 12
    • Python 3.11
    • PyTorch 2.0+ with CUDA 12.8
    • CUDA-capable GPUs; P2P support (NVLink or modern PCIe) for optimal results.
  • vLLM integration:

    • Use the vllm-kvm-dev branch from the provided fork.
    • Enable via CLI flags or LLM constructor arguments (--kv-marketplace, kv_marketplace=True).
  • Transport:

    • CUDA extension is auto-built on install; manual rebuild is supported if needed.
  • Validation workflow:

    • Run the two-GPU demo to verify direct peer copies.
    • Enable kv-marketplace in vLLM and send repeated prompts.
    • Observe:
      • Prefix registry growth
      • Import hits vs misses
      • Changes in request latency and throughput

This is clearly aimed at practitioners comfortable with CUDA, custom wheels, and forked inference runtimes—not a plug-and-play toggle in your SaaS console. But that’s precisely the target that will pressure vendors to adopt ideas like this natively.


The Deeper Shift: From Models to Memory Fabrics

The most interesting part of the Cross-GPU KV Cache Marketplace isn’t the current code. It’s the research space it opens.

If you treat KV caches as a shared, queryable resource, you inherit—and can exploit—a whole lineage of distributed systems problems:

  • Cache-aware routing:

    • Route new requests to GPUs that already hold the relevant prefixes.
    • Similar to HTTP/CDN cache locality, but for internal model states.
  • Consistency and correctness at the KV layer:

    • Strict validation of model versions and tokenization.
    • Handling reconfiguration, upgrades, and partial failures.
  • Prefix deduplication and LCP support:

    • Don’t just match full prefixes; reuse the longest shared prefix portion.
    • Essential for prompts that vary slightly but share large common segments.
  • Sharded and pipelined inference:

    • Importing KV states into tensor-parallel / pipeline-parallel topologies.
  • Compression and transport economics:

    • Encoding KV caches to reduce bandwidth for cross-node or cross-rack reuse.
  • Security and isolation:

    • Multi-tenant scenarios raise questions: which prefixes are shareable, and under what guarantees?

If the first wave of LLM infra was about bigger models and smarter schedulers, the next wave is about intelligence in memory: where states live, how they move, and how little we can recompute.

The KV Marketplace is an existence proof: KV sharing is implementable with today’s stacks, and it behaves like the sort of primitive core platforms could standardize.


When Attention Becomes an API

What starts as a node-local optimization for vLLM hints at a broader future: KV as an addressable interface.

Imagine:

  • A cluster-wide KV fabric where system prompts and common scaffolds are materialized exactly once per model version.
  • Routers that co-locate semantically similar workloads to maximize KV hit rates.
  • Serving frameworks where load_model is only half the story; join_cache(mesh, prefix_signature) is the other half.

We’re not there yet. But Cross-GPU KV Cache Marketplace is a credible step in that direction—and a challenge to inference runtimes that still treat every prefill as an isolated, private expense.

For engineers running LLMs at scale, it poses a sharp question:

If your GPUs are still recomputing the same prefixes all day, how much of your infrastructure is just paying for not having a KV marketplace?


Source: neelsomani/kv-marketplace on GitHub