Inside Nano-vLLM: How Modern Inference Engines Transform Prompts into Tokens

A deep dive into Nano-vLLM's architecture reveals the sophisticated engineering behind LLM inference, from batching strategies to KV cache management and tensor parallelism.

When you send a prompt to an LLM API, what actually happens under the hood? The journey from your text to generated response involves sophisticated engineering that most developers never see. Nano-vLLM, a minimal yet production-grade inference engine, provides a perfect lens for understanding these internals.

The Simple Interface Hiding Complex Architecture

The entry point to Nano-vLLM is deceptively simple: an LLM class with a generate method. You pass in prompts and sampling parameters, and receive generated text. But beneath this straightforward interface lies a carefully orchestrated pipeline that transforms text into tokens, schedules computation efficiently, and manages GPU resources with precision.

From Natural Language to Token Sequences

The first transformation happens at the tokenizer level. Each prompt string goes through a model-specific tokenizer that splits natural language into tokens—the fundamental units that LLMs process. Different model families (Qwen, LLaMA, DeepSeek) use different tokenizers, which is why the same prompt length can produce different token counts across models.

The tokenizer converts each prompt into a sequence: an internal data structure representing a variable-length array of token IDs. This sequence becomes the core unit of work flowing through the system.

The Producer-Consumer Pattern: Decoupling Request Submission from Processing

Rather than processing each sequence immediately, Nano-vLLM adopts a producer-consumer pattern with the Scheduler at its center. The add_request method acts as the producer: it converts prompts to sequences and places them into the Scheduler's queue. Meanwhile, a separate step loop acts as the consumer, pulling batches of sequences from the Scheduler for processing.

This decoupling is key—it allows the system to accumulate multiple sequences and process them together, which is where the performance gains come from. The Scheduler maintains two queues:

Waiting Queue: Sequences that have been submitted but not yet started. New sequences from add_request always enter here first.
Running Queue: Sequences that are actively being processed—either in prefill or decode phase.

The Throughput-Latency Trade-off: Why Batching Matters

GPU computation has significant fixed overhead—initializing CUDA kernels, transferring data between CPU and GPU memory, and synchronizing results. If you process one sequence at a time, you pay this overhead for every single request. By batching multiple sequences together, you amortize this overhead across many requests, dramatically improving overall throughput.

However, batching comes with a trade-off. When three prompts are batched together, each must wait for the others to complete before any results are returned. The total time for the batch is determined by the slowest sequence. This means:

Larger batches yield higher throughput but potentially higher latency for individual requests
Smaller batches yield lower latency but reduced throughput

This is a fundamental tension in inference engine design, and the batch size parameters you configure directly control this trade-off.

Prefill vs. Decode: Two Distinct Phases of Generation

Before diving deeper into the Scheduler, we need to understand a crucial distinction. LLM inference happens in two phases:

Prefill: Processing the input prompt. All input tokens are processed together to build up the model's internal state. During this phase, the user sees nothing.
Decode: Generating output tokens. The model produces one token at a time, each depending on all previous tokens. This is when you see text streaming out.

For a single sequence, there is exactly one prefill phase followed by many decode steps. The Scheduler needs to distinguish between these phases because they have very different computational characteristics—prefill processes many tokens at once, while decode processes just one token per step.

Resource Management: The Block Manager's Role

The Block Manager is where vLLM's memory management innovation lives. To understand it, we first need to introduce a new resource unit: the block.

From Variable-Length Sequences to Fixed-Size Blocks

A sequence is a variable-length array of tokens—it can be 10 tokens or 10,000. But variable-length allocations are inefficient for GPU memory management. The Block Manager solves this by dividing sequences into fixed-size blocks (default: 256 tokens each).

A 700-token sequence would occupy three blocks: two full blocks (256 tokens each) and one partial block (188 tokens, with 68 slots unused). Importantly, tokens from different sequences never share a block—but a long sequence will span multiple blocks.

Prefix Caching: The Hash-to-Block Mapping

Here's where it gets clever. Each block's content is hashed, and the Block Manager maintains a hash-to-block-id mapping. When a new sequence arrives, the system computes hashes for its blocks and checks if any already exist in the cache.

If a block with the same hash exists, the system reuses it by incrementing a reference count—no redundant computation or storage needed. This is particularly powerful for scenarios where many requests share common prefixes (like system prompts in chat applications). The prefix only needs to be computed once; subsequent requests can reuse the cached results.

Control Plane vs. Data Plane

A subtle but important point: the Block Manager lives in CPU memory and only tracks metadata—which blocks are allocated, their reference counts, and hash mappings. The actual KV cache data lives on the GPU.

The Block Manager is the control plane; the GPU memory is the data plane. This separation allows fast allocation decisions without touching GPU memory until actual computation happens. When blocks are deallocated, the Block Manager marks them as free immediately, but the GPU memory isn't zeroed—it's simply overwritten when the block is reused. This avoids unnecessary memory operations.

Model Execution: The Model Runner's Responsibilities

The Model Runner is responsible for actually executing the model on GPU(s). When the step loop retrieves a batch of sequences from the Scheduler, it passes them to the Model Runner along with the action (prefill or decode).

Tensor Parallelism: Scaling Beyond Single GPU Limits

When a model is too large for a single GPU, Nano-vLLM supports tensor parallelism (TP)—splitting the model across multiple GPUs. With TP=8, for example, eight GPUs work together to run a single model.

The communication architecture uses a leader-worker pattern:

Rank 0 (Leader): Receives commands from the step loop, executes its portion, and coordinates with workers.
Ranks 1 to N-1 (Workers): Continuously poll a shared memory buffer for commands from the leader.

When the leader receives a run command, it writes the method name and arguments to shared memory. Workers detect this, read the parameters, and execute the same operation on their respective GPUs. Each worker knows its rank, so it can compute its designated portion of the work.

This shared-memory approach is efficient for single-machine multi-GPU setups, avoiding network overhead.

Preparing for Computation

Before invoking the model, the Model Runner prepares the input based on the action:

Prepare Prefill: Batches multiple sequences with variable lengths, computing cumulative sequence lengths for efficient attention computation.
Prepare Decode: Batches single tokens (one per sequence) with their positions and slot mappings for KV cache access.

This preparation also involves converting CPU-side token data into GPU tensors—the point where data crosses from CPU memory to GPU memory.

CUDA Graphs: Eliminating Kernel Launch Overhead

For decode steps (which process just one token per sequence), kernel launch overhead can become significant relative to actual computation. CUDA Graphs address this by recording a sequence of GPU operations once, then replaying them with different inputs.

Nano-vLLM pre-captures CUDA graphs for common batch sizes (1, 2, 4, 8, 16, up to 512), allowing decode steps to execute with minimal launch overhead.

From Logits to Tokens: The Sampling Process

The model doesn't output a single token—it outputs logits, a probability distribution over the entire vocabulary. The final step is sampling: selecting one token from this distribution.

The temperature parameter controls this selection. Mathematically, it adjusts the shape of the probability distribution:

Low temperature (approaching 0): The distribution becomes sharply peaked. The highest-probability token is almost always selected, making outputs more deterministic and focused.
High temperature: The distribution flattens. Lower-probability tokens have a better chance of being selected, making outputs more diverse and creative.

This is where the "randomness" in LLM outputs comes from—and why the same prompt can produce different responses. The sampling step selects from a valid range of candidates, introducing controlled variability.

The Bigger Picture: Why This Matters

Understanding these internals matters because every decision in this pipeline affects the trade-offs you'll face when deploying LLMs:

Batch size controls throughput vs. latency
Block size affects memory efficiency and cache hit rates
Tensor parallelism determines scalability limits
CUDA graph usage impacts per-token latency

Nano-vLLM's ~1,200 lines of Python distill these core ideas while achieving throughput comparable to full vLLM implementations. This makes it an ideal lens for understanding inference engine design without getting lost in the complexity of supporting dozens of model architectures and hardware backends.

In Part 2, we'll open the black box of model computation itself, exploring attention mechanisms, KV cache internals, and tensor parallelism at the computation level. Understanding these internals will complete the picture—from prompt string to generated text, with nothing left hidden.

#inference #Tokenization #batching #CUDA graphs #tensor parallelism