Two Different Tricks for Fast LLM Inference: Anthropic vs OpenAI's Approaches

Anthropic and OpenAI have both introduced 'fast mode' for their coding models, but they use completely different approaches - one prioritizes speed over capability while the other optimizes existing models.

Anthropic and OpenAI recently introduced "fast mode" for their best coding models, offering significantly higher speeds. However, these implementations are fundamentally different: Anthropic's fast mode provides up to 2.5x tokens per second (170 vs 65), while OpenAI's delivers over 1000 tokens per second (15x faster than their baseline). The key difference? Anthropic serves their actual model, while OpenAI uses a different, less capable model called GPT-5.3-Codex-Spark.

How Anthropic's Fast Mode Works

The core tradeoff in AI inference economics is batching. GPUs are fast, but moving data onto them isn't. Every inference operation requires copying all tokens of a user's prompt onto the GPU before processing can begin. Batching multiple users increases throughput but makes individuals wait for the batch to fill.

Anthropic's approach is essentially a "bus pass" that guarantees immediate departure. When you use their fast mode, you pay six times more (effectively covering the cost of other potential passengers) but get your results much faster since you don't wait for batch completion. This low-batch-size inference is why they can serve the actual Opus 4.6 model while maintaining higher speeds.

How OpenAI's Fast Mode Works

OpenAI's approach is fundamentally different because they're using Cerebras chips. These aren't ordinary GPUs - they're massive 70-square-inch chips (compared to a typical H100's 1 square inch) that can hold entire models in SRAM memory.

The Cerebras chip has 44GB of internal memory, enough for small models (around 20B parameters at fp16 or 40B at int8 quantization) but not enough for GPT-5.3-Codex. This is why OpenAI introduced GPT-5.3-Codex-Spark - a smaller distilled version of their main model that can fit entirely in the Cerebras chip's memory.

By keeping everything in SRAM instead of streaming weights from external memory, inference becomes dramatically faster - about 15x faster in this case. However, this speed comes at the cost of capability, as Spark is notably less capable than the full GPT-5.3-Codex model.

The Technical Trade-offs

Anthropic's approach is simpler but more expensive per user. By reducing batch sizes, they sacrifice overall GPU throughput to improve individual user experience. This makes sense for their use case since they're serving the same high-quality model.

OpenAI's approach is more technically impressive but introduces capability compromises. Getting models to run on Cerebras chips is non-trivial due to their unusual architecture. Training a 20B-40B parameter distilled model that remains "good enough" for coding tasks is also challenging.

The choice between these approaches reflects different priorities: Anthropic values maintaining model quality while improving speed, whereas OpenAI prioritizes maximum speed even if it means using a less capable model.

Is Fast Inference the Next Big Thing?

Despite both labs releasing fast modes, this doesn't necessarily signal that fast inference is becoming the primary focus. Anthropic's implementation appears reactive - they wanted something to announce before OpenAI's more complex Cerebras-based system. OpenAI is mainly exploring what's possible with their new partnership.

For many users, including myself, "fast, less-capable inference" isn't particularly useful. The value of AI coding assistants is dominated by accuracy and reliability, not raw speed. Paying 6x for 20% more mistakes is a poor trade when most time is spent fixing errors rather than waiting for responses.

However, fast inference might become a useful lower-level primitive in AI systems. Claude Code already uses Haiku for some operations, and OpenAI might use Spark similarly. The economics and practical utility of these approaches remain to be seen.

What's clear is that both labs are experimenting with different paths to faster AI, reflecting the ongoing innovation in inference optimization. Whether users prioritize speed or capability will ultimately determine which approach proves more valuable in the long run.